Mohit Sabharwal created PIG-4575:
------------------------------------
Summary: Pass value to MR Partitioners in Spark engine
Key: PIG-4575
URL: https://issues.apache.org/jira/browse/PIG-4575
Project: Pig
Issue Type: Sub-task
Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
Fix For: spark-branch
Spark Partitioner#getPartition does not take 'value' as an argument.
In practice, most MR Partitioner#getPartition implementations will only use the
key and ignore the value. But not all.
In a Spark Partitioner, if the user wants to use the value, then value can made
a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value
extracted from the key in getPartition.
One option is to add 2 extra transformations when custom partitioners are used
for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle
step (and extract value from key inside getPartition) and then transform it
back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate
value (values tend to be large) for all cases, regardless of whether value is
used in getPartition. We could address this by only doing this if some
configuration is set (enabled by default, since null as a value is a legitimate
case which the Partitioner may be handling).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)