Mohit Sabharwal created PIG-4575:
------------------------------------

             Summary: Pass value to MR Partitioners in Spark engine
                 Key: PIG-4575
                 URL: https://issues.apache.org/jira/browse/PIG-4575
             Project: Pig
          Issue Type: Sub-task
          Components: spark
    Affects Versions: spark-branch
            Reporter: Mohit Sabharwal
            Assignee: Mohit Sabharwal
             Fix For: spark-branch


Spark Partitioner#getPartition does not take 'value' as an argument.

In practice, most MR Partitioner#getPartition implementations will only use the 
key and ignore the value. But not all.

In a Spark Partitioner, if the user wants to use the value, then value can made 
a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value 
extracted from the key in getPartition.

One option is to add 2 extra transformations when custom partitioners are used 
for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle 
step (and extract value from key inside getPartition) and then transform it 
back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate 
value (values tend to be large) for all cases, regardless of whether value is 
used in getPartition. We could address this by only doing this if some 
configuration is set (enabled by default, since null as a value is a legitimate 
case which the Partitioner may be handling).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to