[jira] [Created] (PIG-4575) Pass value to MR Partitioners in Spark engine

Mohit Sabharwal (JIRA) Tue, 26 May 2015 16:53:40 -0700

Mohit Sabharwal created PIG-4575:
------------------------------------

             Summary: Pass value to MR Partitioners in Spark engine
                 Key: PIG-4575
                 URL: https://issues.apache.org/jira/browse/PIG-4575
             Project: Pig
          Issue Type: Sub-task
          Components: spark
    Affects Versions: spark-branch
            Reporter: Mohit Sabharwal
            Assignee: Mohit Sabharwal
             Fix For: spark-branch



Spark Partitioner#getPartition does not take 'value' as an argument.

In practice, most MR Partitioner#getPartition implementations will only use the 
key and ignore the value. But not all.

In a Spark Partitioner, if the user wants to use the value, then value can made 
a part of the key, i.e. PairRDD<KeyWithValue, ValueRepeated> and then value 
extracted from the key in getPartition.

One option is to add 2 extra transformations when custom partitioners are used 
for a shuffle. Create a PairRDD<KeyWithValue, ValueRepeated> before the shuffle 
step (and extract value from key inside getPartition) and then transform it 
back to PairRDD<Key, Value>. Doing so will increase RDD size due to duplicate 
value (values tend to be large) for all cases, regardless of whether value is 
used in getPartition. We could address this by only doing this if some 
configuration is set (enabled by default, since null as a value is a legitimate 
case which the Partitioner may be handling).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (PIG-4575) Pass value to MR Partitioners in Spark engine

Reply via email to