Mohit Sabharwal created PIG-4565:
------------------------------------
Summary: Support custom MR partitioners for Spark engine
Key: PIG-4565
URL: https://issues.apache.org/jira/browse/PIG-4565
Project: Pig
Issue Type: Sub-task
Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
Fix For: spark-branch
Shuffle operations like DISTINCT, GROUP, JOIN, CROSS allow custom MR
partitioners to be specified.
Example:
{code}
B = GROUP A BY $0 PARTITION BY
org.apache.pig.test.utils.SimpleCustomPartitioner PARALLEL 2;
public class SimpleCustomPartitioner extends Partitioner <PigNullableWritable,
Writable> {
//@Override
public int getPartition(PigNullableWritable key, Writable value, int
numPartitions) {
if(key.getValueAsPigType() instanceof Integer) {
int ret = (((Integer)key.getValueAsPigType()).intValue() %
numPartitions);
return ret;
}
else {
return (key.hashCode()) % numPartitions;
}
}
}
{code}
Since Spark's shuffle APIs takes a different parititioner class
(org.apache.spark.Partitioner) compared to MapReduce
(org.apache.hadoop.mapreduce.Partitioner), we need to wrap custom partitioners
written for MapReduce inside a Spark Partitioner.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)