liyunzhang_intel created PIG-4504:
-------------------------------------

             Summary: Enable Secondary key sort feature in spark mode
                 Key: PIG-4504
                 URL: https://issues.apache.org/jira/browse/PIG-4504
             Project: Pig
          Issue Type: New Feature
          Components: spark
            Reporter: liyunzhang_intel
            Assignee: liyunzhang_intel


*Some knowledge about secondary key sort:*
MapReduce framework automatically sorts the keys generated by mappers. This 
means that, before starting reducers all intermediate (key, value) pairs 
generated by mappers must be sorted by key (and not by value). Values passed to 
each reducer are not sorted at all and they can be in any order. But if we make 
(key,value) as a compound key, let (key, value) pairs changes to ((key,value), 
null) pairs. Here we call (key,value) as compound key, key is the first key, 
value is the secondary key. In the shuffle process, pairs with the same first 
key will be grouped into the same partition by setting PartitionerClass in the 
JobConf . Pairs with the same first key but different secondary key will be 
sorted in the process of shuffle by setting SortComparatorClass in the JobConf. 
Pairs with the same first key but different secondary key will be transferred 
to the same reduce function by setting GroupingComparatorClass in the JobConf. 

*How pig implements secondary key sort in mapreduce mode?*
In MR:  it implements secondary key sort by setting GroupingComparatorClass, 
PartitionerClass, SortComparatorClass in 
[JobControlCompiler#getJob|https://github.com/kellyzly/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java#L915]

*An example use secondary key sort:*
TestAccumulator#testAccumWithSort

Currently, secondary key sort feature is not implement in spark mode.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to