[jira] [Updated] (PIG-4504) Enable Secondary key sort feature in spark mode

Xuefu Zhang (JIRA) Fri, 15 May 2015 06:01:28 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xuefu Zhang updated PIG-4504:
-----------------------------
       Resolution: Fixed
    Fix Version/s: spark-branch
           Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Enable Secondary key sort feature in spark mode
> -----------------------------------------------
>
>                 Key: PIG-4504
>                 URL: https://issues.apache.org/jira/browse/PIG-4504
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4504.patch, PIG-4504_2.patch, PIG-4504_3.patch, 
> PIG-4504_4.patch, PIG-4504_5.patch, PIG-4504_6.patch, PIG-4504_7.patch, 
> SecondaryKeySort_design_doc (1).docx, 
> Why_need_split_PoLocalRearrange_POGlobalRearrange_POPackage_into_two_SparkNodes_in_sparkPlan.docx
>
>
> *Some knowledge about secondary key sort:*
> MapReduce framework automatically sorts the keys generated by mappers. This 
> means that, before starting reducers all intermediate (key, value) pairs 
> generated by mappers must be sorted by key (and not by value). Values passed 
> to each reducer are not sorted at all and they can be in any order. But if we 
> make (key,value) as a compound key, let (key, value) pairs changes to 
> ((key,value), null) pairs. Here we call (key,value) as compound key, key is 
> the first key, value is the secondary key. In the shuffle process, pairs with 
> the same first key will be grouped into the same partition by setting 
> PartitionerClass in the JobConf . Pairs with the same first key but different 
> secondary key will be sorted in the process of shuffle by setting 
> SortComparatorClass in the JobConf. Pairs with the same first key but 
> different secondary key will be transferred to the same reduce function by 
> setting GroupingComparatorClass in the JobConf. 
> *How pig implements secondary key sort in mapreduce mode?*
> In MR:  it implements secondary key sort by setting GroupingComparatorClass, 
> PartitionerClass, SortComparatorClass in 
> [JobControlCompiler#getJob|https://github.com/kellyzly/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java#L915]
> *An example use secondary key sort:*
> TestAccumulator#testAccumWithSort
> Currently, secondary key sort feature is not implement in spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4504) Enable Secondary key sort feature in spark mode

Reply via email to