[
https://issues.apache.org/jira/browse/MAPREDUCE-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184107#comment-13184107
]
anty.rao commented on MAPREDUCE-3397:
-------------------------------------
The submitted path do not yet eliminated the shuffle phase barrier?
> Support no sort dataflow in map output and reduce merge phrase
> --------------------------------------------------------------
>
> Key: MAPREDUCE-3397
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3397
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Components: task
> Affects Versions: 0.20.205.0
> Reporter: Binglin Chang
> Assignee: Binglin Chang
> Attachments: MAPREDUCE-3397-nosort.v1.patch
>
>
> In our experience, many data aggregation style queries/jobs don't need to
> sort the intermediate data. In fact reducer side can use hashmap or even
> array to do application level aggregations. For example, consider computing
> CTR using display log & click log in sponsored search. Map side just emit
> (adv_id, clk_cnt, dis_cnt), reduce side aggregate clk_cnt and dis_cnt for
> every adv_id, cause adv_id is integer, we can partition adv_id by range:
> ** reduce0: 0-100000
> ** reduce1: 100000-200000
> ** ...
> ** reduceM: xxx-max adv-id
> Then the reducer can use an array(for example: int [1000000][2]) to store the
> aggregated clk_cnt & dis_cnt, and we don't need the framework to sort
> intermediate data anymore.
> By supporting no sort, we can gain a lot of performance improvements:
> # Eliminate map side sort & merge.
> KV paris need to sort by partition first, but this can be done using a
> liner time counting sort, which is much faster than quick sort.
> Just merge spill segments one by one, doesn't need to use heap merge.
> # Eliminate shuffle phrase barrier, reducer can start to processing data
> before all map output data are copied & merged.
> For most cases, memory won't be a problem, cause keys are divided to many
> partitions, each reducers only process a small subset of the global key set.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira