[jira] [Commented] (MAPREDUCE-2354) Shuffle should be optimized

Leitao Guo (JIRA) Sat, 05 Jan 2013 22:06:19 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545322#comment-13545322
 ]


Leitao Guo commented on MAPREDUCE-2354:
---------------------------------------

Meng, I'm interested in your patch on shuffle optimization,  could you upload 
the patch? thanks!
                
> Shuffle should be optimized
> ---------------------------
>
>                 Key: MAPREDUCE-2354
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2354
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.1
>            Reporter: MengWang
>              Labels: mapreduce, shuffle
>             Fix For: 0.24.0
>
>
> Our study shows that shuffle is a performance bottleneck of mapreduce 
> computing. There are some problems of shuffle:
> (1)Shuffle and reduce are tightly-coupled, usually shuffle phase doesn't 
> consume too much memory and CPU, so theoretically, reducetasks's slot can be 
> used for other computing tasks when copying data from maps. This method will 
> enhance cluster utilization. Furthermore, should shuffle be separated from 
> reduce? Then shuffle will not use reduce's slot,we need't distinguish between 
> map slots and reduce slots at all.
> (2)For large jobs, shuffle will use too many network connections, Data 
> transmitted by each network connection is very little, which is inefficient. 
> From 0.21.0 one connection can transfer several map outputs, but i think this 
> is not enough. Maybe we can use a per node shuffle client progress(like 
> tasktracker) to shuffle data for all reduce tasks on this node, then we can 
> shuffle more data trough one connection.
> (3)Too many concurrent connections will cause shuffle server do massive 
> random IO, which is inefficient. Maybe we can aggregate http request(like 
> delay scheduler), then random IO will be sequential.
> (4)How to manage memory used by shuffle efficiently. We use buddy memory 
> allocation, which will waste a considerable amount of memory.
> (5)If shuffle separated from reduce, then we must figure out how to do reduce 
> locality?
> (6)Can we store map outputs in a Storage system(like hdfs)?
> (7)Can shuffle be a general data transfer service, which not only for 
> map/reduce paradigm?
>   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2354) Shuffle should be optimized

Reply via email to