[ 
https://issues.apache.org/jira/browse/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540173
 ] 

Devaraj Das commented on HADOOP-1965:
-------------------------------------

Doug, regarding your comment on pushing the sorting to the maps is not clear to 
me. Each map, in the current framework, and in this issue, will sort the 
outputs that it produces for the reduces. Even if there are multiple spills, 
the final output per reduce will be sorted on the map side (the maps do a final 
merge of the spills). The amount of data that is sorted on the map side is 
dependent on the value of split.getLength().

The one concern I have on this issue is that, for a constant io.sort.mb, we 
double the number of seeks for the final merge of the spill files when compared 
to the #seeks in the current framework. This is because we work on 50% of the 
io.sort.mb space for sort/spill and use the other 50% for collecting. The 
#seeks issue can be avoided by keeping the spill-files handles' open during 
merge but we then might run into issues discussed in HADOOP-874. 

What do others think?

> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>             Fix For: 0.16.0
>
>         Attachments: 1965_single_proc_150mb_gziped.jpeg, 
> 1965_single_proc_150mb_gziped.pdf, 1965_single_proc_150mb_gziped_breakup.png, 
> HADOOP-1965-1.patch
>
>
> Today, the map task stops calling the map method while sort/spill is using 
> the (single instance of) map output buffer. One improvement that can be done 
> to improve performance of the map task is to have another buffer for writing 
> the map outputs to, while sort/spill is using the first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to