[jira] Issue Comment Edited: (HADOOP-1965) Handle map output buffers better

Devaraj Das (JIRA) Mon, 05 Nov 2007 02:27:21 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540173
 ]


devaraj edited comment on HADOOP-1965 at 11/5/07 2:26 AM:
--------------------------------------------------------------

Doug, regarding your comment on pushing the sorting to the maps is not clear to 
me. Each map, in the current framework, and in this issue, will sort the 
outputs that it produces for the reduces. Even if there are multiple spills, 
the final output per reduce will be sorted on the map side (the maps do a final 
merge of the spills). The amount of data that is sorted on the map side is 
dependent on the value of split.getLength().

The one concern I have on this issue is that, if we take the framework as it 
exists now and the framework + this patch, for a constant io.sort.mb, we will 
have roughly double the number of spills (in this patch, we work on 50% of the 
io.sort.mb space for sort/spill and use the other 50% for collecting). That 
will double the number of seeks for the final merge of the spill files. The 
#seeks issue can be avoided by keeping the spill-files handles' open during 
merge but we then might run into issues discussed in HADOOP-874. 

What do others think?

      was (Author: devaraj):
    Doug, regarding your comment on pushing the sorting to the maps is not 
clear to me. Each map, in the current framework, and in this issue, will sort 
the outputs that it produces for the reduces. Even if there are multiple 
spills, the final output per reduce will be sorted on the map side (the maps do 
a final merge of the spills). The amount of data that is sorted on the map side 
is dependent on the value of split.getLength().

The one concern I have on this issue is that, for a constant io.sort.mb, we 
double the number of seeks for the final merge of the spill files when compared 
to the #seeks in the current framework. This is because we work on 50% of the 
io.sort.mb space for sort/spill and use the other 50% for collecting. The 
#seeks issue can be avoided by keeping the spill-files handles' open during 
merge but we then might run into issues discussed in HADOOP-874. 

What do others think?
  
> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>             Fix For: 0.16.0
>
>         Attachments: 1965_single_proc_150mb_gziped.jpeg, 
> 1965_single_proc_150mb_gziped.pdf, 1965_single_proc_150mb_gziped_breakup.png, 
> HADOOP-1965-1.patch
>
>
> Today, the map task stops calling the map method while sort/spill is using 
> the (single instance of) map output buffer. One improvement that can be done 
> to improve performance of the map task is to have another buffer for writing 
> the map outputs to, while sort/spill is using the first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1965) Handle map output buffers better

Reply via email to