[ 
http://issues.apache.org/jira/browse/HADOOP-590?page=comments#action_12441219 ] 
            
Runping Qi commented on HADOOP-590:
-----------------------------------


I observed from my current running job that the throughput for the sortPass 
(sort map output files into runs) is a bit faster than mergePass. I believe two 
factors contribute that: one is that the mergePass reads from one disk and the 
sortPass reads from 4 disks; another is that the mergePass reads from/writes to 
the same disk; the third is that the pass factor  for mergePass is 400, which 
may be too high.


Instead of writing the intermediate files into one big file on one disk, if 
sortPass and mergePass (other than the last pass) write d files, where d is the 
number of usable disks, then the next pass will be able to fully utilize the 
available disks.



> Reducer's pass merger should utilize temporary directories on different disks
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-590
>                 URL: http://issues.apache.org/jira/browse/HADOOP-590
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Runping Qi
>
> The current implementation of pass merge of SequenceFile class uses the same 
> temp directory for the in/out files of the pass merger class, even though 
> when multiple temp dirs are available. Thus, it cannot fully utlize the 
> advantage of multiple disks during sort.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to