[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597694#comment-13597694
 ] 

Ravi Prakash commented on MAPREDUCE-3685:
-----------------------------------------

Hi Mariappan!

Thanks a lot for your review and comments. I'm sure this is not the last JIRA 
to go into MergeManagerImpl. :-) If we spot something definitely lets open a 
new one.

* Here's my understanding of how the Merge works. Lets assume all the map 
outputs are on disk. Lets also say that io.sort.factor is set to X. Since we 
don't want to do any more merges than are necessary, we try to ensure that in 
the last *final* merge, there will be X streams to merge. This means that as 
the reducer starts fetching map outputs, we wait until there are at least 2X-1 
map outputs (We don't know how many map outputs we will really get because some 
maps may not have produced any output). When the number goes over 2X-1, we can 
be sure that we need an intermediate merge of X streams. This leaves X-1 in 
onDiskMapOutputs. The X streams are merged into 1. After this merge, together 
we now have (X-1) + 1 = X streams. When the number of streams > X and < 2X-1, 
we let the code go to finalMerge, which in itself eventually calls 
{code:title=Merger.java:645|borderStyle=solid}
if (numSegments <= factor) {
....No extra merge needed
} else {
....Do a merge of (number of map outputs) - X
}
{code} So from my understanding it seems 2X-1 is the correct number. Please let 
me know if you still think its not.

* Hmmm.. I didn't know you could give a TreeSet a partial ordering and still 
get sorted output. The latest javadocs don't say anything, but I found a 
StackOverflow saying it used to be the case in JDK1.2. Do you know if that is 
still true?

* Unfortunately I hadn't run any performance tests. Hopefully we will get the 
fix on our clusters soon and if we see incredible improvements in performance, 
I'll try to remember to report back here. This should probably help on large 
jobs with a lot of maps and we do have quite a few of those :-)
                
> There are some bugs in implementation of MergeManager
> -----------------------------------------------------
>
>                 Key: MAPREDUCE-3685
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: anty.rao
>            Assignee: anty
>            Priority: Critical
>             Fix For: 0.23.7, 2.0.5-beta
>
>         Attachments: MAPREDUCE-3685-branch-0.23.1.patch, 
> MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, 
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch, 
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch, 
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, 
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, 
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to