[
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597694#comment-13597694
]
Ravi Prakash commented on MAPREDUCE-3685:
-----------------------------------------
Hi Mariappan!
Thanks a lot for your review and comments. I'm sure this is not the last JIRA
to go into MergeManagerImpl. :-) If we spot something definitely lets open a
new one.
* Here's my understanding of how the Merge works. Lets assume all the map
outputs are on disk. Lets also say that io.sort.factor is set to X. Since we
don't want to do any more merges than are necessary, we try to ensure that in
the last *final* merge, there will be X streams to merge. This means that as
the reducer starts fetching map outputs, we wait until there are at least 2X-1
map outputs (We don't know how many map outputs we will really get because some
maps may not have produced any output). When the number goes over 2X-1, we can
be sure that we need an intermediate merge of X streams. This leaves X-1 in
onDiskMapOutputs. The X streams are merged into 1. After this merge, together
we now have (X-1) + 1 = X streams. When the number of streams > X and < 2X-1,
we let the code go to finalMerge, which in itself eventually calls
{code:title=Merger.java:645|borderStyle=solid}
if (numSegments <= factor) {
....No extra merge needed
} else {
....Do a merge of (number of map outputs) - X
}
{code} So from my understanding it seems 2X-1 is the correct number. Please let
me know if you still think its not.
* Hmmm.. I didn't know you could give a TreeSet a partial ordering and still
get sorted output. The latest javadocs don't say anything, but I found a
StackOverflow saying it used to be the case in JDK1.2. Do you know if that is
still true?
* Unfortunately I hadn't run any performance tests. Hopefully we will get the
fix on our clusters soon and if we see incredible improvements in performance,
I'll try to remember to report back here. This should probably help on large
jobs with a lot of maps and we do have quite a few of those :-)
> There are some bugs in implementation of MergeManager
> -----------------------------------------------------
>
> Key: MAPREDUCE-3685
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.1
> Reporter: anty.rao
> Assignee: anty
> Priority: Critical
> Fix For: 0.23.7, 2.0.5-beta
>
> Attachments: MAPREDUCE-3685-branch-0.23.1.patch,
> MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch,
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch,
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch,
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira