[
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597848#comment-13597848
]
Mariappan Asokan commented on MAPREDUCE-3685:
---------------------------------------------
Hi Ravi,
You are absolutely right that we want to make sure that in the final merge
the number of streams to merge is less than or equal to {{ioSortFactor.}} I
stand corrected on that. So I should have stated that the change is:
{code}
if (onDiskMapOutputs.size() > ioSortFactor) {
onDiskMerger.startMerge(onDiskMapOutputs);
}
{code}
Note that I changed ">=" to ">" from my previous suggestion. The rationale for
starting the merge early(instead of waiting until 2*{{ioSortFactor}} - 1 disk
files are created) is to leverage additional overlapped processing. The last
merge will end up with merging less than or equal to {{ioSortFactor}} disk
files.
For example, suppose {{io.sort.factor}} is set to 100 and there are 198 disk
files. With the code in your patch, all 198 files will be merged in the final
merge with two sequential merge passes one with 100 and the other with 99.
With my suggestion, the first 100 would have been merged overlapped with the
fetches and in-memory merges. The last merge will merge only 99 files.
By partial ordering, I did not mean unsorted order. I meant that duplicates
will be retained. Perhaps, I confused you with a mathematical term. Simply
put, in a partially ordered set, all elements
are related by "<=". When you compare two elements in the set, you want to
make sure that the smaller element is collated first. If one of them is
greater than or equal to the other, it is collated second. The Java
{{TreeSet}} implementation will keep the elements in sorted order with this
{{Comparable}} implementation. BTW, what you have in the patch will work.
What I suggested has less code with no unnecessary comparisons. For brevity,
you can even code it as:
{code}
return((this.getCompressedSize() < compPath.getCompressedSize()) ? -1 : 1);
{code}
IMHO, any performance enhancement should not result in performance regression
for any use case. I just wanted to make sure that your patch does not cause
any performance regression in some cases. If you have some setup to run
performance tests, please go ahead. Otherwise, just ignore my suggestion.
-- Asokan
> There are some bugs in implementation of MergeManager
> -----------------------------------------------------
>
> Key: MAPREDUCE-3685
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.1
> Reporter: anty.rao
> Assignee: anty
> Priority: Critical
> Fix For: 0.23.7, 2.0.5-beta
>
> Attachments: MAPREDUCE-3685-branch-0.23.1.patch,
> MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch,
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch,
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch,
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch,
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira