[jira] [Commented] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager

Mariappan Asokan (JIRA) Fri, 08 Mar 2013 21:23:20 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597848#comment-13597848
 ]


Mariappan Asokan commented on MAPREDUCE-3685:
---------------------------------------------

Hi Ravi,
  You are absolutely right that we want to make sure that in the final merge 
the number of streams to merge is less than or equal to {{ioSortFactor.}}  I 
stand corrected on that.  So I should have stated that the change is:
{code}
    if (onDiskMapOutputs.size() > ioSortFactor) {
      onDiskMerger.startMerge(onDiskMapOutputs);
    }
{code}
Note that I changed ">=" to ">" from my previous suggestion.  The rationale for 
starting the merge early(instead of waiting until 2*{{ioSortFactor}} - 1 disk 
files are created) is to leverage additional overlapped processing.  The last 
merge will end up with merging less than or equal to {{ioSortFactor}} disk 
files.

For example, suppose {{io.sort.factor}} is set to 100 and there are 198 disk 
files.  With the code in your patch, all 198 files will be merged in the final 
merge with two sequential merge passes one with 100 and the other with 99.  
With my suggestion, the first 100 would have been merged overlapped with the 
fetches and in-memory merges.  The last merge will merge only 99 files.

By partial ordering, I did not mean unsorted order.  I meant that duplicates 
will be retained. Perhaps, I confused you with a mathematical term.  Simply 
put, in a partially ordered set, all elements
are related by "<=".  When you compare two elements in the set, you want to 
make sure that the smaller element is collated first.  If one of them is 
greater than or equal to the other, it is collated second.  The Java 
{{TreeSet}} implementation will keep the elements in sorted order with this
{{Comparable}} implementation.  BTW, what you have in the patch will work.  
What I suggested has less code with no unnecessary comparisons.  For brevity, 
you can even code it as:
{code}
return((this.getCompressedSize() < compPath.getCompressedSize()) ? -1 : 1);
{code}

IMHO, any performance enhancement should not result in performance regression 
for any use case.  I just wanted to make sure that your patch does not cause 
any performance regression in some cases.  If you have some setup to run 
performance tests, please go ahead.  Otherwise, just ignore my suggestion.

-- Asokan

                
> There are some bugs in implementation of MergeManager
> -----------------------------------------------------
>
>                 Key: MAPREDUCE-3685
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3685
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.1
>            Reporter: anty.rao
>            Assignee: anty
>            Priority: Critical
>             Fix For: 0.23.7, 2.0.5-beta
>
>         Attachments: MAPREDUCE-3685-branch-0.23.1.patch, 
> MAPREDUCE-3685-branch-0.23.1.patch, MAPREDUCE-3685-branch-0.23.1.patch, 
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch, 
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.branch-0.23.patch, 
> MAPREDUCE-3685.branch-0.23.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, 
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, MAPREDUCE-3685.patch, 
> MAPREDUCE-3685.patch, MAPREDUCE-3685.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3685) There are some bugs in implementation of MergeManager

Reply via email to