[
https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511039#comment-13511039
]
Mariappan Asokan commented on MAPREDUCE-4842:
---------------------------------------------
Hi Jason, Arun, and Alejandro,
I came up with a simpler solution to solve this nasty problem. Instead of a
single list {{inputs}} in {{MergeThread,}} we can keep a FIFO list of these
lists. This will make sure that more than one merge can be pending. The
{{run()}} method in {{MergeThread}} will keep pulling out the map output lists
from the FIFO list to merge them(this is a typical producer-consumer scenario.)
I will outline the changes below:
In {{MergeThread}},
* A {{LinkedList<List<T>>}} type member({{pendingToBeMerged}}) is added and the
member {{inputs}} is removed.
* The {{isInProgress()}} method is removed.
* The {{startMerge()}} method will no longer be {{synchronized.}} It will add
the passed list to the tail of {{pendingToBeMerged}} and it will
{{notifyAll()}} on the monitor of {{pendingToBeMerged.}}
* The {{run()}} method will sit in a tight loop. So long as there is an
item(list of map outputs) to be consumed, it will consume(merge) the item and
remove it from {{pendingToBeMerged.}} If {pendingToBeMerged}} has no more
item, it will {{notifyAll()}} on the object's monitor after setting
{{inProgress}} to {{false.}}
In {{MergeManager}},
* All calls to {{isInProgress()}} are removed.
* Unnecessary {{synchronized}} clauses on merge thread objects are removed
since the methods where they are in themselves are {{synchronized.}}
I created a patch with the above changes and tested it on my laptop. The
mapreduce tests seem to run without any problem. However, I do not claim that
it is completely tested. It has to go through the rigorous testing that Jason
did.
If you are interested in taking a look at the patch, I will post it to this
Jira. I welcome your questions and suggestions on the idea of the patch.
-- Asokan
> Shuffle race can hang reducer
> -----------------------------
>
> Key: MAPREDUCE-4842
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 2.0.2-alpha, 0.23.5
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Blocker
> Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch,
> MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.
> It looked similar to the problem described in MAPREDUCE-3721, where the
> fetchers were all being told to WAIT by the MergeManager but no merge was
> taking place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira