[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13511039#comment-13511039
 ] 

Mariappan Asokan commented on MAPREDUCE-4842:
---------------------------------------------

Hi Jason, Arun, and Alejandro,
  I came up with a simpler solution to solve this nasty problem.  Instead of a 
single list {{inputs}} in {{MergeThread,}} we can keep a FIFO list of these 
lists.  This will make sure that more than one merge can be pending.  The 
{{run()}} method in {{MergeThread}} will keep pulling out the map output lists 
from the FIFO list to merge them(this is a typical producer-consumer scenario.)

I will outline the changes below:

In {{MergeThread}},

* A {{LinkedList<List<T>>}} type member({{pendingToBeMerged}}) is added and the 
member {{inputs}} is removed.

* The {{isInProgress()}} method is removed.

* The {{startMerge()}} method will no longer be {{synchronized.}}  It will add 
the passed list to the tail of {{pendingToBeMerged}} and it will 
{{notifyAll()}} on the monitor of {{pendingToBeMerged.}}

* The {{run()}} method will sit in a tight loop.  So long as there is an 
item(list of map outputs) to be consumed, it will consume(merge) the item and 
remove it from {{pendingToBeMerged.}}  If {pendingToBeMerged}} has no more 
item, it will {{notifyAll()}} on the object's monitor after setting
{{inProgress}} to {{false.}}

In {{MergeManager}},

* All calls to {{isInProgress()}} are removed.

* Unnecessary {{synchronized}} clauses on merge thread objects are removed 
since the methods where they are in themselves are {{synchronized.}}

I created a patch with the above changes and tested it on my laptop.  The 
mapreduce tests seem to run without any problem.  However, I do not claim that 
it is completely tested.  It has to go through the rigorous testing that Jason 
did.

If you are interested in taking a look at the patch, I will post it to this 
Jira.  I welcome your questions and suggestions on the idea of the patch.

-- Asokan

                
> Shuffle race can hang reducer
> -----------------------------
>
>                 Key: MAPREDUCE-4842
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4842
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Blocker
>         Attachments: MAPREDUCE-4842.patch, MAPREDUCE-4842.patch, 
> MAPREDUCE-4842.patch, MAPREDUCE-4842.patch
>
>
> Saw an instance where the shuffle caused multiple reducers in a job to hang.  
> It looked similar to the problem described in MAPREDUCE-3721, where the 
> fetchers were all being told to WAIT by the MergeManager but no merge was 
> taking place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to