Rajesh Balamohan created TEZ-2214:
-------------------------------------

             Summary: FetcherOrderedGrouped can get stuck indefinitely when 
MergeManager misses memToDiskMerging
                 Key: TEZ-2214
                 URL: https://issues.apache.org/jira/browse/TEZ-2214
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Rajesh Balamohan


Scenario:
- commitMemory & usedMemory are beyond their allowed threshold.
- InMemoryMerge kicks off and is in the process of flushing memory contents to 
disk
- As it progresses, it releases memory segments as well (but not yet over).
- Fetchers who need memory < maxSingleShuffleLimit, get scheduled.
- If fetchers are fast, this quickly adds up to commitMemory & usedMemory. 
Since InMemoryMerge is already in progress, this wouldn't trigger another 
merge().
- Pretty soon all fetchers would be stalled and get into the following state.

{noformat}
Thread 9351: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
imprecise)
 - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
 - 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
 @bci=17, line=337 (Interpreted frame)
 - 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
 @bci=34, line=157 (Interpreted frame)
{noformat}

- Even if InMemoryMerger completes, "commitedMem & usedMem" are beyond their 
threshold and no other fetcher threads (all are in stalled state) are there to 
release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to