[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376763#comment-14376763 ]
Rajesh Balamohan edited comment on TEZ-2214 at 3/23/15 10:21 PM: ----------------------------------------------------------------- [~hitesh] - In such cases, the next line "inMemoryMerger.waitForMerge()" acts as the barrier. It would wait until the existing merge completes (which internally releases memory for usedMemory & commitMemory). was (Author: rajesh.balamohan): [~hitesh] - In such cases, the next line "inMemoryMerger.waitForMerge()" acts as the barrier. It would wait until the existing merging completes (which internally releases memory for usedMemory & commitMemory). > FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses > memToDiskMerging > ------------------------------------------------------------------------------------------ > > Key: TEZ-2214 > URL: https://issues.apache.org/jira/browse/TEZ-2214 > Project: Apache Tez > Issue Type: Bug > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-2214.1.patch > > > Scenario: > - commitMemory & usedMemory are beyond their allowed threshold. > - InMemoryMerge kicks off and is in the process of flushing memory contents > to disk > - As it progresses, it releases memory segments as well (but not yet over). > - Fetchers who need memory < maxSingleShuffleLimit, get scheduled. > - If fetchers are fast, this quickly adds up to commitMemory & usedMemory. > Since InMemoryMerge is already in progress, this wouldn't trigger another > merge(). > - Pretty soon all fetchers would be stalled and get into the following state. > {noformat} > Thread 9351: (state = BLOCKED) > - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be > imprecise) > - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() > @bci=17, line=337 (Interpreted frame) > - > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() > @bci=34, line=157 (Interpreted frame) > {noformat} > - Even if InMemoryMerger completes, "commitedMem & usedMem" are beyond their > threshold and no other fetcher threads (all are in stalled state) are there > to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)