[
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2214:
----------------------------------
Attachment: TEZ-2214.3.patch
>>
It's possible for fetchers which already have an active list to keep going -
and get memory as it is released by the mergeThread - or just get memory
because some is available. Is this the situation which can cause the race ?
>>
Right, this is the case. As merge is happening, memory gets released which is
taken up fetchers. By the time, existing merge completes, commitMemory &
usedMemory are already beyond allowed threshold. And this causes the issue.
>>
Question: This same block could just as well have been placed in the
waitForInMemoryMerge method ? Essentially, any place where it could be
triggered after a merge completes.
>>
Yes, it is possible to move the code block to waitForInMemoryMerge(). Addressed
it in the current patch. (i.e after inMemoryMerger.waitForMerge(), we double
check if the memory limits beyond thresholds. If so, we trigger one more merge
and block until it is done in order to release memory.)
> FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses
> memToDiskMerging
> ------------------------------------------------------------------------------------------
>
> Key: TEZ-2214
> URL: https://issues.apache.org/jira/browse/TEZ-2214
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch, TEZ-2214.3.patch
>
>
> Scenario:
> - commitMemory & usedMemory are beyond their allowed threshold.
> - InMemoryMerge kicks off and is in the process of flushing memory contents
> to disk
> - As it progresses, it releases memory segments as well (but not yet over).
> - Fetchers who need memory < maxSingleShuffleLimit, get scheduled.
> - If fetchers are fast, this quickly adds up to commitMemory & usedMemory.
> Since InMemoryMerge is already in progress, this wouldn't trigger another
> merge().
> - Pretty soon all fetchers would be stalled and get into the following state.
> {noformat}
> Thread 9351: (state = BLOCKED)
> - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
> imprecise)
> - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
> -
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
> @bci=17, line=337 (Interpreted frame)
> -
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
> @bci=34, line=157 (Interpreted frame)
> {noformat}
> - Even if InMemoryMerger completes, "commitedMem & usedMem" are beyond their
> threshold and no other fetcher threads (all are in stalled state) are there
> to release memory. This causes fetchers to wait indefinitely.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)