[
https://issues.apache.org/jira/browse/TEZ-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143321#comment-15143321
]
Jason Lowe commented on TEZ-3114:
---------------------------------
bq. TaskAttemptId are shared.
If task attempt IDs are shared then that doesn't explain how the heap dump had
900+ copies of a bunch of separate attempt IDs. Spot checking a few of the 976
copies of one of them, they were all the pathComponent of various
InputAttemptIdentifier entries in the pathToIdentifierMap in the
ShuffleScheduler. There are simply a _ton_ of strings flying around in the
shuffle phase of a task, which is why I filed TEZ-3115. Without proper flow
control we can't sustain these things indefinitely, but if we had a lot more
headroom by cleaning up all these strings then it would still buy us a lot of
time.
bq. That said, EventMetaData instances should be candidates for GC once
LogicalIOPorcessorRuntimeTask is done with them - and forwarded them over to
the actual Input/Ouput/Processor. From the heap dump, do you know what is
holding on to these instances ?
The problem is the LogicalIOProcessorRuntimeTask is _not_ done with them. The
LinkedBlockingQueue for the LogicalIOPorcessorRuntimeTask had 1712923 entries
in it. I think we were simply adding entries to the queue much faster than
they were being processed. The handler thread for that queue was blocked in
ShuffleScheduler#addKnownMapOutput, so maybe there was a lot of lock contention
on the ShuffleScheduler lock? Shuffles were progressing and the thread that
hit the OOM was actually a fetcher thread, so I don't think we were deadlocked
on that lock.
> Shuffle OOM due to EventMetaData flood
> --------------------------------------
>
> Key: TEZ-3114
> URL: https://issues.apache.org/jira/browse/TEZ-3114
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jason Lowe
>
> A task encountered an OOM during shuffle, and investigation of the heap dump
> showed a lot of memory being consumed by almost 3.5 million EventMetaData
> objects. Auto-parallelism had reduced the number of tasks in the vertex to 1
> and there were 2000 upstream tasks to shuffle.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)