[
https://issues.apache.org/jira/browse/HADOOP-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209497#comment-15209497
]
Jason Lowe commented on HADOOP-12958:
-------------------------------------
This OOM was hit by Tez during container reuse as it was transitioning between
two tasks. The previous task was using a DefaultSorter object that had a
SpillThread inner class as a utility thread. That thread was using the
filesystem, so there was a phantom reference to that thread. The outer
DefaultSorter class had a very large byte buffer (>50% of the heap size). When
the previous task completed all references except for the phantom reference
were released, and when the new task initialized it tried to instantiate a new
DefaultSorter with a similarly-sized large buffer.
The attempt to allocate another >50% heap object triggered a full GC since the
old DefaultSorter object hadn't been collected yet. Since the phantom
reference was still referring to the old SpillThread object (and therefore
indirectly to the old DefaultSorter object), the garbage collector queued the
phantom reference *but could not reclaim it*. All it could do is enqueue it on
the specified queue for further processing by the application code. Therefore
after the full GC completed we still had >50% of the heap used from the old
DefaultSorter object waiting for final phantom reference cleanup processing,
and that resulted in an OOM for the attempt to initialize a new DefaultSorter
object of similar size.
Normally the prior task's DefaultSorter would be collected in a single full GC
cycle, since the Tez code makes sure it's no longer referenced. However the
phantom reference in the filesystem statistics code is causing some objects
that normally would be collected in a single full GC cycle to survive that
cycle, and that breaks any use-case where the object is >50% of the heap and a
similarly-sized object will be subsequently allocated. Tez container reuse
does just that, shutting down one task just before it initializes another.
Debugging the OOMs caused by this is a bit tricky, since most of the
OOM-triggered heap dumps showed plenty of memory available on the heap. It
appears that by the time the OOM dump is created, the phantom reference queue
was handled by the StatisticsDataReferenceCleaner so the objects in question
were all unreachable on the heap in the dump.
> PhantomReference for filesystem statistics can trigger OOM
> ----------------------------------------------------------
>
> Key: HADOOP-12958
> URL: https://issues.apache.org/jira/browse/HADOOP-12958
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.3, 2.6.4
> Reporter: Jason Lowe
> Fix For: 2.7.3, 2.6.5
>
>
> I saw an OOM that appears to have been caused by the phantom references
> introduced for file system statistics management. I'll post details in a
> followup comment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)