[ 
https://issues.apache.org/jira/browse/HADOOP-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209497#comment-15209497
 ] 

Jason Lowe commented on HADOOP-12958:
-------------------------------------

This OOM was hit by Tez during container reuse as it was transitioning between 
two tasks.  The previous task was using a DefaultSorter object that had a 
SpillThread inner class as a utility thread.  That thread was using the 
filesystem, so there was a phantom reference to that thread.  The outer 
DefaultSorter class had a very large byte buffer (>50% of the heap size).  When 
the previous task completed all references except for the phantom reference 
were released, and when the new task initialized it tried to instantiate a new 
DefaultSorter with a similarly-sized large buffer.

The attempt to allocate another >50% heap object triggered a full GC since the 
old DefaultSorter object hadn't been collected yet.  Since the phantom 
reference was still referring to the old SpillThread object (and therefore 
indirectly to the old DefaultSorter object), the garbage collector queued the 
phantom reference *but could not reclaim it*.  All it could do is enqueue it on 
the specified queue for further processing by the application code.  Therefore 
after the full GC completed we still had >50% of the heap used from the old 
DefaultSorter object waiting for final phantom reference cleanup processing, 
and that resulted in an OOM for the attempt to initialize a new DefaultSorter 
object of similar size.

Normally the prior task's DefaultSorter would be collected in a single full GC 
cycle, since the Tez code makes sure it's no longer referenced.  However the 
phantom reference in the filesystem statistics code is causing some objects 
that normally would be collected in a single full GC cycle to survive that 
cycle, and that breaks any use-case where the object is >50% of the heap and a 
similarly-sized object will be subsequently allocated.  Tez container reuse 
does just that, shutting down one task just before it initializes another.

Debugging the OOMs caused by this is a bit tricky, since most of the 
OOM-triggered heap dumps showed plenty of memory available on the heap.  It 
appears that by the time the OOM dump is created, the phantom reference queue 
was handled by the StatisticsDataReferenceCleaner so the objects in question 
were all unreachable on the heap in the dump.

> PhantomReference for filesystem statistics can trigger OOM
> ----------------------------------------------------------
>
>                 Key: HADOOP-12958
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12958
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.3, 2.6.4
>            Reporter: Jason Lowe
>             Fix For: 2.7.3, 2.6.5
>
>
> I saw an OOM that appears to have been caused by the phantom references 
> introduced for file system statistics management.  I'll post details in a 
> followup comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to