[jira] [Commented] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM

Wei Deng (JIRA) Sat, 09 Apr 2016 12:38:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233697#comment-15233697
 ]


Wei Deng commented on SPARK-12511:
----------------------------------

I also ran into OOM in the streaming driver when I was testing a very simple 
pyspark streaming code (taking data from direct kafka stream) on Spark 1.6.0. I 
haven't configured checkpoint yet. It appears that the driver will always crash 
after running for 9+ hours while nothing abnormal with the executors. Once I 
switched to Spark 1.6.1 (which should have included the fix for this bug), my 
pyspark streaming driver is able to run for 14 hours now without any sign of 
memory leak or OOM.

[~zsxwing] Could you please confirm if this bug might also impact pyspark 
streaming driver *without* checkpoint configured?

In case anybody interested in seeing the pyspark streaming code that triggered 
the driver OOM under Spark 1.6.0, here it is: 
https://github.com/avinashmandava/energyiot/blob/master/analytics/writemetrics.py

> streaming driver with checkpointing unable to finalize leading to OOM
> ---------------------------------------------------------------------
>
>                 Key: SPARK-12511
>                 URL: https://issues.apache.org/jira/browse/SPARK-12511
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Streaming
>    Affects Versions: 1.5.2, 1.6.0
>         Environment: pyspark 1.5.2
> yarn 2.6.0
> python 2.6
> centos 6.5
> openjdk 1.8.0
>            Reporter: Antony Mayi
>            Assignee: Shixiong Zhu
>            Priority: Critical
>             Fix For: 1.6.1, 2.0.0
>
>         Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, 
> finalizer-spark_assembly.png
>
>
> Spark streaming application when configured with checkpointing is filling 
> driver's heap with multiple ZipFileInputStream instances as results of 
> spark-assembly.jar (potentially some others like for example snappy-java.jar) 
> getting repetitively referenced (loaded?). Java Finalizer can't finalize 
> these ZipFileInputStream instances and it eventually takes all heap leading 
> the driver to OOM crash.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Leave it running and monitor the driver java process heap
> ** with heap dump you will primarily see growing instances of byte array data 
> (here cumulated zip payload of the jar refs):
> {noformat}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:         32653       32735296  [B
>    2:         48000        5135816  [C
>    3:            41        1344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
>    4:         11362        1261816  java.lang.Class
>    5:         47054        1129296  java.lang.String
>    6:         25460        1018400  java.lang.ref.Finalizer
>    7:          9802         789400  [Ljava.lang.Object;
> {noformat}
> ** with visualvm you can see:
> *** increasing number of objects pending for finalization
> !finalizer-pending.png!
> *** increasing number of ZipFileInputStreams instances related to the 
> spark-assembly.jar referenced by Finalizer
> !finalizer-spark_assembly.png!
> * Depending on the heap size and running time this will lead to driver OOM 
> crash
> h2. Comments
> * The [^bug.py] is lightweight proof of the problem. In production I am 
> experiencing this as quite rapid effect - in few hours it eats gigs of heap 
> and kills the app.
> * If the same [^bug.py] is run without checkpointing there is no issue 
> whatsoever.
> * Not sure if it is just pyspark related.
> * In [^bug.py] I am using the socketTextStream input but seems to be 
> independent of the input type (in production having same problem with Kafka 
> direct stream, have seen it even with textFileStream).
> * It is happening even if the input stream doesn't produce any data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM

Reply via email to