OutOfMemoryError

Tw UxTLi51Nus Fri, 23 Jun 2017 00:08:14 -0700

Hi,

I have a dataset with ~5M rows x 20 columns, containing a groupID and arowID. My goal is to check whether (some) columns contain more than afixed fraction (say, 50%) of missing (null) values within a group. Ifthis is found, the entire column is set to missing (null), for thatgroup.


The Problem:

The loop runs like a charm during the first iterations, but towards theend, around the 6th or 7th iteration I see my CPU utilization dropping(using 1 instead of 6 cores). Along with that, execution time for oneiteration increases significantly. At some point, I get an OutOfMemoryError:


* spark.driver.memory < 4G: at collect() (FAIL 1)
* 4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)

Enabling a HeapDump on OOM (and analyzing it with Eclipse MAT) showedtwo classes taking up lots of memory:


* java.lang.Thread
      - char (2G)
      - scala.collection.IndexedSeqLike
          - scala.collection.mutable.WrappedArray (1G)
      - java.lang.String (1G)

* org.apache.spark.sql.execution.ui.SQLListener
      - org.apache.spark.sql.execution.ui.SQLExecutionUIData
        (various of up to 1G in size)
          - java.lang.String
      - ...

Turning off the SparkUI and/or setting spark.ui.retainedXXX to somethinglow (e.g. 1) did not solve the issue.


Any idea what I am doing wrong? Or is this a bug?

My Code can be found as a Github Gist [0]. More details can be found onthe StackOverflow Question [1] I posted, but did not receive any answersuntil now.


Thanks!

[0]https://gist.github.com/TwUxTLi51Nus/4accdb291494be9201abfad72541ce74[1]http://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace

PS: As a workaround, I have been using "checkpoint" after every fewiterations.



--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

OutOfMemoryError

Reply via email to