Hi,

I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.

The Problem:
The loop runs like a charm during the first iterations, but towards the end, around the 6th or 7th iteration I see my CPU utilization dropping (using 1 instead of 6 cores). Along with that, execution time for one iteration increases significantly. At some point, I get an OutOfMemory Error:

* spark.driver.memory < 4G: at collect() (FAIL 1)
* 4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)

Enabling a HeapDump on OOM (and analyzing it with Eclipse MAT) showed two classes taking up lots of memory:

* java.lang.Thread
      - char (2G)
      - scala.collection.IndexedSeqLike
          - scala.collection.mutable.WrappedArray (1G)
      - java.lang.String (1G)

* org.apache.spark.sql.execution.ui.SQLListener
      - org.apache.spark.sql.execution.ui.SQLExecutionUIData
        (various of up to 1G in size)
          - java.lang.String
      - ...

Turning off the SparkUI and/or setting spark.ui.retainedXXX to something low (e.g. 1) did not solve the issue.

Any idea what I am doing wrong? Or is this a bug?

My Code can be found as a Github Gist [0]. More details can be found on the StackOverflow Question [1] I posted, but did not receive any answers until now.

Thanks!

[0] https://gist.github.com/TwUxTLi51Nus/4accdb291494be9201abfad72541ce74 [1] http://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace

PS: As a workaround, I have been using "checkpoint" after every few iterations.


--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to