Hi,
I have a dataset with ~5M rows x 20 columns, containing a groupID and a
rowID. My goal is to check whether (some) columns contain more than a
fixed fraction (say, 50%) of missing (null) values within a group. If
this is found, the entire column is set to missing (null), for that
group.
The Problem:
The loop runs like a charm during the first iterations, but towards the
end, around the 6th or 7th iteration I see my CPU utilization dropping
(using 1 instead of 6 cores). Along with that, execution time for one
iteration increases significantly. At some point, I get an OutOfMemory
Error:
* spark.driver.memory < 4G: at collect() (FAIL 1)
* 4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)
Enabling a HeapDump on OOM (and analyzing it with Eclipse MAT) showed
two classes taking up lots of memory:
* java.lang.Thread
- char (2G)
- scala.collection.IndexedSeqLike
- scala.collection.mutable.WrappedArray (1G)
- java.lang.String (1G)
* org.apache.spark.sql.execution.ui.SQLListener
- org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
- java.lang.String
- ...
Turning off the SparkUI and/or setting spark.ui.retainedXXX to something
low (e.g. 1) did not solve the issue.
Any idea what I am doing wrong? Or is this a bug?
My Code can be found as a Github Gist [0]. More details can be found on
the StackOverflow Question [1] I posted, but did not receive any answers
until now.
Thanks!
[0]
https://gist.github.com/TwUxTLi51Nus/4accdb291494be9201abfad72541ce74
[1]
http://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace
PS: As a workaround, I have been using "checkpoint" after every few
iterations.
--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org