I'm not sure if this has been discussed already, if so, please point me to
the thread and/or related JIRA.

I have been running with about 1TB volume on a 20 node D2 cluster (255
GiB/node).
I have uniformly distributed data, so skew is not a problem.

I found that default settings (or wrong setting) for driver and executor
memory caused out of memory exceptions during shuffle (subtractByKey to be
exact). This was not easy to track down, for me at least.

Once I bumped up driver to 12G and executor to 10G with 300 executors and
3000 partitions, shuffle worked quite well (12 mins for subtractByKey). I'm
sure there are more improvement to made, but it's a lot better than heap
space exceptions!

>From my reading, the shuffle OOM problem is in ExternalAppendOnlyMap or
similar disk backed collection.
I have some familiarity with that code based on previous work with external
sorting.

Is it possible to detect misconfiguration that leads to these OOMs and
produce a more meaningful error messages? I think that would really help
users who might not understand all the inner workings and configuration of
Spark (myself included). As it is, heap space issues are a challenge and
does not present Spark in a positive light.

I can help with that effort if someone is willing to point me to the precise
location of memory pressure during shuffle.

Thanks!



-----
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Detecting-configuration-problems-tp13980.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to