I'm not sure if this has been discussed already, if so, please point me to the thread and/or related JIRA.
I have been running with about 1TB volume on a 20 node D2 cluster (255 GiB/node). I have uniformly distributed data, so skew is not a problem. I found that default settings (or wrong setting) for driver and executor memory caused out of memory exceptions during shuffle (subtractByKey to be exact). This was not easy to track down, for me at least. Once I bumped up driver to 12G and executor to 10G with 300 executors and 3000 partitions, shuffle worked quite well (12 mins for subtractByKey). I'm sure there are more improvement to made, but it's a lot better than heap space exceptions! >From my reading, the shuffle OOM problem is in ExternalAppendOnlyMap or similar disk backed collection. I have some familiarity with that code based on previous work with external sorting. Is it possible to detect misconfiguration that leads to these OOMs and produce a more meaningful error messages? I think that would really help users who might not understand all the inner workings and configuration of Spark (myself included). As it is, heap space issues are a challenge and does not present Spark in a positive light. I can help with that effort if someone is willing to point me to the precise location of memory pressure during shuffle. Thanks! ----- -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Detecting-configuration-problems-tp13980.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org