Thanks Akhil! I suspect the root cause of the shuffle OOM I was seeing (and probably many that users might see) is due to individual partitions on the reduce side not fitting in memory. As a guideline, I was thinking of something like "be sure that your largest partitions occupy no more then 1% of executor memory" or something to that effect. I can add that documentation to the tuning page if someone can suggest the the best wording and numbers. I can also add a simple Spark shell example to estimate largest partition size to determine executor memory and number of partitions.
One more question: I'm trying to get my head around the shuffle code. I see ShuffleManager, but that seems to be on the reduce side. Where is the code driving the map side writes and reduce reads? I think it is possible to add up reduce side volume for a key (they are byte reads at some point) and raise an alarm if it's getting too high. Even a warning on the console would be better than a catastrophic OOM. ----- -- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Detecting-configuration-problems-tp13980p13998.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org