Hi all,

I'm running the teraSort benchmark with a relative small input set: 5GB.
During profiling, I can see I am using a total of 68GB. I've got a terabyte
of memory in my system, and set
spark.executor.memory 900g
spark.driver.memory 900g
I use the default for 
spark.shuffle.memoryFraction 
spark.storage.memoryFraction
I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for
storage.

I noticed a lot of variation in runtime (under the same load), and tracked
this down to this function in 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
  private def spillToPartitionFiles(collection:
SizeTrackingPairCollection[(Int, K), C]): Unit = {
    spillToPartitionFiles(collection.iterator)
  }
In a slow run, it would loop through this function 12000 times, in a fast
run only 700 times, even though the settings in both runs are the same and
there are no other users on the system. When I look at the function calling
this (insertAll, also in ExternalSorter), I see that spillToPartitionFiles
is only called 700 times in both fast and slow runs, meaning that the
function recursively calls itself very often. Because of the function name,
I assume the system is spilling to disk. As I have sufficient memory, I
assume that I forgot to set a certain memory setting. Anybody any idea which
other setting I have to set, in order to not spill data in this scenario?

Thanks,

Tom



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spilling-when-not-expected-tp11017.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to