I'm trying to determine how to bound my memory use in a job working with more data than can simultaneously fit in RAM. From reading the tuning guide, my impression is that Spark's memory usage is roughly the following:
(A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory used by all currently running tasks I can bound A with spark.storage.memoryFraction and I can bound B with spark.shuffle.memoryFraction. I'm wondering how to bound C. It's been hinted at a few times on this mailing list that you can reduce memory use by increasing the number of partitions. That leads me to believe that the amount of transient memory is roughly follows: total_data_set_size/number_of_partitions * number_of_tasks_simultaneously_running_per_machine Does this sound right? In other words, as I increase the number of partitions, the size of each partition will decrease, and since each task is processing a single partition and there are a bounded number of tasks in flight, my memory use has a rough upper limit. Keith