I'm trying to determine how to bound my memory use in a job working with
more data than can simultaneously fit in RAM.  From reading the tuning
guide, my impression is that Spark's memory usage is roughly the following:

(A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
used by all currently running tasks

I can bound A with spark.storage.memoryFraction and I can bound B with
spark.shuffle.memoryFraction.
 I'm wondering how to bound C.

It's been hinted at a few times on this mailing list that you can reduce
memory use by increasing the number of partitions.  That leads me to
believe that the amount of transient memory is roughly follows:

total_data_set_size/number_of_partitions *
number_of_tasks_simultaneously_running_per_machine

Does this sound right?  In other words, as I increase the number of
partitions, the size of each partition will decrease, and since each task
is processing a single partition and there are a bounded number of tasks in
flight, my memory use has a rough upper limit.

Keith

Reply via email to