Keith, do you mean "bound" as in (a) strictly control to some quantifiable
limit, or (b) try to minimize the amount used by each task?

If "a", then that is outside the scope of Spark's memory management, which
you should think of as an application-level (that is, above JVM) mechanism.
In this scope, Spark "voluntarily" tracks and limits the amount of memory
it uses for explicitly known data structures, such as RDDs. What Spark
cannot do is, e.g., control or manage the amount of JVM memory that a given
piece of user code might take up. For example, I might write some closure
code that allocates a large array of doubles unbeknownst to Spark.

If "b", then your thinking is in the right direction, although quite
imperfect, because of things like the example above. We often experience
OOME if we're not careful with job partitioning. What I think Spark needs
to evolve to is at least to include a mechanism for application-level hints
about task memory requirements. We might work on this and submit a PR for
it.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Tue, May 27, 2014 at 5:33 PM, Keith Simmons <ke...@pulse.io> wrote:

> I'm trying to determine how to bound my memory use in a job working with
> more data than can simultaneously fit in RAM.  From reading the tuning
> guide, my impression is that Spark's memory usage is roughly the following:
>
> (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
> used by all currently running tasks
>
> I can bound A with spark.storage.memoryFraction and I can bound B with 
> spark.shuffle.memoryFraction.
>  I'm wondering how to bound C.
>
> It's been hinted at a few times on this mailing list that you can reduce
> memory use by increasing the number of partitions.  That leads me to
> believe that the amount of transient memory is roughly follows:
>
> total_data_set_size/number_of_partitions *
> number_of_tasks_simultaneously_running_per_machine
>
> Does this sound right?  In other words, as I increase the number of
> partitions, the size of each partition will decrease, and since each task
> is processing a single partition and there are a bounded number of tasks in
> flight, my memory use has a rough upper limit.
>
> Keith
>

Reply via email to