Thanks! Sounds like my rough understanding was roughly right :)
Definitely understand cached RDDs can add to the memory requirements.
Luckily, like you mentioned, you can configure spark to flush that to disk
and bound its total size in memory via spark.storage.memoryFraction, so I
have a
Keith, do you mean bound as in (a) strictly control to some quantifiable
limit, or (b) try to minimize the amount used by each task?
If a, then that is outside the scope of Spark's memory management, which
you should think of as an application-level (that is, above JVM) mechanism.
In this scope,
A dash of both. I want to know enough that I can reason about, rather
than strictly control, the amount of memory Spark will use. If I have a
big data set, I want to understand how I can design it so that Spark's
memory consumption falls below my available resources. Or alternatively,
if it's