As I've told before, I am currently writing my master's thesis on storage and
memory usage in Spark. I am currently specifically looking at the different
fractions of memory:

I was able to find 3 memory regions, but it seems to leave some unaccounted
for:
1. spark.shuffle.memoryFraction: 20%
2. spark.storage.memoryFraction: 60%
3. spark.storage.unrollFraction: 20% of spark.storage.memoryFraction = 12%

4a. Unaccounted: 100-(20+60+12)=8%
or, if unrollFraction is not only proportional to, but also resides in
storage.memoryFraction:
4b. Unaccounted: 100-(20+0.8*60+0.2*60)=20%

Question 1: How big is the unaccounted fraction, and what is this used for? 
(Expected answer: Spark environment)

Question 2: What is stored into spark.storage.memoryFraction?
>From the log messages, with all RDDs cached:
14/09/23 10:56:56 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 184.7 KB, free 47.1 GB)
14/09/23 13:13:11 INFO MemoryStore: Block rdd_1_1 stored as values in memory
(estimated size 1458.0 MB, free 47.1 GB)
Expected answer: broadcast variables, cached RDDs, potentially unrolled
blocks (although not seen in the messages, and not noticeable in the size
reduction in these logmessages)
Remark: If there is nothing else that resides in this area, then in the case
that the user would not use .cache() or .persist(MEMORY), a lot of memory is
kept unused, since the broadcast is connectible small, and unroll, if stored
here, takes a max of 20% of the 60%, right?

Question 3: Which RDDs are not only instantiated, but also actually filled
with data?
I am trying to estimate the dataset I have. I know that because of lazy
evaluation, we can never be certain, but it should be possible to estimate a
minimum. Is it safe to assume that at least the RDDs that are the output of
a sort/shuffle stage, and the ones that the user calls
{cache(),persist(MEMORY),collect()} on, are not only instantiated, but also
filled with data? And are there any other assumptions we can make, for
instance about the other RDDs?

Question 4a: Where is intermediate data between stages stored? 
Question 4b: Where is intermediate data during stages stored? 
When I do not use rdd.cache(), I do not see the memory in
storage.memoryFraction go up. Therefore, I think we can eliminate this
fraction.
The intermediate data from a sort/shuffle uses the ExternalSorter or the
ExternalAppendOnlyMap, which relates to the shuffle portion. Is this data
moved & removed at the end of the stage, or does the next stage retrieve it
from here?
Is there any more intermediate data?
If only RDDs that relate to a sort/shuffle are filled, then I expect it to
be in this area, but it might also be possible that these are moved once the
particular shuffle finishes?

Question 5: If I have sufficient memory (256G), will there be a difference
in execution time between caching no RDDs and caching all RDDs?
I did not expect it, but my intermediate results show a 1.5 to 2x
difference. 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-memory-regions-tp8577.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to