Hi All, I currently have 3 questions regarding memory usage:

1)
Regarding overall memory usage:
If I set SPARK_DRIVER_MEMORY to x GB, Spark reports
/14/09/11 15:36:41 INFO MemoryStore: MemoryStore started with capacity
~0.55*x GB/
*Question:*
Does this relate to spark.storage.memoryFraction (default 0.6), and is the
other 0.4 used by spark.shulffle.memoryFraction (default 0.2) and spark
general usage (0,2?).

2)
Regarding RDD’s and intermediate storage:
Say I have two different programs:
a)
JavaPairRDD<String, String> a = file.flatMapToPair(some class());
JavaPairRDD<String, String> b = a.reduceByKey(some class());
JavaPairRDD<String, String> c = b.mapValues(some class());

b)
file.flatMapToPair(some class()).reduceByKey(some class()).mapValues(some
class());

I am now wondering which RDD's are actually created, and if they are the
same in both situations:
I could see a scenario in a) in which lazy evaluation has a similar
situation too
Int a, b, c;
a = 0;
b = a;
c = b;
Were the compiler removes a and b, and only stores c. 

I could also see a scenario in b) in which Spark has a need to create
intermediate RDD’s to pass the data to the next function, without them being
explicitly mentioned, but still being there. In this case, there are RDD's
created, and potentially removed.

Now when I look into the output, I see
/MappedRDD[37]/
But I only defined 18 RDD's in my code with JavaPairRDD.

*Question:*
When are RDD's actually created? Can I trace these one-on-one in the output? 

3)
Regarding the size of an RDD:
When I run a program, I see the following lines:
/14/09/11 15:36:44 INFO MemoryStore: ensureFreeSpace(3760) called with
curMem=360852, maxMem=2899102924
14/09/11 15:36:44 INFO MemoryStore: Block broadcast_9 stored as values in
memory (estimated size 3.7 KB, free 2.7 GB)/
But also
/14/09/11 12:57:08 INFO ExternalAppendOnlyMap: Thread 239 spilling in-memory
map of 493 MB to disk (7 times so far)
14/09/11 12:57:09 INFO ExternalAppendOnlyMap: Thread 239 spilling in-memory
map of 493 MB to disk (8 times so far)/

I could see a scenario in which a shuffle uses more than an actual RDD store
needs, but this seems disproportional to me. 
*Question:*
Where can I see the actual size of an individual RDD? Or is there a way to
calculate it?

Thanks a lot for any help!!

Tom








--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Questions-regarding-memory-usage-tp8376.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to