Hi All, I currently have 3 questions regarding memory usage: 1) Regarding overall memory usage: If I set SPARK_DRIVER_MEMORY to x GB, Spark reports /14/09/11 15:36:41 INFO MemoryStore: MemoryStore started with capacity ~0.55*x GB/ *Question:* Does this relate to spark.storage.memoryFraction (default 0.6), and is the other 0.4 used by spark.shulffle.memoryFraction (default 0.2) and spark general usage (0,2?).
2) Regarding RDD’s and intermediate storage: Say I have two different programs: a) JavaPairRDD<String, String> a = file.flatMapToPair(some class()); JavaPairRDD<String, String> b = a.reduceByKey(some class()); JavaPairRDD<String, String> c = b.mapValues(some class()); b) file.flatMapToPair(some class()).reduceByKey(some class()).mapValues(some class()); I am now wondering which RDD's are actually created, and if they are the same in both situations: I could see a scenario in a) in which lazy evaluation has a similar situation too Int a, b, c; a = 0; b = a; c = b; Were the compiler removes a and b, and only stores c. I could also see a scenario in b) in which Spark has a need to create intermediate RDD’s to pass the data to the next function, without them being explicitly mentioned, but still being there. In this case, there are RDD's created, and potentially removed. Now when I look into the output, I see /MappedRDD[37]/ But I only defined 18 RDD's in my code with JavaPairRDD. *Question:* When are RDD's actually created? Can I trace these one-on-one in the output? 3) Regarding the size of an RDD: When I run a program, I see the following lines: /14/09/11 15:36:44 INFO MemoryStore: ensureFreeSpace(3760) called with curMem=360852, maxMem=2899102924 14/09/11 15:36:44 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 3.7 KB, free 2.7 GB)/ But also /14/09/11 12:57:08 INFO ExternalAppendOnlyMap: Thread 239 spilling in-memory map of 493 MB to disk (7 times so far) 14/09/11 12:57:09 INFO ExternalAppendOnlyMap: Thread 239 spilling in-memory map of 493 MB to disk (8 times so far)/ I could see a scenario in which a shuffle uses more than an actual RDD store needs, but this seems disproportional to me. *Question:* Where can I see the actual size of an individual RDD? Or is there a way to calculate it? Thanks a lot for any help!! Tom -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Questions-regarding-memory-usage-tp8376.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org