[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318530#comment-15318530
 ] 

Gabor Feher commented on SPARK-15796:
-------------------------------------

MEMORY_ONLY caching works in a way that when a partition doesn't fit into the 
memory, then it won't save it in the memory cache region. It prints stuff like 
this:
{{code}]
16/06/07 06:35:27 INFO MemoryStore: Will not store rdd_1_464 as it would 
require dropping another block from the same RDD
16/06/07 06:35:27 WARN MemoryStore: Not enough space to cache rdd_1_464 in 
memory! (computed 5.5 MB so far)
{{code}}
MEMORY_AND_DISK caching works in a way that if a partition doesn't fit into the 
memory, then it saves it to the disk. It prints stuff like this:
{{code}}
16/06/07 06:46:39 WARN CacheManager: Persisting partition rdd_1_99 to disk 
instead.
{{code}}

In the MEMORY_ONLY case, if I shouldn't expect it to work with too much data as 
you suggest, then why Spark even bothers dropping the blocks from memory? If 
it's a non-goal to store oversized RDDs, then it would be much simpler to just 
throw an OOM.
In the MEMORY_AND_DISK case,  I can see the exact same GC issue with 
MEMORY_ONLY. But there the whole point should be that we are caching RDDs that 
don't fit into the memory, no?

So, these two behaviors made me assume that Spark will work even if I try to 
cache too big stuff. I understand if you say that this is a JVM-implementation 
dependent issue, I have no idea how many people are using other JVMs than 
OpenJDK. But this raises the question: are there any situations when it makes 
sense to raise "spark.memory.fraction" above the old generation size? With 
caching I can say it doesn't make sense, but maybe execution could use it 
meaningfully?

Maybe it is worth mentioning that my use case is not that exotic: we are 
developing a program based on Spark that works with user-provided data: so 
there is no way to say at implementation time whether a particular RDD will fit 
into memory or not.

Speaking of storageFraction, I was not trying to say that there is a problem 
with it. But the following sentence in 
http://spark.apache.org/docs/1.6.1/tuning.html is not correct, if I understand 
correctly:
{{quote}}
In the GC stats that are printed, if the OldGen is close to being full, reduce 
the amount of memory used for caching by lowering spark.memory.storageFraction; 
it is better to cache fewer objects than to slow down task execution!
{{quote}}
Because storageFraction will not actually reduce the amount of cache unless 
execution needs more memory.

Thanks for looking into the issue! To sum up, this is at least a bug in the 
documentation:
* tuning.html should have better advice for when OldGen is close to being full
* I'd prefer a mention of these GC issues somewhere near the cache docs, given 
that many people are using OpenJDK with default settings I believe.

> Spark 1.6 default memory settings can cause heavy GC when caching
> -----------------------------------------------------------------
>
>                 Key: SPARK-15796
>                 URL: https://issues.apache.org/jira/browse/SPARK-15796
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 1.6.0, 1.6.1
>            Reporter: Gabor Feher
>            Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
>     val conf = new SparkConf().setAppName("Cache Demo Application")           
>                             
>     val sc = new SparkContext(conf)
>     val startTime = System.currentTimeMillis()
>                                                                               
>                             
>     val cacheFiller = sc.parallelize(1 to 500000000, 1000)                    
>                             
>       .mapPartitionsWithIndex {
>         case (ix, it) =>
>           println(s"CREATE DATA PARTITION ${ix}")                             
>                             
>           val r = new scala.util.Random(ix)
>           it.map(x => (r.nextLong, r.nextLong))
>       }
>     cacheFiller.persist(StorageLevel.MEMORY_ONLY)
>     cacheFiller.foreach(identity)
>     val finishTime = System.currentTimeMillis()
>     val elapsedTime = (finishTime - startTime) / 1000
>     println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to