Re: Logistic Regression Iterations causing High GC in Spark 2.3

Jörn Franke Mon, 29 Jul 2019 00:08:04 -0700

I would remove the all GC tuning and add it later once you found the underlying 
root cause. Usually more GC means you need to provide more memory, because 
something has changed (your application, spark Version etc.)


We don’t have your full code to give exact advise, but you may want to rethink 
the one code / executor approach and have less executors but more cores / 
executor. That sometimes can lead to more heap usage (especially if you 
broadcast). Keep in mind that if you use more cores/executor it usually also 
requires more memory for the executor, but less executors. Similarly the 
executor instances might be too many and they may not have enough heap.
You can also increase the memory of the executor.

> Am 29.07.2019 um 08:22 schrieb Dhrubajyoti Hati <dhruba.w...@gmail.com>:
> 
> Hi,
> 
> We were running Logistic Regression in Spark 2.2.X and then we tried to see 
> how does it do in Spark 2.3.X. Now we are facing an issue while running a 
> Logistic Regression Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In the 
> TreeAggregate method it takes a huge time due to very High GC Activity. I 
> have tuned the GC, created different sized clusters, higher spark 
> version(2.4.X), smaller data but nothing helps. The GC time is 100 - 1000 
> times of the processing time in avg for iterations. 
> 
> The strange part is in Spark 2.2 this doesn't happen at all. Same code, same 
> cluster sizing, same data in both the cases.
> 
> I was wondering if someone can explain this behaviour and help me to resolve 
> this. How can the same code has so different behaviour in two Spark version, 
> especially the higher ones?
> 
> Here are the config which I used:
> 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> #GC Tuning
> spark.executor.extraJavaOptions= -XX:+UseG1GC -XX:+PrintFlagsFinal 
> -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions 
> -XX:+G1SummarizeConcMark -Xms9000m -XX:ParallelGCThreads=20 
> -XX:ConcGCThreads=5
> 
> spark.executor.instances=20
> spark.executor.cores=1
> spark.executor.memory=9010m
> 
> 
> Regards,
> Dhrub
>

Re: Logistic Regression Iterations causing High GC in Spark 2.3

Reply via email to