spark git commit: [SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6

yhuai Mon, 22 Aug 2016 11:17:03 -0700

Repository: spark
Updated Branches:
  refs/heads/master 209e1b3c0 -> 342278c09



[SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6

## What changes were proposed in this pull request?

Collect GC discussion in one section, and documenting findings about G1 GC heap 
region size.

## How was this patch tested?

Jekyll doc build

Author: Sean Owen <so...@cloudera.com>

Closes #14732 from srowen/SPARK-16320.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/342278c0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/342278c0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/342278c0

Branch: refs/heads/master
Commit: 342278c09cf6e79ed4f63422988a6bbd1e7d8a91
Parents: 209e1b3
Author: Sean Owen <so...@cloudera.com>
Authored: Mon Aug 22 11:15:53 2016 -0700
Committer: Yin Huai <yh...@databricks.com>
Committed: Mon Aug 22 11:15:53 2016 -0700

----------------------------------------------------------------------
 docs/tuning.md | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/342278c0/docs/tuning.md
----------------------------------------------------------------------
diff --git a/docs/tuning.md b/docs/tuning.md
index 976f2eb..cbf3721 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -122,21 +122,8 @@ large records.
 `R` is the storage space within `M` where cached blocks immune to being 
evicted by execution.
 
 The value of `spark.memory.fraction` should be set in order to fit this amount 
of heap space
-comfortably within the JVM's old or "tenured" generation. Otherwise, when much 
of this space is
-used for caching and execution, the tenured generation will be full, which 
causes the JVM to
-significantly increase time spent in garbage collection. See
-<a 
href="https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html";>Java
 GC sizing documentation</a>
-for more information.
-
-The tenured generation size is controlled by the JVM's `NewRatio` parameter, 
which defaults to 2,
-meaning that the tenured generation is 2 times the size of the new generation 
(the rest of the heap).
-So, by default, the tenured generation occupies 2/3 or about 0.66 of the heap. 
A value of
-0.6 for `spark.memory.fraction` keeps storage and execution memory within the 
old generation with
-room to spare. If `spark.memory.fraction` is increased to, say, 0.8, then 
`NewRatio` may have to
-increase to 6 or more.
-
-`NewRatio` is set as a JVM flag for executors, which means adding
-`spark.executor.extraJavaOptions=-XX:NewRatio=x` to a Spark job's 
configuration.
+comfortably within the JVM's old or "tenured" generation. See the discussion 
of advanced GC
+tuning below for details.
 
 ## Determining Memory Consumption
 
@@ -217,14 +204,22 @@ temporary objects created during task execution. Some 
steps which may be useful
 * Check if there are too many garbage collections by collecting GC stats. If a 
full GC is invoked multiple times for
   before a task completes, it means that there isn't enough memory available 
for executing tasks.
 
-* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
-  memory used for caching by lowering `spark.memory.storageFraction`; it is 
better to cache fewer
-  objects than to slow down task execution!
-
 * If there are too many minor collections but not many major GCs, allocating 
more memory for Eden would help. You
   can set the size of the Eden to be an over-estimate of how much memory each 
task will need. If the size of Eden
   is determined to be `E`, then you can set the size of the Young generation 
using the option `-Xmn=4/3*E`. (The scaling
   up by 4/3 is to account for space used by survivor regions as well.)
+  
+* In the GC stats that are printed, if the OldGen is close to being full, 
reduce the amount of
+  memory used for caching by lowering `spark.memory.fraction`; it is better to 
cache fewer
+  objects than to slow down task execution. Alternatively, consider decreasing 
the size of
+  the Young generation. This means lowering `-Xmn` if you've set it as above. 
If not, try changing the 
+  value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, 
meaning that the Old generation 
+  occupies 2/3 of the heap. It should be large enough such that this fraction 
exceeds `spark.memory.fraction`.
+  
+* Try the G1GC garbage collector with `-XX:+UseG1GC`. It can improve 
performance in some situations where
+  garbage collection is a bottleneck. Note that with large executor heap 
sizes, it may be important to
+  increase the [G1 region 
size](https://blogs.oracle.com/g1gc/entry/g1_gc_tuning_a_case) 
+  with `-XX:G1HeapRegionSize`
 
 * As an example, if your task is reading data from HDFS, the amount of memory 
used by the task can be estimated using
   the size of the data block read from HDFS. Note that the size of a 
decompressed block is often 2 or 3 times the
@@ -237,6 +232,9 @@ Our experience suggests that the effect of GC tuning 
depends on your application
 There are [many more tuning 
options](http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html) 
described online,
 but at a high level, managing how frequently full GC takes place can help in 
reducing the overhead.
 
+GC tuning flags for executors can be specified by setting 
`spark.executor.extraJavaOptions` in
+a job's configuration.
+
 # Other Considerations
 
 ## Level of Parallelism


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6

Reply via email to