Github user ash211 commented on a diff in the pull request:
https://github.com/apache/spark/pull/880#discussion_r13115420
--- Diff: docs/configuration.md ---
@@ -94,49 +127,95 @@ there are at least five properties that you will
commonly want to control:
comma-separated list of multiple directories on different disks.
NOTE: In Spark 1.0 and later this will be overriden by
SPARK_LOCAL_DIRS (Standalone, Mesos) or
- LOCAL_DIRS (YARN) envrionment variables set by the cluster manager.
+ LOCAL_DIRS (YARN) environment variables set by the cluster manager.
</td>
</tr>
<tr>
- <td><code>spark.cores.max</code></td>
- <td>(not set)</td>
+ <td><code>spark.logConf</code></td>
+ <td>false</td>
<td>
- When running on a <a href="spark-standalone.html">standalone deploy
cluster</a> or a
- <a href="running-on-mesos.html#mesos-run-modes">Mesos cluster in
"coarse-grained"
- sharing mode</a>, the maximum amount of CPU cores to request for the
application from
- across the cluster (not from each machine). If not set, the default
will be
- <code>spark.deploy.defaultCores</code> on Spark's standalone cluster
manager, or
- infinite (all available cores) on Mesos.
+ Logs the effective SparkConf as INFO when a SparkContext is started.
</td>
</tr>
</table>
-
Apart from these, the following properties are also available, and may be
useful in some situations:
+#### Runtime Environment
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
- <td><code>spark.default.parallelism</code></td>
+ <td><code>spark.executor.memory</code></td>
+ <td>512m</td>
<td>
- <ul>
- <li>Local mode: number of cores on the local machine</li>
- <li>Mesos fine grained mode: 8</li>
- <li>Others: total number of cores on all executor nodes or 2,
whichever is larger</li>
- </ul>
+ Amount of memory to use per executor process, in the same format as
JVM memory strings
+ (e.g. <code>512m</code>, <code>2g</code>).
</td>
+</tr>
+<tr>
+ <td><code>spark.executor.extraJavaOptions</code></td>
+ <td>(none)</td>
<td>
- Default number of tasks to use across the cluster for distributed
shuffle operations
- (<code>groupByKey</code>, <code>reduceByKey</code>, etc) when not set
by user.
+ A string of extra JVM options to pass to executors. For instance, GC
settings or other
+ logging. Note that it is illegal to set Spark properties or heap size
settings with this
+ option. Spark properties should be set using a SparkConf object or the
+ spark-defaults.conf file used with the spark-submit script. Heap size
settings can be set
+ with spark.executor.memory.
</td>
</tr>
<tr>
- <td><code>spark.storage.memoryFraction</code></td>
- <td>0.6</td>
+ <td><code>spark.executor.extraClassPath</code></td>
+ <td>(none)</td>
<td>
- Fraction of Java heap to use for Spark's memory cache. This should not
be larger than the "old"
- generation of objects in the JVM, which by default is given 0.6 of the
heap, but you can increase
- it if you configure your own old generation size.
+ Extra classpath entries to append to the classpath of executors. This
exists primarily
+ for backwards-compatibility with older versions of Spark. Users
typically should not need
+ to set this option.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.executor.extraLibraryPath</code></td>
+ <td>(none)</td>
+ <td>
+ Set a special library path to use when launching executor JVM's.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.files.userClassPathFirst</code></td>
+ <td>false</td>
+ <td>
+ (Experimental) Whether to give user-added jars precedence over Spark's
own jars when
+ loading classes in Executors. This feature can be used to mitigate
conflicts between
+ Spark's dependencies and user dependencies. It is currently an
experimental feature.
+ </td>
+</tr>
+</table>
+
+#### Shuffle Behavior
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+ <td><code>spark.shuffle.consolidateFiles</code></td>
+ <td>false</td>
+ <td>
+ If set to "true", consolidates intermediate files created during a
shuffle. Creating fewer
+ files can improve filesystem performance for shuffles with large
numbers of reduce tasks. It
+ is recommended to set this to "true" when using ext4 or xfs
filesystems. On ext3, this option
+ might degrade performance on machines with many (>8) cores due to
filesystem limitations.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.shuffle.spill</code></td>
+ <td>true</td>
+ <td>
+ If set to "true", limits the amount of memory used during reduces by
spilling data out to disk.
+ This spilling threshold is specified by
<code>spark.shuffle.memoryFraction</code>.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.shuffle.spill.compress</code></td>
+ <td>true</td>
+ <td>
+ Whether to compress data spilled during shuffles.
--- End diff --
What compression algorithm is used here?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---