[GitHub] spark pull request: Organize configuration docs

ash211 Tue, 27 May 2014 21:31:46 -0700

Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/880#discussion_r13115420
  
    --- Diff: docs/configuration.md ---
    @@ -94,49 +127,95 @@ there are at least five properties that you will 
commonly want to control:
         comma-separated list of multiple directories on different disks.
     
         NOTE: In Spark 1.0 and later this will be overriden by 
SPARK_LOCAL_DIRS (Standalone, Mesos) or
    -    LOCAL_DIRS (YARN) envrionment variables set by the cluster manager.
    +    LOCAL_DIRS (YARN) environment variables set by the cluster manager.
       </td>
     </tr>
     <tr>
    -  <td><code>spark.cores.max</code></td>
    -  <td>(not set)</td>
    +  <td><code>spark.logConf</code></td>
    +  <td>false</td>
       <td>
    -    When running on a <a href="spark-standalone.html">standalone deploy 
cluster</a> or a
    -    <a href="running-on-mesos.html#mesos-run-modes">Mesos cluster in 
"coarse-grained"
    -    sharing mode</a>, the maximum amount of CPU cores to request for the 
application from
    -    across the cluster (not from each machine). If not set, the default 
will be
    -    <code>spark.deploy.defaultCores</code> on Spark's standalone cluster 
manager, or
    -    infinite (all available cores) on Mesos.
    +    Logs the effective SparkConf as INFO when a SparkContext is started.
       </td>
     </tr>
     </table>
     
    -
     Apart from these, the following properties are also available, and may be 
useful in some situations:
     
    +#### Runtime Environment
     <table class="table">
     <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
     <tr>
    -  <td><code>spark.default.parallelism</code></td>
    +  <td><code>spark.executor.memory</code></td>
    +  <td>512m</td>
       <td>
    -    <ul>
    -      <li>Local mode: number of cores on the local machine</li>
    -      <li>Mesos fine grained mode: 8</li>
    -      <li>Others: total number of cores on all executor nodes or 2, 
whichever is larger</li>
    -    </ul>
    +    Amount of memory to use per executor process, in the same format as 
JVM memory strings
    +    (e.g. <code>512m</code>, <code>2g</code>).
       </td>
    +</tr>
    +<tr>
    +  <td><code>spark.executor.extraJavaOptions</code></td>
    +  <td>(none)</td>
       <td>
    -    Default number of tasks to use across the cluster for distributed 
shuffle operations
    -    (<code>groupByKey</code>, <code>reduceByKey</code>, etc) when not set 
by user.
    +    A string of extra JVM options to pass to executors. For instance, GC 
settings or other
    +    logging. Note that it is illegal to set Spark properties or heap size 
settings with this
    +    option. Spark properties should be set using a SparkConf object or the
    +    spark-defaults.conf file used with the spark-submit script. Heap size 
settings can be set
    +    with spark.executor.memory.
       </td>
     </tr>
     <tr>
    -  <td><code>spark.storage.memoryFraction</code></td>
    -  <td>0.6</td>
    +  <td><code>spark.executor.extraClassPath</code></td>
    +  <td>(none)</td>
       <td>
    -    Fraction of Java heap to use for Spark's memory cache. This should not 
be larger than the "old"
    -    generation of objects in the JVM, which by default is given 0.6 of the 
heap, but you can increase
    -    it if you configure your own old generation size.
    +    Extra classpath entries to append to the classpath of executors. This 
exists primarily
    +    for backwards-compatibility with older versions of Spark. Users 
typically should not need
    +    to set this option.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.executor.extraLibraryPath</code></td>
    +  <td>(none)</td>
    +  <td>
    +    Set a special library path to use when launching executor JVM's.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.files.userClassPathFirst</code></td>
    +  <td>false</td>
    +  <td>
    +    (Experimental) Whether to give user-added jars precedence over Spark's 
own jars when
    +    loading classes in Executors. This feature can be used to mitigate 
conflicts between
    +    Spark's dependencies and user dependencies. It is currently an 
experimental feature.
    +  </td>
    +</tr>
    +</table>
    +
    +#### Shuffle Behavior
    +<table class="table">
    +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +<tr>
    +  <td><code>spark.shuffle.consolidateFiles</code></td>
    +  <td>false</td>
    +  <td>
    +    If set to "true", consolidates intermediate files created during a 
shuffle. Creating fewer
    +    files can improve filesystem performance for shuffles with large 
numbers of reduce tasks. It
    +    is recommended to set this to "true" when using ext4 or xfs 
filesystems. On ext3, this option
    +    might degrade performance on machines with many (>8) cores due to 
filesystem limitations.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.shuffle.spill</code></td>
    +  <td>true</td>
    +  <td>
    +    If set to "true", limits the amount of memory used during reduces by 
spilling data out to disk.
    +    This spilling threshold is specified by 
<code>spark.shuffle.memoryFraction</code>.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.shuffle.spill.compress</code></td>
    +  <td>true</td>
    +  <td>
    +    Whether to compress data spilled during shuffles.
    --- End diff --
    
    What compression algorithm is used here?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Organize configuration docs

Reply via email to