Repository: spark
Updated Branches:
  refs/heads/master ebbf85f07 -> ca9fe540f


[SPARK-10662] [DOCS] Code snippets are not properly formatted in tables

* Backticks are processed properly in Spark Properties table
* Removed unnecessary spaces
* See 
http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html

Author: Jacek Laskowski <jacek.laskow...@deepsense.io>

Closes #8795 from jaceklaskowski/docs-yarn-formatting.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca9fe540
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca9fe540
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca9fe540

Branch: refs/heads/master
Commit: ca9fe540fe04e2e230d1e76526b5502bab152914
Parents: ebbf85f
Author: Jacek Laskowski <jacek.laskow...@deepsense.io>
Authored: Mon Sep 21 19:46:39 2015 +0100
Committer: Sean Owen <so...@cloudera.com>
Committed: Mon Sep 21 19:46:39 2015 +0100

----------------------------------------------------------------------
 docs/configuration.md           |  97 ++++++++++++++++----------------
 docs/programming-guide.md       | 100 ++++++++++++++++-----------------
 docs/running-on-mesos.md        |  14 ++---
 docs/running-on-yarn.md         | 106 ++++++++++++++++++-----------------
 docs/sql-programming-guide.md   |  16 +++---
 docs/submitting-applications.md |   8 +--
 6 files changed, 171 insertions(+), 170 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/ca9fe540/docs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/configuration.md b/docs/configuration.md
index 5ec097c..b22587c 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -34,20 +34,20 @@ val conf = new SparkConf()
 val sc = new SparkContext(conf)
 {% endhighlight %}
 
-Note that we can have more than 1 thread in local mode, and in cases like 
Spark Streaming, we may 
+Note that we can have more than 1 thread in local mode, and in cases like 
Spark Streaming, we may
 actually require one to prevent any sort of starvation issues.
 
-Properties that specify some time duration should be configured with a unit of 
time. 
+Properties that specify some time duration should be configured with a unit of 
time.
 The following format is accepted:
- 
+
     25ms (milliseconds)
     5s (seconds)
     10m or 10min (minutes)
     3h (hours)
     5d (days)
     1y (years)
-    
-    
+
+
 Properties that specify a byte size should be configured with a unit of size.  
 The following format is accepted:
 
@@ -140,7 +140,7 @@ of the most common options to set are:
   <td>
     Amount of memory to use for the driver process, i.e. where SparkContext is 
initialized.
     (e.g. <code>1g</code>, <code>2g</code>).
-    
+
     <br /><em>Note:</em> In client mode, this config must not be set through 
the <code>SparkConf</code>
     directly in your application, because the driver JVM has already started 
at that point.
     Instead, please set this through the <code>--driver-memory</code> command 
line option
@@ -207,7 +207,7 @@ Apart from these, the following properties are also 
available, and may be useful
 
     <br /><em>Note:</em> In client mode, this config must not be set through 
the <code>SparkConf</code>
     directly in your application, because the driver JVM has already started 
at that point.
-    Instead, please set this through the <code>--driver-class-path</code> 
command line option or in 
+    Instead, please set this through the <code>--driver-class-path</code> 
command line option or in
     your default properties file.</td>
   </td>
 </tr>
@@ -216,10 +216,10 @@ Apart from these, the following properties are also 
available, and may be useful
   <td>(none)</td>
   <td>
     A string of extra JVM options to pass to the driver. For instance, GC 
settings or other logging.
-    
+
     <br /><em>Note:</em> In client mode, this config must not be set through 
the <code>SparkConf</code>
     directly in your application, because the driver JVM has already started 
at that point.
-    Instead, please set this through the <code>--driver-java-options</code> 
command line option or in 
+    Instead, please set this through the <code>--driver-java-options</code> 
command line option or in
     your default properties file.</td>
   </td>
 </tr>
@@ -228,10 +228,10 @@ Apart from these, the following properties are also 
available, and may be useful
   <td>(none)</td>
   <td>
     Set a special library path to use when launching the driver JVM.
-    
+
     <br /><em>Note:</em> In client mode, this config must not be set through 
the <code>SparkConf</code>
     directly in your application, because the driver JVM has already started 
at that point.
-    Instead, please set this through the <code>--driver-library-path</code> 
command line option or in 
+    Instead, please set this through the <code>--driver-library-path</code> 
command line option or in
     your default properties file.</td>
   </td>
 </tr>
@@ -242,7 +242,7 @@ Apart from these, the following properties are also 
available, and may be useful
     (Experimental) Whether to give user-added jars precedence over Spark's own 
jars when loading
     classes in the the driver. This feature can be used to mitigate conflicts 
between Spark's
     dependencies and user dependencies. It is currently an experimental 
feature.
-    
+
     This is used in cluster mode only.
   </td>
 </tr>
@@ -250,8 +250,8 @@ Apart from these, the following properties are also 
available, and may be useful
   <td><code>spark.executor.extraClassPath</code></td>
   <td>(none)</td>
   <td>
-    Extra classpath entries to prepend to the classpath of executors. This 
exists primarily for 
-    backwards-compatibility with older versions of Spark. Users typically 
should not need to set 
+    Extra classpath entries to prepend to the classpath of executors. This 
exists primarily for
+    backwards-compatibility with older versions of Spark. Users typically 
should not need to set
     this option.
   </td>
 </tr>
@@ -259,9 +259,9 @@ Apart from these, the following properties are also 
available, and may be useful
   <td><code>spark.executor.extraJavaOptions</code></td>
   <td>(none)</td>
   <td>
-    A string of extra JVM options to pass to executors. For instance, GC 
settings or other logging. 
-    Note that it is illegal to set Spark properties or heap size settings with 
this option. Spark 
-    properties should be set using a SparkConf object or the 
spark-defaults.conf file used with the 
+    A string of extra JVM options to pass to executors. For instance, GC 
settings or other logging.
+    Note that it is illegal to set Spark properties or heap size settings with 
this option. Spark
+    properties should be set using a SparkConf object or the 
spark-defaults.conf file used with the
     spark-submit script. Heap size settings can be set with 
spark.executor.memory.
   </td>
 </tr>
@@ -305,7 +305,7 @@ Apart from these, the following properties are also 
available, and may be useful
   <td>daily</td>
   <td>
     Set the time interval by which the executor logs will be rolled over.
-    Rolling is disabled by default. Valid values are `daily`, `hourly`, 
`minutely` or
+    Rolling is disabled by default. Valid values are <code>daily</code>, 
<code>hourly<code>, <code>minutely<code> or
     any interval in seconds. See 
<code>spark.executor.logs.rolling.maxRetainedFiles</code>
     for automatic cleaning of old logs.
   </td>
@@ -330,13 +330,13 @@ Apart from these, the following properties are also 
available, and may be useful
   <td><code>spark.python.profile</code></td>
   <td>false</td>
   <td>
-    Enable profiling in Python worker, the profile result will show up by 
`sc.show_profiles()`,
+    Enable profiling in Python worker, the profile result will show up by 
<code>sc.show_profiles()<code>,
     or it will be displayed before the driver exiting. It also can be dumped 
into disk by
-    `sc.dump_profiles(path)`. If some of the profile results had been 
displayed manually,
+    <code>sc.dump_profiles(path)<code>. If some of the profile results had 
been displayed manually,
     they will not be displayed automatically before driver exiting.
 
-    By default the `pyspark.profiler.BasicProfiler` will be used, but this can 
be overridden by
-    passing a profiler class in as a parameter to the `SparkContext` 
constructor.
+    By default the <code>pyspark.profiler.BasicProfiler<code> will be used, 
but this can be overridden by
+    passing a profiler class in as a parameter to the <code>SparkContext<code> 
constructor.
   </td>
 </tr>
 <tr>
@@ -460,11 +460,11 @@ Apart from these, the following properties are also 
available, and may be useful
   <td><code>spark.shuffle.service.enabled</code></td>
   <td>false</td>
   <td>
-    Enables the external shuffle service. This service preserves the shuffle 
files written by 
-    executors so the executors can be safely removed. This must be enabled if 
+    Enables the external shuffle service. This service preserves the shuffle 
files written by
+    executors so the executors can be safely removed. This must be enabled if
     <code>spark.dynamicAllocation.enabled</code> is "true". The external 
shuffle service
     must be set up in order to enable it. See
-    <a href="job-scheduling.html#configuration-and-setup">dynamic allocation 
+    <a href="job-scheduling.html#configuration-and-setup">dynamic allocation
     configuration and setup documentation</a> for more information.
   </td>
 </tr>
@@ -747,9 +747,9 @@ Apart from these, the following properties are also 
available, and may be useful
   <td>1 in YARN mode, all the available cores on the worker in standalone 
mode.</td>
   <td>
     The number of cores to use on each executor. For YARN and standalone mode 
only.
-    
-    In standalone mode, setting this parameter allows an application to run 
multiple executors on 
-    the same worker, provided that there are enough cores on that worker. 
Otherwise, only one 
+
+    In standalone mode, setting this parameter allows an application to run 
multiple executors on
+    the same worker, provided that there are enough cores on that worker. 
Otherwise, only one
     executor per application will run on each worker.
   </td>
 </tr>
@@ -893,14 +893,14 @@ Apart from these, the following properties are also 
available, and may be useful
   <td><code>spark.akka.heartbeat.interval</code></td>
   <td>1000s</td>
   <td>
-    This is set to a larger value to disable the transport failure detector 
that comes built in to 
-    Akka. It can be enabled again, if you plan to use this feature (Not 
recommended). A larger 
-    interval value reduces network overhead and a smaller value ( ~ 1 s) might 
be more 
-    informative for Akka's failure detector. Tune this in combination of 
`spark.akka.heartbeat.pauses` 
-    if you need to. A likely positive use case for using failure detector 
would be: a sensistive 
-    failure detector can help evict rogue executors quickly. However this is 
usually not the case 
-    as GC pauses and network lags are expected in a real Spark cluster. Apart 
from that enabling 
-    this leads to a lot of exchanges of heart beats between nodes leading to 
flooding the network 
+    This is set to a larger value to disable the transport failure detector 
that comes built in to
+    Akka. It can be enabled again, if you plan to use this feature (Not 
recommended). A larger
+    interval value reduces network overhead and a smaller value ( ~ 1 s) might 
be more
+    informative for Akka's failure detector. Tune this in combination of 
<code>spark.akka.heartbeat.pauses</code>
+    if you need to. A likely positive use case for using failure detector 
would be: a sensistive
+    failure detector can help evict rogue executors quickly. However this is 
usually not the case
+    as GC pauses and network lags are expected in a real Spark cluster. Apart 
from that enabling
+    this leads to a lot of exchanges of heart beats between nodes leading to 
flooding the network
     with those.
   </td>
 </tr>
@@ -909,9 +909,9 @@ Apart from these, the following properties are also 
available, and may be useful
   <td>6000s</td>
   <td>
      This is set to a larger value to disable the transport failure detector 
that comes built in to Akka.
-     It can be enabled again, if you plan to use this feature (Not 
recommended). Acceptable heart 
+     It can be enabled again, if you plan to use this feature (Not 
recommended). Acceptable heart
      beat pause for Akka. This can be used to control sensitivity to GC 
pauses. Tune
-     this along with `spark.akka.heartbeat.interval` if you need to.
+     this along with <code>spark.akka.heartbeat.interval</code> if you need to.
   </td>
 </tr>
 <tr>
@@ -978,7 +978,7 @@ Apart from these, the following properties are also 
available, and may be useful
   <td><code>spark.network.timeout</code></td>
   <td>120s</td>
   <td>
-    Default timeout for all network interactions. This config will be used in 
place of 
+    Default timeout for all network interactions. This config will be used in 
place of
     <code>spark.core.connection.ack.wait.timeout</code>, 
<code>spark.akka.timeout</code>,
     <code>spark.storage.blockManagerSlaveTimeoutMs</code>,
     <code>spark.shuffle.io.connectionTimeout</code>, 
<code>spark.rpc.askTimeout</code> or
@@ -991,8 +991,8 @@ Apart from these, the following properties are also 
available, and may be useful
   <td>
     Maximum number of retries when binding to a port before giving up.
     When a port is given a specific value (non 0), each subsequent retry will
-    increment the port used in the previous attempt by 1 before retrying. This 
-    essentially allows it to try a range of ports from the start port 
specified 
+    increment the port used in the previous attempt by 1 before retrying. This
+    essentially allows it to try a range of ports from the start port specified
     to port + maxRetries.
   </td>
 </tr>
@@ -1191,7 +1191,7 @@ Apart from these, the following properties are also 
available, and may be useful
   <td><code>spark.dynamicAllocation.executorIdleTimeout</code></td>
   <td>60s</td>
   <td>
-    If dynamic allocation is enabled and an executor has been idle for more 
than this duration, 
+    If dynamic allocation is enabled and an executor has been idle for more 
than this duration,
     the executor will be removed. For more detail, see this
     <a href="job-scheduling.html#resource-allocation-policy">description</a>.
   </td>
@@ -1424,11 +1424,11 @@ Apart from these, the following properties are also 
available, and may be useful
   <td>false</td>
   <td>
     Enables or disables Spark Streaming's internal backpressure mechanism 
(since 1.5).
-    This enables the Spark Streaming to control the receiving rate based on 
the 
+    This enables the Spark Streaming to control the receiving rate based on the
     current batch scheduling delays and processing times so that the system 
receives
-    only as fast as the system can process. Internally, this dynamically sets 
the 
+    only as fast as the system can process. Internally, this dynamically sets 
the
     maximum receiving rate of receivers. This rate is upper bounded by the 
values
-    `spark.streaming.receiver.maxRate` and 
`spark.streaming.kafka.maxRatePerPartition`
+    <code>spark.streaming.receiver.maxRate</code> and 
<code>spark.streaming.kafka.maxRatePerPartition</code>
     if they are set (see below).
   </td>
 </tr>
@@ -1542,15 +1542,15 @@ The following variables can be set in `spark-env.sh`:
   <tr><th style="width:21%">Environment Variable</th><th>Meaning</th></tr>
   <tr>
     <td><code>JAVA_HOME</code></td>
-    <td>Location where Java is installed (if it's not on your default 
`PATH`).</td>
+    <td>Location where Java is installed (if it's not on your default 
<code>PATH</code>).</td>
   </tr>
   <tr>
     <td><code>PYSPARK_PYTHON</code></td>
-    <td>Python binary executable to use for PySpark in both driver and workers 
(default is `python`).</td>
+    <td>Python binary executable to use for PySpark in both driver and workers 
(default is <code>python</code>).</td>
   </tr>
   <tr>
     <td><code>PYSPARK_DRIVER_PYTHON</code></td>
-    <td>Python binary executable to use for PySpark in driver only (default is 
PYSPARK_PYTHON).</td>
+    <td>Python binary executable to use for PySpark in driver only (default is 
<code>PYSPARK_PYTHON</code>).</td>
   </tr>
   <tr>
     <td><code>SPARK_LOCAL_IP</code></td>
@@ -1580,4 +1580,3 @@ Spark uses [log4j](http://logging.apache.org/log4j/) for 
logging. You can config
 To specify a different configuration directory other than the default 
"SPARK_HOME/conf",
 you can set SPARK_CONF_DIR. Spark will use the the configuration files 
(spark-defaults.conf, spark-env.sh, log4j.properties, etc)
 from this directory.
-

http://git-wip-us.apache.org/repos/asf/spark/blob/ca9fe540/docs/programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 4cf83bb..8ad2383 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -182,8 +182,8 @@ in-process.
 In the Spark shell, a special interpreter-aware SparkContext is already 
created for you, in the
 variable called `sc`. Making your own SparkContext will not work. You can set 
which master the
 context connects to using the `--master` argument, and you can add JARs to the 
classpath
-by passing a comma-separated list to the `--jars` argument. You can also add 
dependencies 
-(e.g. Spark Packages) to your shell session by supplying a comma-separated 
list of maven coordinates 
+by passing a comma-separated list to the `--jars` argument. You can also add 
dependencies
+(e.g. Spark Packages) to your shell session by supplying a comma-separated 
list of maven coordinates
 to the `--packages` argument. Any additional repositories where dependencies 
might exist (e.g. SonaType)
 can be passed to the `--repositories` argument. For example, to run 
`bin/spark-shell` on exactly
 four cores, use:
@@ -217,7 +217,7 @@ context connects to using the `--master` argument, and you 
can add Python .zip,
 to the runtime path by passing a comma-separated list to `--py-files`. You can 
also add dependencies
 (e.g. Spark Packages) to your shell session by supplying a comma-separated 
list of maven coordinates
 to the `--packages` argument. Any additional repositories where dependencies 
might exist (e.g. SonaType)
-can be passed to the `--repositories` argument. Any python dependencies a 
Spark Package has (listed in 
+can be passed to the `--repositories` argument. Any python dependencies a 
Spark Package has (listed in
 the requirements.txt of that package) must be manually installed using pip 
when necessary.
 For example, to run `bin/pyspark` on exactly four cores, use:
 
@@ -249,8 +249,8 @@ the [IPython Notebook](http://ipython.org/notebook.html) 
with PyLab plot support
 $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" 
./bin/pyspark
 {% endhighlight %}
 
-After the IPython Notebook server is launched, you can create a new "Python 2" 
notebook from 
-the "Files" tab. Inside the notebook, you can input the command `%pylab 
inline` as part of 
+After the IPython Notebook server is launched, you can create a new "Python 2" 
notebook from
+the "Files" tab. Inside the notebook, you can input the command `%pylab 
inline` as part of
 your notebook before you start to try Spark from the IPython notebook.
 
 </div>
@@ -418,9 +418,9 @@ Apart from text files, Spark's Python API also supports 
several other data forma
 
 **Writable Support**
 
-PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts Writables to base Java types, and pickles the 
-resulting Java objects using [Pyrolite](https://github.com/irmen/Pyrolite/). 
When saving an RDD of key-value pairs to SequenceFile, 
-PySpark does the reverse. It unpickles Python objects into Java objects and 
then converts them to Writables. The following 
+PySpark SequenceFile support loads an RDD of key-value pairs within Java, 
converts Writables to base Java types, and pickles the
+resulting Java objects using [Pyrolite](https://github.com/irmen/Pyrolite/). 
When saving an RDD of key-value pairs to SequenceFile,
+PySpark does the reverse. It unpickles Python objects into Java objects and 
then converts them to Writables. The following
 Writables are automatically converted:
 
 <table class="table">
@@ -435,9 +435,9 @@ Writables are automatically converted:
 <tr><td>MapWritable</td><td>dict</td></tr>
 </table>
 
-Arrays are not handled out-of-the-box. Users need to specify custom 
`ArrayWritable` subtypes when reading or writing. When writing, 
-users also need to specify custom converters that convert arrays to custom 
`ArrayWritable` subtypes. When reading, the default 
-converter will convert custom `ArrayWritable` subtypes to Java `Object[]`, 
which then get pickled to Python tuples. To get 
+Arrays are not handled out-of-the-box. Users need to specify custom 
`ArrayWritable` subtypes when reading or writing. When writing,
+users also need to specify custom converters that convert arrays to custom 
`ArrayWritable` subtypes. When reading, the default
+converter will convert custom `ArrayWritable` subtypes to Java `Object[]`, 
which then get pickled to Python tuples. To get
 Python `array.array` for arrays of primitive types, users need to specify 
custom converters.
 
 **Saving and Loading SequenceFiles**
@@ -454,7 +454,7 @@ classes can be specified, but for standard Writables this 
is not required.
 
 **Saving and Loading Other Hadoop Input/Output Formats**
 
-PySpark can also read any Hadoop InputFormat or write any Hadoop OutputFormat, 
for both 'new' and 'old' Hadoop MapReduce APIs. 
+PySpark can also read any Hadoop InputFormat or write any Hadoop OutputFormat, 
for both 'new' and 'old' Hadoop MapReduce APIs.
 If required, a Hadoop configuration can be passed in as a Python dict. Here is 
an example using the
 Elasticsearch ESInputFormat:
 
@@ -474,15 +474,15 @@ Note that, if the InputFormat simply depends on a Hadoop 
configuration and/or in
 the key and value classes can easily be converted according to the above table,
 then this approach should work well for such cases.
 
-If you have custom serialized binary data (such as loading data from Cassandra 
/ HBase), then you will first need to 
+If you have custom serialized binary data (such as loading data from Cassandra 
/ HBase), then you will first need to
 transform that data on the Scala/Java side to something which can be handled 
by Pyrolite's pickler.
-A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) 
trait is provided 
-for this. Simply extend this trait and implement your transformation code in 
the ```convert``` 
-method. Remember to ensure that this class, along with any dependencies 
required to access your ```InputFormat```, are packaged into your Spark job jar 
and included on the PySpark 
+A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) 
trait is provided
+for this. Simply extend this trait and implement your transformation code in 
the ```convert```
+method. Remember to ensure that this class, along with any dependencies 
required to access your ```InputFormat```, are packaged into your Spark job jar 
and included on the PySpark
 classpath.
 
-See the [Python 
examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python) and 
-the [Converter 
examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters)
 
+See the [Python 
examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python) and
+the [Converter 
examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters)
 for examples of using Cassandra / HBase ```InputFormat``` and 
```OutputFormat``` with custom converters.
 
 </div>
@@ -758,7 +758,7 @@ One of the harder things about Spark is understanding the 
scope and life cycle o
 
 #### Example
 
-Consider the naive RDD element sum below, which behaves completely differently 
depending on whether execution is happening within the same JVM. A common 
example of this is when running Spark in `local` mode (`--master = local[n]`) 
versus deploying a Spark application to a cluster (e.g. via spark-submit to 
YARN): 
+Consider the naive RDD element sum below, which behaves completely differently 
depending on whether execution is happening within the same JVM. A common 
example of this is when running Spark in `local` mode (`--master = local[n]`) 
versus deploying a Spark application to a cluster (e.g. via spark-submit to 
YARN):
 
 <div class="codetabs">
 
@@ -777,7 +777,7 @@ println("Counter value: " + counter)
 <div data-lang="java"  markdown="1">
 {% highlight java %}
 int counter = 0;
-JavaRDD<Integer> rdd = sc.parallelize(data); 
+JavaRDD<Integer> rdd = sc.parallelize(data);
 
 // Wrong: Don't do this!!
 rdd.foreach(x -> counter += x);
@@ -803,7 +803,7 @@ print("Counter value: " + counter)
 
 #### Local vs. cluster modes
 
-The primary challenge is that the behavior of the above code is undefined. In 
local mode with a single JVM, the above code will sum the values within the RDD 
and store it in **counter**. This is because both the RDD and the variable 
**counter** are in the same memory space on the driver node. 
+The primary challenge is that the behavior of the above code is undefined. In 
local mode with a single JVM, the above code will sum the values within the RDD 
and store it in **counter**. This is because both the RDD and the variable 
**counter** are in the same memory space on the driver node.
 
 However, in `cluster` mode, what happens is more complicated, and the above 
may not work as intended. To execute jobs, Spark breaks up the processing of 
RDD operations into tasks - each of which is operated on by an executor. Prior 
to execution, Spark computes the **closure**. The closure is those variables 
and methods which must be visible for the executor to perform its computations 
on the RDD (in this case `foreach()`). This closure is serialized and sent to 
each executor. In `local` mode, there is only the one executors so everything 
shares the same closure. In other modes however, this is not the case and the 
executors running on seperate worker nodes each have their own copy of the 
closure.
 
@@ -813,9 +813,9 @@ To ensure well-defined behavior in these sorts of scenarios 
one should use an [`
 
 In general, closures - constructs like loops or locally defined methods, 
should not be used to mutate some global state. Spark does not define or 
guarantee the behavior of mutations to objects referenced from outside of 
closures. Some code that does this may work in local mode, but that's just by 
accident and such code will not behave as expected in distributed mode. Use an 
Accumulator instead if some global aggregation is needed.
 
-#### Printing elements of an RDD 
+#### Printing elements of an RDD
 Another common idiom is attempting to print out the elements of an RDD using 
`rdd.foreach(println)` or `rdd.map(println)`. On a single machine, this will 
generate the expected output and print all the RDD's elements. However, in 
`cluster` mode, the output to `stdout` being called by the executors is now 
writing to the executor's `stdout` instead, not the one on the driver, so 
`stdout` on the driver won't show these! To print all elements on the driver, 
one can use the `collect()` method to first bring the RDD to the driver node 
thus: `rdd.collect().foreach(println)`. This can cause the driver to run out of 
memory, though, because `collect()` fetches the entire RDD to a single machine; 
if you only need to print a few elements of the RDD, a safer approach is to use 
the `take()`: `rdd.take(100).foreach(println)`.
- 
+
 ### Working with Key-Value Pairs
 
 <div class="codetabs">
@@ -859,7 +859,7 @@ only available on RDDs of key-value pairs.
 The most common ones are distributed "shuffle" operations, such as grouping or 
aggregating the elements
 by a key.
 
-In Java, key-value pairs are represented using the 
+In Java, key-value pairs are represented using the
 
[scala.Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2)
 class
 from the Scala standard library. You can simply call `new Tuple2(a, b)` to 
create a tuple, and access
 its fields later with `tuple._1()` and `tuple._2()`.
@@ -974,7 +974,7 @@ for details.
   <td> <b>groupByKey</b>([<i>numTasks</i>]) <a name="GroupByLink"></a> </td>
   <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, 
Iterable&lt;V&gt;) pairs. <br />
     <b>Note:</b> If you are grouping in order to perform an aggregation (such 
as a sum or
-      average) over each key, using <code>reduceByKey</code> or 
<code>aggregateByKey</code> will yield much better 
+      average) over each key, using <code>reduceByKey</code> or 
<code>aggregateByKey</code> will yield much better
       performance.
     <br />
     <b>Note:</b> By default, the level of parallelism in the output depends on 
the number of partitions of the parent RDD.
@@ -1025,7 +1025,7 @@ for details.
 <tr>
   <td> <b>repartitionAndSortWithinPartitions</b>(<i>partitioner</i>) <a 
name="Repartition2Link"></a></td>
   <td> Repartition the RDD according to the given partitioner and, within each 
resulting partition,
-  sort records by their keys. This is more efficient than calling 
<code>repartition</code> and then sorting within 
+  sort records by their keys. This is more efficient than calling 
<code>repartition</code> and then sorting within
   each partition because it can push the sorting down into the shuffle 
machinery. </td>
 </tr>
 </table>
@@ -1038,7 +1038,7 @@ RDD API doc
  [Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
  [Python](api/python/pyspark.html#pyspark.RDD),
  [R](api/R/index.html))
- 
+
 and pair RDD functions doc
 ([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions),
  [Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
@@ -1094,7 +1094,7 @@ for details.
 </tr>
 <tr>
   <td> <b>foreach</b>(<i>func</i>) </td>
-  <td> Run a function <i>func</i> on each element of the dataset. This is 
usually done for side effects such as updating an <a 
href="#AccumLink">Accumulator</a> or interacting with external storage systems. 
+  <td> Run a function <i>func</i> on each element of the dataset. This is 
usually done for side effects such as updating an <a 
href="#AccumLink">Accumulator</a> or interacting with external storage systems.
   <br /><b>Note</b>: modifying variables other than Accumulators outside of 
the <code>foreach()</code> may result in undefined behavior. See <a 
href="#ClosuresLink">Understanding closures </a> for more details.</td>
 </tr>
 </table>
@@ -1118,13 +1118,13 @@ co-located to compute the result.
 In Spark, data is generally not distributed across partitions to be in the 
necessary place for a
 specific operation. During computations, a single task will operate on a 
single partition - thus, to
 organize all the data for a single `reduceByKey` reduce task to execute, Spark 
needs to perform an
-all-to-all operation. It must read from all partitions to find all the values 
for all keys, 
-and then bring together values across partitions to compute the final result 
for each key - 
+all-to-all operation. It must read from all partitions to find all the values 
for all keys,
+and then bring together values across partitions to compute the final result 
for each key -
 this is called the **shuffle**.
 
 Although the set of elements in each partition of newly shuffled data will be 
deterministic, and so
-is the ordering of partitions themselves, the ordering of these elements is 
not. If one desires predictably 
-ordered data following shuffle then it's possible to use: 
+is the ordering of partitions themselves, the ordering of these elements is 
not. If one desires predictably
+ordered data following shuffle then it's possible to use:
 
 * `mapPartitions` to sort each partition using, for example, `.sorted`
 * `repartitionAndSortWithinPartitions` to efficiently sort partitions while 
simultaneously repartitioning
@@ -1141,26 +1141,26 @@ network I/O. To organize data for the shuffle, Spark 
generates sets of tasks - *
 organize the data, and a set of *reduce* tasks to aggregate it. This 
nomenclature comes from
 MapReduce and does not directly relate to Spark's `map` and `reduce` 
operations.
 
-Internally, results from individual map tasks are kept in memory until they 
can't fit. Then, these 
-are sorted based on the target partition and written to a single file. On the 
reduce side, tasks 
+Internally, results from individual map tasks are kept in memory until they 
can't fit. Then, these
+are sorted based on the target partition and written to a single file. On the 
reduce side, tasks
 read the relevant sorted blocks.
-        
-Certain shuffle operations can consume significant amounts of heap memory 
since they employ 
-in-memory data structures to organize records before or after transferring 
them. Specifically, 
-`reduceByKey` and `aggregateByKey` create these structures on the map side, 
and `'ByKey` operations 
-generate these on the reduce side. When data does not fit in memory Spark will 
spill these tables 
+
+Certain shuffle operations can consume significant amounts of heap memory 
since they employ
+in-memory data structures to organize records before or after transferring 
them. Specifically,
+`reduceByKey` and `aggregateByKey` create these structures on the map side, 
and `'ByKey` operations
+generate these on the reduce side. When data does not fit in memory Spark will 
spill these tables
 to disk, incurring the additional overhead of disk I/O and increased garbage 
collection.
 
 Shuffle also generates a large number of intermediate files on disk. As of 
Spark 1.3, these files
-are preserved until the corresponding RDDs are no longer used and are garbage 
collected. 
-This is done so the shuffle files don't need to be re-created if the lineage 
is re-computed. 
-Garbage collection may happen only after a long period time, if the 
application retains references 
-to these RDDs or if GC does not kick in frequently. This means that 
long-running Spark jobs may 
+are preserved until the corresponding RDDs are no longer used and are garbage 
collected.
+This is done so the shuffle files don't need to be re-created if the lineage 
is re-computed.
+Garbage collection may happen only after a long period time, if the 
application retains references
+to these RDDs or if GC does not kick in frequently. This means that 
long-running Spark jobs may
 consume a large amount of disk space. The temporary storage directory is 
specified by the
 `spark.local.dir` configuration parameter when configuring the Spark context.
 
 Shuffle behavior can be tuned by adjusting a variety of configuration 
parameters. See the
-'Shuffle Behavior' section within the [Spark Configuration 
Guide](configuration.html). 
+'Shuffle Behavior' section within the [Spark Configuration 
Guide](configuration.html).
 
 ## RDD Persistence
 
@@ -1246,7 +1246,7 @@ efficiency. We recommend going through the following 
process to select one:
   This is the most CPU-efficient option, allowing operations on the RDDs to 
run as fast as possible.
 
 * If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization 
library](tuning.html) to
-make the objects much more space-efficient, but still reasonably fast to 
access. 
+make the objects much more space-efficient, but still reasonably fast to 
access.
 
 * Don't spill to disk unless the functions that computed your datasets are 
expensive, or they filter
 a large amount of the data. Otherwise, recomputing a partition may be as fast 
as reading it from
@@ -1345,7 +1345,7 @@ Accumulators are variables that are only "added" to 
through an associative opera
 therefore be efficiently supported in parallel. They can be used to implement 
counters (as in
 MapReduce) or sums. Spark natively supports accumulators of numeric types, and 
programmers
 can add support for new types. If accumulators are created with a name, they 
will be
-displayed in Spark's UI. This can be useful for understanding the progress of 
+displayed in Spark's UI. This can be useful for understanding the progress of
 running stages (NOTE: this is not yet supported in Python).
 
 An accumulator is created from an initial value `v` by calling 
`SparkContext.accumulator(v)`. Tasks
@@ -1474,8 +1474,8 @@ vecAccum = sc.accumulator(Vector(...), 
VectorAccumulatorParam())
 
 </div>
 
-For accumulator updates performed inside <b>actions only</b>, Spark guarantees 
that each task's update to the accumulator 
-will only be applied once, i.e. restarted tasks will not update the value. In 
transformations, users should be aware 
+For accumulator updates performed inside <b>actions only</b>, Spark guarantees 
that each task's update to the accumulator
+will only be applied once, i.e. restarted tasks will not update the value. In 
transformations, users should be aware
 of that each task's update may be applied more than once if tasks or job 
stages are re-executed.
 
 Accumulators do not change the lazy evaluation model of Spark. If they are 
being updated within an operation on an RDD, their value is only updated once 
that RDD is computed as part of an action. Consequently, accumulator updates 
are not guaranteed to be executed when made within a lazy transformation like 
`map()`. The below code fragment demonstrates this property:
@@ -1486,7 +1486,7 @@ Accumulators do not change the lazy evaluation model of 
Spark. If they are being
 {% highlight scala %}
 val accum = sc.accumulator(0)
 data.map { x => accum += x; f(x) }
-// Here, accum is still 0 because no actions have caused the `map` to be 
computed.
+// Here, accum is still 0 because no actions have caused the <code>map</code> 
to be computed.
 {% endhighlight %}
 </div>
 
@@ -1553,7 +1553,7 @@ Several changes were made to the Java API:
   code that `extends Function` should `implement Function` instead.
 * New variants of the `map` transformations, like `mapToPair` and 
`mapToDouble`, were added to create RDDs
   of special data types.
-* Grouping operations like `groupByKey`, `cogroup` and `join` have changed 
from returning 
+* Grouping operations like `groupByKey`, `cogroup` and `join` have changed 
from returning
   `(Key, List<Value>)` pairs to `(Key, Iterable<Value>)`.
 
 </div>

http://git-wip-us.apache.org/repos/asf/spark/blob/ca9fe540/docs/running-on-mesos.md
----------------------------------------------------------------------
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
index 330c159..460a66f 100644
--- a/docs/running-on-mesos.md
+++ b/docs/running-on-mesos.md
@@ -245,7 +245,7 @@ See the [configuration page](configuration.html) for 
information on Spark config
   <td><code>spark.mesos.coarse</code></td>
   <td>false</td>
   <td>
-    If set to "true", runs over Mesos clusters in
+    If set to <code>true</code>, runs over Mesos clusters in
     <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing 
mode</a>,
     where Spark acquires one long-lived Mesos task on each machine instead of 
one Mesos task per
     Spark task. This gives lower-latency scheduling for short queries, but 
leaves resources in use
@@ -254,16 +254,16 @@ See the [configuration page](configuration.html) for 
information on Spark config
 </tr>
 <tr>
   <td><code>spark.mesos.extra.cores</code></td>
-  <td>0</td>
+  <td><code>0</code></td>
   <td>
     Set the extra amount of cpus to request per task. This setting is only 
used for Mesos coarse grain mode.
     The total amount of cores requested per task is the number of cores in the 
offer plus the extra cores configured.
-    Note that total amount of cores the executor will request in total will 
not exceed the spark.cores.max setting.
+    Note that total amount of cores the executor will request in total will 
not exceed the <code>spark.cores.max</code> setting.
   </td>
 </tr>
 <tr>
   <td><code>spark.mesos.mesosExecutor.cores</code></td>
-  <td>1.0</td>
+  <td><code>1.0</code></td>
   <td>
     (Fine-grained mode only) Number of cores to give each Mesos executor. This 
does not
     include the cores used to run the Spark tasks. In other words, even if no 
Spark task
@@ -287,7 +287,7 @@ See the [configuration page](configuration.html) for 
information on Spark config
   <td>
     Set the list of volumes which will be mounted into the Docker image, which 
was set using
     <code>spark.mesos.executor.docker.image</code>. The format of this 
property is a comma-separated list of
-    mappings following the form passed to <tt>docker run -v</tt>. That is they 
take the form:
+    mappings following the form passed to <code>docker run -v</code>. That is 
they take the form:
 
     <pre>[host_path:]container_path[:ro|:rw]</pre>
   </td>
@@ -318,7 +318,7 @@ See the [configuration page](configuration.html) for 
information on Spark config
   <td>executor memory * 0.10, with minimum of 384</td>
   <td>
     The amount of additional memory, specified in MB, to be allocated per 
executor. By default,
-    the overhead will be larger of either 384 or 10% of 
`spark.executor.memory`. If it's set,
+    the overhead will be larger of either 384 or 10% of 
<code>spark.executor.memory</code>. If set,
     the final overhead will be this value.
   </td>
 </tr>
@@ -339,7 +339,7 @@ See the [configuration page](configuration.html) for 
information on Spark config
 </tr>
 <tr>
   <td><code>spark.mesos.secret</code></td>
-  <td>(none)/td>
+  <td>(none)</td>
   <td>
     Set the secret with which Spark framework will use to authenticate with 
Mesos.
   </td>

http://git-wip-us.apache.org/repos/asf/spark/blob/ca9fe540/docs/running-on-yarn.md
----------------------------------------------------------------------
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 3a961d2..0e25ccf 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -23,7 +23,7 @@ Unlike [Spark standalone](spark-standalone.html) and 
[Mesos](running-on-mesos.ht
 To launch a Spark application in `yarn-cluster` mode:
 
     $ ./bin/spark-submit --class path.to.your.Class --master yarn-cluster 
[options] <app jar> [app options]
-    
+
 For example:
 
     $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
@@ -43,7 +43,7 @@ To launch a Spark application in `yarn-client` mode, do the 
same, but replace `y
 
 ## Adding Other JARs
 
-In `yarn-cluster` mode, the driver runs on a different machine than the 
client, so `SparkContext.addJar` won't work out of the box with files that are 
local to the client. To make files on the client available to 
`SparkContext.addJar`, include them with the `--jars` option in the launch 
command. 
+In `yarn-cluster` mode, the driver runs on a different machine than the 
client, so `SparkContext.addJar` won't work out of the box with files that are 
local to the client. To make files on the client available to 
`SparkContext.addJar`, include them with the `--jars` option in the launch 
command.
 
     $ ./bin/spark-submit --class my.main.Class \
         --master yarn-cluster \
@@ -64,16 +64,16 @@ Most of the configs are the same for Spark on YARN as for 
other deployment modes
 
 # Debugging your Application
 
-In YARN terminology, executors and application masters run inside 
"containers". YARN has two modes for handling container logs after an 
application has completed. If log aggregation is turned on (with the 
`yarn.log-aggregation-enable` config), container logs are copied to HDFS and 
deleted on the local machine. These logs can be viewed from anywhere on the 
cluster with the "yarn logs" command.
+In YARN terminology, executors and application masters run inside 
"containers". YARN has two modes for handling container logs after an 
application has completed. If log aggregation is turned on (with the 
`yarn.log-aggregation-enable` config), container logs are copied to HDFS and 
deleted on the local machine. These logs can be viewed from anywhere on the 
cluster with the `yarn logs` command.
 
     yarn logs -applicationId <app ID>
-    
+
 will print out the contents of all log files from all containers from the 
given application. You can also view the container log files directly in HDFS 
using the HDFS shell or API. The directory where they are located can be found 
by looking at your YARN configs (`yarn.nodemanager.remote-app-log-dir` and 
`yarn.nodemanager.remote-app-log-dir-suffix`). The logs are also available on 
the Spark Web UI under the Executors Tab. You need to have both the Spark 
history server and the MapReduce history server running and configure 
`yarn.log.server.url` in `yarn-site.xml` properly. The log URL on the Spark 
history server UI will redirect you to the MapReduce history server to show the 
aggregated logs.
 
 When log aggregation isn't turned on, logs are retained locally on each 
machine under `YARN_APP_LOGS_DIR`, which is usually configured to `/tmp/logs` 
or `$HADOOP_HOME/logs/userlogs` depending on the Hadoop version and 
installation. Viewing logs for a container requires going to the host that 
contains them and looking in this directory.  Subdirectories organize log files 
by application ID and container ID. The logs are also available on the Spark 
Web UI under the Executors Tab and doesn't require running the MapReduce 
history server.
 
 To review per-container launch environment, increase 
`yarn.nodemanager.delete.debug-delay-sec` to a
-large value (e.g. 36000), and then access the application cache through 
`yarn.nodemanager.local-dirs`
+large value (e.g. `36000`), and then access the application cache through 
`yarn.nodemanager.local-dirs`
 on the nodes on which containers are launched. This directory contains the 
launch script, JARs, and
 all environment variables used for launching each container. This process is 
useful for debugging
 classpath problems in particular. (Note that enabling this requires admin 
privileges on cluster
@@ -92,7 +92,7 @@ Note that for the first option, both executors and the 
application master will s
 log4j configuration, which may cause issues when they run on the same node 
(e.g. trying to write
 to the same log file).
 
-If you need a reference to the proper location to put log files in the YARN so 
that YARN can properly display and aggregate them, use 
`spark.yarn.app.container.log.dir` in your log4j.properties. For example, 
`log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`.
 For streaming application, configuring `RollingFileAppender` and setting file 
location to YARN's log directory will avoid disk overflow caused by large log 
file, and logs can be accessed using YARN's log utility.
+If you need a reference to the proper location to put log files in the YARN so 
that YARN can properly display and aggregate them, use 
`spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, 
`log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`.
 For streaming applications, configuring `RollingFileAppender` and setting file 
location to YARN's log directory will avoid disk overflow caused by large log 
files, and logs can be accessed using YARN's log utility.
 
 #### Spark Properties
 
@@ -100,24 +100,26 @@ If you need a reference to the proper location to put log 
files in the YARN so t
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
   <td><code>spark.yarn.am.memory</code></td>
-  <td>512m</td>
+  <td><code>512m</code></td>
   <td>
     Amount of memory to use for the YARN Application Master in client mode, in 
the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
     In cluster mode, use <code>spark.driver.memory</code> instead.
+    <p/>
+    Use lower-case suffixes, e.g. <code>k</code>, <code>m</code>, 
<code>g</code>, <code>t</code>, and <code>p</code>, for kibi-, mebi-, gibi-, 
tebi-, and pebibytes, respectively.
   </td>
 </tr>
 <tr>
   <td><code>spark.driver.cores</code></td>
-  <td>1</td>
+  <td><code>1</code></td>
   <td>
     Number of cores used by the driver in YARN cluster mode.
-    Since the driver is run in the same JVM as the YARN Application Master in 
cluster mode, this also controls the cores used by the YARN AM.
-    In client mode, use <code>spark.yarn.am.cores</code> to control the number 
of cores used by the YARN AM instead.
+    Since the driver is run in the same JVM as the YARN Application Master in 
cluster mode, this also controls the cores used by the YARN Application Master.
+    In client mode, use <code>spark.yarn.am.cores</code> to control the number 
of cores used by the YARN Application Master instead.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.am.cores</code></td>
-  <td>1</td>
+  <td><code>1</code></td>
   <td>
     Number of cores to use for the YARN Application Master in client mode.
     In cluster mode, use <code>spark.driver.cores</code> instead.
@@ -125,39 +127,39 @@ If you need a reference to the proper location to put log 
files in the YARN so t
 </tr>
 <tr>
   <td><code>spark.yarn.am.waitTime</code></td>
-  <td>100s</td>
+  <td><code>100s</code></td>
   <td>
-    In `yarn-cluster` mode, time for the application master to wait for the
-    SparkContext to be initialized. In `yarn-client` mode, time for the 
application master to wait
+    In <code>yarn-cluster</code> mode, time for the YARN Application Master to 
wait for the
+    SparkContext to be initialized. In <code>yarn-client</code> mode, time for 
the YARN Application Master to wait
     for the driver to connect to it.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.submit.file.replication</code></td>
-  <td>The default HDFS replication (usually 3)</td>
+  <td>The default HDFS replication (usually <code>3</code>)</td>
   <td>
     HDFS replication level for the files uploaded into HDFS for the 
application. These include things like the Spark jar, the app jar, and any 
distributed cache files/archives.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.preserve.staging.files</code></td>
-  <td>false</td>
+  <td><code>false</code></td>
   <td>
-    Set to true to preserve the staged files (Spark jar, app jar, distributed 
cache files) at the end of the job rather than delete them.
+    Set to <code>true</code> to preserve the staged files (Spark jar, app jar, 
distributed cache files) at the end of the job rather than delete them.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.scheduler.heartbeat.interval-ms</code></td>
-  <td>3000</td>
+  <td><code>3000</code></td>
   <td>
     The interval in ms in which the Spark application master heartbeats into 
the YARN ResourceManager.
-    The value is capped at half the value of YARN's configuration for the 
expiry interval
-    (<code>yarn.am.liveness-monitor.expiry-interval-ms</code>).
+    The value is capped at half the value of YARN's configuration for the 
expiry interval, i.e.
+    <code>yarn.am.liveness-monitor.expiry-interval-ms</code>.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.scheduler.initial-allocation.interval</code></td>
-  <td>200ms</td>
+  <td><code>200ms</code></td>
   <td>
     The initial interval in which the Spark application master eagerly 
heartbeats to the YARN ResourceManager
     when there are pending container allocation requests. It should be no 
larger than
@@ -177,8 +179,8 @@ If you need a reference to the proper location to put log 
files in the YARN so t
   <td><code>spark.yarn.historyServer.address</code></td>
   <td>(none)</td>
   <td>
-    The address of the Spark history server (i.e. host.com:18080). The address 
should not contain a scheme (http://). Defaults to not being set since the 
history server is an optional service. This address is given to the YARN 
ResourceManager when the Spark application finishes to link the application 
from the ResourceManager UI to the Spark history server UI. 
-    For this property, YARN properties can be used as variables, and these are 
substituted by Spark at runtime. For eg, if the Spark history server runs on 
the same node as the YARN ResourceManager, it can be set to 
`${hadoopconf-yarn.resourcemanager.hostname}:18080`. 
+    The address of the Spark history server, e.g. <code>host.com:18080</code>. 
The address should not contain a scheme (<code>http://</code>). Defaults to not 
being set since the history server is an optional service. This address is 
given to the YARN ResourceManager when the Spark application finishes to link 
the application from the ResourceManager UI to the Spark history server UI.
+    For this property, YARN properties can be used as variables, and these are 
substituted by Spark at runtime. For example, if the Spark history server runs 
on the same node as the YARN ResourceManager, it can be set to 
<code>${hadoopconf-yarn.resourcemanager.hostname}:18080</code>.
   </td>
 </tr>
 <tr>
@@ -197,42 +199,42 @@ If you need a reference to the proper location to put log 
files in the YARN so t
 </tr>
 <tr>
  <td><code>spark.executor.instances</code></td>
-  <td>2</td>
+  <td><code>2</code></td>
   <td>
-    The number of executors. Note that this property is incompatible with 
<code>spark.dynamicAllocation.enabled</code>. If both 
<code>spark.dynamicAllocation.enabled</code> and 
<code>spark.executor.instances</code> are specified, dynamic allocation is 
turned off and the specified number of <code>spark.executor.instances</code> is 
used. 
+    The number of executors. Note that this property is incompatible with 
<code>spark.dynamicAllocation.enabled</code>. If both 
<code>spark.dynamicAllocation.enabled</code> and 
<code>spark.executor.instances</code> are specified, dynamic allocation is 
turned off and the specified number of <code>spark.executor.instances</code> is 
used.
   </td>
 </tr>
 <tr>
  <td><code>spark.yarn.executor.memoryOverhead</code></td>
   <td>executorMemory * 0.10, with minimum of 384 </td>
   <td>
-    The amount of off heap memory (in megabytes) to be allocated per executor. 
This is memory that accounts for things like VM overheads, interned strings, 
other native overheads, etc. This tends to grow with the executor size 
(typically 6-10%).
+    The amount of off-heap memory (in megabytes) to be allocated per executor. 
This is memory that accounts for things like VM overheads, interned strings, 
other native overheads, etc. This tends to grow with the executor size 
(typically 6-10%).
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.driver.memoryOverhead</code></td>
   <td>driverMemory * 0.10, with minimum of 384 </td>
   <td>
-    The amount of off heap memory (in megabytes) to be allocated per driver in 
cluster mode. This is memory that accounts for things like VM overheads, 
interned strings, other native overheads, etc. This tends to grow with the 
container size (typically 6-10%).
+    The amount of off-heap memory (in megabytes) to be allocated per driver in 
cluster mode. This is memory that accounts for things like VM overheads, 
interned strings, other native overheads, etc. This tends to grow with the 
container size (typically 6-10%).
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.am.memoryOverhead</code></td>
   <td>AM memory * 0.10, with minimum of 384 </td>
   <td>
-    Same as <code>spark.yarn.driver.memoryOverhead</code>, but for the 
Application Master in client mode.
+    Same as <code>spark.yarn.driver.memoryOverhead</code>, but for the YARN 
Application Master in client mode.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.am.port</code></td>
   <td>(random)</td>
   <td>
-    Port for the YARN Application Master to listen on. In YARN client mode, 
this is used to communicate between the Spark driver running on a gateway and 
the Application Master running on YARN. In YARN cluster mode, this is used for 
the dynamic executor feature, where it handles the kill from the scheduler 
backend.
+    Port for the YARN Application Master to listen on. In YARN client mode, 
this is used to communicate between the Spark driver running on a gateway and 
the YARN Application Master running on YARN. In YARN cluster mode, this is used 
for the dynamic executor feature, where it handles the kill from the scheduler 
backend.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.queue</code></td>
-  <td>default</td>
+  <td><code>default</code></td>
   <td>
     The name of the YARN queue to which the application is submitted.
   </td>
@@ -245,18 +247,18 @@ If you need a reference to the proper location to put log 
files in the YARN so t
     By default, Spark on YARN will use a Spark jar installed locally, but the 
Spark jar can also be
     in a world-readable location on HDFS. This allows YARN to cache it on 
nodes so that it doesn't
     need to be distributed each time an application runs. To point to a jar on 
HDFS, for example,
-    set this configuration to "hdfs:///some/path".
+    set this configuration to <code>hdfs:///some/path</code>.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.access.namenodes</code></td>
   <td>(none)</td>
   <td>
-    A list of secure HDFS namenodes your Spark application is going to access. 
For 
-    example, 
`spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032`. 
-    The Spark application must have acess to the namenodes listed and Kerberos 
must 
-    be properly configured to be able to access them (either in the same realm 
or in 
-    a trusted realm). Spark acquires security tokens for each of the namenodes 
so that 
+    A comma-separated list of secure HDFS namenodes your Spark application is 
going to access. For
+    example, 
<code>spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032</code>.
+    The Spark application must have access to the namenodes listed and 
Kerberos must
+    be properly configured to be able to access them (either in the same realm 
or in
+    a trusted realm). Spark acquires security tokens for each of the namenodes 
so that
     the Spark application can access those remote HDFS clusters.
   </td>
 </tr>
@@ -264,18 +266,18 @@ If you need a reference to the proper location to put log 
files in the YARN so t
   <td><code>spark.yarn.appMasterEnv.[EnvironmentVariableName]</code></td>
   <td>(none)</td>
   <td>
-     Add the environment variable specified by 
<code>EnvironmentVariableName</code> to the 
-     Application Master process launched on YARN. The user can specify 
multiple of 
-     these and to set multiple environment variables. In `yarn-cluster` mode 
this controls 
-     the environment of the SPARK driver and in `yarn-client` mode it only 
controls 
-     the environment of the executor launcher. 
+     Add the environment variable specified by 
<code>EnvironmentVariableName</code> to the
+     Application Master process launched on YARN. The user can specify 
multiple of
+     these and to set multiple environment variables. In 
<code>yarn-cluster</code> mode this controls
+     the environment of the Spark driver and in <code>yarn-client</code> mode 
it only controls
+     the environment of the executor launcher.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.containerLauncherMaxThreads</code></td>
-  <td>25</td>
+  <td><code>25</code></td>
   <td>
-    The maximum number of threads to use in the application master for 
launching executor containers.
+    The maximum number of threads to use in the YARN Application Master for 
launching executor containers.
   </td>
 </tr>
 <tr>
@@ -283,19 +285,19 @@ If you need a reference to the proper location to put log 
files in the YARN so t
   <td>(none)</td>
   <td>
   A string of extra JVM options to pass to the YARN Application Master in 
client mode.
-  In cluster mode, use `spark.driver.extraJavaOptions` instead.
+  In cluster mode, use <code>spark.driver.extraJavaOptions</code> instead.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.am.extraLibraryPath</code></td>
   <td>(none)</td>
   <td>
-    Set a special library path to use when launching the application master in 
client mode.
+    Set a special library path to use when launching the YARN Application 
Master in client mode.
   </td>
 </tr>
 <tr>
   <td><code>spark.yarn.maxAppAttempts</code></td>
-  <td>yarn.resourcemanager.am.max-attempts in YARN</td>
+  <td><code>yarn.resourcemanager.am.max-attempts</code> in YARN</td>
   <td>
   The maximum number of attempts that will be made to submit the application.
   It should be no larger than the global number of max attempts in the YARN 
configuration.
@@ -303,10 +305,10 @@ If you need a reference to the proper location to put log 
files in the YARN so t
 </tr>
 <tr>
   <td><code>spark.yarn.submit.waitAppCompletion</code></td>
-  <td>true</td>
+  <td><code>true</code></td>
   <td>
   In YARN cluster mode, controls whether the client waits to exit until the 
application completes.
-  If set to true, the client process will stay alive reporting the 
application's status.
+  If set to <code>true</code>, the client process will stay alive reporting 
the application's status.
   Otherwise, the client process will exit after submission.
   </td>
 </tr>
@@ -332,7 +334,7 @@ If you need a reference to the proper location to put log 
files in the YARN so t
   <td>(none)</td>
   <td>
   The full path to the file that contains the keytab for the principal 
specified above.
-  This keytab will be copied to the node running the Application Master via 
the Secure Distributed Cache,
+  This keytab will be copied to the node running the YARN Application Master 
via the Secure Distributed Cache,
   for renewing the login tickets and the delegation tokens periodically.
   </td>
 </tr>
@@ -371,14 +373,14 @@ If you need a reference to the proper location to put log 
files in the YARN so t
 </tr>
 <tr>
   <td><code>spark.yarn.security.tokens.${service}.enabled</code></td>
-  <td>true</td>
+  <td><code>true</code></td>
   <td>
   Controls whether to retrieve delegation tokens for non-HDFS services when 
security is enabled.
   By default, delegation tokens for all supported services are retrieved when 
those services are
   configured, but it's possible to disable that behavior if it somehow 
conflicts with the
   application being run.
   <p/>
-  Currently supported services are: hive, hbase
+  Currently supported services are: <code>hive</code>, <code>hbase</code>
   </td>
 </tr>
 </table>
@@ -387,5 +389,5 @@ If you need a reference to the proper location to put log 
files in the YARN so t
 
 - Whether core requests are honored in scheduling decisions depends on which 
scheduler is in use and how it is configured.
 - In `yarn-cluster` mode, the local directories used by the Spark executors 
and the Spark driver will be the local directories configured for YARN (Hadoop 
YARN config `yarn.nodemanager.local-dirs`). If the user specifies 
`spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark 
executors will use the local directories configured for YARN while the Spark 
driver will use those defined in `spark.local.dir`. This is because the Spark 
driver does not run on the YARN cluster in `yarn-client` mode, only the Spark 
executors do.
-- The `--files` and `--archives` options support specifying file names with 
the # similar to Hadoop. For example you can specify: `--files 
localtest.txt#appSees.txt` and this will upload the file you have locally named 
localtest.txt into HDFS but this will be linked to by the name `appSees.txt`, 
and your application should use the name as `appSees.txt` to reference it when 
running on YARN.
+- The `--files` and `--archives` options support specifying file names with 
the # similar to Hadoop. For example you can specify: `--files 
localtest.txt#appSees.txt` and this will upload the file you have locally named 
`localtest.txt` into HDFS but this will be linked to by the name `appSees.txt`, 
and your application should use the name as `appSees.txt` to reference it when 
running on YARN.
 - The `--jars` option allows the `SparkContext.addJar` function to work if you 
are using it with local files and running in `yarn-cluster` mode. It does not 
need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.

http://git-wip-us.apache.org/repos/asf/spark/blob/ca9fe540/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 7ae9244..a1cbc7d 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1676,7 +1676,7 @@ results <- collect(sql(sqlContext, "FROM src SELECT key, 
value"))
 ### Interacting with Different Versions of Hive Metastore
 
 One of the most important pieces of Spark SQL's Hive support is interaction 
with Hive metastore,
-which enables Spark SQL to access metadata of Hive tables. Starting from Spark 
1.4.0, a single binary 
+which enables Spark SQL to access metadata of Hive tables. Starting from Spark 
1.4.0, a single binary
 build of Spark SQL can be used to query different versions of Hive metastores, 
using the configuration described below.
 Note that independent of the version of Hive that is being used to talk to the 
metastore, internally Spark SQL
 will compile against Hive 1.2.1 and use those classes for internal execution 
(serdes, UDFs, UDAFs, etc).
@@ -1706,8 +1706,8 @@ The following options can be used to configure the 
version of Hive that is used
         either <code>1.2.1</code> or not defined.
         <li><code>maven</code></li>
         Use Hive jars of specified version downloaded from Maven repositories. 
 This configuration
-        is not generally recommended for production deployments. 
-        <li>A classpath in the standard format for the JVM.  This classpath 
must include all of Hive 
+        is not generally recommended for production deployments.
+        <li>A classpath in the standard format for the JVM.  This classpath 
must include all of Hive
         and its dependencies, including the correct version of Hadoop.  These 
jars only need to be
         present on the driver, but if you are running in yarn cluster mode 
then you must ensure
         they are packaged with you application.</li>
@@ -1806,7 +1806,7 @@ the Data Sources API.  The following options are 
supported:
 <div data-lang="scala"  markdown="1">
 
 {% highlight scala %}
-val jdbcDF = sqlContext.read.format("jdbc").options( 
+val jdbcDF = sqlContext.read.format("jdbc").options(
   Map("url" -> "jdbc:postgresql:dbserver",
   "dbtable" -> "schema.tablename")).load()
 {% endhighlight %}
@@ -2023,11 +2023,11 @@ options.
 
  - Optimized execution using manually managed memory (Tungsten) is now enabled 
by default, along with
    code generation for expression evaluation.  These features can both be 
disabled by setting
-   `spark.sql.tungsten.enabled` to `false.
- - Parquet schema merging is no longer enabled by default.  It can be 
re-enabled by setting 
+   `spark.sql.tungsten.enabled` to `false`.
+ - Parquet schema merging is no longer enabled by default.  It can be 
re-enabled by setting
    `spark.sql.parquet.mergeSchema` to `true`.
- - Resolution of strings to columns in python now supports using dots (`.`) to 
qualify the column or 
-   access nested values.  For example `df['table.column.nestedField']`.  
However, this means that if 
+ - Resolution of strings to columns in python now supports using dots (`.`) to 
qualify the column or
+   access nested values.  For example `df['table.column.nestedField']`.  
However, this means that if
    your column name contains any dots you must now escape them using backticks 
(e.g., ``table.`column.with.dots`.nested``).   
  - In-memory columnar storage partition pruning is on by default. It can be 
disabled by setting
    `spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.

http://git-wip-us.apache.org/repos/asf/spark/blob/ca9fe540/docs/submitting-applications.md
----------------------------------------------------------------------
diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md
index 7ea4d6f..915be0f 100644
--- a/docs/submitting-applications.md
+++ b/docs/submitting-applications.md
@@ -103,7 +103,7 @@ run it with `--help`. Here are a few examples of common 
options:
 export HADOOP_CONF_DIR=XXX
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
-  --master yarn-cluster \  # can also be `yarn-client` for client mode
+  --master yarn-cluster \  # can also be yarn-client for client mode
   --executor-memory 20G \
   --num-executors 50 \
   /path/to/examples.jar \
@@ -174,9 +174,9 @@ This can use up a significant amount of space over time and 
will need to be clea
 is handled automatically, and with Spark standalone, automatic cleanup can be 
configured with the
 `spark.worker.cleanup.appDataTtl` property.
 
-Users may also include any other dependencies by supplying a comma-delimited 
list of maven coordinates 
-with `--packages`. All transitive dependencies will be handled when using this 
command. Additional 
-repositories (or resolvers in SBT) can be added in a comma-delimited fashion 
with the flag `--repositories`. 
+Users may also include any other dependencies by supplying a comma-delimited 
list of maven coordinates
+with `--packages`. All transitive dependencies will be handled when using this 
command. Additional
+repositories (or resolvers in SBT) can be added in a comma-delimited fashion 
with the flag `--repositories`.
 These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` 
to include Spark Packages.
 
 For Python, the equivalent `--py-files` option can be used to distribute 
`.egg`, `.zip` and `.py` libraries


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to