MAPREDUCE-6260. Convert site documentation to markdown (Masatake Iwasaki via aw)


Branch: refs/heads/trunk
Commit: 8b787e2fdbd0050c0345cf14b26af9d61049068f
Parents: 34b78d5
Author: Allen Wittenauer <>
Authored: Tue Feb 17 06:52:14 2015 -1000
Committer: Allen Wittenauer <>
Committed: Tue Feb 17 06:52:14 2015 -1000

 hadoop-mapreduce-project/CHANGES.txt            |    3 +
 .../src/site/apt/DistributedCacheDeploy.apt.vm  |  151 -
 .../src/site/apt/EncryptedShuffle.apt.vm        |  320 ---
 .../src/site/apt/MapReduceTutorial.apt.vm       | 1605 -----------
 ...pReduce_Compatibility_Hadoop1_Hadoop2.apt.vm |  114 -
 .../src/site/apt/MapredAppMasterRest.apt.vm     | 2709 ------------------
 .../src/site/apt/MapredCommands.apt.vm          |  233 --
 .../apt/PluggableShuffleAndPluggableSort.apt.vm |   98 -
 .../site/markdown/  |  119 +
 .../src/site/markdown/       |  255 ++
 .../src/site/markdown/      | 1156 ++++++++
 .../  |   69 +
 .../src/site/markdown/    | 2397 ++++++++++++++++
 .../src/site/markdown/         |  153 +
 .../         |   73 +
 .../src/site/apt/HistoryServerRest.apt.vm       | 2672 -----------------
 .../src/site/markdown/      | 2361 +++++++++++++++
 17 files changed, 6586 insertions(+), 7902 deletions(-)
diff --git a/hadoop-mapreduce-project/CHANGES.txt 
index 9ef7a32..aebc71e 100644
--- a/hadoop-mapreduce-project/CHANGES.txt
+++ b/hadoop-mapreduce-project/CHANGES.txt
@@ -96,6 +96,9 @@ Trunk (Unreleased)
     MAPREDUCE-6250. deprecate sbin/ (aw)
+    MAPREDUCE-6260. Convert site documentation to markdown (Masatake Iwasaki
+    via aw)
     MAPREDUCE-6191. Improve clearing stale state of Java serialization
diff --git 
deleted file mode 100644
index 2195e10..0000000
+++ /dev/null
@@ -1,151 +0,0 @@
-~~ Licensed under the Apache License, Version 2.0 (the "License");
-~~ you may not use this file except in compliance with the License.
-~~ You may obtain a copy of the License at
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License. See accompanying LICENSE file.
-  ---
-  Hadoop Map Reduce Next Generation-${project.version} - Distributed Cache 
-  ---
-  ---
-  ${}
-Hadoop MapReduce Next Generation - Distributed Cache Deploy
-* Introduction
-  The MapReduce application framework has rudimentary support for deploying a
-  new version of the MapReduce framework via the distributed cache. By setting
-  the appropriate configuration properties, users can run a different version
-  of MapReduce than the one initially deployed to the cluster. For example,
-  cluster administrators can place multiple versions of MapReduce in HDFS and
-  configure <<<mapred-site.xml>>> to specify which version jobs will use by
-  default. This allows the administrators to perform a rolling upgrade of the
-  MapReduce framework under certain conditions.
-* Preconditions and Limitations
-  The support for deploying the MapReduce framework via the distributed cache
-  currently does not address the job client code used to submit and query
-  jobs. It also does not address the <<<ShuffleHandler>>> code that runs as an
-  auxilliary service within each NodeManager. As a result the following
-  limitations apply to MapReduce versions that can be successfully deployed via
-  the distributed cache in a rolling upgrade fashion:
-  * The MapReduce version must be compatible with the job client code used to
-    submit and query jobs. If it is incompatible then the job client must be
-    upgraded separately on any node from which jobs using the new MapReduce
-    version will be submitted or queried.
-  * The MapReduce version must be compatible with the configuration files used
-    by the job client submitting the jobs. If it is incompatible with that
-    configuration (e.g.: a new property must be set or an existing property
-    value changed) then the configuration must be updated first.
-  * The MapReduce version must be compatible with the <<<ShuffleHandler>>>
-    version running on the nodes in the cluster. If it is incompatible then the
-    new <<<ShuffleHandler>>> code must be deployed to all the nodes in the
-    cluster, and the NodeManagers must be restarted to pick up the new
-    <<<ShuffleHandler>>> code.
-* Deploying a New MapReduce Version via the Distributed Cache
-  Deploying a new MapReduce version consists of three steps:
-  [[1]] Upload the MapReduce archive to a location that can be accessed by the
-  job submission client. Ideally the archive should be on the cluster's default
-  filesystem at a publicly-readable path. See the archive location discussion
-  below for more details.
-  [[2]] Configure <<<mapreduce.application.framework.path>>> to point to the
-  location where the archive is located. As when specifying distributed cache
-  files for a job, this is a URL that also supports creating an alias for the
-  archive if a URL fragment is specified. For example,
-  will be localized as <<<mrframework>>> rather than
-  <<<hadoop-mapreduce-${project.version}.tar.gz>>>.
-  [[3]] Configure <<<mapreduce.application.classpath>>> to set the proper
-  classpath to use with the MapReduce archive configured above. NOTE: An error
-  occurs if <<<mapreduce.application.framework.path>>> is configured but
-  <<<mapreduce.application.classpath>>> does not reference the base name of the
-  archive path or the alias if an alias was specified.
-** Location of the MapReduce Archive and How It Affects Job Performance
-  Note that the location of the MapReduce archive can be critical to job
-  submission and job startup performance. If the archive is not located on the
-  cluster's default filesystem then it will be copied to the job staging
-  directory for each job and localized to each node where the job's tasks
-  run. This will slow down job submission and task startup performance.
-  If the archive is located on the default filesystem then the job client will
-  not upload the archive to the job staging directory for each job
-  submission. However if the archive path is not readable by all cluster users
-  then the archive will be localized separately for each user on each node
-  where tasks execute. This can cause unnecessary duplication in the
-  distributed cache.
-  When working with a large cluster it can be important to increase the
-  replication factor of the archive to increase its availability. This will
-  spread the load when the nodes in the cluster localize the archive for the
-  first time.
-* MapReduce Archives and Classpath Configuration
-  Setting a proper classpath for the MapReduce archive depends upon the
-  composition of the archive and whether it has any additional dependencies.
-  For example, the archive can contain not only the MapReduce jars but also the
-  necessary YARN, HDFS, and Hadoop Common jars and all other dependencies. In
-  that case, <<<mapreduce.application.classpath>>> would be configured to
-  something like the following example, where the archive basename is
-  hadoop-mapreduce-${project.version}.tar.gz and the archive is organized
-  internally similar to the standard Hadoop distribution archive:
-  Another possible approach is to have the archive consist of just the
-  MapReduce jars and have the remaining dependencies picked up from the Hadoop
-  distribution installed on the nodes.  In that case, the above example would
-  change to something like the following:
-** NOTE: 
-  If shuffle encryption is also enabled in the cluster, then we could meet the 
problem that MR job get failed with exception like below: 
-2014-10-10 02:17:16,600 WARN [fetcher#1] 
org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
junpingdu-centos5-3.cs1cloud.internal:13562 with 1 map outputs PKIX path building failed: unable to find 
valid certification path to requested target
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at
-    at 
-  This is because MR client (deployed from HDFS) cannot access ssl-client.xml 
in local FS under directory of $HADOOP_CONF_DIR. To fix the problem, we can add 
the directory with ssl-client.xml to the classpath of MR which is specified in 
"mapreduce.application.classpath" as mentioned above. To avoid MR application 
being affected by other local configurations, it is better to create a 
dedicated directory for putting ssl-client.xml, e.g. a sub-directory under 
diff --git 
deleted file mode 100644
index 1761ad8..0000000
+++ /dev/null
@@ -1,320 +0,0 @@
-~~ Licensed under the Apache License, Version 2.0 (the "License");
-~~ you may not use this file except in compliance with the License.
-~~ You may obtain a copy of the License at
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License. See accompanying LICENSE file.
-  ---
-  Hadoop Map Reduce Next Generation-${project.version} - Encrypted Shuffle
-  ---
-  ---
-  ${}
-Hadoop MapReduce Next Generation - Encrypted Shuffle
-* {Introduction}
-  The Encrypted Shuffle capability allows encryption of the MapReduce shuffle
-  using HTTPS and with optional client authentication (also known as
-  bi-directional HTTPS, or HTTPS with client certificates). It comprises:
-  * A Hadoop configuration setting for toggling the shuffle between HTTP and
-    HTTPS.
-  * A Hadoop configuration settings for specifying the keystore and truststore
-   properties (location, type, passwords) used by the shuffle service and the
-   reducers tasks fetching shuffle data.
-  * A way to re-load truststores across the cluster (when a node is added or
-    removed).
-* {Configuration}
-**  <<core-site.xml>> Properties
-  To enable encrypted shuffle, set the following properties in core-site.xml of
-  all nodes in the cluster:
-| <<Property>>                         | <<Default Value>>   | <<Explanation>> 
-| <<<hadoop.ssl.require.client.cert>>> | <<<false>>>         | Whether client 
certificates are required |
-| <<<hadoop.ssl.hostname.verifier>>>   | <<<DEFAULT>>>       | The hostname 
verifier to provide for HttpsURLConnections. Valid values are: <<DEFAULT>>, 
-| <<<hadoop.ssl.keystores.factory.class>>> | 
<<<>>> | The 
KeyStoresFactory implementation to use |
-| <<<hadoop.ssl.server.conf>>>         | <<<ssl-server.xml>>> | Resource file 
from which ssl server keystore information will be extracted. This file is 
looked up in the classpath, typically it should be in Hadoop conf/ directory |
-| <<<hadoop.ssl.client.conf>>>         | <<<ssl-client.xml>>> | Resource file 
from which ssl server keystore information will be extracted. This file is 
looked up in the classpath, typically it should be in Hadoop conf/ directory |
-| <<<hadoop.ssl.enabled.protocols>>>   | <<<TLSv1>>>         | The supported 
SSL protocols (JDK6 can use <<TLSv1>>, JDK7+ can use <<TLSv1,TLSv1.1,TLSv1.2>>) 
-  <<IMPORTANT:>> Currently requiring client certificates should be set to 
-  Refer the {{{ClientCertificates}Client Certificates}} section for details.
-  <<IMPORTANT:>> All these properties should be marked as final in the cluster
-  configuration files.
-*** Example:
-    ...
-    <property>
-      <name>hadoop.ssl.require.client.cert</name>
-      <value>false</value>
-      <final>true</final>
-    </property>
-    <property>
-      <name>hadoop.ssl.hostname.verifier</name>
-      <value>DEFAULT</value>
-      <final>true</final>
-    </property>
-    <property>
-      <name>hadoop.ssl.keystores.factory.class</name>
-      <value></value>
-      <final>true</final>
-    </property>
-    <property>
-      <name>hadoop.ssl.server.conf</name>
-      <value>ssl-server.xml</value>
-      <final>true</final>
-    </property>
-    <property>
-      <name>hadoop.ssl.client.conf</name>
-      <value>ssl-client.xml</value>
-      <final>true</final>
-    </property>
-    ...
-**  <<<mapred-site.xml>>> Properties
-  To enable encrypted shuffle, set the following property in mapred-site.xml
-  of all nodes in the cluster:
-| <<Property>>                         | <<Default Value>>   | <<Explanation>> 
-| <<<mapreduce.shuffle.ssl.enabled>>>  | <<<false>>>         | Whether 
encrypted shuffle is enabled |
-  <<IMPORTANT:>> This property should be marked as final in the cluster
-  configuration files.
-*** Example:
-    ...
-    <property>
-      <name>mapreduce.shuffle.ssl.enabled</name>
-      <value>true</value>
-      <final>true</final>
-    </property>
-    ...
-  The Linux container executor should be set to prevent job tasks from
-  reading the server keystore information and gaining access to the shuffle
-  server certificates.
-  Refer to Hadoop Kerberos configuration for details on how to do this.
-* {Keystore and Truststore Settings}
-  Currently <<<FileBasedKeyStoresFactory>>> is the only <<<KeyStoresFactory>>>
-  implementation. The <<<FileBasedKeyStoresFactory>>> implementation uses the
-  following properties, in the <<ssl-server.xml>> and <<ssl-client.xml>> files,
-  to configure the keystores and truststores.
-** <<<ssl-server.xml>>> (Shuffle server) Configuration:
-  The mapred user should own the <<ssl-server.xml>> file and have exclusive
-  read access to it.
-| <<Property>>                                | <<Default Value>>   | 
<<Explanation>> |
-| <<<ssl.server.keystore.type>>>              | <<<jks>>>           | Keystore 
file type |
-| <<<ssl.server.keystore.location>>>          | NONE                | Keystore 
file location. The mapred user should own this file and have exclusive read 
access to it. |
-| <<<ssl.server.keystore.password>>>          | NONE                | Keystore 
file password |
-| <<<ssl.server.truststore.type>>>            | <<<jks>>>           | 
Truststore file type |
-| <<<ssl.server.truststore.location>>>        | NONE                | 
Truststore file location. The mapred user should own this file and have 
exclusive read access to it. |
-| <<<ssl.server.truststore.password>>>        | NONE                | 
Truststore file password |
-| <<<ssl.server.truststore.reload.interval>>> | 10000               | 
Truststore reload interval, in milliseconds |
-*** Example:
-  <!-- Server Certificate Store -->
-  <property>
-    <name>ssl.server.keystore.type</name>
-    <value>jks</value>
-  </property>
-  <property>
-    <name>ssl.server.keystore.location</name>
-    <value>${user.home}/keystores/server-keystore.jks</value>
-  </property>
-  <property>
-    <name>ssl.server.keystore.password</name>
-    <value>serverfoo</value>
-  </property>
-  <!-- Server Trust Store -->
-  <property>
-    <name>ssl.server.truststore.type</name>
-    <value>jks</value>
-  </property>
-  <property>
-    <name>ssl.server.truststore.location</name>
-    <value>${user.home}/keystores/truststore.jks</value>
-  </property>
-  <property>
-    <name>ssl.server.truststore.password</name>
-    <value>clientserverbar</value>
-  </property>
-  <property>
-    <name>ssl.server.truststore.reload.interval</name>
-    <value>10000</value>
-  </property>
-** <<<ssl-client.xml>>> (Reducer/Fetcher) Configuration:
-  The mapred user should own the <<ssl-client.xml>> file and it should have
-  default permissions.
-| <<Property>>                                | <<Default Value>>   | 
<<Explanation>> |
-| <<<ssl.client.keystore.type>>>              | <<<jks>>>           | Keystore 
file type |
-| <<<ssl.client.keystore.location>>>          | NONE                | Keystore 
file location. The mapred user should own this file and it should have default 
permissions. |
-| <<<ssl.client.keystore.password>>>          | NONE                | Keystore 
file password |
-| <<<ssl.client.truststore.type>>>            | <<<jks>>>           | 
Truststore file type |
-| <<<ssl.client.truststore.location>>>        | NONE                | 
Truststore file location. The mapred user should own this file and it should 
have default permissions. |
-| <<<ssl.client.truststore.password>>>        | NONE                | 
Truststore file password |
-| <<<ssl.client.truststore.reload.interval>>> | 10000                | 
Truststore reload interval, in milliseconds |
-*** Example:
-  <!-- Client certificate Store -->
-  <property>
-    <name>ssl.client.keystore.type</name>
-    <value>jks</value>
-  </property>
-  <property>
-    <name>ssl.client.keystore.location</name>
-    <value>${user.home}/keystores/client-keystore.jks</value>
-  </property>
-  <property>
-    <name>ssl.client.keystore.password</name>
-    <value>clientfoo</value>
-  </property>
-  <!-- Client Trust Store -->
-  <property>
-    <name>ssl.client.truststore.type</name>
-    <value>jks</value>
-  </property>
-  <property>
-    <name>ssl.client.truststore.location</name>
-    <value>${user.home}/keystores/truststore.jks</value>
-  </property>
-  <property>
-    <name>ssl.client.truststore.password</name>
-    <value>clientserverbar</value>
-  </property>
-  <property>
-    <name>ssl.client.truststore.reload.interval</name>
-    <value>10000</value>
-  </property>
-* Activating Encrypted Shuffle
-  When you have made the above configuration changes, activate Encrypted
-  Shuffle by re-starting all NodeManagers.
-  <<IMPORTANT:>> Using encrypted shuffle will incur in a significant
-  performance impact. Users should profile this and potentially reserve
-  1 or more cores for encrypted shuffle.
-* {ClientCertificates} Client Certificates
-  Using Client Certificates does not fully ensure that the client is a
-  reducer task for the job. Currently, Client Certificates (their private key)
-  keystore files must be readable by all users submitting jobs to the cluster.
-  This means that a rogue job could read such those keystore files and use
-  the client certificates in them to establish a secure connection with a
-  Shuffle server. However, unless the rogue job has a proper JobToken, it won't
-  be able to retrieve shuffle data from the Shuffle server. A job, using its
-  own JobToken, can only retrieve shuffle data that belongs to itself.
-* Reloading Truststores
-  By default the truststores will reload their configuration every 10 seconds.
-  If a new truststore file is copied over the old one, it will be re-read,
-  and its certificates will replace the old ones. This mechanism is useful for
-  adding or removing nodes from the cluster, or for adding or removing trusted
-  clients. In these cases, the client or NodeManager certificate is added to
-  (or removed from) all the truststore files in the system, and the new
-  configuration will be picked up without you having to restart the NodeManager
-  daemons.
-* Debugging
-  <<NOTE:>> Enable debugging only for troubleshooting, and then only for jobs
-  running on small amounts of data. It is very verbose and slows down jobs by
-  several orders of magnitude. (You might need to increase mapred.task.timeout
-  to prevent jobs from failing because tasks run so slowly.)
-  To enable SSL debugging in the reducers, set <<<>>> in
-  the <<<>>> property; for example:
-  <property>
-    <name></name>
-    <value>-Xmx-200m</value>
-  </property>
-  You can do this on a per-job basis, or by means of a cluster-wide setting in
-  the <<<mapred-site.xml>>> file.
-  To set this property in NodeManager, set it in the <<<>>> file:

Reply via email to