[5/8] hadoop git commit: MAPREDUCE-6260. Convert site documentation to markdown (Masatake Iwasaki via aw)

aw Tue, 17 Feb 2015 08:55:11 -0800

http://git-wip-us.apache.org/repos/asf/hadoop/blob/8b787e2f/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
deleted file mode 100644
index e011563..0000000
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/MapredCommands.apt.vm
+++ /dev/null
@@ -1,233 +0,0 @@
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements.  See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License.  You may obtain a copy of the License at
-~~
-~~     http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-  ---
-  MapReduce Commands Guide
-  ---
-  ---
-  ${maven.build.timestamp}
-
-MapReduce Commands Guide
-
-%{toc|section=1|fromDepth=2|toDepth=4}
-
-* Overview
-
-  MapReduce commands are invoked by the <<<bin/mapred>>> script. Running the
-  script without any arguments prints the description for all commands.
-
-   Usage: <<<mapred [--config confdir] [--loglevel loglevel] COMMAND>>>
-
-   MapReduce has an option parsing framework that employs parsing generic
-   options as well as running classes.
-
-*-------------------------+---------------------------------------------------+
-|| COMMAND_OPTIONS        || Description                                      |
-*-------------------------+---------------------------------------------------+
-| --config confdir | Overwrites the default Configuration directory. Default
-|                  | is $\{HADOOP_PREFIX\}/conf.
-*-------------------------+---------------------------------------------------+
-| --loglevel loglevel | Overwrites the log level. Valid log levels are FATAL,
-|                     | ERROR, WARN, INFO, DEBUG, and TRACE. Default is INFO.
-*-------------------------+---------------------------------------------------+
-| COMMAND COMMAND_OPTIONS | Various commands with their options are described
-|                         | in the following sections. The commands have been
-|                         | grouped into {{User Commands}} and
-|                         | {{Administration Commands}}.
-*-------------------------+---------------------------------------------------+
-
-* User Commands
-
-   Commands useful for users of a hadoop cluster.
-
-** <<<pipes>>>
-
-   Runs a pipes job.
-
-   Usage: <<<mapred pipes [-conf <path>] [-jobconf <key=value>, <key=value>,
-   ...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat
-   <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer
-   <class>] [-program <executable>] [-reduces <num>]>>>
-
-*----------------------------------------+------------------------------------+
-|| COMMAND_OPTION                        || Description
-*----------------------------------------+------------------------------------+
-| -conf <path>                           | Configuration for job
-*----------------------------------------+------------------------------------+
-| -jobconf <key=value>, <key=value>, ... | Add/override configuration for job
-*----------------------------------------+------------------------------------+
-| -input <path>                          | Input directory
-*----------------------------------------+------------------------------------+
-| -output <path>                         | Output directory
-*----------------------------------------+------------------------------------+
-| -jar <jar file>                        | Jar filename
-*----------------------------------------+------------------------------------+
-| -inputformat <class>                   | InputFormat class
-*----------------------------------------+------------------------------------+
-| -map <class>                           | Java Map class
-*----------------------------------------+------------------------------------+
-| -partitioner <class>                   | Java Partitioner
-*----------------------------------------+------------------------------------+
-| -reduce <class>                        | Java Reduce class
-*----------------------------------------+------------------------------------+
-| -writer <class>                        | Java RecordWriter
-*----------------------------------------+------------------------------------+
-| -program <executable>                  | Executable URI
-*----------------------------------------+------------------------------------+
-| -reduces <num>                         | Number of reduces
-*----------------------------------------+------------------------------------+
-
-** <<<job>>>
-
-   Command to interact with Map Reduce Jobs.
-
-   Usage: <<<mapred job
-          | 
[{{{../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options}GENERIC_OPTIONS}}]
-          | [-submit <job-file>]
-          | [-status <job-id>]
-          | [-counter <job-id> <group-name> <counter-name>]
-          | [-kill <job-id>]
-          | [-events <job-id> <from-event-#> <#-of-events>]
-          | [-history [all] <jobOutputDir>] | [-list [all]]
-          | [-kill-task <task-id>] | [-fail-task <task-id>]
-          | [-set-priority <job-id> <priority>]>>>
-
-*------------------------------+---------------------------------------------+
-|| COMMAND_OPTION              || Description
-*------------------------------+---------------------------------------------+
-| -submit <job-file>           | Submits the job.
-*------------------------------+---------------------------------------------+
-| -status <job-id>             | Prints the map and reduce completion
-                               | percentage and all job counters.
-*------------------------------+---------------------------------------------+
-| -counter <job-id> <group-name> <counter-name> | Prints the counter value.
-*------------------------------+---------------------------------------------+
-| -kill <job-id>               | Kills the job.
-*------------------------------+---------------------------------------------+
-| -events <job-id> <from-event-#> <#-of-events> | Prints the events' details
-                               | received by jobtracker for the given range.
-*------------------------------+---------------------------------------------+
-| -history [all]<jobOutputDir> | Prints job details, failed and killed tip
-                               | details.  More details about the job such as
-                               | successful tasks and task attempts made for
-                               | each task can be viewed by specifying the
-                               | [all] option.
-*------------------------------+---------------------------------------------+
-| -list [all]                  | Displays jobs which are yet to complete.
-                               | <<<-list all>>> displays all jobs.
-*------------------------------+---------------------------------------------+
-| -kill-task <task-id>         | Kills the task. Killed tasks are NOT counted
-                               | against failed attempts.
-*------------------------------+---------------------------------------------+
-| -fail-task <task-id>         | Fails the task. Failed tasks are counted
-                               | against failed attempts.
-*------------------------------+---------------------------------------------+
-| -set-priority <job-id> <priority> | Changes the priority of the job. Allowed
-                               | priority values are VERY_HIGH, HIGH, NORMAL,
-                               | LOW, VERY_LOW
-*------------------------------+---------------------------------------------+
-
-** <<<queue>>>
-
-   command to interact and view Job Queue information
-
-   Usage: <<<mapred queue [-list] | [-info <job-queue-name> [-showJobs]]
-          | [-showacls]>>>
-
-*-----------------+-----------------------------------------------------------+
-|| COMMAND_OPTION || Description
-*-----------------+-----------------------------------------------------------+
-| -list           | Gets list of Job Queues configured in the system.
-                  | Along with scheduling information associated with the job
-                  | queues.
-*-----------------+-----------------------------------------------------------+
-| -info <job-queue-name> [-showJobs] | Displays the job queue information and
-                  | associated scheduling information of particular job queue.
-                  | If <<<-showJobs>>> options is present a list of jobs
-                  | submitted to the particular job queue is displayed.
-*-----------------+-----------------------------------------------------------+
-| -showacls       | Displays the queue name and associated queue operations
-                  | allowed for the current user. The list consists of only
-                  | those queues to which the user has access.
-*-----------------+-----------------------------------------------------------+
-
-** <<<classpath>>>
-
-   Prints the class path needed to get the Hadoop jar and the required
-   libraries.
-
-   Usage: <<<mapred classpath>>>
-
-** <<<distcp>>>
-
-   Copy file or directories recursively. More information can be found at
-   {{{./DistCp.html}Hadoop DistCp Guide}}.
-
-** <<<archive>>>
-
-   Creates a hadoop archive. More information can be found at
-   {{{./HadoopArchives.html}Hadoop Archives Guide}}.
-
-** <<<version>>>
-
-   Prints the version.
-
-   Usage: <<<mapred version>>>
-
-* Administration Commands
-
-   Commands useful for administrators of a hadoop cluster.
-
-** <<<historyserver>>>
-
-   Start JobHistoryServer.
-
-   Usage: <<<mapred historyserver>>>
-
-** <<<hsadmin>>>
-
-   Runs a MapReduce hsadmin client for execute JobHistoryServer administrative
-   commands.
-
-   Usage: <<<mapred hsadmin
-          [-refreshUserToGroupsMappings] |
-          [-refreshSuperUserGroupsConfiguration] |
-          [-refreshAdminAcls] |
-          [-refreshLoadedJobCache] |
-          [-refreshLogRetentionSettings] |
-          [-refreshJobRetentionSettings] |
-          [-getGroups [username]] | [-help [cmd]]>>>
-
-*-----------------+-----------------------------------------------------------+
-|| COMMAND_OPTION || Description
-*-----------------+-----------------------------------------------------------+
-| -refreshUserToGroupsMappings | Refresh user-to-groups mappings
-*-----------------+-----------------------------------------------------------+
-| -refreshSuperUserGroupsConfiguration| Refresh superuser proxy groups mappings
-*-----------------+-----------------------------------------------------------+
-| -refreshAdminAcls | Refresh acls for administration of Job history server
-*-----------------+-----------------------------------------------------------+
-| -refreshLoadedJobCache | Refresh loaded job cache of Job history server
-*-----------------+-----------------------------------------------------------+
-| -refreshJobRetentionSettings|Refresh job history period, job cleaner settings
-*-----------------+-----------------------------------------------------------+
-| -refreshLogRetentionSettings | Refresh log retention period and log retention
-|                              | check interval
-*-----------------+-----------------------------------------------------------+
-| -getGroups [username] | Get the groups which given user belongs to
-*-----------------+-----------------------------------------------------------+
-| -help [cmd] | Displays help for the given command or all commands if none is
-|             | specified.
-*-----------------+-----------------------------------------------------------+


http://git-wip-us.apache.org/repos/asf/hadoop/blob/8b787e2f/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
deleted file mode 100644
index 06d8022..0000000
--- 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/apt/PluggableShuffleAndPluggableSort.apt.vm
+++ /dev/null
@@ -1,98 +0,0 @@
-~~ Licensed under the Apache License, Version 2.0 (the "License");
-~~ you may not use this file except in compliance with the License.
-~~ You may obtain a copy of the License at
-~~
-~~   http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License. See accompanying LICENSE file.
-
-  ---
-  Hadoop Map Reduce Next Generation-${project.version} - Pluggable Shuffle and 
Pluggable Sort
-  ---
-  ---
-  ${maven.build.timestamp}
-
-Hadoop MapReduce Next Generation - Pluggable Shuffle and Pluggable Sort
-
-* Introduction
-
-  The pluggable shuffle and pluggable sort capabilities allow replacing the 
-  built in shuffle and sort logic with alternate implementations. Example use 
-  cases for this are: using a different application protocol other than HTTP 
-  such as RDMA for shuffling data from the Map nodes to the Reducer nodes; or
-  replacing the sort logic with custom algorithms that enable Hash aggregation 
-  and Limit-N query.
-
-  <<IMPORTANT:>> The pluggable shuffle and pluggable sort capabilities are 
-  experimental and unstable. This means the provided APIs may change and break 
-  compatibility in future versions of Hadoop.
-
-* Implementing a Custom Shuffle and a Custom Sort 
-
-  A custom shuffle implementation requires a
-  
<<<org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.AuxiliaryService>>>
 
-  implementation class running in the NodeManagers and a 
-  <<<org.apache.hadoop.mapred.ShuffleConsumerPlugin>>> implementation class
-  running in the Reducer tasks.
-
-  The default implementations provided by Hadoop can be used as references:
-
-    * <<<org.apache.hadoop.mapred.ShuffleHandler>>>
-    
-    * <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>
-
-  A custom sort implementation requires a 
<<<org.apache.hadoop.mapred.MapOutputCollector>>>
-  implementation class running in the Mapper tasks and (optionally, depending
-  on the sort implementation) a 
<<<org.apache.hadoop.mapred.ShuffleConsumerPlugin>>> 
-  implementation class running in the Reducer tasks.
-
-  The default implementations provided by Hadoop can be used as references:
-
-  * <<<org.apache.hadoop.mapred.MapTask$MapOutputBuffer>>>
-  
-  * <<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>
-
-* Configuration
-
-  Except for the auxiliary service running in the NodeManagers serving the 
-  shuffle (by default the <<<ShuffleHandler>>>), all the pluggable components 
-  run in the job tasks. This means, they can be configured on per job basis. 
-  The auxiliary service servicing the Shuffle must be configured in the 
-  NodeManagers configuration.
-
-** Job Configuration Properties (on per job basis):
-
-*--------------------------------------+---------------------+-----------------+
-| <<Property>>                         | <<Default Value>>   | <<Explanation>> 
|
-*--------------------------------------+---------------------+-----------------+
-| <<<mapreduce.job.reduce.shuffle.consumer.plugin.class>>> | 
<<<org.apache.hadoop.mapreduce.task.reduce.Shuffle>>>         | The 
<<<ShuffleConsumerPlugin>>> implementation to use |
-*--------------------------------------+---------------------+-----------------+
-| <<<mapreduce.job.map.output.collector.class>>>   | 
<<<org.apache.hadoop.mapred.MapTask$MapOutputBuffer>>> | The 
<<<MapOutputCollector>>> implementation(s) to use |
-*--------------------------------------+---------------------+-----------------+
-
-  These properties can also be set in the <<<mapred-site.xml>>> to change the 
default values for all jobs.
-
-  The collector class configuration may specify a comma-separated list of 
collector implementations.
-  In this case, the map task will attempt to instantiate each in turn until 
one of the
-  implementations successfully initializes. This can be useful if a given 
collector
-  implementation is only compatible with certain types of keys or values, for 
example.
-
-** NodeManager Configuration properties, <<<yarn-site.xml>>> in all nodes:
-
-*--------------------------------------+---------------------+-----------------+
-| <<Property>>                         | <<Default Value>>   | <<Explanation>> 
|
-*--------------------------------------+---------------------+-----------------+
-| <<<yarn.nodemanager.aux-services>>> | <<<...,mapreduce_shuffle>>>  | The 
auxiliary service name |
-*--------------------------------------+---------------------+-----------------+
-| <<<yarn.nodemanager.aux-services.mapreduce_shuffle.class>>>   | 
<<<org.apache.hadoop.mapred.ShuffleHandler>>> | The auxiliary service class to 
use |
-*--------------------------------------+---------------------+-----------------+
-
-  <<IMPORTANT:>> If setting an auxiliary service in addition the default 
-  <<<mapreduce_shuffle>>> service, then a new service key should be added to 
the
-  <<<yarn.nodemanager.aux-services>>> property, for example 
<<<mapred.shufflex>>>.
-  Then the property defining the corresponding class must be
-  <<<yarn.nodemanager.aux-services.mapreduce_shufflex.class>>>.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/8b787e2f/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
new file mode 100644
index 0000000..36ad8fc
--- /dev/null
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/DistributedCacheDeploy.md.vm
@@ -0,0 +1,119 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+#set ( $H3 = '###' )
+#set ( $H4 = '####' )
+#set ( $H5 = '#####' )
+
+Hadoop: Distributed Cache Deploy
+================================
+
+Introduction
+------------
+
+The MapReduce application framework has rudimentary support for deploying a 
new version of the MapReduce framework via the distributed cache. By setting 
the appropriate configuration properties, users can run a different version of 
MapReduce than the one initially deployed to the cluster. For example, cluster 
administrators can place multiple versions of MapReduce in HDFS and configure 
`mapred-site.xml` to specify which version jobs will use by default. This 
allows the administrators to perform a rolling upgrade of the MapReduce 
framework under certain conditions.
+
+Preconditions and Limitations
+-----------------------------
+
+The support for deploying the MapReduce framework via the distributed cache 
currently does not address the job client code used to submit and query jobs. 
It also does not address the `ShuffleHandler` code that runs as an auxilliary 
service within each NodeManager. As a result the following limitations apply to 
MapReduce versions that can be successfully deployed via the distributed cache 
in a rolling upgrade fashion:
+
+* The MapReduce version must be compatible with the job client code used to
+  submit and query jobs. If it is incompatible then the job client must be
+  upgraded separately on any node from which jobs using the new MapReduce
+  version will be submitted or queried.
+
+* The MapReduce version must be compatible with the configuration files used
+  by the job client submitting the jobs. If it is incompatible with that
+  configuration (e.g.: a new property must be set or an existing property
+  value changed) then the configuration must be updated first.
+
+* The MapReduce version must be compatible with the `ShuffleHandler`
+  version running on the nodes in the cluster. If it is incompatible then the
+  new `ShuffleHandler` code must be deployed to all the nodes in the
+  cluster, and the NodeManagers must be restarted to pick up the new
+  `ShuffleHandler` code.
+
+Deploying a New MapReduce Version via the Distributed Cache
+-----------------------------------------------------------
+
+Deploying a new MapReduce version consists of three steps:
+
+1.  Upload the MapReduce archive to a location that can be accessed by the
+    job submission client. Ideally the archive should be on the cluster's 
default
+    filesystem at a publicly-readable path. See the archive location discussion
+    below for more details.
+
+2.  Configure `mapreduce.application.framework.path` to point to the
+    location where the archive is located. As when specifying distributed cache
+    files for a job, this is a URL that also supports creating an alias for the
+    archive if a URL fragment is specified. For example,
+    
`hdfs:/mapred/framework/hadoop-mapreduce-${project.version}.tar.gz#mrframework`
+    will be localized as `mrframework` rather than
+    `hadoop-mapreduce-${project.version}.tar.gz`.
+
+3.  Configure `mapreduce.application.classpath` to set the proper
+    classpath to use with the MapReduce archive configured above. NOTE: An 
error
+    occurs if `mapreduce.application.framework.path` is configured but
+    `mapreduce.application.classpath` does not reference the base name of the
+    archive path or the alias if an alias was specified.
+
+$H3 Location of the MapReduce Archive and How It Affects Job Performance
+
+Note that the location of the MapReduce archive can be critical to job 
submission and job startup performance. If the archive is not located on the 
cluster's default filesystem then it will be copied to the job staging 
directory for each job and localized to each node where the job's tasks run. 
This will slow down job submission and task startup performance.
+
+If the archive is located on the default filesystem then the job client will 
not upload the archive to the job staging directory for each job submission. 
However if the archive path is not readable by all cluster users then the 
archive will be localized separately for each user on each node where tasks 
execute. This can cause unnecessary duplication in the distributed cache.
+
+When working with a large cluster it can be important to increase the 
replication factor of the archive to increase its availability. This will 
spread the load when the nodes in the cluster localize the archive for the 
first time.
+
+MapReduce Archives and Classpath Configuration
+----------------------------------------------
+
+Setting a proper classpath for the MapReduce archive depends upon the 
composition of the archive and whether it has any additional dependencies. For 
example, the archive can contain not only the MapReduce jars but also the 
necessary YARN, HDFS, and Hadoop Common jars and all other dependencies. In 
that case, `mapreduce.application.classpath` would be configured to something 
like the following example, where the archive basename is 
hadoop-mapreduce-${project.version}.tar.gz and the archive is organized 
internally similar to the standard Hadoop distribution archive:
+
+`$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/common/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/yarn/lib/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/hdfs/lib/*`
+
+Another possible approach is to have the archive consist of just the MapReduce 
jars and have the remaining dependencies picked up from the Hadoop distribution 
installed on the nodes. In that case, the above example would change to 
something like the following:
+
+`$HADOOP_CONF_DIR,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/*,$PWD/hadoop-mapreduce-${project.version}.tar.gz/hadoop-mapreduce-${project.version}/share/hadoop/mapreduce/lib/*,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*`
+
+$H3 NOTE:
+
+If shuffle encryption is also enabled in the cluster, then we could meet the 
problem that MR job get failed with exception like below:
+
+    2014-10-10 02:17:16,600 WARN [fetcher#1] 
org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
junpingdu-centos5-3.cs1cloud.internal:13562 with 1 map outputs
+    javax.net.ssl.SSLHandshakeException: 
sun.security.validator.ValidatorException: PKIX path building failed: 
sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
valid certification path to requested target
+        at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
+        at 
com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1731)
+        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:241)
+        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:235)
+        at 
com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1206)
+        at 
com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:136)
+        at 
com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:593)
+        at 
com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:529)
+        at 
com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:925)
+        at 
com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1170)
+        at 
com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1197)
+        at 
com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1181)
+        at 
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:434)
+        at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:81)
+        at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:61)
+        at 
sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:584)
+        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1193)
+        at 
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
+        at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
+        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427)
+    ....
+
+This is because MR client (deployed from HDFS) cannot access ssl-client.xml in 
local FS under directory of $HADOOP\_CONF\_DIR. To fix the problem, we can add 
the directory with ssl-client.xml to the classpath of MR which is specified in 
"mapreduce.application.classpath" as mentioned above. To avoid MR application 
being affected by other local configurations, it is better to create a 
dedicated directory for putting ssl-client.xml, e.g. a sub-directory under 
$HADOOP\_CONF\_DIR, like: $HADOOP\_CONF\_DIR/security.

http://git-wip-us.apache.org/repos/asf/hadoop/blob/8b787e2f/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
----------------------------------------------------------------------
diff --git 
a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
new file mode 100644
index 0000000..58fd52a
--- /dev/null
+++ 
b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/EncryptedShuffle.md
@@ -0,0 +1,255 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+Hadoop: Encrypted Shuffle
+=========================
+
+Introduction
+------------
+
+The Encrypted Shuffle capability allows encryption of the MapReduce shuffle 
using HTTPS and with optional client authentication (also known as 
bi-directional HTTPS, or HTTPS with client certificates). It comprises:
+
+*   A Hadoop configuration setting for toggling the shuffle between HTTP and
+    HTTPS.
+
+*   A Hadoop configuration settings for specifying the keystore and truststore
+    properties (location, type, passwords) used by the shuffle service and the
+    reducers tasks fetching shuffle data.
+
+*   A way to re-load truststores across the cluster (when a node is added or
+    removed).
+
+Configuration
+-------------
+
+### **core-site.xml** Properties
+
+To enable encrypted shuffle, set the following properties in core-site.xml of 
all nodes in the cluster:
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `hadoop.ssl.require.client.cert` | `false` | Whether client certificates are 
required |
+| `hadoop.ssl.hostname.verifier` | `DEFAULT` | The hostname verifier to 
provide for HttpsURLConnections. Valid values are: **DEFAULT**, **STRICT**, 
**STRICT\_I6**, **DEFAULT\_AND\_LOCALHOST** and **ALLOW\_ALL** |
+| `hadoop.ssl.keystores.factory.class` | 
`org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory` | The 
KeyStoresFactory implementation to use |
+| `hadoop.ssl.server.conf` | `ssl-server.xml` | Resource file from which ssl 
server keystore information will be extracted. This file is looked up in the 
classpath, typically it should be in Hadoop conf/ directory |
+| `hadoop.ssl.client.conf` | `ssl-client.xml` | Resource file from which ssl 
server keystore information will be extracted. This file is looked up in the 
classpath, typically it should be in Hadoop conf/ directory |
+| `hadoop.ssl.enabled.protocols` | `TLSv1` | The supported SSL protocols (JDK6 
can use **TLSv1**, JDK7+ can use **TLSv1,TLSv1.1,TLSv1.2**) |
+
+**IMPORTANT:** Currently requiring client certificates should be set to false. 
Refer the [Client Certificates](#Client_Certificates) section for details.
+
+**IMPORTANT:** All these properties should be marked as final in the cluster 
configuration files.
+
+#### Example:
+
+```xml
+  <property>
+    <name>hadoop.ssl.require.client.cert</name>
+    <value>false</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.hostname.verifier</name>
+    <value>DEFAULT</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.keystores.factory.class</name>
+    <value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.server.conf</name>
+    <value>ssl-server.xml</value>
+    <final>true</final>
+  </property>
+
+  <property>
+    <name>hadoop.ssl.client.conf</name>
+    <value>ssl-client.xml</value>
+    <final>true</final>
+  </property>
+```
+
+### `mapred-site.xml` Properties
+
+To enable encrypted shuffle, set the following property in mapred-site.xml of 
all nodes in the cluster:
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `mapreduce.shuffle.ssl.enabled` | `false` | Whether encrypted shuffle is 
enabled |
+
+**IMPORTANT:** This property should be marked as final in the cluster 
configuration files.
+
+#### Example:
+
+```xml
+  <property>
+    <name>mapreduce.shuffle.ssl.enabled</name>
+    <value>true</value>
+    <final>true</final>
+  </property>
+```
+
+The Linux container executor should be set to prevent job tasks from reading 
the server keystore information and gaining access to the shuffle server 
certificates.
+
+Refer to Hadoop Kerberos configuration for details on how to do this.
+
+Keystore and Truststore Settings
+--------------------------------
+
+Currently `FileBasedKeyStoresFactory` is the only `KeyStoresFactory` 
implementation. The `FileBasedKeyStoresFactory` implementation uses the 
following properties, in the **ssl-server.xml** and **ssl-client.xml** files, 
to configure the keystores and truststores.
+
+### `ssl-server.xml` (Shuffle server) Configuration:
+
+The mapred user should own the **ssl-server.xml** file and have exclusive read 
access to it.
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `ssl.server.keystore.type` | `jks` | Keystore file type |
+| `ssl.server.keystore.location` | NONE | Keystore file location. The mapred 
user should own this file and have exclusive read access to it. |
+| `ssl.server.keystore.password` | NONE | Keystore file password |
+| `ssl.server.truststore.type` | `jks` | Truststore file type |
+| `ssl.server.truststore.location` | NONE | Truststore file location. The 
mapred user should own this file and have exclusive read access to it. |
+| `ssl.server.truststore.password` | NONE | Truststore file password |
+| `ssl.server.truststore.reload.interval` | 10000 | Truststore reload 
interval, in milliseconds |
+
+#### Example:
+
+```xml
+<configuration>
+
+  <!-- Server Certificate Store -->
+  <property>
+    <name>ssl.server.keystore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.server.keystore.location</name>
+    <value>${user.home}/keystores/server-keystore.jks</value>
+  </property>
+  <property>
+    <name>ssl.server.keystore.password</name>
+    <value>serverfoo</value>
+  </property>
+
+  <!-- Server Trust Store -->
+  <property>
+    <name>ssl.server.truststore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.server.truststore.location</name>
+    <value>${user.home}/keystores/truststore.jks</value>
+  </property>
+  <property>
+    <name>ssl.server.truststore.password</name>
+    <value>clientserverbar</value>
+  </property>
+  <property>
+    <name>ssl.server.truststore.reload.interval</name>
+    <value>10000</value>
+  </property>
+</configuration>
+```
+
+### `ssl-client.xml` (Reducer/Fetcher) Configuration:
+
+The mapred user should own the **ssl-client.xml** file and it should have 
default permissions.
+
+| **Property** | **Default Value** | **Explanation** |
+|:---- |:---- |:---- |
+| `ssl.client.keystore.type` | `jks` | Keystore file type |
+| `ssl.client.keystore.location` | NONE | Keystore file location. The mapred 
user should own this file and it should have default permissions. |
+| `ssl.client.keystore.password` | NONE | Keystore file password |
+| `ssl.client.truststore.type` | `jks` | Truststore file type |
+| `ssl.client.truststore.location` | NONE | Truststore file location. The 
mapred user should own this file and it should have default permissions. |
+| `ssl.client.truststore.password` | NONE | Truststore file password |
+| `ssl.client.truststore.reload.interval` | 10000 | Truststore reload 
interval, in milliseconds |
+
+#### Example:
+
+```xml
+<configuration>
+
+  <!-- Client certificate Store -->
+  <property>
+    <name>ssl.client.keystore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.client.keystore.location</name>
+    <value>${user.home}/keystores/client-keystore.jks</value>
+  </property>
+  <property>
+    <name>ssl.client.keystore.password</name>
+    <value>clientfoo</value>
+  </property>
+
+  <!-- Client Trust Store -->
+  <property>
+    <name>ssl.client.truststore.type</name>
+    <value>jks</value>
+  </property>
+  <property>
+    <name>ssl.client.truststore.location</name>
+    <value>${user.home}/keystores/truststore.jks</value>
+  </property>
+  <property>
+    <name>ssl.client.truststore.password</name>
+    <value>clientserverbar</value>
+  </property>
+  <property>
+    <name>ssl.client.truststore.reload.interval</name>
+    <value>10000</value>
+  </property>
+</configuration>
+```
+
+Activating Encrypted Shuffle
+----------------------------
+
+When you have made the above configuration changes, activate Encrypted Shuffle 
by re-starting all NodeManagers.
+
+**IMPORTANT:** Using encrypted shuffle will incur in a significant performance 
impact. Users should profile this and potentially reserve 1 or more cores for 
encrypted shuffle.
+
+Client Certificates
+-------------------
+
+Using Client Certificates does not fully ensure that the client is a reducer 
task for the job. Currently, Client Certificates (their private key) keystore 
files must be readable by all users submitting jobs to the cluster. This means 
that a rogue job could read such those keystore files and use the client 
certificates in them to establish a secure connection with a Shuffle server. 
However, unless the rogue job has a proper JobToken, it won't be able to 
retrieve shuffle data from the Shuffle server. A job, using its own JobToken, 
can only retrieve shuffle data that belongs to itself.
+
+Reloading Truststores
+---------------------
+
+By default the truststores will reload their configuration every 10 seconds. 
If a new truststore file is copied over the old one, it will be re-read, and 
its certificates will replace the old ones. This mechanism is useful for adding 
or removing nodes from the cluster, or for adding or removing trusted clients. 
In these cases, the client or NodeManager certificate is added to (or removed 
from) all the truststore files in the system, and the new configuration will be 
picked up without you having to restart the NodeManager daemons.
+
+Debugging
+---------
+
+**NOTE:** Enable debugging only for troubleshooting, and then only for jobs 
running on small amounts of data. It is very verbose and slows down jobs by 
several orders of magnitude. (You might need to increase mapred.task.timeout to 
prevent jobs from failing because tasks run so slowly.)
+
+To enable SSL debugging in the reducers, set `-Djavax.net.debug=all` in the 
`mapreduce.reduce.child.java.opts` property; for example:
+
+      <property>
+        <name>mapred.reduce.child.java.opts</name>
+        <value>-Xmx-200m -Djavax.net.debug=all</value>
+      </property>
+
+You can do this on a per-job basis, or by means of a cluster-wide setting in 
the `mapred-site.xml` file.
+
+To set this property in NodeManager, set it in the `yarn-env.sh` file:
+
+      YARN_NODEMANAGER_OPTS="-Djavax.net.debug=all"

[5/8] hadoop git commit: MAPREDUCE-6260. Convert site documentation to markdown (Masatake Iwasaki via aw)

Reply via email to