HADOOP-11495. Convert site documentation from apt to markdown (Masatake Iwasaki via aw)
Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/e9d26fe9 Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/e9d26fe9 Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/e9d26fe9 Branch: refs/heads/YARN-2928 Commit: e9d26fe9eb16a0482d3581504ecad22b4cd65077 Parents: 6338ce3 Author: Allen Wittenauer <[email protected]> Authored: Tue Feb 10 13:39:57 2015 -0800 Committer: Allen Wittenauer <[email protected]> Committed: Tue Feb 10 13:39:57 2015 -0800 ---------------------------------------------------------------------- hadoop-common-project/hadoop-common/CHANGES.txt | 3 + .../src/site/apt/CLIMiniCluster.apt.vm | 83 -- .../src/site/apt/ClusterSetup.apt.vm | 651 -------------- .../src/site/apt/CommandsManual.apt.vm | 327 ------- .../src/site/apt/Compatibility.apt.vm | 541 ------------ .../src/site/apt/DeprecatedProperties.apt.vm | 552 ------------ .../src/site/apt/FileSystemShell.apt.vm | 764 ---------------- .../src/site/apt/HttpAuthentication.apt.vm | 98 --- .../src/site/apt/InterfaceClassification.apt.vm | 239 ----- .../hadoop-common/src/site/apt/Metrics.apt.vm | 879 ------------------- .../src/site/apt/NativeLibraries.apt.vm | 205 ----- .../src/site/apt/RackAwareness.apt.vm | 140 --- .../src/site/apt/SecureMode.apt.vm | 689 --------------- .../src/site/apt/ServiceLevelAuth.apt.vm | 216 ----- .../src/site/apt/SingleCluster.apt.vm | 286 ------ .../src/site/apt/SingleNodeSetup.apt.vm | 24 - .../src/site/apt/Superusers.apt.vm | 144 --- .../hadoop-common/src/site/apt/Tracing.apt.vm | 233 ----- .../src/site/markdown/CLIMiniCluster.md.vm | 68 ++ .../src/site/markdown/ClusterSetup.md | 339 +++++++ .../src/site/markdown/CommandsManual.md | 227 +++++ .../src/site/markdown/Compatibility.md | 313 +++++++ .../src/site/markdown/DeprecatedProperties.md | 288 ++++++ .../src/site/markdown/FileSystemShell.md | 689 +++++++++++++++ .../src/site/markdown/HttpAuthentication.md | 58 ++ .../site/markdown/InterfaceClassification.md | 105 +++ .../hadoop-common/src/site/markdown/Metrics.md | 456 ++++++++++ .../src/site/markdown/NativeLibraries.md.vm | 145 +++ .../src/site/markdown/RackAwareness.md | 104 +++ .../src/site/markdown/SecureMode.md | 375 ++++++++ .../src/site/markdown/ServiceLevelAuth.md | 144 +++ .../src/site/markdown/SingleCluster.md.vm | 232 +++++ .../src/site/markdown/SingleNodeSetup.md | 20 + .../src/site/markdown/Superusers.md | 106 +++ .../hadoop-common/src/site/markdown/Tracing.md | 182 ++++ 35 files changed, 3854 insertions(+), 6071 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/hadoop/blob/e9d26fe9/hadoop-common-project/hadoop-common/CHANGES.txt ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/CHANGES.txt b/hadoop-common-project/hadoop-common/CHANGES.txt index fadc744..1ba93e8 100644 --- a/hadoop-common-project/hadoop-common/CHANGES.txt +++ b/hadoop-common-project/hadoop-common/CHANGES.txt @@ -168,6 +168,9 @@ Trunk (Unreleased) HADOOP-6964. Allow compact property description in xml (Kengo Seki via aw) + HADOOP-11495. Convert site documentation from apt to markdown + (Masatake Iwasaki via aw) + BUG FIXES HADOOP-11473. test-patch says "-1 overall" even when all checks are +1 http://git-wip-us.apache.org/repos/asf/hadoop/blob/e9d26fe9/hadoop-common-project/hadoop-common/src/site/apt/CLIMiniCluster.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/apt/CLIMiniCluster.apt.vm b/hadoop-common-project/hadoop-common/src/site/apt/CLIMiniCluster.apt.vm deleted file mode 100644 index 2d12c39..0000000 --- a/hadoop-common-project/hadoop-common/src/site/apt/CLIMiniCluster.apt.vm +++ /dev/null @@ -1,83 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - Hadoop MapReduce Next Generation ${project.version} - CLI MiniCluster. - --- - --- - ${maven.build.timestamp} - -Hadoop MapReduce Next Generation - CLI MiniCluster. - -%{toc|section=1|fromDepth=0} - -* {Purpose} - - Using the CLI MiniCluster, users can simply start and stop a single-node - Hadoop cluster with a single command, and without the need to set any - environment variables or manage configuration files. The CLI MiniCluster - starts both a <<<YARN>>>/<<<MapReduce>>> & <<<HDFS>>> clusters. - - This is useful for cases where users want to quickly experiment with a real - Hadoop cluster or test non-Java programs that rely on significant Hadoop - functionality. - -* {Hadoop Tarball} - - You should be able to obtain the Hadoop tarball from the release. Also, you - can directly create a tarball from the source: - -+---+ -$ mvn clean install -DskipTests -$ mvn package -Pdist -Dtar -DskipTests -Dmaven.javadoc.skip -+---+ - <<NOTE:>> You will need {{{http://code.google.com/p/protobuf/}protoc 2.5.0}} - installed. - - The tarball should be available in <<<hadoop-dist/target/>>> directory. - -* {Running the MiniCluster} - - From inside the root directory of the extracted tarball, you can start the CLI - MiniCluster using the following command: - -+---+ -$ bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-${project.version}-tests.jar minicluster -rmport RM_PORT -jhsport JHS_PORT -+---+ - - In the example command above, <<<RM_PORT>>> and <<<JHS_PORT>>> should be - replaced by the user's choice of these port numbers. If not specified, random - free ports will be used. - - There are a number of command line arguments that the users can use to control - which services to start, and to pass other configuration properties. - The available command line arguments: - -+---+ -$ -D <property=value> Options to pass into configuration object -$ -datanodes <arg> How many datanodes to start (default 1) -$ -format Format the DFS (default false) -$ -help Prints option help. -$ -jhsport <arg> JobHistoryServer port (default 0--we choose) -$ -namenode <arg> URL of the namenode (default is either the DFS -$ cluster or a temporary dir) -$ -nnport <arg> NameNode port (default 0--we choose) -$ -nodemanagers <arg> How many nodemanagers to start (default 1) -$ -nodfs Don't start a mini DFS cluster -$ -nomr Don't start a mini MR cluster -$ -rmport <arg> ResourceManager port (default 0--we choose) -$ -writeConfig <path> Save configuration to this XML file. -$ -writeDetails <path> Write basic information to this JSON file. -+---+ - - To display this full list of available arguments, the user can pass the - <<<-help>>> argument to the above command. http://git-wip-us.apache.org/repos/asf/hadoop/blob/e9d26fe9/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm b/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm deleted file mode 100644 index 52b0552..0000000 --- a/hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm +++ /dev/null @@ -1,651 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - Hadoop ${project.version} - Cluster Setup - --- - --- - ${maven.build.timestamp} - -%{toc|section=1|fromDepth=0} - -Hadoop Cluster Setup - -* {Purpose} - - This document describes how to install and configure - Hadoop clusters ranging from a few nodes to extremely large clusters - with thousands of nodes. To play with Hadoop, you may first want to - install it on a single machine (see {{{./SingleCluster.html}Single Node Setup}}). - - This document does not cover advanced topics such as {{{./SecureMode.html}Security}} or - High Availability. - -* {Prerequisites} - - * Install Java. See the {{{http://wiki.apache.org/hadoop/HadoopJavaVersions}Hadoop Wiki}} for known good versions. - * Download a stable version of Hadoop from Apache mirrors. - -* {Installation} - - Installing a Hadoop cluster typically involves unpacking the software on all - the machines in the cluster or installing it via a packaging system as - appropriate for your operating system. It is important to divide up the hardware - into functions. - - Typically one machine in the cluster is designated as the NameNode and - another machine the as ResourceManager, exclusively. These are the masters. Other - services (such as Web App Proxy Server and MapReduce Job History server) are usually - run either on dedicated hardware or on shared infrastrucutre, depending upon the load. - - The rest of the machines in the cluster act as both DataNode and NodeManager. - These are the slaves. - -* {Configuring Hadoop in Non-Secure Mode} - - Hadoop's Java configuration is driven by two types of important configuration files: - - * Read-only default configuration - <<<core-default.xml>>>, - <<<hdfs-default.xml>>>, <<<yarn-default.xml>>> and - <<<mapred-default.xml>>>. - - * Site-specific configuration - <<<etc/hadoop/core-site.xml>>>, - <<<etc/hadoop/hdfs-site.xml>>>, <<<etc/hadoop/yarn-site.xml>>> and - <<<etc/hadoop/mapred-site.xml>>>. - - - Additionally, you can control the Hadoop scripts found in the bin/ - directory of the distribution, by setting site-specific values via the - <<<etc/hadoop/hadoop-env.sh>>> and <<<etc/hadoop/yarn-env.sh>>>. - - To configure the Hadoop cluster you will need to configure the - <<<environment>>> in which the Hadoop daemons execute as well as the - <<<configuration parameters>>> for the Hadoop daemons. - - HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN damones - are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be - used, then the MapReduce Job History Server will also be running. For - large installations, these are generally running on separate hosts. - - -** {Configuring Environment of Hadoop Daemons} - - Administrators should use the <<<etc/hadoop/hadoop-env.sh>>> and optionally the - <<<etc/hadoop/mapred-env.sh>>> and <<<etc/hadoop/yarn-env.sh>>> scripts to do - site-specific customization of the Hadoop daemons' process environment. - - At the very least, you must specify the <<<JAVA_HOME>>> so that it is - correctly defined on each remote node. - - Administrators can configure individual daemons using the configuration - options shown below in the table: - -*--------------------------------------+--------------------------------------+ -|| Daemon || Environment Variable | -*--------------------------------------+--------------------------------------+ -| NameNode | HADOOP_NAMENODE_OPTS | -*--------------------------------------+--------------------------------------+ -| DataNode | HADOOP_DATANODE_OPTS | -*--------------------------------------+--------------------------------------+ -| Secondary NameNode | HADOOP_SECONDARYNAMENODE_OPTS | -*--------------------------------------+--------------------------------------+ -| ResourceManager | YARN_RESOURCEMANAGER_OPTS | -*--------------------------------------+--------------------------------------+ -| NodeManager | YARN_NODEMANAGER_OPTS | -*--------------------------------------+--------------------------------------+ -| WebAppProxy | YARN_PROXYSERVER_OPTS | -*--------------------------------------+--------------------------------------+ -| Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_OPTS | -*--------------------------------------+--------------------------------------+ - - - For example, To configure Namenode to use parallelGC, the following - statement should be added in hadoop-env.sh : - ----- - export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC" ----- - - See <<<etc/hadoop/hadoop-env.sh>>> for other examples. - - Other useful configuration parameters that you can customize include: - - * <<<HADOOP_PID_DIR>>> - The directory where the - daemons' process id files are stored. - - * <<<HADOOP_LOG_DIR>>> - The directory where the - daemons' log files are stored. Log files are automatically created - if they don't exist. - - * <<<HADOOP_HEAPSIZE_MAX>>> - The maximum amount of - memory to use for the Java heapsize. Units supported by the JVM - are also supported here. If no unit is present, it will be assumed - the number is in megabytes. By default, Hadoop will let the JVM - determine how much to use. This value can be overriden on - a per-daemon basis using the appropriate <<<_OPTS>>> variable listed above. - For example, setting <<<HADOOP_HEAPSIZE_MAX=1g>>> and - <<<HADOOP_NAMENODE_OPTS="-Xmx5g">>> will configure the NameNode with 5GB heap. - - In most cases, you should specify the <<<HADOOP_PID_DIR>>> and - <<<HADOOP_LOG_DIR>>> directories such that they can only be - written to by the users that are going to run the hadoop daemons. - Otherwise there is the potential for a symlink attack. - - It is also traditional to configure <<<HADOOP_PREFIX>>> in the system-wide - shell environment configuration. For example, a simple script inside - <<</etc/profile.d>>>: - ---- - HADOOP_PREFIX=/path/to/hadoop - export HADOOP_PREFIX ---- - -*--------------------------------------+--------------------------------------+ -|| Daemon || Environment Variable | -*--------------------------------------+--------------------------------------+ -| ResourceManager | YARN_RESOURCEMANAGER_HEAPSIZE | -*--------------------------------------+--------------------------------------+ -| NodeManager | YARN_NODEMANAGER_HEAPSIZE | -*--------------------------------------+--------------------------------------+ -| WebAppProxy | YARN_PROXYSERVER_HEAPSIZE | -*--------------------------------------+--------------------------------------+ -| Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_HEAPSIZE | -*--------------------------------------+--------------------------------------+ - -** {Configuring the Hadoop Daemons} - - This section deals with important parameters to be specified in - the given configuration files: - - * <<<etc/hadoop/core-site.xml>>> - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<fs.defaultFS>>> | NameNode URI | <hdfs://host:port/> | -*-------------------------+-------------------------+------------------------+ -| <<<io.file.buffer.size>>> | 131072 | | -| | | Size of read/write buffer used in SequenceFiles. | -*-------------------------+-------------------------+------------------------+ - - * <<<etc/hadoop/hdfs-site.xml>>> - - * Configurations for NameNode: - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<dfs.namenode.name.dir>>> | | | -| | Path on the local filesystem where the NameNode stores the namespace | | -| | and transactions logs persistently. | | -| | | If this is a comma-delimited list of directories then the name table is | -| | | replicated in all of the directories, for redundancy. | -*-------------------------+-------------------------+------------------------+ -| <<<dfs.namenode.hosts>>> / <<<dfs.namenode.hosts.exclude>>> | | | -| | List of permitted/excluded DataNodes. | | -| | | If necessary, use these files to control the list of allowable | -| | | datanodes. | -*-------------------------+-------------------------+------------------------+ -| <<<dfs.blocksize>>> | 268435456 | | -| | | HDFS blocksize of 256MB for large file-systems. | -*-------------------------+-------------------------+------------------------+ -| <<<dfs.namenode.handler.count>>> | 100 | | -| | | More NameNode server threads to handle RPCs from large number of | -| | | DataNodes. | -*-------------------------+-------------------------+------------------------+ - - * Configurations for DataNode: - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<dfs.datanode.data.dir>>> | | | -| | Comma separated list of paths on the local filesystem of a | | -| | <<<DataNode>>> where it should store its blocks. | | -| | | If this is a comma-delimited list of directories, then data will be | -| | | stored in all named directories, typically on different devices. | -*-------------------------+-------------------------+------------------------+ - - * <<<etc/hadoop/yarn-site.xml>>> - - * Configurations for ResourceManager and NodeManager: - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.acl.enable>>> | | | -| | <<<true>>> / <<<false>>> | | -| | | Enable ACLs? Defaults to <false>. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.admin.acl>>> | | | -| | Admin ACL | | -| | | ACL to set admins on the cluster. | -| | | ACLs are of for <comma-separated-users><space><comma-separated-groups>. | -| | | Defaults to special value of <<*>> which means <anyone>. | -| | | Special value of just <space> means no one has access. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.log-aggregation-enable>>> | | | -| | <false> | | -| | | Configuration to enable or disable log aggregation | -*-------------------------+-------------------------+------------------------+ - - - * Configurations for ResourceManager: - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.address>>> | | | -| | <<<ResourceManager>>> host:port for clients to submit jobs. | | -| | | <host:port>\ | -| | | If set, overrides the hostname set in <<<yarn.resourcemanager.hostname>>>. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.scheduler.address>>> | | | -| | <<<ResourceManager>>> host:port for ApplicationMasters to talk to | | -| | Scheduler to obtain resources. | | -| | | <host:port>\ | -| | | If set, overrides the hostname set in <<<yarn.resourcemanager.hostname>>>. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.resource-tracker.address>>> | | | -| | <<<ResourceManager>>> host:port for NodeManagers. | | -| | | <host:port>\ | -| | | If set, overrides the hostname set in <<<yarn.resourcemanager.hostname>>>. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.admin.address>>> | | | -| | <<<ResourceManager>>> host:port for administrative commands. | | -| | | <host:port>\ | -| | | If set, overrides the hostname set in <<<yarn.resourcemanager.hostname>>>. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.webapp.address>>> | | | -| | <<<ResourceManager>>> web-ui host:port. | | -| | | <host:port>\ | -| | | If set, overrides the hostname set in <<<yarn.resourcemanager.hostname>>>. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.hostname>>> | | | -| | <<<ResourceManager>>> host. | | -| | | <host>\ | -| | | Single hostname that can be set in place of setting all <<<yarn.resourcemanager*address>>> resources. Results in default ports for ResourceManager components. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.scheduler.class>>> | | | -| | <<<ResourceManager>>> Scheduler class. | | -| | | <<<CapacityScheduler>>> (recommended), <<<FairScheduler>>> (also recommended), or <<<FifoScheduler>>> | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.scheduler.minimum-allocation-mb>>> | | | -| | Minimum limit of memory to allocate to each container request at the <<<Resource Manager>>>. | | -| | | In MBs | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.scheduler.maximum-allocation-mb>>> | | | -| | Maximum limit of memory to allocate to each container request at the <<<Resource Manager>>>. | | -| | | In MBs | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.resourcemanager.nodes.include-path>>> / | | | -| <<<yarn.resourcemanager.nodes.exclude-path>>> | | | -| | List of permitted/excluded NodeManagers. | | -| | | If necessary, use these files to control the list of allowable | -| | | NodeManagers. | -*-------------------------+-------------------------+------------------------+ - - * Configurations for NodeManager: - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.resource.memory-mb>>> | | | -| | Resource i.e. available physical memory, in MB, for given <<<NodeManager>>> | | -| | | Defines total available resources on the <<<NodeManager>>> to be made | -| | | available to running containers | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.vmem-pmem-ratio>>> | | | -| | Maximum ratio by which virtual memory usage of tasks may exceed | -| | physical memory | | -| | | The virtual memory usage of each task may exceed its physical memory | -| | | limit by this ratio. The total amount of virtual memory used by tasks | -| | | on the NodeManager may exceed its physical memory usage by this ratio. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.local-dirs>>> | | | -| | Comma-separated list of paths on the local filesystem where | | -| | intermediate data is written. || -| | | Multiple paths help spread disk i/o. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.log-dirs>>> | | | -| | Comma-separated list of paths on the local filesystem where logs | | -| | are written. | | -| | | Multiple paths help spread disk i/o. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.log.retain-seconds>>> | | | -| | <10800> | | -| | | Default time (in seconds) to retain log files on the NodeManager | -| | | Only applicable if log-aggregation is disabled. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.remote-app-log-dir>>> | | | -| | </logs> | | -| | | HDFS directory where the application logs are moved on application | -| | | completion. Need to set appropriate permissions. | -| | | Only applicable if log-aggregation is enabled. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.remote-app-log-dir-suffix>>> | | | -| | <logs> | | -| | | Suffix appended to the remote log dir. Logs will be aggregated to | -| | | $\{yarn.nodemanager.remote-app-log-dir\}/$\{user\}/$\{thisParam\} | -| | | Only applicable if log-aggregation is enabled. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.aux-services>>> | | | -| | mapreduce_shuffle | | -| | | Shuffle service that needs to be set for Map Reduce applications. | -*-------------------------+-------------------------+------------------------+ - - * Configurations for History Server (Needs to be moved elsewhere): - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.log-aggregation.retain-seconds>>> | | | -| | <-1> | | -| | | How long to keep aggregation logs before deleting them. -1 disables. | -| | | Be careful, set this too small and you will spam the name node. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.log-aggregation.retain-check-interval-seconds>>> | | | -| | <-1> | | -| | | Time between checks for aggregated log retention. If set to 0 or a | -| | | negative value then the value is computed as one-tenth of the | -| | | aggregated log retention time. | -| | | Be careful, set this too small and you will spam the name node. | -*-------------------------+-------------------------+------------------------+ - - * <<<etc/hadoop/mapred-site.xml>>> - - * Configurations for MapReduce Applications: - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.framework.name>>> | | | -| | yarn | | -| | | Execution framework set to Hadoop YARN. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.map.memory.mb>>> | 1536 | | -| | | Larger resource limit for maps. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.map.java.opts>>> | -Xmx1024M | | -| | | Larger heap-size for child jvms of maps. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.reduce.memory.mb>>> | 3072 | | -| | | Larger resource limit for reduces. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.reduce.java.opts>>> | -Xmx2560M | | -| | | Larger heap-size for child jvms of reduces. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.task.io.sort.mb>>> | 512 | | -| | | Higher memory-limit while sorting data for efficiency. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.task.io.sort.factor>>> | 100 | | -| | | More streams merged at once while sorting files. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.reduce.shuffle.parallelcopies>>> | 50 | | -| | | Higher number of parallel copies run by reduces to fetch outputs | -| | | from very large number of maps. | -*-------------------------+-------------------------+------------------------+ - - * Configurations for MapReduce JobHistory Server: - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.jobhistory.address>>> | | | -| | MapReduce JobHistory Server <host:port> | Default port is 10020. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.jobhistory.webapp.address>>> | | | -| | MapReduce JobHistory Server Web UI <host:port> | Default port is 19888. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.jobhistory.intermediate-done-dir>>> | /mr-history/tmp | | -| | | Directory where history files are written by MapReduce jobs. | -*-------------------------+-------------------------+------------------------+ -| <<<mapreduce.jobhistory.done-dir>>> | /mr-history/done| | -| | | Directory where history files are managed by the MR JobHistory Server. | -*-------------------------+-------------------------+------------------------+ - -* {Monitoring Health of NodeManagers} - - Hadoop provides a mechanism by which administrators can configure the - NodeManager to run an administrator supplied script periodically to - determine if a node is healthy or not. - - Administrators can determine if the node is in a healthy state by - performing any checks of their choice in the script. If the script - detects the node to be in an unhealthy state, it must print a line to - standard output beginning with the string ERROR. The NodeManager spawns - the script periodically and checks its output. If the script's output - contains the string ERROR, as described above, the node's status is - reported as <<<unhealthy>>> and the node is black-listed by the - ResourceManager. No further tasks will be assigned to this node. - However, the NodeManager continues to run the script, so that if the - node becomes healthy again, it will be removed from the blacklisted nodes - on the ResourceManager automatically. The node's health along with the - output of the script, if it is unhealthy, is available to the - administrator in the ResourceManager web interface. The time since the - node was healthy is also displayed on the web interface. - - The following parameters can be used to control the node health - monitoring script in <<<etc/hadoop/yarn-site.xml>>>. - -*-------------------------+-------------------------+------------------------+ -|| Parameter || Value || Notes | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.health-checker.script.path>>> | | | -| | Node health script | | -| | | Script to check for node's health status. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.health-checker.script.opts>>> | | | -| | Node health script options | | -| | | Options for script to check for node's health status. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.health-checker.script.interval-ms>>> | | | -| | Node health script interval | | -| | | Time interval for running health script. | -*-------------------------+-------------------------+------------------------+ -| <<<yarn.nodemanager.health-checker.script.timeout-ms>>> | | | -| | Node health script timeout interval | | -| | | Timeout for health script execution. | -*-------------------------+-------------------------+------------------------+ - - The health checker script is not supposed to give ERROR if only some of the - local disks become bad. NodeManager has the ability to periodically check - the health of the local disks (specifically checks nodemanager-local-dirs - and nodemanager-log-dirs) and after reaching the threshold of number of - bad directories based on the value set for the config property - yarn.nodemanager.disk-health-checker.min-healthy-disks, the whole node is - marked unhealthy and this info is sent to resource manager also. The boot - disk is either raided or a failure in the boot disk is identified by the - health checker script. - -* {Slaves File} - - List all slave hostnames or IP addresses in your <<<etc/hadoop/slaves>>> - file, one per line. Helper scripts (described below) will use the - <<<etc/hadoop/slaves>>> file to run commands on many hosts at once. It is not - used for any of the Java-based Hadoop configuration. In order - to use this functionality, ssh trusts (via either passphraseless ssh or - some other means, such as Kerberos) must be established for the accounts - used to run Hadoop. - -* {Hadoop Rack Awareness} - - Many Hadoop components are rack-aware and take advantage of the - network topology for performance and safety. Hadoop daemons obtain the - rack information of the slaves in the cluster by invoking an administrator - configured module. See the {{{./RackAwareness.html}Rack Awareness}} - documentation for more specific information. - - It is highly recommended configuring rack awareness prior to starting HDFS. - -* {Logging} - - Hadoop uses the {{{http://logging.apache.org/log4j/2.x/}Apache log4j}} via the Apache Commons Logging framework for - logging. Edit the <<<etc/hadoop/log4j.properties>>> file to customize the - Hadoop daemons' logging configuration (log-formats and so on). - -* {Operating the Hadoop Cluster} - - Once all the necessary configuration is complete, distribute the files to the - <<<HADOOP_CONF_DIR>>> directory on all the machines. This should be the - same directory on all machines. - - In general, it is recommended that HDFS and YARN run as separate users. - In the majority of installations, HDFS processes execute as 'hdfs'. YARN - is typically using the 'yarn' account. - -** Hadoop Startup - - To start a Hadoop cluster you will need to start both the HDFS and YARN - cluster. - - The first time you bring up HDFS, it must be formatted. Format a new - distributed filesystem as <hdfs>: - ----- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name> ----- - - Start the HDFS NameNode with the following command on the - designated node as <hdfs>: - ----- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start namenode ----- - - Start a HDFS DataNode with the following command on each - designated node as <hdfs>: - ----- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start datanode ----- - - If <<<etc/hadoop/slaves>>> and ssh trusted access is configured - (see {{{./SingleCluster.html}Single Node Setup}}), all of the - HDFS processes can be started with a utility script. As <hdfs>: - ----- -[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh ----- - - Start the YARN with the following command, run on the designated - ResourceManager as <yarn>: - ----- -[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon start resourcemanager ----- - - Run a script to start a NodeManager on each designated host as <yarn>: - ----- -[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon start nodemanager ----- - - Start a standalone WebAppProxy server. Run on the WebAppProxy - server as <yarn>. If multiple servers are used with load balancing - it should be run on each of them: - ----- -[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon start proxyserver ----- - - If <<<etc/hadoop/slaves>>> and ssh trusted access is configured - (see {{{./SingleCluster.html}Single Node Setup}}), all of the - YARN processes can be started with a utility script. As <yarn>: - ----- -[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh ----- - - Start the MapReduce JobHistory Server with the following command, run - on the designated server as <mapred>: - ----- -[mapred]$ $HADOOP_PREFIX/bin/mapred --daemon start historyserver ----- - -** Hadoop Shutdown - - Stop the NameNode with the following command, run on the designated NameNode - as <hdfs>: - ----- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon stop namenode ----- - - Run a script to stop a DataNode as <hdfs>: - ----- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon stop datanode ----- - - If <<<etc/hadoop/slaves>>> and ssh trusted access is configured - (see {{{./SingleCluster.html}Single Node Setup}}), all of the - HDFS processes may be stopped with a utility script. As <hdfs>: - ----- -[hdfs]$ $HADOOP_PREFIX/sbin/stop-dfs.sh ----- - - Stop the ResourceManager with the following command, run on the designated - ResourceManager as <yarn>: - ----- -[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon stop resourcemanager ----- - - Run a script to stop a NodeManager on a slave as <yarn>: - ----- -[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon stop nodemanager ----- - - If <<<etc/hadoop/slaves>>> and ssh trusted access is configured - (see {{{./SingleCluster.html}Single Node Setup}}), all of the - YARN processes can be stopped with a utility script. As <yarn>: - ----- -[yarn]$ $HADOOP_PREFIX/sbin/stop-yarn.sh ----- - - Stop the WebAppProxy server. Run on the WebAppProxy server as - <yarn>. If multiple servers are used with load balancing it - should be run on each of them: - ----- -[yarn]$ $HADOOP_PREFIX/bin/yarn stop proxyserver ----- - - Stop the MapReduce JobHistory Server with the following command, run on the - designated server as <mapred>: - ----- -[mapred]$ $HADOOP_PREFIX/bin/mapred --daemon stop historyserver ----- - -* {Web Interfaces} - - Once the Hadoop cluster is up and running check the web-ui of the - components as described below: - -*-------------------------+-------------------------+------------------------+ -|| Daemon || Web Interface || Notes | -*-------------------------+-------------------------+------------------------+ -| NameNode | http://<nn_host:port>/ | Default HTTP port is 50070. | -*-------------------------+-------------------------+------------------------+ -| ResourceManager | http://<rm_host:port>/ | Default HTTP port is 8088. | -*-------------------------+-------------------------+------------------------+ -| MapReduce JobHistory Server | http://<jhs_host:port>/ | | -| | | Default HTTP port is 19888. | -*-------------------------+-------------------------+------------------------+ - - http://git-wip-us.apache.org/repos/asf/hadoop/blob/e9d26fe9/hadoop-common-project/hadoop-common/src/site/apt/CommandsManual.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/apt/CommandsManual.apt.vm b/hadoop-common-project/hadoop-common/src/site/apt/CommandsManual.apt.vm deleted file mode 100644 index 67c8bc3..0000000 --- a/hadoop-common-project/hadoop-common/src/site/apt/CommandsManual.apt.vm +++ /dev/null @@ -1,327 +0,0 @@ -~~ Licensed to the Apache Software Foundation (ASF) under one or more -~~ contributor license agreements. See the NOTICE file distributed with -~~ this work for additional information regarding copyright ownership. -~~ The ASF licenses this file to You under the Apache License, Version 2.0 -~~ (the "License"); you may not use this file except in compliance with -~~ the License. You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. - - --- - Hadoop Commands Guide - --- - --- - ${maven.build.timestamp} - -%{toc} - -Hadoop Commands Guide - -* Overview - - All of the Hadoop commands and subprojects follow the same basic structure: - - Usage: <<<shellcommand [SHELL_OPTIONS] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]>>> - -*--------+---------+ -|| FIELD || Description -*-----------------------+---------------+ -| shellcommand | The command of the project being invoked. For example, - | Hadoop common uses <<<hadoop>>>, HDFS uses <<<hdfs>>>, - | and YARN uses <<<yarn>>>. -*---------------+-------------------+ -| SHELL_OPTIONS | Options that the shell processes prior to executing Java. -*-----------------------+---------------+ -| COMMAND | Action to perform. -*-----------------------+---------------+ -| GENERIC_OPTIONS | The common set of options supported by - | multiple commands. -*-----------------------+---------------+ -| COMMAND_OPTIONS | Various commands with their options are - | described in this documention for the - | Hadoop common sub-project. HDFS and YARN are - | covered in other documents. -*-----------------------+---------------+ - -** {Shell Options} - - All of the shell commands will accept a common set of options. For some commands, - these options are ignored. For example, passing <<<---hostnames>>> on a - command that only executes on a single host will be ignored. - -*-----------------------+---------------+ -|| SHELL_OPTION || Description -*-----------------------+---------------+ -| <<<--buildpaths>>> | Enables developer versions of jars. -*-----------------------+---------------+ -| <<<--config confdir>>> | Overwrites the default Configuration - | directory. Default is <<<${HADOOP_PREFIX}/conf>>>. -*-----------------------+----------------+ -| <<<--daemon mode>>> | If the command supports daemonization (e.g., - | <<<hdfs namenode>>>), execute in the appropriate - | mode. Supported modes are <<<start>>> to start the - | process in daemon mode, <<<stop>>> to stop the - | process, and <<<status>>> to determine the active - | status of the process. <<<status>>> will return - | an {{{http://refspecs.linuxbase.org/LSB_3.0.0/LSB-generic/LSB-generic/iniscrptact.html}LSB-compliant}} result code. - | If no option is provided, commands that support - | daemonization will run in the foreground. -*-----------------------+---------------+ -| <<<--debug>>> | Enables shell level configuration debugging information -*-----------------------+---------------+ -| <<<--help>>> | Shell script usage information. -*-----------------------+---------------+ -| <<<--hostnames>>> | A space delimited list of hostnames where to execute - | a multi-host subcommand. By default, the content of - | the <<<slaves>>> file is used. -*-----------------------+----------------+ -| <<<--hosts>>> | A file that contains a list of hostnames where to execute - | a multi-host subcommand. By default, the content of the - | <<<slaves>>> file is used. -*-----------------------+----------------+ -| <<<--loglevel loglevel>>> | Overrides the log level. Valid log levels are -| | FATAL, ERROR, WARN, INFO, DEBUG, and TRACE. -| | Default is INFO. -*-----------------------+---------------+ - -** {Generic Options} - - Many subcommands honor a common set of configuration options to alter their behavior: - -*------------------------------------------------+-----------------------------+ -|| GENERIC_OPTION || Description -*------------------------------------------------+-----------------------------+ -|<<<-archives \<comma separated list of archives\> >>> | Specify comma separated - | archives to be unarchived on - | the compute machines. Applies - | only to job. -*------------------------------------------------+-----------------------------+ -|<<<-conf \<configuration file\> >>> | Specify an application - | configuration file. -*------------------------------------------------+-----------------------------+ -|<<<-D \<property\>=\<value\> >>> | Use value for given property. -*------------------------------------------------+-----------------------------+ -|<<<-files \<comma separated list of files\> >>> | Specify comma separated files - | to be copied to the map - | reduce cluster. Applies only - | to job. -*------------------------------------------------+-----------------------------+ -|<<<-jt \<local\> or \<resourcemanager:port\>>>> | Specify a ResourceManager. - | Applies only to job. -*------------------------------------------------+-----------------------------+ -|<<<-libjars \<comma seperated list of jars\> >>>| Specify comma separated jar - | files to include in the - | classpath. Applies only to - | job. -*------------------------------------------------+-----------------------------+ - -Hadoop Common Commands - - All of these commands are executed from the <<<hadoop>>> shell command. They - have been broken up into {{User Commands}} and - {{Admininistration Commands}}. - -* User Commands - - Commands useful for users of a hadoop cluster. - -** <<<archive>>> - - Creates a hadoop archive. More information can be found at - {{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html} - Hadoop Archives Guide}}. - -** <<<checknative>>> - - Usage: <<<hadoop checknative [-a] [-h] >>> - -*-----------------+-----------------------------------------------------------+ -|| COMMAND_OPTION || Description -*-----------------+-----------------------------------------------------------+ -| -a | Check all libraries are available. -*-----------------+-----------------------------------------------------------+ -| -h | print help -*-----------------+-----------------------------------------------------------+ - - This command checks the availability of the Hadoop native code. See - {{{NativeLibraries.html}}} for more information. By default, this command - only checks the availability of libhadoop. - -** <<<classpath>>> - - Usage: <<<hadoop classpath [--glob|--jar <path>|-h|--help]>>> - -*-----------------+-----------------------------------------------------------+ -|| COMMAND_OPTION || Description -*-----------------+-----------------------------------------------------------+ -| --glob | expand wildcards -*-----------------+-----------------------------------------------------------+ -| --jar <path> | write classpath as manifest in jar named <path> -*-----------------+-----------------------------------------------------------+ -| -h, --help | print help -*-----------------+-----------------------------------------------------------+ - - Prints the class path needed to get the Hadoop jar and the required - libraries. If called without arguments, then prints the classpath set up by - the command scripts, which is likely to contain wildcards in the classpath - entries. Additional options print the classpath after wildcard expansion or - write the classpath into the manifest of a jar file. The latter is useful in - environments where wildcards cannot be used and the expanded classpath exceeds - the maximum supported command line length. - -** <<<credential>>> - - Usage: <<<hadoop credential <subcommand> [options]>>> - -*-------------------+-------------------------------------------------------+ -||COMMAND_OPTION || Description -*-------------------+-------------------------------------------------------+ -| create <alias> [-v <value>][-provider <provider-path>]| Prompts the user for - | a credential to be stored as the given alias when a value - | is not provided via <<<-v>>>. The - | <hadoop.security.credential.provider.path> within the - | core-site.xml file will be used unless a <<<-provider>>> is - | indicated. -*-------------------+-------------------------------------------------------+ -| delete <alias> [-i][-provider <provider-path>] | Deletes the credential with - | the provided alias and optionally warns the user when - | <<<--interactive>>> is used. - | The <hadoop.security.credential.provider.path> within the - | core-site.xml file will be used unless a <<<-provider>>> is - | indicated. -*-------------------+-------------------------------------------------------+ -| list [-provider <provider-path>] | Lists all of the credential aliases - | The <hadoop.security.credential.provider.path> within the - | core-site.xml file will be used unless a <<<-provider>>> is - | indicated. -*-------------------+-------------------------------------------------------+ - - Command to manage credentials, passwords and secrets within credential providers. - - The CredentialProvider API in Hadoop allows for the separation of applications - and how they store their required passwords/secrets. In order to indicate - a particular provider type and location, the user must provide the - <hadoop.security.credential.provider.path> configuration element in core-site.xml - or use the command line option <<<-provider>>> on each of the following commands. - This provider path is a comma-separated list of URLs that indicates the type and - location of a list of providers that should be consulted. For example, the following path: - <<<user:///,jceks://file/tmp/test.jceks,jceks://[email protected]/my/path/test.jceks>>> - - indicates that the current user's credentials file should be consulted through - the User Provider, that the local file located at <<</tmp/test.jceks>>> is a Java Keystore - Provider and that the file located within HDFS at <<<nn1.example.com/my/path/test.jceks>>> - is also a store for a Java Keystore Provider. - - When utilizing the credential command it will often be for provisioning a password - or secret to a particular credential store provider. In order to explicitly - indicate which provider store to use the <<<-provider>>> option should be used. Otherwise, - given a path of multiple providers, the first non-transient provider will be used. - This may or may not be the one that you intended. - - Example: <<<-provider jceks://file/tmp/test.jceks>>> - -** <<<distch>>> - - Usage: <<<hadoop distch [-f urilist_url] [-i] [-log logdir] path:owner:group:permissions>>> - -*-------------------+-------------------------------------------------------+ -||COMMAND_OPTION || Description -*-------------------+-------------------------------------------------------+ -| -f | List of objects to change -*----+------------+ -| -i | Ignore failures -*----+------------+ -| -log | Directory to log output -*-----+---------+ - - Change the ownership and permissions on many files at once. - -** <<<distcp>>> - - Copy file or directories recursively. More information can be found at - {{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html} - Hadoop DistCp Guide}}. - -** <<<fs>>> - - This command is documented in the {{{./FileSystemShell.html}File System Shell Guide}}. It is a synonym for <<<hdfs dfs>>> when HDFS is in use. - -** <<<jar>>> - - Usage: <<<hadoop jar <jar> [mainClass] args...>>> - - Runs a jar file. - - Use {{{../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html#jar}<<<yarn jar>>>}} - to launch YARN applications instead. - -** <<<jnipath>>> - - Usage: <<<hadoop jnipath>>> - - Print the computed java.library.path. - -** <<<key>>> - - Manage keys via the KeyProvider. - -** <<<trace>>> - - View and modify Hadoop tracing settings. See the {{{./Tracing.html}Tracing Guide}}. - -** <<<version>>> - - Usage: <<<hadoop version>>> - - Prints the version. - -** <<<CLASSNAME>>> - - Usage: <<<hadoop CLASSNAME>>> - - Runs the class named <<<CLASSNAME>>>. The class must be part of a package. - -* {Administration Commands} - - Commands useful for administrators of a hadoop cluster. - -** <<<daemonlog>>> - - Usage: <<<hadoop daemonlog -getlevel <host:port> <name> >>> - Usage: <<<hadoop daemonlog -setlevel <host:port> <name> <level> >>> - -*------------------------------+-----------------------------------------------------------+ -|| COMMAND_OPTION || Description -*------------------------------+-----------------------------------------------------------+ -| -getlevel <host:port> <name> | Prints the log level of the daemon running at - | <host:port>. This command internally connects - | to http://<host:port>/logLevel?log=<name> -*------------------------------+-----------------------------------------------------------+ -| -setlevel <host:port> <name> <level> | Sets the log level of the daemon - | running at <host:port>. This command internally - | connects to http://<host:port>/logLevel?log=<name> -*------------------------------+-----------------------------------------------------------+ - - Get/Set the log level for each daemon. - -* Files - -** <<etc/hadoop/hadoop-env.sh>> - - This file stores the global settings used by all Hadoop shell commands. - -** <<etc/hadoop/hadoop-user-functions.sh>> - - This file allows for advanced users to override some shell functionality. - -** <<~/.hadooprc>> - - This stores the personal environment for an individual user. It is - processed after the hadoop-env.sh and hadoop-user-functions.sh files - and can contain the same settings. http://git-wip-us.apache.org/repos/asf/hadoop/blob/e9d26fe9/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm b/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm deleted file mode 100644 index 98d1f57..0000000 --- a/hadoop-common-project/hadoop-common/src/site/apt/Compatibility.apt.vm +++ /dev/null @@ -1,541 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- -Apache Hadoop Compatibility - --- - --- - ${maven.build.timestamp} - -Apache Hadoop Compatibility - -%{toc|section=1|fromDepth=0} - -* Purpose - - This document captures the compatibility goals of the Apache Hadoop - project. The different types of compatibility between Hadoop - releases that affects Hadoop developers, downstream projects, and - end-users are enumerated. For each type of compatibility we: - - * describe the impact on downstream projects or end-users - - * where applicable, call out the policy adopted by the Hadoop - developers when incompatible changes are permitted. - -* Compatibility types - -** Java API - - Hadoop interfaces and classes are annotated to describe the intended - audience and stability in order to maintain compatibility with previous - releases. See {{{./InterfaceClassification.html}Hadoop Interface - Classification}} - for details. - - * InterfaceAudience: captures the intended audience, possible - values are Public (for end users and external projects), - LimitedPrivate (for other Hadoop components, and closely related - projects like YARN, MapReduce, HBase etc.), and Private (for intra component - use). - - * InterfaceStability: describes what types of interface changes are - permitted. Possible values are Stable, Evolving, Unstable, and Deprecated. - -*** Use Cases - - * Public-Stable API compatibility is required to ensure end-user programs - and downstream projects continue to work without modification. - - * LimitedPrivate-Stable API compatibility is required to allow upgrade of - individual components across minor releases. - - * Private-Stable API compatibility is required for rolling upgrades. - -*** Policy - - * Public-Stable APIs must be deprecated for at least one major release - prior to their removal in a major release. - - * LimitedPrivate-Stable APIs can change across major releases, - but not within a major release. - - * Private-Stable APIs can change across major releases, - but not within a major release. - - * Classes not annotated are implicitly "Private". Class members not - annotated inherit the annotations of the enclosing class. - - * Note: APIs generated from the proto files need to be compatible for - rolling-upgrades. See the section on wire-compatibility for more details. - The compatibility policies for APIs and wire-communication need to go - hand-in-hand to address this. - -** Semantic compatibility - - Apache Hadoop strives to ensure that the behavior of APIs remains - consistent over versions, though changes for correctness may result in - changes in behavior. Tests and javadocs specify the API's behavior. - The community is in the process of specifying some APIs more rigorously, - and enhancing test suites to verify compliance with the specification, - effectively creating a formal specification for the subset of behaviors - that can be easily tested. - -*** Policy - - The behavior of API may be changed to fix incorrect behavior, - such a change to be accompanied by updating existing buggy tests or adding - tests in cases there were none prior to the change. - -** Wire compatibility - - Wire compatibility concerns data being transmitted over the wire - between Hadoop processes. Hadoop uses Protocol Buffers for most RPC - communication. Preserving compatibility requires prohibiting - modification as described below. - Non-RPC communication should be considered as well, - for example using HTTP to transfer an HDFS image as part of - snapshotting or transferring MapTask output. The potential - communications can be categorized as follows: - - * Client-Server: communication between Hadoop clients and servers (e.g., - the HDFS client to NameNode protocol, or the YARN client to - ResourceManager protocol). - - * Client-Server (Admin): It is worth distinguishing a subset of the - Client-Server protocols used solely by administrative commands (e.g., - the HAAdmin protocol) as these protocols only impact administrators - who can tolerate changes that end users (which use general - Client-Server protocols) can not. - - * Server-Server: communication between servers (e.g., the protocol between - the DataNode and NameNode, or NodeManager and ResourceManager) - -*** Use Cases - - * Client-Server compatibility is required to allow users to - continue using the old clients even after upgrading the server - (cluster) to a later version (or vice versa). For example, a - Hadoop 2.1.0 client talking to a Hadoop 2.3.0 cluster. - - * Client-Server compatibility is also required to allow users to upgrade the - client before upgrading the server (cluster). For example, a Hadoop 2.4.0 - client talking to a Hadoop 2.3.0 cluster. This allows deployment of - client-side bug fixes ahead of full cluster upgrades. Note that new cluster - features invoked by new client APIs or shell commands will not be usable. - YARN applications that attempt to use new APIs (including new fields in data - structures) that have not yet deployed to the cluster can expect link - exceptions. - - * Client-Server compatibility is also required to allow upgrading - individual components without upgrading others. For example, - upgrade HDFS from version 2.1.0 to 2.2.0 without upgrading MapReduce. - - * Server-Server compatibility is required to allow mixed versions - within an active cluster so the cluster may be upgraded without - downtime in a rolling fashion. - -*** Policy - - * Both Client-Server and Server-Server compatibility is preserved within a - major release. (Different policies for different categories are yet to be - considered.) - - * Compatibility can be broken only at a major release, though breaking compatibility - even at major releases has grave consequences and should be discussed in the Hadoop community. - - * Hadoop protocols are defined in .proto (ProtocolBuffers) files. - Client-Server protocols and Server-protocol .proto files are marked as stable. - When a .proto file is marked as stable it means that changes should be made - in a compatible fashion as described below: - - * The following changes are compatible and are allowed at any time: - - * Add an optional field, with the expectation that the code deals with the field missing due to communication with an older version of the code. - - * Add a new rpc/method to the service - - * Add a new optional request to a Message - - * Rename a field - - * Rename a .proto file - - * Change .proto annotations that effect code generation (e.g. name of java package) - - * The following changes are incompatible but can be considered only at a major release - - * Change the rpc/method name - - * Change the rpc/method parameter type or return type - - * Remove an rpc/method - - * Change the service name - - * Change the name of a Message - - * Modify a field type in an incompatible way (as defined recursively) - - * Change an optional field to required - - * Add or delete a required field - - * Delete an optional field as long as the optional field has reasonable defaults to allow deletions - - * The following changes are incompatible and hence never allowed - - * Change a field id - - * Reuse an old field that was previously deleted. - - * Field numbers are cheap and changing and reusing is not a good idea. - - -** Java Binary compatibility for end-user applications i.e. Apache Hadoop ABI - - As Apache Hadoop revisions are upgraded end-users reasonably expect that - their applications should continue to work without any modifications. - This is fulfilled as a result of support API compatibility, Semantic - compatibility and Wire compatibility. - - However, Apache Hadoop is a very complex, distributed system and services a - very wide variety of use-cases. In particular, Apache Hadoop MapReduce is a - very, very wide API; in the sense that end-users may make wide-ranging - assumptions such as layout of the local disk when their map/reduce tasks are - executing, environment variables for their tasks etc. In such cases, it - becomes very hard to fully specify, and support, absolute compatibility. - -*** Use cases - - * Existing MapReduce applications, including jars of existing packaged - end-user applications and projects such as Apache Pig, Apache Hive, - Cascading etc. should work unmodified when pointed to an upgraded Apache - Hadoop cluster within a major release. - - * Existing YARN applications, including jars of existing packaged - end-user applications and projects such as Apache Tez etc. should work - unmodified when pointed to an upgraded Apache Hadoop cluster within a - major release. - - * Existing applications which transfer data in/out of HDFS, including jars - of existing packaged end-user applications and frameworks such as Apache - Flume, should work unmodified when pointed to an upgraded Apache Hadoop - cluster within a major release. - -*** Policy - - * Existing MapReduce, YARN & HDFS applications and frameworks should work - unmodified within a major release i.e. Apache Hadoop ABI is supported. - - * A very minor fraction of applications maybe affected by changes to disk - layouts etc., the developer community will strive to minimize these - changes and will not make them within a minor version. In more egregious - cases, we will consider strongly reverting these breaking changes and - invalidating offending releases if necessary. - - * In particular for MapReduce applications, the developer community will - try our best to support provide binary compatibility across major - releases e.g. applications using org.apache.hadoop.mapred. - - * APIs are supported compatibly across hadoop-1.x and hadoop-2.x. See - {{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html} - Compatibility for MapReduce applications between hadoop-1.x and hadoop-2.x}} - for more details. - -** REST APIs - - REST API compatibility corresponds to both the request (URLs) and responses - to each request (content, which may contain other URLs). Hadoop REST APIs - are specifically meant for stable use by clients across releases, - even major releases. The following are the exposed REST APIs: - - * {{{../hadoop-hdfs/WebHDFS.html}WebHDFS}} - Stable - - * {{{../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html}ResourceManager}} - - * {{{../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html}NodeManager}} - - * {{{../../hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html}MR Application Master}} - - * {{{../../hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html}History Server}} - -*** Policy - - The APIs annotated stable in the text above preserve compatibility - across at least one major release, and maybe deprecated by a newer - version of the REST API in a major release. - -** Metrics/JMX - - While the Metrics API compatibility is governed by Java API compatibility, - the actual metrics exposed by Hadoop need to be compatible for users to - be able to automate using them (scripts etc.). Adding additional metrics - is compatible. Modifying (eg changing the unit or measurement) or removing - existing metrics breaks compatibility. Similarly, changes to JMX MBean - object names also break compatibility. - -*** Policy - - Metrics should preserve compatibility within the major release. - -** File formats & Metadata - - User and system level data (including metadata) is stored in files of - different formats. Changes to the metadata or the file formats used to - store data/metadata can lead to incompatibilities between versions. - -*** User-level file formats - - Changes to formats that end-users use to store their data can prevent - them for accessing the data in later releases, and hence it is highly - important to keep those file-formats compatible. One can always add a - "new" format improving upon an existing format. Examples of these formats - include har, war, SequenceFileFormat etc. - -**** Policy - - * Non-forward-compatible user-file format changes are - restricted to major releases. When user-file formats change, new - releases are expected to read existing formats, but may write data - in formats incompatible with prior releases. Also, the community - shall prefer to create a new format that programs must opt in to - instead of making incompatible changes to existing formats. - -*** System-internal file formats - - Hadoop internal data is also stored in files and again changing these - formats can lead to incompatibilities. While such changes are not as - devastating as the user-level file formats, a policy on when the - compatibility can be broken is important. - -**** MapReduce - - MapReduce uses formats like I-File to store MapReduce-specific data. - - -***** Policy - - MapReduce-internal formats like IFile maintain compatibility within a - major release. Changes to these formats can cause in-flight jobs to fail - and hence we should ensure newer clients can fetch shuffle-data from old - servers in a compatible manner. - -**** HDFS Metadata - - HDFS persists metadata (the image and edit logs) in a particular format. - Incompatible changes to either the format or the metadata prevent - subsequent releases from reading older metadata. Such incompatible - changes might require an HDFS "upgrade" to convert the metadata to make - it accessible. Some changes can require more than one such "upgrades". - - Depending on the degree of incompatibility in the changes, the following - potential scenarios can arise: - - * Automatic: The image upgrades automatically, no need for an explicit - "upgrade". - - * Direct: The image is upgradable, but might require one explicit release - "upgrade". - - * Indirect: The image is upgradable, but might require upgrading to - intermediate release(s) first. - - * Not upgradeable: The image is not upgradeable. - -***** Policy - - * A release upgrade must allow a cluster to roll-back to the older - version and its older disk format. The rollback needs to restore the - original data, but not required to restore the updated data. - - * HDFS metadata changes must be upgradeable via any of the upgrade - paths - automatic, direct or indirect. - - * More detailed policies based on the kind of upgrade are yet to be - considered. - -** Command Line Interface (CLI) - - The Hadoop command line programs may be use either directly via the - system shell or via shell scripts. Changing the path of a command, - removing or renaming command line options, the order of arguments, - or the command return code and output break compatibility and - may adversely affect users. - -*** Policy - - CLI commands are to be deprecated (warning when used) for one - major release before they are removed or incompatibly modified in - a subsequent major release. - -** Web UI - - Web UI, particularly the content and layout of web pages, changes - could potentially interfere with attempts to screen scrape the web - pages for information. - -*** Policy - - Web pages are not meant to be scraped and hence incompatible - changes to them are allowed at any time. Users are expected to use - REST APIs to get any information. - -** Hadoop Configuration Files - - Users use (1) Hadoop-defined properties to configure and provide hints to - Hadoop and (2) custom properties to pass information to jobs. Hence, - compatibility of config properties is two-fold: - - * Modifying key-names, units of values, and default values of Hadoop-defined - properties. - - * Custom configuration property keys should not conflict with the - namespace of Hadoop-defined properties. Typically, users should - avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, net, - file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn. - -*** Policy - - * Hadoop-defined properties are to be deprecated at least for one - major release before being removed. Modifying units for existing - properties is not allowed. - - * The default values of Hadoop-defined properties can - be changed across minor/major releases, but will remain the same - across point releases within a minor release. - - * Currently, there is NO explicit policy regarding when new - prefixes can be added/removed, and the list of prefixes to be - avoided for custom configuration properties. However, as noted above, - users should avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, - net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn. - -** Directory Structure - - Source code, artifacts (source and tests), user logs, configuration files, - output and job history are all stored on disk either local file system or - HDFS. Changing the directory structure of these user-accessible - files break compatibility, even in cases where the original path is - preserved via symbolic links (if, for example, the path is accessed - by a servlet that is configured to not follow symbolic links). - -*** Policy - - * The layout of source code and build artifacts can change - anytime, particularly so across major versions. Within a major - version, the developers will attempt (no guarantees) to preserve - the directory structure; however, individual files can be - added/moved/deleted. The best way to ensure patches stay in sync - with the code is to get them committed to the Apache source tree. - - * The directory structure of configuration files, user logs, and - job history will be preserved across minor and point releases - within a major release. - -** Java Classpath - - User applications built against Hadoop might add all Hadoop jars - (including Hadoop's library dependencies) to the application's - classpath. Adding new dependencies or updating the version of - existing dependencies may interfere with those in applications' - classpaths. - -*** Policy - - Currently, there is NO policy on when Hadoop's dependencies can - change. - -** Environment variables - - Users and related projects often utilize the exported environment - variables (eg HADOOP_CONF_DIR), therefore removing or renaming - environment variables is an incompatible change. - -*** Policy - - Currently, there is NO policy on when the environment variables - can change. Developers try to limit changes to major releases. - -** Build artifacts - - Hadoop uses maven for project management and changing the artifacts - can affect existing user workflows. - -*** Policy - - * Test artifacts: The test jars generated are strictly for internal - use and are not expected to be used outside of Hadoop, similar to - APIs annotated @Private, @Unstable. - - * Built artifacts: The hadoop-client artifact (maven - groupId:artifactId) stays compatible within a major release, - while the other artifacts can change in incompatible ways. - -** Hardware/Software Requirements - - To keep up with the latest advances in hardware, operating systems, - JVMs, and other software, new Hadoop releases or some of their - features might require higher versions of the same. For a specific - environment, upgrading Hadoop might require upgrading other - dependent software components. - -*** Policies - - * Hardware - - * Architecture: The community has no plans to restrict Hadoop to - specific architectures, but can have family-specific - optimizations. - - * Minimum resources: While there are no guarantees on the - minimum resources required by Hadoop daemons, the community - attempts to not increase requirements within a minor release. - - * Operating Systems: The community will attempt to maintain the - same OS requirements (OS kernel versions) within a minor - release. Currently GNU/Linux and Microsoft Windows are the OSes officially - supported by the community while Apache Hadoop is known to work reasonably - well on other OSes such as Apple MacOSX, Solaris etc. - - * The JVM requirements will not change across point releases - within the same minor release except if the JVM version under - question becomes unsupported. Minor/major releases might require - later versions of JVM for some/all of the supported operating - systems. - - * Other software: The community tries to maintain the minimum - versions of additional software required by Hadoop. For example, - ssh, kerberos etc. - -* References - - Here are some relevant JIRAs and pages related to the topic: - - * The evolution of this document - - {{{https://issues.apache.org/jira/browse/HADOOP-9517}HADOOP-9517}} - - * Binary compatibility for MapReduce end-user applications between hadoop-1.x and hadoop-2.x - - {{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html} - MapReduce Compatibility between hadoop-1.x and hadoop-2.x}} - - * Annotations for interfaces as per interface classification - schedule - - {{{https://issues.apache.org/jira/browse/HADOOP-7391}HADOOP-7391}} - {{{./InterfaceClassification.html}Hadoop Interface Classification}} - - * Compatibility for Hadoop 1.x releases - - {{{https://issues.apache.org/jira/browse/HADOOP-5071}HADOOP-5071}} - - * The {{{http://wiki.apache.org/hadoop/Roadmap}Hadoop Roadmap}} page - that captures other release policies -
