Repository: hadoop Updated Branches: refs/heads/branch-2 efb7e287f -> 343cffb0e
http://git-wip-us.apache.org/repos/asf/hadoop/blob/343cffb0/hadoop-common-project/hadoop-common/src/site/markdown/SingleCluster.md.vm ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/markdown/SingleCluster.md.vm b/hadoop-common-project/hadoop-common/src/site/markdown/SingleCluster.md.vm new file mode 100644 index 0000000..78a08b0 --- /dev/null +++ b/hadoop-common-project/hadoop-common/src/site/markdown/SingleCluster.md.vm @@ -0,0 +1,231 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +#set ( $H3 = '###' ) +#set ( $H4 = '####' ) +#set ( $H5 = '#####' ) + +Hadoop: Setting up a Single Node Cluster. +========================================= + +* [Hadoop: Setting up a Single Node Cluster.](#Hadoop:_Setting_up_a_Single_Node_Cluster.) + * [Purpose](#Purpose) + * [Prerequisites](#Prerequisites) + * [Supported Platforms](#Supported_Platforms) + * [Required Software](#Required_Software) + * [Installing Software](#Installing_Software) + * [Download](#Download) + * [Prepare to Start the Hadoop Cluster](#Prepare_to_Start_the_Hadoop_Cluster) + * [Standalone Operation](#Standalone_Operation) + * [Pseudo-Distributed Operation](#Pseudo-Distributed_Operation) + * [Configuration](#Configuration) + * [Setup passphraseless ssh](#Setup_passphraseless_ssh) + * [Execution](#Execution) + * [YARN on a Single Node](#YARN_on_a_Single_Node) + * [Fully-Distributed Operation](#Fully-Distributed_Operation) + +Purpose +------- + +This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). + +Prerequisites +------------- + +$H3 Supported Platforms + +* GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. + +* Windows is also a supported platform but the followings steps are for Linux only. To set up Hadoop on Windows, see [wiki page](http://wiki.apache.org/hadoop/Hadoop2OnWindows). + +$H3 Required Software + +Required software for Linux include: + +1. Java⢠must be installed. Recommended Java versions are described at [HadoopJavaVersions](http://wiki.apache.org/hadoop/HadoopJavaVersions). + +2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. +$H3 Installing Software + +If your cluster doesn't have the requisite software you will need to install it. + +For example on Ubuntu Linux: + + $ sudo apt-get install ssh + $ sudo apt-get install rsync + +Download +-------- + +To get a Hadoop distribution, download a recent stable release from one of the [Apache Download Mirrors](http://www.apache.org/dyn/closer.cgi/hadoop/common/). + +Prepare to Start the Hadoop Cluster +----------------------------------- + +Unpack the downloaded Hadoop distribution. In the distribution, edit the file `etc/hadoop/hadoop-env.sh` to define some parameters as follows: + + # set to the root of your Java installation + export JAVA_HOME=/usr/java/latest + +Try the following command: + + $ bin/hadoop + +This will display the usage documentation for the hadoop script. + +Now you are ready to start your Hadoop cluster in one of the three supported modes: + +* [Local (Standalone) Mode](#Standalone_Operation) +* [Pseudo-Distributed Mode](#Pseudo-Distributed_Operation) +* [Fully-Distributed Mode](#Fully-Distributed_Operation) + +Standalone Operation +-------------------- + +By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. + +The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. + + $ mkdir input + $ cp etc/hadoop/*.xml input + $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+' + $ cat output/* + +Pseudo-Distributed Operation +---------------------------- + +Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. + +$H3 Configuration + +Use the following: + +etc/hadoop/core-site.xml: + + <configuration> + <property> + <name>fs.defaultFS</name> + <value>hdfs://localhost:9000</value> + </property> + </configuration> + +etc/hadoop/hdfs-site.xml: + + <configuration> + <property> + <name>dfs.replication</name> + <value>1</value> + </property> + </configuration> + +$H3 Setup passphraseless ssh + +Now check that you can ssh to the localhost without a passphrase: + + $ ssh localhost + +If you cannot ssh to localhost without a passphrase, execute the following commands: + + $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa + $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys + $ export HADOOP\_PREFIX=/usr/local/hadoop + +$H3 Execution + +The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see [YARN on Single Node](#YARN_on_Single_Node). + +1. Format the filesystem: + + $ bin/hdfs namenode -format + +2. Start NameNode daemon and DataNode daemon: + + $ sbin/start-dfs.sh + + The hadoop daemon log output is written to the `$HADOOP_LOG_DIR` directory (defaults to `$HADOOP_HOME/logs`). + +3. Browse the web interface for the NameNode; by default it is available at: + + * NameNode - `http://localhost:50070/` + +4. Make the HDFS directories required to execute MapReduce jobs: + + $ bin/hdfs dfs -mkdir /user + $ bin/hdfs dfs -mkdir /user/<username> + +5. Copy the input files into the distributed filesystem: + + $ bin/hdfs dfs -put etc/hadoop input + +6. Run some of the examples provided: + + $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+' + +7. Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them: + + $ bin/hdfs dfs -get output output + $ cat output/* + + or + + View the output files on the distributed filesystem: + + $ bin/hdfs dfs -cat output/* + +8. When you're done, stop the daemons with: + + $ sbin/stop-dfs.sh + +$H3 YARN on a Single Node + +You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition. + +The following instructions assume that 1. ~ 4. steps of [the above instructions](#Execution) are already executed. + +1. Configure parameters as follows:`etc/hadoop/mapred-site.xml`: + + <configuration> + <property> + <name>mapreduce.framework.name</name> + <value>yarn</value> + </property> + </configuration> + + `etc/hadoop/yarn-site.xml`: + + <configuration> + <property> + <name>yarn.nodemanager.aux-services</name> + <value>mapreduce_shuffle</value> + </property> + </configuration> + +2. Start ResourceManager daemon and NodeManager daemon: + + $ sbin/start-yarn.sh + +3. Browse the web interface for the ResourceManager; by default it is available at: + + * ResourceManager - `http://localhost:8088/` + +4. Run a MapReduce job. + +5. When you're done, stop the daemons with: + + $ sbin/stop-yarn.sh + +Fully-Distributed Operation +--------------------------- + +For information on setting up fully-distributed, non-trivial clusters see [Cluster Setup](./ClusterSetup.html). http://git-wip-us.apache.org/repos/asf/hadoop/blob/343cffb0/hadoop-common-project/hadoop-common/src/site/markdown/SingleNodeSetup.md ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/markdown/SingleNodeSetup.md b/hadoop-common-project/hadoop-common/src/site/markdown/SingleNodeSetup.md new file mode 100644 index 0000000..fae8b5c --- /dev/null +++ b/hadoop-common-project/hadoop-common/src/site/markdown/SingleNodeSetup.md @@ -0,0 +1,20 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +Single Node Setup +================= + +This page will be removed in the next major release. + +See [Single Cluster Setup](./SingleCluster.html) to set up and configure a single-node Hadoop installation. http://git-wip-us.apache.org/repos/asf/hadoop/blob/343cffb0/hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md b/hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md new file mode 100644 index 0000000..8c9fb72 --- /dev/null +++ b/hadoop-common-project/hadoop-common/src/site/markdown/Superusers.md @@ -0,0 +1,106 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +Proxy user - Superusers Acting On Behalf Of Other Users +======================================================= + +* [Proxy user - Superusers Acting On Behalf Of Other Users](#Proxy_user_-_Superusers_Acting_On_Behalf_Of_Other_Users) + * [Introduction](#Introduction) + * [Use Case](#Use_Case) + * [Code example](#Code_example) + * [Configurations](#Configurations) + * [Caveats](#Caveats) + +Introduction +------------ + +This document describes how a superuser can submit jobs or access hdfs on behalf of another user. + +Use Case +-------- + +The code example described in the next section is applicable for the following use case. + +A superuser with username 'super' wants to submit job and access hdfs on behalf of a user joe. The superuser has kerberos credentials but user joe doesn't have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super's kerberos credentials. In other words super is impersonating the user joe. + +Some products such as Apache Oozie need this. + +Code example +------------ + +In this example super's credentials are used for login and a proxy user ugi object is created for joe. The operations are performed within the doAs method of this proxy user ugi object. + + ... + //Create ugi for joe. The login user is 'super'. + UserGroupInformation ugi = + UserGroupInformation.createProxyUser("joe", UserGroupInformation.getLoginUser()); + ugi.doAs(new PrivilegedExceptionAction<Void>() { + public Void run() throws Exception { + //Submit a job + JobClient jc = new JobClient(conf); + jc.submitJob(conf); + //OR access hdfs + FileSystem fs = FileSystem.get(conf); + fs.mkdir(someFilePath); + } + } + +Configurations +-------------- + +You can configure proxy user using properties `hadoop.proxyuser.$superuser.hosts` along with either or both of `hadoop.proxyuser.$superuser.groups` and `hadoop.proxyuser.$superuser.users`. + +By specifying as below in core-site.xml, the superuser named `super` can connect only from `host1` and `host2` to impersonate a user belonging to `group1` and `group2`. + + <property> + <name>hadoop.proxyuser.super.hosts</name> + <value>host1,host2</value> + </property> + <property> + <name>hadoop.proxyuser.super.groups</name> + <value>group1,group2</value> + </property> + +If these configurations are not present, impersonation will not be allowed and connection will fail. + +If more lax security is preferred, the wildcard value \* may be used to allow impersonation from any host or of any user. For example, by specifying as below in core-site.xml, user named `oozie` accessing from any host can impersonate any user belonging to any group. + + <property> + <name>hadoop.proxyuser.oozie.hosts</name> + <value>*</value> + </property> + <property> + <name>hadoop.proxyuser.oozie.groups</name> + <value>*</value> + </property> + +The `hadoop.proxyuser.$superuser.hosts` accepts list of ip addresses, ip address ranges in CIDR format and/or host names. For example, by specifying as below, user named `super` accessing from hosts in the range `10.222.0.0-15` and `10.113.221.221` can impersonate `user1` and `user2`. + + <property> + <name>hadoop.proxyuser.super.hosts</name> + <value>10.222.0.0/16,10.113.221.221</value> + </property> + <property> + <name>hadoop.proxyuser.super.users</name> + <value>user1,user2</value> + </property> + +Caveats +------- + +If the cluster is running in [Secure Mode](./SecureMode.html), the superuser must have kerberos credentials to be able to impersonate another user. + +It cannot use delegation tokens for this feature. It would be wrong if superuser adds its own delegation token to the proxy user ugi, as it will allow the proxy user to connect to the service with the privileges of the superuser. + +However, if the superuser does want to give a delegation token to joe, it must first impersonate joe and get a delegation token for joe, in the same way as the code example above, and add it to the ugi of joe. In this way the delegation token will have the owner as joe. http://git-wip-us.apache.org/repos/asf/hadoop/blob/343cffb0/hadoop-common-project/hadoop-common/src/site/markdown/Tracing.md ---------------------------------------------------------------------- diff --git a/hadoop-common-project/hadoop-common/src/site/markdown/Tracing.md b/hadoop-common-project/hadoop-common/src/site/markdown/Tracing.md new file mode 100644 index 0000000..3ef35b2 --- /dev/null +++ b/hadoop-common-project/hadoop-common/src/site/markdown/Tracing.md @@ -0,0 +1,209 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +Enabling Dapper-like Tracing in Hadoop +====================================== + +* [Enabling Dapper-like Tracing in Hadoop](#Enabling_Dapper-like_Tracing_in_Hadoop) + * [Dapper-like Tracing in Hadoop](#Dapper-like_Tracing_in_Hadoop) + * [HTrace](#HTrace) + * [Samplers](#Samplers) + * [SpanReceivers](#SpanReceivers) + * [Setting up ZipkinSpanReceiver](#Setting_up_ZipkinSpanReceiver) + * [Dynamic update of tracing configuration](#Dynamic_update_of_tracing_configuration) + * [Starting tracing spans by HTrace API](#Starting_tracing_spans_by_HTrace_API) + * [Sample code for tracing](#Sample_code_for_tracing) + +Dapper-like Tracing in Hadoop +----------------------------- + +### HTrace + +[HDFS-5274](https://issues.apache.org/jira/browse/HDFS-5274) added support for tracing requests through HDFS, +using the open source tracing library, +[Apache HTrace](https://git-wip-us.apache.org/repos/asf/incubator-htrace.git). +Setting up tracing is quite simple, however it requires some very minor changes to your client code. + +### Samplers + +Configure the samplers in `core-site.xml` property: `hadoop.htrace.sampler`. +The value can be NeverSampler, AlwaysSampler or ProbabilitySampler. +NeverSampler: HTrace is OFF for all spans; +AlwaysSampler: HTrace is ON for all spans; +ProbabilitySampler: HTrace is ON for some percentage% of top-level spans. + + <property> + <name>hadoop.htrace.sampler</name> + <value>NeverSampler</value> + </property> + +### SpanReceivers + +The tracing system works by collecting information in structs called 'Spans'. +It is up to you to choose how you want to receive this information +by implementing the SpanReceiver interface, which defines one method: + + public void receiveSpan(Span span); + +Configure what SpanReceivers you'd like to use +by putting a comma separated list of the fully-qualified class name of classes implementing SpanReceiver +in `core-site.xml` property: `hadoop.htrace.spanreceiver.classes`. + + <property> + <name>hadoop.htrace.spanreceiver.classes</name> + <value>org.apache.htrace.impl.LocalFileSpanReceiver</value> + </property> + <property> + <name>hadoop.htrace.local-file-span-receiver.path</name> + <value>/var/log/hadoop/htrace.out</value> + </property> + +You can omit package name prefix if you use span receiver bundled with HTrace. + + <property> + <name>hadoop.htrace.spanreceiver.classes</name> + <value>LocalFileSpanReceiver</value> + </property> + +### Setting up ZipkinSpanReceiver + +Instead of implementing SpanReceiver by yourself, +you can use `ZipkinSpanReceiver` which uses +[Zipkin](https://github.com/twitter/zipkin) for collecting and displaying tracing data. + +In order to use `ZipkinSpanReceiver`, +you need to download and setup [Zipkin](https://github.com/twitter/zipkin) first. + +you also need to add the jar of `htrace-zipkin` to the classpath of Hadoop on each node. +Here is example setup procedure. + + $ git clone https://github.com/cloudera/htrace + $ cd htrace/htrace-zipkin + $ mvn compile assembly:single + $ cp target/htrace-zipkin-*-jar-with-dependencies.jar $HADOOP_HOME/share/hadoop/common/lib/ + +The sample configuration for `ZipkinSpanReceiver` is shown below. +By adding these to `core-site.xml` of NameNode and DataNodes, `ZipkinSpanReceiver` is initialized on the startup. +You also need this configuration on the client node in addition to the servers. + + <property> + <name>hadoop.htrace.spanreceiver.classes</name> + <value>ZipkinSpanReceiver</value> + </property> + <property> + <name>hadoop.htrace.zipkin.collector-hostname</name> + <value>192.168.1.2</value> + </property> + <property> + <name>hadoop.htrace.zipkin.collector-port</name> + <value>9410</value> + </property> + +### Dynamic update of tracing configuration + +You can use `hadoop trace` command to see and update the tracing configuration of each servers. +You must specify IPC server address of namenode or datanode by `-host` option. +You need to run the command against all servers if you want to update the configuration of all servers. + +`hadoop trace -list` shows list of loaded span receivers associated with the id. + + $ hadoop trace -list -host 192.168.56.2:9000 + ID CLASS + 1 org.apache.htrace.impl.LocalFileSpanReceiver + + $ hadoop trace -list -host 192.168.56.2:50020 + ID CLASS + 1 org.apache.htrace.impl.LocalFileSpanReceiver + +`hadoop trace -remove` removes span receiver from server. +`-remove` options takes id of span receiver as argument. + + $ hadoop trace -remove 1 -host 192.168.56.2:9000 + Removed trace span receiver 1 + +`hadoop trace -add` adds span receiver to server. +You need to specify the class name of span receiver as argument of `-class` option. +You can specify the configuration associated with span receiver by `-Ckey=value` options. + + $ hadoop trace -add -class LocalFileSpanReceiver -Chadoop.htrace.local-file-span-receiver.path=/tmp/htrace.out -host 192.168.56.2:9000 + Added trace span receiver 2 with configuration hadoop.htrace.local-file-span-receiver.path = /tmp/htrace.out + + $ hadoop trace -list -host 192.168.56.2:9000 + ID CLASS + 2 org.apache.htrace.impl.LocalFileSpanReceiver + +### Starting tracing spans by HTrace API + +In order to trace, you will need to wrap the traced logic with **tracing span** as shown below. +When there is running tracing spans, +the tracing information is propagated to servers along with RPC requests. + +In addition, you need to initialize `SpanReceiver` once per process. + + import org.apache.hadoop.hdfs.HdfsConfiguration; + import org.apache.hadoop.tracing.SpanReceiverHost; + import org.apache.htrace.Sampler; + import org.apache.htrace.Trace; + import org.apache.htrace.TraceScope; + + ... + + SpanReceiverHost.getInstance(new HdfsConfiguration()); + + ... + + TraceScope ts = Trace.startSpan("Gets", Sampler.ALWAYS); + try { + ... // traced logic + } finally { + if (ts != null) ts.close(); + } + +### Sample code for tracing + +The `TracingFsShell.java` shown below is the wrapper of FsShell +which start tracing span before invoking HDFS shell command. + + import org.apache.hadoop.conf.Configuration; + import org.apache.hadoop.fs.FsShell; + import org.apache.hadoop.tracing.SpanReceiverHost; + import org.apache.hadoop.util.ToolRunner; + import org.apache.htrace.Sampler; + import org.apache.htrace.Trace; + import org.apache.htrace.TraceScope; + + public class TracingFsShell { + public static void main(String argv[]) throws Exception { + Configuration conf = new Configuration(); + FsShell shell = new FsShell(); + conf.setQuietMode(false); + shell.setConf(conf); + SpanReceiverHost.getInstance(conf); + int res = 0; + TraceScope ts = null; + try { + ts = Trace.startSpan("FsShell", Sampler.ALWAYS); + res = ToolRunner.run(shell, argv); + } finally { + shell.close(); + if (ts != null) ts.close(); + } + System.exit(res); + } + } + +You can compile and execute this code as shown below. + + $ javac -cp `hadoop classpath` TracingFsShell.java + $ java -cp .:`hadoop classpath` TracingFsShell -ls / http://git-wip-us.apache.org/repos/asf/hadoop/blob/343cffb0/hadoop-project/src/site/site.xml ---------------------------------------------------------------------- diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index 8254c9f..6aa226c 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -53,6 +53,7 @@ <item name="Hadoop Commands Reference" href="hadoop-project-dist/hadoop-common/CommandsManual.html"/> <item name="FileSystem Shell" href="hadoop-project-dist/hadoop-common/FileSystemShell.html"/> <item name="Hadoop Compatibility" href="hadoop-project-dist/hadoop-common/Compatibility.html"/> + <item name="Interface Classification" href="hadoop-project-dist/hadoop-common/InterfaceClassification.html"/> <item name="FileSystem Specification" href="hadoop-project-dist/hadoop-common/filesystem/index.html"/> </menu> @@ -61,6 +62,7 @@ <item name="CLI Mini Cluster" href="hadoop-project-dist/hadoop-common/CLIMiniCluster.html"/> <item name="Native Libraries" href="hadoop-project-dist/hadoop-common/NativeLibraries.html"/> <item name="Proxy User" href="hadoop-project-dist/hadoop-common/Superusers.html"/> + <item name="Rack Awareness" href="hadoop-project-dist/hadoop-common/RackAwareness.html"/> <item name="Secure Mode" href="hadoop-project-dist/hadoop-common/SecureMode.html"/> <item name="Service Level Authorization" href="hadoop-project-dist/hadoop-common/ServiceLevelAuth.html"/> <item name="HTTP Authentication" href="hadoop-project-dist/hadoop-common/HttpAuthentication.html"/>
