svn commit: r629361 [7/7] - in /hadoop/core/trunk: ./ docs/ src/docs/src/documentation/content/xdocs/

ddas Tue, 19 Feb 2008 21:18:57 -0800

Added: 
hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml
URL: 
http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml?rev=629361&view=auto
==============================================================================
--- 
hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml 
(added)
+++ 
hadoop/core/trunk/src/docs/src/documentation/content/xdocs/hod_user_guide.xml 
Tue Feb 19 21:17:48 2008
@@ -0,0 +1,506 @@
+<?xml version="1.0"?>
+
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN"
+          "http://forrest.apache.org/dtd/document-v20.dtd";>
+<document>
+  <header>
+    <title>
+      Hadoop On Demand 0.4 User Guide
+    </title>
+  </header>
+
+<body>
+  <section>
+    <title> Introduction </title><anchor id="Introduction"></anchor>
+  <p>Hadoop On Demand (HOD) is a system for provisioning virtual Hadoop 
clusters over a large physical cluster. It uses the Torque resource manager to 
do node allocation. On the allocated nodes, it can start Hadoop Map/Reduce and 
HDFS daemons. It automatically generates the appropriate configuration files 
(hadoop-site.xml) for the Hadoop daemons and client. HOD also has the 
capability to distribute Hadoop to the nodes in the virtual cluster that it 
allocates. In short, HOD makes it easy for administrators and users to quickly 
setup and use Hadoop. It is also a very useful tool for Hadoop developers and 
testers who need to share a physical cluster for testing their own Hadoop 
versions.</p>
+  <p>HOD 0.4 supports Hadoop from version 0.15 onwards.</p>
+  <p>The rest of the documentation comprises of a quick-start guide that helps 
you get quickly started with using HOD, a more detailed guide of all HOD 
features, command line options, known issues and trouble-shooting 
information.</p>
+  </section>
+  <section>
+               <title> Getting Started Using HOD 0.4 </title><anchor 
id="Getting_Started_Using_HOD_0_4"></anchor>
+  <p>In this section, we shall see a step-by-step introduction on how to use 
HOD for the most basic operations. Before following these steps, it is assumed 
that HOD 0.4 and its dependent hardware and software components are setup and 
configured correctly. This is a step that is generally performed by system 
administrators of the cluster.</p>
+  <p>The HOD 0.4 user interface is a command line utility called 
<code>hod</code>. It is driven by a configuration file, that is typically setup 
for users by system administrators. Users can override this configuration when 
using the <code>hod</code>, which is described later in this documentation. The 
configuration file can be specified in two ways when using <code>hod</code>, as 
described below: </p>
+  <ul>
+    <li> Specify it on command line, using the -c option. Such as <code>hod -c 
path-to-the-configuration-file other-options</code></li>
+    <li> Set up an environment variable <em>HOD_CONF_DIR</em> where 
<code>hod</code> will be run. This should be pointed to a directory on the 
local file system, containing a file called <em>hodrc</em>. Note that this is 
analogous to the <em>HADOOP_CONF_DIR</em> and <em>hadoop-site.xml</em> file for 
Hadoop. If no configuration file is specified on the command line, 
<code>hod</code> shall look for the <em>HOD_CONF_DIR</em> environment variable 
and a <em>hodrc</em> file under that.</li>
+    </ul>
+  <p>In examples listed below, we shall not explicitly point to the 
configuration option, assuming it is correctly specified.</p>
+  <p><code>hod</code> can be used in two modes, the <em>operation</em> mode 
and the <em>script</em> mode. We shall describe the two modes in detail 
below.</p>
+  <section><title> HOD <em>Operation</em> Mode </title><anchor 
id="HOD_Operation_Mode"></anchor>
+  <p>A typical session of HOD using this option will involve at least three 
steps: allocate, run hadoop jobs, deallocate. In order to use this mode, 
perform the following steps.</p>
+  <p><strong> Create a Cluster Directory </strong></p><anchor 
id="Create_a_Cluster_Directory"></anchor>
+  <p>The <em>cluster directory</em> is a directory on the local file system 
where <code>hod</code> will generate the Hadoop configuration, 
<em>hadoop-site.xml</em>, corresponding to the cluster it allocates. Create 
this directory and pass it to the <code>hod</code> operations as stated below. 
Once a cluster is allocated, a user can utilize it to run Hadoop jobs by 
specifying the cluster directory as the Hadoop --config option. </p>
+  <p><strong> Operation <em>allocate</em></strong></p><anchor 
id="Operation_allocate"></anchor>
+  <p>The <em>allocate</em> operation is used to allocate a set of nodes and 
install and provision Hadoop on them. It has the following syntax:</p>
+    <table>
+      
+        <tr>
+          <td><code>$ hod -o "allocate cluster_dir number_of_nodes"</code></td>
+        </tr>
+      
+    </table>
+  <p>If the command completes successfully, then 
<code>cluster_dir/hadoop-site.xml</code> will be generated and will contain 
information about the allocated cluster. It will also print out the information 
about the Hadoop web UIs.</p>
+  <p>An example run of this command produces the following output. Note in 
this example that <code>~/hod-clusters/test</code> is the cluster directory, 
and we are allocating 5 nodes:</p>
+  <table>
+    <tr>
+      <td><code>$ hod -o "allocate ~/hod-clusters/test 5"</code><br/>
+      <code>INFO - HDFS UI on http://foo1.bar.com:53422</code><br/>
+      <code>INFO - Mapred UI on http://foo2.bar.com:55380</code><br/></td>
+      </tr>
+   </table>
+  <p><strong> Running Hadoop jobs using the allocated cluster 
</strong></p><anchor id="Running_Hadoop_jobs_using_the_al"></anchor>
+  <p>Now, one can run Hadoop jobs using the allocated cluster in the usual 
manner. This assumes variables like <em>JAVA_HOME</em> and path to the Hadoop 
installation are set up correctly.:</p>
+    <table>
+      
+        <tr>
+          <td><code>$ hadoop --config cluster_dir hadoop_command 
hadoop_command_args</code></td>
+        </tr>
+      
+    </table>
+  <p>or</p>
+    <table>
+      
+        <tr>
+          <td><code>$ export HADOOP_CONF_DIR=cluster_dir</code> <br />
+              <code>$ hadoop hadoop_command hadoop_command_args</code></td>
+        </tr>
+      
+    </table>
+  <p>Continuing our example, the following command will run a wordcount 
example on the allocated cluster:</p>
+  <table><tr><td><code>$ hadoop --config ~/hod-clusters/test jar 
/path/to/hadoop/hadoop-examples.jar wordcount /path/to/input 
/path/to/output</code></td></tr></table>
+  <p>or</p>
+  <table><tr>
+    <td><code>$ export HADOOP_CONF_DIR=~/hod-clusters/test</code><br />
+    <code>$ hadoop jar /path/to/hadoop/hadoop-examples.jar wordcount 
/path/to/input /path/to/output</code></td>
+    </tr>
+  </table>
+  <p><strong> Operation <em>deallocate</em></strong></p><anchor 
id="Operation_deallocate"></anchor>
+  <p>The <em>deallocate</em> operation is used to release an allocated 
cluster. When finished with a cluster, deallocate must be run so that the nodes 
become free for others to use. The <em>deallocate</em> operation has the 
following syntax:</p>
+    <table>
+      
+        <tr>
+          <td><code>$ hod -o "deallocate cluster_dir"</code></td>
+        </tr>
+      
+    </table>
+  <p>Continuing our example, the following command will deallocate the 
cluster:</p>
+  <table><tr><td><code>$ hod -o "deallocate 
~/hod-clusters/test"</code></td></tr></table>
+  <p>As can be seen, when used in the <em>operation</em> mode, HOD allows the 
users to allocate a cluster, and use it flexibly for running Hadoop jobs. For 
example, users can run multiple jobs in parallel on the same cluster, by 
running hadoop from multiple shells pointing to the same configuration.</p>
+       </section>
+  <section><title> HOD <em>Script</em> Mode </title><anchor 
id="HOD_Script_Mode"></anchor>
+  <p>The HOD <em>script mode</em> combines the operations of allocating, using 
and deallocating a cluster into a single operation. This is very useful for 
users who want to run a script of hadoop jobs and let HOD handle the cleanup 
automatically once the script completes. In order to use <code>hod</code> in 
the script mode, do the following:</p>
+  <p><strong> Create a script file </strong></p><anchor 
id="Create_a_script_file"></anchor>
+  <p>This will be a regular shell script that will typically contain hadoop 
commands, such as:</p>
+  <table><tr><td><code>$ hadoop jar jar_file options</code></td>
+  </tr></table>
+  <p>However, the user can add any valid commands as part of the script. HOD 
will execute this script setting <em>HADOOP_CONF_DIR</em> automatically to 
point to the allocated cluster. So users do not need to worry about this. They 
also do not need to create a cluster directory as in the <em>operation</em> 
mode.</p>
+  <p><strong> Running the script </strong></p><anchor 
id="Running_the_script"></anchor>
+  <p>The syntax for the <em>script mode</em> as is as follows:</p>
+    <table>
+      
+        <tr>
+          <td><code>$ hod -m number_of_nodes -z script_file</code></td>
+        </tr>
+      
+    </table>
+  <p>Note that HOD will deallocate the cluster as soon as the script 
completes, and this means that the script must not complete until the hadoop 
jobs themselves are completed. Users must take care of this while writing the 
script. </p>
+   </section>
+  </section>
+  <section>
+               <title> HOD 0.4 Features </title><anchor 
id="HOD_0_4_Features"></anchor>
+  <section><title> Provisioning and Managing Hadoop Clusters </title><anchor 
id="Provisioning_and_Managing_Hadoop"></anchor>
+  <p>The primary feature of HOD is to provision Hadoop Map/Reduce and HDFS 
clusters. This is described above in the Getting Started section. Also, as long 
as nodes are available, and organizational policies allow, a user can use HOD 
to allocate multiple Map/Reduce clusters simultaneously. The user would need to 
specify different paths for the <code>cluster_dir</code> parameter mentioned 
above for each cluster he/she allocates. HOD provides the <em>list</em> and the 
<em>info</em> operations to enable managing multiple clusters.</p>
+  <p><strong> Operation <em>list</em></strong></p><anchor 
id="Operation_list"></anchor>
+  <p>The list operation lists all the clusters allocated so far by a user. The 
cluster directory where the hadoop-site.xml is stored for the cluster, and it's 
status vis-a-vis connectivity with the JobTracker and/or HDFS is shown. The 
list operation has the following syntax:</p>
+    <table>
+      
+        <tr>
+          <td><code>$ hod -o "list"</code></td>
+        </tr>
+      
+    </table>
+  <p><strong> Operation <em>info</em></strong></p><anchor 
id="Operation_info"></anchor>
+  <p>The info operation shows information about a given cluster. The 
information shown includes the Torque job id, and locations of the important 
daemons like the HOD Ringmaster process, and the Hadoop JobTracker and NameNode 
daemons. The info operation has the following syntax:</p>
+    <table>
+      
+        <tr>
+          <td><code>$ hod -o "info cluster_dir"</code></td>
+        </tr>
+      
+    </table>
+  <p>The <code>cluster_dir</code> should be a valid cluster directory 
specified in an earlier <em>allocate</em> operation.</p>
+  </section>
+  <section><title> Using a tarball to distribute Hadoop </title><anchor 
id="Using_a_tarball_to_distribute_Ha"></anchor>
+  <p>When provisioning Hadoop, HOD can use either a pre-installed Hadoop on 
the cluster nodes or distribute and install a Hadoop tarball as part of the 
provisioning operation. If the tarball option is being used, there is no need 
to have a pre-installed Hadoop on the cluster nodes, nor a need to use a 
pre-installed one. This is especially useful in a development / QE environment 
where individual developers may have different versions of Hadoop to test on a 
shared cluster. </p>
+  <p>In order to use a pre-installed Hadoop, you must specify, in the hodrc, 
the <code>pkgs</code> option in the <code>gridservice-hdfs</code> and 
<code>gridservice-mapred</code> sections. This must point to the path where 
Hadoop is installed on all nodes of the cluster.</p>
+  <p>The tarball option can be used in both the <em>operation</em> and 
<em>script</em> options. </p>
+  <p>In the operation option, the syntax is as follows:</p>
+    <table>
+        <tr>
+          <td><code>$ hod -t hadoop_tarball_location -o "allocate cluster_dir 
number_of_nodes"</code></td>
+        </tr>
+    </table>
+  <p>For example, the following command allocates Hadoop provided by the 
tarball <code>~/share/hadoop.tar.gz</code>:</p>
+  <table><tr><td><code>$ hod -t ~/share/hadoop.tar.gz -o "allocate 
~/hadoop-cluster 10"</code></td></tr></table>
+  <p>In the script option, the syntax is as follows:</p>
+    <table>
+        <tr>
+          <td><code>$ hod -t hadoop_tarball_location -m number_of_nodes -z 
script_file</code></td>
+        </tr>
+    </table>
+  <p>The hadoop_tarball specified in the syntax above should point to a path 
on a shared file system that is accessible from all the compute nodes. 
Currently, HOD only supports NFS mounted file systems.</p>
+  <p><em>Note:</em></p>
+  <ul>
+    <li> For better distribution performance it is recommended that the Hadoop 
tarball contain only the libraries and binaries, and not the source or 
documentation.</li>
+    <li> When you want to run jobs against a cluster allocated using the 
tarball, you must use a compatible version of hadoop to submit your jobs. The 
best would be to untar and use the version that is present in the tarball 
itself.</li>
+  </ul>
+  </section>
+  <section><title> Using an external HDFS </title><anchor 
id="Using_an_external_HDFS"></anchor>
+  <p>In typical Hadoop clusters provisioned by HOD, HDFS is already set up 
statically (without using HOD). This allows data to persist in HDFS after the 
HOD provisioned clusters is deallocated. To use a statically configured HDFS, 
your hodrc must point to an external HDFS. Specifically, set the following 
options to the correct values in the section <code>gridservice-hdfs</code> of 
the hodrc:</p>
+   <table><tr><td>external = true</td></tr><tr><td>host = Hostname of the HDFS 
NameNode</td></tr><tr><td>fs_port = Port number of the HDFS 
NameNode</td></tr><tr><td>info_port = Port number of the HDFS NameNode web 
UI</td></tr></table>
+  <p><em>Note:</em> You can also enable this option from command line. That 
is, to use a static HDFS, you will need to say: <br />
+    </p>
+    <table>
+        <tr>
+          <td><code>$ hod --gridservice-hdfs.external -o "allocate cluster_dir 
number_of_nodes"</code></td>
+        </tr>
+    </table>
+  <p>HOD can be used to provision an HDFS cluster as well as a Map/Reduce 
cluster, if required. To do so, set the following option in the section 
<code>gridservice-hdfs</code> of the hodrc:</p>
+  <table><tr><td>external = false</td></tr></table>
+  </section>
+  <section><title> Options for Configuring Hadoop </title><anchor 
id="Options_for_Configuring_Hadoop"></anchor>
+  <p>HOD provides a very convenient mechanism to configure both the Hadoop 
daemons that it provisions and also the hadoop-site.xml that it generates on 
the client side. This is done by specifying Hadoop configuration parameters in 
either the HOD configuration file, or from the command line when allocating 
clusters.</p>
+  <p><strong> Configuring Hadoop Daemons </strong></p><anchor 
id="Configuring_Hadoop_Daemons"></anchor>
+  <p>For configuring the Hadoop daemons, you can do the following:</p>
+  <p>For Map/Reduce, specify the options as a comma separated list of 
key-value pairs to the <code>server-params</code> option in the 
<code>gridservice-mapred</code> section. Likewise for a dynamically provisioned 
HDFS cluster, specify the options in the <code>server-params</code> option in 
the <code>gridservice-hdfs</code> section. If these parameters should be marked 
as <em>final</em>, then include these in the <code>final-server-params</code> 
option of the appropriate section.</p>
+  <p>For example:</p>
+  <table><tr><td><code>server-params = 
mapred.reduce.parallel.copies=20,io.sort.factor=100,io.sort.mb=128,io.file.buffer.size=131072</code></td></tr><tr><td><code>final-server-params
 = 
mapred.child.java.opts=-Xmx512m,dfs.block.size=134217728,fs.inmemory.size.mb=128</code></td>
+  </tr></table>
+  <p>In order to provide the options from command line, you can use the 
following syntax:</p>
+  <p>For configuring the Map/Reduce daemons use:</p>
+    <table>
+        <tr>
+          <td><code>$ hod -Mmapred.reduce.parallel.copies=20 
-Mio.sort.factor=100 -o "allocate cluster_dir number_of_nodes"</code></td>
+        </tr>
+    </table>
+  <p>In the example above, the <em>mapred.reduce.parallel.copies</em> 
parameter and the <em>io.sort.factor</em> parameter will be appended to the 
other <code>server-params</code> or if they already exist in 
<code>server-params</code>, will override them. In order to specify these are 
<em>final</em> parameters, you can use:</p>
+    <table>
+        <tr>
+          <td><code>$ hod -Fmapred.reduce.parallel.copies=20 
-Fio.sort.factor=100 -o "allocate cluster_dir number_of_nodes"</code></td>
+        </tr>
+    </table>
+  <p>However, note that final parameters cannot be overwritten from command 
line. They can only be appended if not already specified.</p>
+  <p>Similar options exist for configuring dynamically provisioned HDFS 
daemons. For doing so, replace -M with -H and -F with -S.</p>
+  <p><strong> Configuring Hadoop Job Submission (Client) Programs 
</strong></p><anchor id="Configuring_Hadoop_Job_Submissio"></anchor>
+  <p>As mentioned above, if the allocation operation completes successfully 
then <code>cluster_dir/hadoop-site.xml</code> will be generated and will 
contain information about the allocated cluster's JobTracker and NameNode. This 
configuration is used when submitting jobs to the cluster. HOD provides an 
option to include additional Hadoop configuration parameters into this file. 
The syntax for doing so is as follows:</p>
+    <table>
+        <tr>
+          <td><code>$ hod -Cmapred.userlog.limit.kb=200 
-Cmapred.child.java.opts=-Xmx512m -o "allocate cluster_dir 
number_of_nodes"</code></td>
+        </tr>
+    </table>
+  <p>In this example, the <em>mapred.userlog.limit.kb</em> and 
<em>mapred.child.java.opts</em> options will be included into the 
hadoop-site.xml that is generated by HOD.</p>
+  </section>
+  <section><title> Viewing Hadoop Web-UIs </title><anchor 
id="Viewing_Hadoop_Web_UIs"></anchor>
+  <p>The HOD allocation operation prints the JobTracker and NameNode web UI 
URLs. For example:</p>
+   <table><tr><td><code>$ hod -c ~/hod-conf-dir/hodrc -o "allocate 
~/hadoop-cluster 10"</code><br/>
+    <code>INFO - HDFS UI on http://host242.foo.com:55391</code><br/>
+    <code>INFO - Mapred UI on http://host521.foo.com:54874</code>
+    </td></tr></table>
+  <p>The same information is also available via the <em>info</em> operation 
described above.</p>
+  </section>
+  <section><title> Collecting and Viewing Hadoop Logs </title><anchor 
id="Collecting_and_Viewing_Hadoop_Lo"></anchor>
+  <p>To get the Hadoop logs of the daemons running on one of the allocated 
nodes: </p>
+  <ul>
+    <li> Log into the node of interest. If you want to look at the logs of the 
JobTracker or NameNode, then you can find the node running these by using the 
<em>list</em> and <em>info</em> operations mentioned above.</li>
+    <li> Get the process information of the daemon of interest (for example, 
<code>ps ux | grep TaskTracker</code>)</li>
+    <li> In the process information, search for the value of the variable 
<code>-Dhadoop.log.dir</code>. Typically this will be a decendent directory of 
the <code>hodring.temp-dir</code> value from the hod configuration file.</li>
+    <li> Change to the <code>hadoop.log.dir</code> directory to view daemon 
and user logs.</li>
+  </ul>
+  <p>HOD also provides a mechanism to collect logs when a cluster is being 
deallocated and persist them into a file system, or an externally configured 
HDFS. By doing so, these logs can be viewed after the jobs are completed and 
the nodes are released. In order to do so, configure the log-destination-uri to 
a URI as follows:</p>
+   <table><tr><td><code>log-destination-uri = 
hdfs://host123:45678/user/hod/logs</code> or</td></tr>
+    <tr><td><code>log-destination-uri = 
file://path/to/store/log/files</code></td></tr>
+    </table>
+  <p>Under the root directory specified above in the path, HOD will create a 
create a path user_name/torque_jobid and store gzipped log files for each node 
that was part of the job.</p>
+  <p>Note that to store the files to HDFS, you may need to configure the 
<code>hodring.pkgs</code> option with the Hadoop version that matches the HDFS 
mentioned. If not, HOD will try to use the Hadoop version that it is using to 
provision the Hadoop cluster itself.</p>
+  </section>
+  <section><title> Auto-deallocation of Idle Clusters </title><anchor 
id="Auto_deallocation_of_Idle_Cluste"></anchor>
+  <p>HOD automatically deallocates clusters that are not running Hadoop jobs 
for a given period of time. Each HOD allocation includes a monitoring facility 
that constantly checks for running Hadoop jobs. If it detects no running Hadoop 
jobs for a given period, it will automatically deallocate its own cluster and 
thus free up nodes which are not being used effectively.</p>
+  <p><em>Note:</em> While the cluster is deallocated, the <em>cluster 
directory</em> is not cleaned up automatically. The user must deallocate this 
cluster through the regular <em>deallocate</em> operation to clean this up.</p>
+       </section>
+  <section><title> Specifying Additional Job Attributes </title><anchor 
id="Specifying_Additional_Job_Attrib"></anchor>
+  <p>HOD allows the user to specify a wallclock time and a name (or title) for 
a Torque job. </p>
+  <p>The wallclock time is the estimated amount of time for which the Torque 
job will be valid. After this time has expired, Torque will automatically 
delete the job and free up the nodes. Specifying the wallclock time can also 
help the job scheduler to better schedule jobs, and help improve utilization of 
cluster resources.</p>
+  <p>To specify the wallclock time, use the following syntax:</p>
+    <table>
+        <tr>
+          <td><code>$ hod -l time_in_seconds -o "allocate cluster_dir 
number_of_nodes"</code></td>
+        </tr>
+    </table>
+  <p>The name or title of a Torque job helps in user friendly identification 
of the job. The string specified here will show up in all information where 
Torque job attributes are displayed, including the <code>qstat</code> 
command.</p>
+  <p>To specify the name or title, use the following syntax:</p>
+    <table>
+        <tr>
+          <td><code>$ hod -N name_of_job -o "allocate cluster_dir 
number_of_nodes"</code></td>
+        </tr>
+    </table>
+  <p><em>Note:</em> Due to restriction in the underlying Torque resource 
manager, names which do not start with a alphabet or contain a 'space' will 
cause the job to fail. The failure message points to the problem being in the 
specified job name.</p>
+  </section>
+  <section><title> Capturing HOD exit codes in Torque </title><anchor 
id="Capturing_HOD_exit_codes_in_Torq"></anchor>
+  <p>HOD exit codes are captured in the Torque exit_status field. This will 
help users and system administrators to distinguish successful runs from 
unsuccessful runs of HOD. The exit codes are 0 if allocation succeeded and all 
hadoop jobs ran on the allocated cluster correctly. They are non-zero if 
allocation failed or some of the hadoop jobs failed on the allocated cluster. 
The exit codes that are possible are mentioned in the table below. <em>Note: 
Hadoop job status is captured only if the version of Hadoop used is 16 or 
above.</em></p>
+  <table>
+    
+      <tr>
+        <td> Exit Code </td>
+        <td> Meaning </td>
+      </tr>
+      <tr>
+        <td> 6 </td>
+        <td> Ringmaster failure </td>
+      </tr>
+      <tr>
+        <td> 7 </td>
+        <td> DFS failure </td>
+      </tr>
+      <tr>
+        <td> 8 </td>
+        <td> Job tracker failure </td>
+      </tr>
+      <tr>
+        <td> 10 </td>
+        <td> Cluster dead </td>
+      </tr>
+      <tr>
+        <td> 12 </td>
+        <td> Cluster already allocated </td>
+      </tr>
+      <tr>
+        <td> 13 </td>
+        <td> HDFS dead </td>
+      </tr>
+      <tr>
+        <td> 14 </td>
+        <td> Mapred dead </td>
+      </tr>
+      <tr>
+        <td> 16 </td>
+        <td> All Map/Reduce jobs that ran on the cluster failed. Refer to 
hadoop logs for more details. </td>
+      </tr>
+      <tr>
+        <td> 17 </td>
+        <td> Some of the Map/Reduce jobs that ran on the cluster failed. Refer 
to hadoop logs for more details. </td>
+      </tr>
+    
+  </table>
+  </section>
+       </section>
+  <section>
+               <title> Command Line Options </title><anchor 
id="Command_Line_Options"></anchor>
+  <p>Command line options for the <code>hod</code> command are used for two 
purposes: defining an operation that HOD must perform, and defining 
configuration options for customizing HOD that override options defined in the 
default configuration file. This section covers both types of options. </p>
+  <section><title> Options Defining Operations </title><anchor 
id="Options_Defining_Operations"></anchor>
+  <p><em>--help</em><br />
+    Prints out the help message to see the basic options.</p>
+  <p><em>--verbose-help</em><br />
+    All configuration options provided in the hodrc file can be passed on the 
command line, using the syntax <code>--section_name.option_name[=value]</code>. 
When provided this way, the value provided on command line overrides the option 
provided in hodrc. The verbose-help command lists all the available options in 
the hodrc file. This is also a nice way to see the meaning of the configuration 
options.</p>
+  <p><em>-o "operation_name options"</em><br />
+    This class of options are used to define the <em>operation</em> mode of 
HOD. <em>Note:</em> The operation_name and other options must be specified 
within double quotes.</p>
+  <p><em>-o "help"</em><br />
+    Lists the operations available in the <em>operation</em> mode.</p>
+  <p><em>-o "allocate cluster_dir number_of_nodes"</em><br />
+    Allocates a cluster on the given number of cluster nodes, and store the 
allocation information in cluster_dir for use with subsequent 
<code>hadoop</code> commands. Note that the <code>cluster_dir</code> must exist 
before running the command.</p>
+  <p><em>-o "list"</em><br />
+    Lists the clusters allocated by this user. Information provided includes 
the Torque job id corresponding to the cluster, the cluster directory where the 
allocation information is stored, and whether the Map/Reduce daemon is still 
active or not.</p>
+  <p><em>-o "info cluster_dir"</em><br />
+    Lists information about the cluster whose allocation information is stored 
in the specified cluster directory.</p>
+  <p><em>-o "deallocate cluster_dir"</em><br />
+    Deallocates the cluster whose allocation information is stored in the 
specified cluster directory.</p>
+  <p><em>-z script_file</em><br />
+    Runs HOD in <em>script mode</em>. Provisions Hadoop on a given number of 
nodes, executes the given script from the submitting node, and deallocates the 
cluster when the script completes. Refer to option <em>-m</em></p>
+  </section>
+  <section><title> Options Configuring HOD </title><anchor 
id="Options_Configuring_HOD"></anchor>
+  <p>As described above, HOD is configured using a configuration file that is 
usually set up by system administrators. This is a INI style configuration file 
that is divided into sections, and options inside each section. Each section 
relates to one of the HOD processes: client, ringmaster, hodring, mapreduce or 
hdfs. The options inside a section comprise of an option name and value. </p>
+  <p>Users can override the configuration defined in the default configuration 
in two ways: </p>
+  <ul>
+    <li> Users can supply their own configuration file to HOD in each of the 
commands, using the <code>-c</code> option</li>
+    <li> Users can supply specific configuration options to HOD/ Options 
provided on command line <em>override</em> the values provided in the 
configuration file being used.</li>
+  </ul>
+  <p>This section describes some of the most commonly used configuration 
options. These commonly used options are provided with a <em>short</em> option 
for convenience of specification. All other options can be specified using a 
<em>long</em> option that is also described below.</p>
+  <p><em>-c config_file</em><br />
+    Provides the configuration file to use. Can be used with all other options 
of HOD. Alternatively, the <code>HOD_CONF_DIR</code> environment variable can 
be defined to specify a directory that contains a file named 
<code>hodrc</code>, alleviating the need to specify the configuration file in 
each HOD command.</p>
+  <p><em>-b 1|2|3|4</em><br />
+    Enables the given debug level. Can be used with all other options of HOD. 
4 is most verbose.</p>
+  <p><em>-t hadoop_tarball</em><br />
+    Provisions Hadoop from the given tar.gz file. This option is only 
applicable to the <em>allocate</em> operation. For better distribution 
performance it is strongly recommended that the Hadoop tarball is created 
<em>after</em> removing the source or documentation.</p>
+  <p><em>-m number_of_nodes</em><br />
+    When used in the <em>script</em> mode, this specifies the number of nodes 
to allocate. Note that this option is useful only in the script mode.</p>
+  <p><em>-N job-name</em><br />
+    The Name to give to the resource manager job that HOD uses underneath. For 
e.g. in the case of Torque, this translates to the <code>qsub -N</code> option, 
and can be seen as the job name using the <code>qstat</code> command.</p>
+  <p><em>-l wall-clock-time</em><br />
+    The amount of time for which the user expects to have work on the 
allocated cluster. This is passed to the resource manager underneath HOD, and 
can be used in more efficient scheduling and utilization of the cluster. Note 
that in the case of Torque, the cluster is automatically deallocated after this 
time expires.</p>
+  <p><em>-j java-home</em><br />
+    Path to be set to the JAVA_HOME environment variable. This is used in the 
<em>script</em> mode. HOD sets the JAVA_HOME environment variable tot his value 
and launches the user script in that.</p>
+  <p><em>-A account-string</em><br />
+    Accounting information to pass to underlying resource manager.</p>
+  <p><em>-Q queue-name</em><br />
+    Name of the queue in the underlying resource manager to which the job must 
be submitted.</p>
+  <p><em>-Mkey1=value1 -Mkey2=value2</em><br />
+    Provides configuration parameters for the provisioned Map/Reduce daemons 
(JobTracker and TaskTrackers). A hadoop-site.xml is generated with these values 
on the cluster nodes. <br />
+    <em>Note:</em> Values which have the following characters: space, comma, 
equal-to, semi-colon need to be escaped with a '\' character, and need to be 
enclosed within quotes. You can escape a '\' with a '\' too. </p>
+  <p><em>-Hkey1=value1 -Hkey2=value2</em><br />
+    Provides configuration parameters for the provisioned HDFS daemons 
(NameNode and DataNodes). A hadoop-site.xml is generated with these values on 
the cluster nodes <br />
+    <em>Note:</em> Values which have the following characters: space, comma, 
equal-to, semi-colon need to be escaped with a '\' character, and need to be 
enclosed within quotes. You can escape a '\' with a '\' too. </p>
+  <p><em>-Ckey1=value1 -Ckey2=value2</em><br />
+    Provides configuration parameters for the client from where jobs can be 
submitted. A hadoop-site.xml is generated with these values on the submit node. 
<br />
+    <em>Note:</em> Values which have the following characters: space, comma, 
equal-to, semi-colon need to be escaped with a '\' character, and need to be 
enclosed within quotes. You can escape a '\' with a '\' too. </p>
+  <p><em>--section-name.option-name=value</em><br />
+    This is the method to provide options using the <em>long</em> format. For 
e.g. you could say <em>--hod.script-wait-time=20</em></p>
+               </section>
+       </section>
+       <section>
+         <title> Troubleshooting </title><anchor id="Troubleshooting"></anchor>
+  <p>The following section identifies some of the most likely error conditions 
users can run into when using HOD and ways to trouble-shoot them</p>
+  <section><title><code>hod</code> Hangs During Allocation </title><anchor 
id="_hod_Hangs_During_Allocation"></anchor><anchor 
id="hod_Hangs_During_Allocation"></anchor>
+  <p><em>Possible Cause:</em> One of the HOD or Hadoop components have failed 
to come up. In such a case, the <code>hod</code> command will return after a 
few minutes (typically 2-3 minutes) with an error code of either 7 or 8 as 
defined in the Error Codes section. Refer to that section for further details. 
</p>
+  <p><em>Possible Cause:</em> A large allocation is fired with a tarball. 
Sometimes due to load in the network, or on the allocated nodes, the tarball 
distribution might be significantly slow and take a couple of minutes to come 
back. Wait for completion. Also check that the tarball does not have the Hadoop 
sources or documentation.</p>
+  <p><em>Possible Cause:</em> A Torque related problem. If the cause is Torque 
related, the <code>hod</code> command will not return for more than 5 minutes. 
Running <code>hod</code> in debug mode may show the <code>qstat</code> command 
being executed repeatedly. Executing the <code>qstat</code> command from a 
separate shell may show that the job is in the <code>Q</code> (Queued) state. 
This usually indicates a problem with Torque. Possible causes could include 
some nodes being down, or new nodes added that Torque is not aware of. 
Generally, system administator help is needed to resolve this problem.</p>
+    </section>
+  <section><title><code>hod</code> Hangs During Deallocation </title><anchor 
id="_hod_Hangs_During_Deallocation"></anchor><anchor 
id="hod_Hangs_During_Deallocation"></anchor>
+  <p><em>Possible Cause:</em> A Torque related problem, usually load on the 
Torque server, or the allocation is very large. Generally, waiting for the 
command to complete is the only option.</p>
+  </section>
+  <section><title><code>hod</code> Fails With an error code and error message 
</title><anchor id="hod_Fails_With_an_error_code_and"></anchor><anchor 
id="_hod_Fails_With_an_error_code_an"></anchor>
+  <p>If the exit code of the <code>hod</code> command is not <code>0</code>, 
then refer to the following table of error exit codes to determine why the code 
may have occurred and how to debug the situation.</p>
+  <p><strong> Error Codes </strong></p><anchor id="Error_Codes"></anchor>
+  <table>
+    
+      <tr>
+        <th>Error Code</th>
+        <th>Meaning</th>
+        <th>Possible Causes and Remedial Actions</th>
+      </tr>
+      <tr>
+        <td> 1 </td>
+        <td> Configuration error </td>
+        <td> Incorrect configuration values specified in hodrc, or other 
errors related to HOD configuration. The error messages in this case must be 
sufficient to debug and fix the problem. </td>
+      </tr>
+      <tr>
+        <td> 2 </td>
+        <td> Invalid operation </td>
+        <td> Do <code>hod -o "help"</code> for the list of valid operations. 
</td>
+      </tr>
+      <tr>
+        <td> 3 </td>
+        <td> Invalid operation arguments </td>
+        <td> Do <code>hod -o "help"</code> for the list of valid operations. 
Note that for an <em>allocate</em> operation, the directory argument must 
specify an existing directory. </td>
+      </tr>
+      <tr>
+        <td> 4 </td>
+        <td> Scheduler failure </td>
+        <td> 1. Requested more resources than available. Run <code>checknodes 
cluster_name</code> to see if enough nodes are available. <br />
+          2. Torque is misconfigured, the path to Torque binaries is 
misconfigured, or other Torque problems. Contact system administrator. </td>
+      </tr>
+      <tr>
+        <td> 5 </td>
+        <td> Job execution failure </td>
+        <td> 1. Torque Job was deleted from outside. Execute the Torque 
<code>qstat</code> command to see if you have any jobs in the <code>R</code> 
(Running) state. If none exist, try re-executing HOD. <br />
+          2. Torque problems such as the server momentarily going down, or 
becoming unresponsive. Contact system administrator. </td>
+      </tr>
+      <tr>
+        <td> 6 </td>
+        <td> Ringmaster failure </td>
+        <td> 1. Invalid configuration in the <code>ringmaster</code> 
section,<br />
+          2. invalid <code>pkgs</code> option in <code>gridservice-mapred or 
gridservice-hdfs</code> section,<br />
+          3. an invalid hadoop tarball,<br />
+          4. mismatched version in Hadoop between the MapReduce and an 
external HDFS.<br />
+          The Torque <code>qstat</code> command will most likely show a job in 
the <code>C</code> (Completed) state. Refer to the section <em>Locating 
Ringmaster Logs</em> below for more information. </td>
+      </tr>
+      <tr>
+        <td> 7 </td>
+        <td> DFS failure </td>
+        <td> 1. Problem in starting Hadoop clusters. Review the Hadoop related 
configuration. Look at the Hadoop logs using information specified in 
<em>Getting Hadoop Logs</em> section above. <br />
+          2. Invalid configuration in the <code>hodring</code> section of 
hodrc. <code>ssh</code> to all allocated nodes (determined by <code>qstat -f 
torque_job_id</code>) and grep for <code>ERROR</code> or <code>CRITICAL</code> 
in hodring logs. Refer to the section <em>Locating Hodring Logs</em> below for 
more information. <br />
+          3. Invalid tarball specified which is not packaged correctly. <br />
+          4. Cannot communicate with an externally configured HDFS. </td>
+      </tr>
+      <tr>
+        <td> 8 </td>
+        <td> Job tracker failure </td>
+        <td> Similar to the causes in <em>DFS failure</em> case. </td>
+      </tr>
+      <tr>
+        <td> 10 </td>
+        <td> Cluster dead </td>
+        <td> 1. Cluster was auto-deallocated because it was idle for a long 
time. <br />
+          2. Cluster was auto-deallocated because the wallclock time specified 
by the system administrator or user was exceeded. <br />
+          3. Cannot communicate with the JobTracker and HDFS NameNode which 
were successfully allocated. Deallocate the cluster, and allocate again. </td>
+      </tr>
+      <tr>
+        <td> 12 </td>
+        <td> Cluster already allocated </td>
+        <td> The cluster directory specified has been used in a previous 
allocate operation and is not yet deallocated. Specify a different directory, 
or deallocate the previous allocation first. </td>
+      </tr>
+      <tr>
+        <td> 13 </td>
+        <td> HDFS dead </td>
+        <td> Cannot communicate with the HDFS NameNode. HDFS NameNode went 
down. </td>
+      </tr>
+      <tr>
+        <td> 14 </td>
+        <td> Mapred dead </td>
+        <td> 1. Cluster was auto-deallocated because it was idle for a long 
time. <br />
+          2. Cluster was auto-deallocated because the wallclock time specified 
by the system administrator or user was exceeded. <br />
+          3. Cannot communicate with the Map/Reduce JobTracker. JobTracker 
node went down. <br />
+          </td>
+      </tr>
+      <tr>
+        <td> 15 </td>
+        <td> Cluster not allocated </td>
+        <td> An operation which requires an allocated cluster is given a 
cluster directory with no state information. </td>
+      </tr>
+    
+  </table>
+    </section>
+  <section><title> Hadoop Jobs Not Running on a Successfully Allocated Cluster 
</title><anchor id="Hadoop_Jobs_Not_Running_on_a_Suc"></anchor>
+  <p>This scenario generally occurs when a cluster is allocated, and is left 
inactive for sometime, and then hadoop jobs are attempted to be run on them. 
Then Hadoop jobs fail with the following exception:</p>
+  <table><tr><td><code>08/01/25 16:31:40 INFO ipc.Client: Retrying connect to 
server: foo.bar.com/1.1.1.1:53567. Already tried 1 
time(s).</code></td></tr></table>
+  <p><em>Possible Cause:</em> No Hadoop jobs were run for a significant 
portion of time. Thus the cluster would have got deallocated as described in 
the section <em>Auto-deallocation of Idle Clusters</em>. Deallocate the cluster 
and allocate it again.</p>
+  <p><em>Possible Cause:</em> The wallclock limit specified by the Torque 
administrator or the <code>-l</code> option defined in the section 
<em>Specifying Additional Job Attributes</em> was exceeded since allocation 
time. Thus the cluster would have got released. Deallocate the cluster and 
allocate it again.</p>
+  <p><em>Possible Cause:</em> There is a version mismatch between the version 
of the hadoop being used in provisioning (typically via the tarball option) and 
the external HDFS. Ensure compatible versions are being used.</p>
+  <p><em>Possible Cause:</em> There is a version mismatch between the version 
of the hadoop client being used to submit jobs and the hadoop used in 
provisioning (typically via the tarball option). Ensure compatible versions are 
being used.</p>
+  <p><em>Possible Cause:</em> You used one of the options for specifying 
Hadoop configuration <code>-M or -H</code>, which had special characters like 
space or comma that were not escaped correctly. Refer to the section 
<em>Options Configuring HOD</em> for checking how to specify such options 
correctly.</p>
+    </section>
+  <section><title> My Hadoop Job Got Killed </title><anchor 
id="My_Hadoop_Job_Got_Killed"></anchor>
+  <p><em>Possible Cause:</em> The wallclock limit specified by the Torque 
administrator or the <code>-l</code> option defined in the section 
<em>Specifying Additional Job Attributes</em> was exceeded since allocation 
time. Thus the cluster would have got released. Deallocate the cluster and 
allocate it again, this time with a larger wallclock time.</p>
+  <p><em>Possible Cause:</em> Problems with the JobTracker node. Refer to the 
section in <em>Collecting and Viewing Hadoop Logs</em> to get more 
information.</p>
+    </section>
+  <section><title> Hadoop Job Fails with Message: 'Job tracker still 
initializing' </title><anchor id="Hadoop_Job_Fails_with_Message_Jo"></anchor>
+  <p><em>Possible Cause:</em> The hadoop job was being run as part of the HOD 
script command, and it started before the JobTracker could come up fully. 
Allocate the cluster using a large value for the configuration option 
<code>--hod.script-wait-time</code>. Typically a value of 120 should work, 
though it is typically unnecessary to be that large.</p>
+    </section>
+  <section><title> The Exit Codes For HOD Are Not Getting Into Torque 
</title><anchor id="The_Exit_Codes_For_HOD_Are_Not_G"></anchor>
+  <p><em>Possible Cause:</em> Version 0.16 of hadoop is required for this 
functionality to work. The version of Hadoop used does not match. Use the 
required version of Hadoop.</p>
+  <p><em>Possible Cause:</em> The deallocation was done without using the 
<code>hod</code> command; for e.g. directly using <code>qdel</code>. When the 
cluster is deallocated in this manner, the HOD processes are terminated using 
signals. This results in the exit code to be based on the signal number, rather 
than the exit code of the program.</p>
+    </section>
+  <section><title> The Hadoop Logs are Not Uploaded to DFS </title><anchor 
id="The_Hadoop_Logs_are_Not_Uploaded"></anchor>
+  <p><em>Possible Cause:</em> There is a version mismatch between the version 
of the hadoop being used for uploading the logs and the external HDFS. Ensure 
that the correct version is specified in the <code>hodring.pkgs</code> 
option.</p>
+    </section>
+  <section><title> Locating Ringmaster Logs </title><anchor 
id="Locating_Ringmaster_Logs"></anchor>
+  <p>To locate the ringmaster logs, follow these steps: </p>
+  <ul>
+    <li> Execute hod in the debug mode using the -b option. This will print 
the Torque job id for the current run.</li>
+    <li> Execute <code>qstat -f torque_job_id</code> and look up the value of 
the <code>exec_host</code> parameter in the output. The first host in this list 
is the ringmaster node.</li>
+    <li> Login to this node.</li>
+    <li> The ringmaster log location is specified by the 
<code>ringmaster.log-dir</code> option in the hodrc. The name of the log file 
will be <code>username.torque_job_id/ringmaster-main.log</code>.</li>
+    <li> If you don't get enough information, you may want to set the 
ringmaster debug level to 4. This can be done by passing 
<code>--ringmaster.debug 4</code> to the hod command line.</li>
+  </ul>
+  </section>
+  <section><title> Locating Hodring Logs </title><anchor 
id="Locating_Hodring_Logs"></anchor>
+  <p>To locate hodring logs, follow the steps below: </p>
+  <ul>
+    <li> Execute hod in the debug mode using the -b option. This will print 
the Torque job id for the current run.</li>
+    <li> Execute <code>qstat -f torque_job_id</code> and look up the value of 
the <code>exec_host</code> parameter in the output. All nodes in this list 
should have a hodring on them.</li>
+    <li> Login to any of these nodes.</li>
+    <li> The hodring log location is specified by the 
<code>hodring.log-dir</code> option in the hodrc. The name of the log file will 
be <code>username.torque_job_id/hodring-main.log</code>.</li>
+    <li> If you don't get enough information, you may want to set the hodring 
debug level to 4. This can be done by passing <code>--hodring.debug 4</code> to 
the hod command line.</li>
+  </ul>
+  </section>
+       </section>
+</body>
+</document>


Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml
URL: 
http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml?rev=629361&r1=629360&r2=629361&view=diff
==============================================================================
--- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml 
(original)
+++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/site.xml Tue Feb 
19 21:17:48 2008
@@ -41,7 +41,11 @@
     <mapred    label="Map-Reduce Tutorial" href="mapred_tutorial.html" />
     <mapred    label="Native Hadoop Libraries" href="native_libraries.html" />
     <streaming label="Streaming"          href="streaming.html" />
-    <hod       label="Hadoop On Demand"   href="hod.html" />
+    <hod       label="Hadoop On Demand" href="hod.html">
+      <hod-user-guide href="hod_user_guide.html"/>
+      <hod-admin-guide href="hod_admin_guide.html"/>
+      <hod-config-guide href="hod_config_guide.html"/>
+    </hod>
     <api       label="API Docs"           href="ext:api/index" />
     <wiki      label="Wiki"               href="ext:wiki" />
     <faq       label="FAQ"                href="ext:faq" />
@@ -63,6 +67,18 @@
     <gzip      href="http://www.gzip.org/"; />
     <cygwin    href="http://www.cygwin.com/"; />
     <osx       href="http://www.apple.com/macosx"; />
+    <hod href="">
+      <cluster-resources href="http://www.clusterresources.com"; />
+      <torque 
href="http://www.clusterresources.com/pages/products/torque-resource-manager.php";
 />
+      <torque-download 
href="http://www.clusterresources.com/downloads/torque/"; />
+      <torque-docs 
href="http://www.clusterresources.com/pages/resources/documentation.php"; />
+      <torque-wiki 
href="http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki"; />
+      <torque-mailing-list 
href="http://www.clusterresources.com/pages/resources/mailing-lists.php"; />
+      <torque-basic-config 
href="http://www.clusterresources.com/wiki/doku.php?id=torque:1.2_basic_configuration";
 />
+      <torque-advanced-config 
href="http://www.clusterresources.com/wiki/doku.php?id=torque:1.3_advanced_configuration";
 />
+      <python href="http://www.python.org"; />
+      <twisted-python href="http://twistedmatrix.com/trac/"; />
+    </hod>
     <api href="api/">
       <index href="index.html" />
       <org href="org/">

svn commit: r629361 [7/7] - in /hadoop/core/trunk: ./ docs/ src/docs/src/documentation/content/xdocs/

Reply via email to