svn commit: r705430 [2/4] - in /hadoop/core/trunk: docs/ src/contrib/hod/ src/docs/src/documentation/content/xdocs/

acmurthy Thu, 16 Oct 2008 17:36:21 -0700
Modified: hadoop/core/trunk/docs/hod_user_guide.html
URL: 
http://svn.apache.org/viewvc/hadoop/core/trunk/docs/hod_user_guide.html?rev=705430&r1=705429&r2=705430&view=diff
==============================================================================
--- hadoop/core/trunk/docs/hod_user_guide.html (original)
+++ hadoop/core/trunk/docs/hod_user_guide.html Thu Oct 16 17:35:57 2008
@@ -262,7 +262,11 @@
 <a href="#Hangs+During+Deallocation">hod Hangs During Deallocation </a>
 </li>
 <li>
-<a href="#Fails+With+an+error+code+and+error+message">hod Fails With an error 
code and error message </a>
+<a href="#Fails+With+an+Error+Code+and+Error+Message">hod Fails With an Error 
Code and Error Message </a>
+</li>
+<li>
+<a href="#Hadoop+DFSClient+Warns+with+a%0A++NotReplicatedYetException">Hadoop 
DFSClient Warns with a
+  NotReplicatedYetException</a>
 </li>
 <li>
 <a href="#Hadoop+Jobs+Not+Running+on+a+Successfully+Allocated+Cluster"> Hadoop 
Jobs Not Running on a Successfully Allocated Cluster </a>
@@ -277,7 +281,7 @@
 <a href="#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque"> The Exit Codes 
For HOD Are Not Getting Into Torque </a>
 </li>
 <li>
-<a href="#The+Hadoop+Logs+are+Not+Uploaded+to+DFS"> The Hadoop Logs are Not 
Uploaded to DFS </a>
+<a href="#The+Hadoop+Logs+are+Not+Uploaded+to+HDFS"> The Hadoop Logs are Not 
Uploaded to HDFS </a>
 </li>
 <li>
 <a href="#Locating+Ringmaster+Logs"> Locating Ringmaster Logs </a>
@@ -748,7 +752,7 @@
 <tr>
         
 <td colspan="1" rowspan="1"> 7 </td>
-        <td colspan="1" rowspan="1"> DFS failure </td>
+        <td colspan="1" rowspan="1"> HDFS failure </td>
       
 </tr>
       
@@ -966,8 +970,8 @@
 <a name="_hod_Hangs_During_Deallocation" 
id="_hod_Hangs_During_Deallocation"></a><a name="hod_Hangs_During_Deallocation" 
id="hod_Hangs_During_Deallocation"></a>
 <p>
 <em>Possible Cause:</em> A Torque related problem, usually load on the Torque 
server, or the allocation is very large. Generally, waiting for the command to 
complete is the only option.</p>
-<a name="N105C2"></a><a name="Fails+With+an+error+code+and+error+message"></a>
-<h3 class="h4">hod Fails With an error code and error message </h3>
+<a name="N105C2"></a><a name="Fails+With+an+Error+Code+and+Error+Message"></a>
+<h3 class="h4">hod Fails With an Error Code and Error Message </h3>
 <a name="hod_Fails_With_an_error_code_and" 
id="hod_Fails_With_an_error_code_and"></a><a 
name="_hod_Fails_With_an_error_code_an" 
id="_hod_Fails_With_an_error_code_an"></a>
 <p>If the exit code of the <span class="codefrag">hod</span> command is not 
<span class="codefrag">0</span>, then refer to the following table of error 
exit codes to determine why the code may have occurred and how to debug the 
situation.</p>
 <p>
@@ -1047,14 +1051,14 @@
 <tr>
         
 <td colspan="1" rowspan="1"> 7 </td>
-        <td colspan="1" rowspan="1"> DFS failure </td>
-        <td colspan="1" rowspan="1"> When HOD fails to allocate due to DFS 
failures (or Job tracker failures, error code 8, see below), it prints a 
failure message "Hodring at &lt;hostname&gt; failed with following errors:" and 
then gives the actual error message, which may indicate one of the 
following:<br>
+        <td colspan="1" rowspan="1"> HDFS failure </td>
+        <td colspan="1" rowspan="1"> When HOD fails to allocate due to HDFS 
failures (or Job tracker failures, error code 8, see below), it prints a 
failure message "Hodring at &lt;hostname&gt; failed with following errors:" and 
then gives the actual error message, which may indicate one of the 
following:<br>
           1. Problem in starting Hadoop clusters. Usually the actual cause in 
the error message will indicate the problem on the hostname mentioned. Also, 
review the Hadoop related configuration in the HOD configuration files. Look at 
the Hadoop logs using information specified in <em>Collecting and Viewing 
Hadoop Logs</em> section above. <br>
           2. Invalid configuration on the node running the hodring, specified 
by the hostname in the error message <br>
           3. Invalid configuration in the <span 
class="codefrag">hodring</span> section of hodrc. <span 
class="codefrag">ssh</span> to the hostname specified in the error message and 
grep for <span class="codefrag">ERROR</span> or <span 
class="codefrag">CRITICAL</span> in hodring logs. Refer to the section 
<em>Locating Hodring Logs</em> below for more information. <br>
           4. Invalid tarball specified which is not packaged correctly. <br>
           5. Cannot communicate with an externally configured HDFS.<br>
-          When such DFS or Job tracker failure occurs, one can login into the 
host with hostname mentioned in HOD failure message and debug the problem. 
While fixing the problem, one should also review other log messages in the 
ringmaster log to see which other machines also might have had problems 
bringing up the jobtracker/namenode, apart from the hostname that is reported 
in the failure message. This possibility of other machines also having problems 
occurs because HOD continues to try and launch hadoop daemons on multiple 
machines one after another depending upon the value of the configuration 
variable <a 
href="hod_config_guide.html#3.4+ringmaster+options">ringmaster.max-master-failures</a>.
 Refer to the section <em>Locating Ringmaster Logs</em> below to find more 
about ringmaster logs.
+          When such HDFS or Job tracker failure occurs, one can login into the 
host with hostname mentioned in HOD failure message and debug the problem. 
While fixing the problem, one should also review other log messages in the 
ringmaster log to see which other machines also might have had problems 
bringing up the jobtracker/namenode, apart from the hostname that is reported 
in the failure message. This possibility of other machines also having problems 
occurs because HOD continues to try and launch hadoop daemons on multiple 
machines one after another depending upon the value of the configuration 
variable <a 
href="hod_config_guide.html#3.4+ringmaster+options">ringmaster.max-master-failures</a>.
 Refer to the section <em>Locating Ringmaster Logs</em> below to find more 
about ringmaster logs.
           </td>
       
 </tr>
@@ -1123,7 +1127,31 @@
 </tr>
   
 </table>
-<a name="N10757"></a><a 
name="Hadoop+Jobs+Not+Running+on+a+Successfully+Allocated+Cluster"></a>
+<a name="N10757"></a><a 
name="Hadoop+DFSClient+Warns+with+a%0A++NotReplicatedYetException"></a>
+<h3 class="h4">Hadoop DFSClient Warns with a
+  NotReplicatedYetException</h3>
+<p>Sometimes, when you try to upload a file to the HDFS immediately after
+  allocating a HOD cluster, DFSClient warns with a NotReplicatedYetException. 
It
+  usually shows a message something like - </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+<tr>
+<td colspan="1" rowspan="1"><span class="codefrag">WARN
+  hdfs.DFSClient: NotReplicatedYetException sleeping &lt;filename&gt; retries
+  left 3</span></td>
+</tr>
+<tr>
+<td colspan="1" rowspan="1"><span class="codefrag">08/01/25 16:31:40 INFO 
hdfs.DFSClient:
+  org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
+  &lt;filename&gt; could only be replicated to 0 nodes, instead of
+  1</span></td>
+</tr>
+</table>
+<p> This scenario arises when you try to upload a file
+  to the HDFS while the DataNodes are still in the process of contacting the
+  NameNode. This can be resolved by waiting for some time before uploading a 
new
+  file to the HDFS, so that enough DataNodes start and contact the
+  NameNode.</p>
+<a name="N1076F"></a><a 
name="Hadoop+Jobs+Not+Running+on+a+Successfully+Allocated+Cluster"></a>
 <h3 class="h4"> Hadoop Jobs Not Running on a Successfully Allocated Cluster 
</h3>
 <a name="Hadoop_Jobs_Not_Running_on_a_Suc" 
id="Hadoop_Jobs_Not_Running_on_a_Suc"></a>
 <p>This scenario generally occurs when a cluster is allocated, and is left 
inactive for sometime, and then hadoop jobs are attempted to be run on them. 
Then Hadoop jobs fail with the following exception:</p>
@@ -1142,31 +1170,31 @@
 <em>Possible Cause:</em> There is a version mismatch between the version of 
the hadoop client being used to submit jobs and the hadoop used in provisioning 
(typically via the tarball option). Ensure compatible versions are being 
used.</p>
 <p>
 <em>Possible Cause:</em> You used one of the options for specifying Hadoop 
configuration <span class="codefrag">-M or -H</span>, which had special 
characters like space or comma that were not escaped correctly. Refer to the 
section <em>Options Configuring HOD</em> for checking how to specify such 
options correctly.</p>
-<a name="N10792"></a><a name="My+Hadoop+Job+Got+Killed"></a>
+<a name="N107AA"></a><a name="My+Hadoop+Job+Got+Killed"></a>
 <h3 class="h4"> My Hadoop Job Got Killed </h3>
 <a name="My_Hadoop_Job_Got_Killed" id="My_Hadoop_Job_Got_Killed"></a>
 <p>
 <em>Possible Cause:</em> The wallclock limit specified by the Torque 
administrator or the <span class="codefrag">-l</span> option defined in the 
section <em>Specifying Additional Job Attributes</em> was exceeded since 
allocation time. Thus the cluster would have got released. Deallocate the 
cluster and allocate it again, this time with a larger wallclock time.</p>
 <p>
 <em>Possible Cause:</em> Problems with the JobTracker node. Refer to the 
section in <em>Collecting and Viewing Hadoop Logs</em> to get more 
information.</p>
-<a name="N107AD"></a><a 
name="Hadoop+Job+Fails+with+Message%3A+%27Job+tracker+still+initializing%27"></a>
+<a name="N107C5"></a><a 
name="Hadoop+Job+Fails+with+Message%3A+%27Job+tracker+still+initializing%27"></a>
 <h3 class="h4"> Hadoop Job Fails with Message: 'Job tracker still 
initializing' </h3>
 <a name="Hadoop_Job_Fails_with_Message_Jo" 
id="Hadoop_Job_Fails_with_Message_Jo"></a>
 <p>
 <em>Possible Cause:</em> The hadoop job was being run as part of the HOD 
script command, and it started before the JobTracker could come up fully. 
Allocate the cluster using a large value for the configuration option <span 
class="codefrag">--hod.script-wait-time</span>. Typically a value of 120 should 
work, though it is typically unnecessary to be that large.</p>
-<a name="N107BD"></a><a 
name="The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque"></a>
+<a name="N107D5"></a><a 
name="The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque"></a>
 <h3 class="h4"> The Exit Codes For HOD Are Not Getting Into Torque </h3>
 <a name="The_Exit_Codes_For_HOD_Are_Not_G" 
id="The_Exit_Codes_For_HOD_Are_Not_G"></a>
 <p>
 <em>Possible Cause:</em> Version 0.16 of hadoop is required for this 
functionality to work. The version of Hadoop used does not match. Use the 
required version of Hadoop.</p>
 <p>
 <em>Possible Cause:</em> The deallocation was done without using the <span 
class="codefrag">hod</span> command; for e.g. directly using <span 
class="codefrag">qdel</span>. When the cluster is deallocated in this manner, 
the HOD processes are terminated using signals. This results in the exit code 
to be based on the signal number, rather than the exit code of the program.</p>
-<a name="N107D5"></a><a name="The+Hadoop+Logs+are+Not+Uploaded+to+DFS"></a>
-<h3 class="h4"> The Hadoop Logs are Not Uploaded to DFS </h3>
+<a name="N107ED"></a><a name="The+Hadoop+Logs+are+Not+Uploaded+to+HDFS"></a>
+<h3 class="h4"> The Hadoop Logs are Not Uploaded to HDFS </h3>
 <a name="The_Hadoop_Logs_are_Not_Uploaded" 
id="The_Hadoop_Logs_are_Not_Uploaded"></a>
 <p>
 <em>Possible Cause:</em> There is a version mismatch between the version of 
the hadoop being used for uploading the logs and the external HDFS. Ensure that 
the correct version is specified in the <span 
class="codefrag">hodring.pkgs</span> option.</p>
-<a name="N107E5"></a><a name="Locating+Ringmaster+Logs"></a>
+<a name="N107FD"></a><a name="Locating+Ringmaster+Logs"></a>
 <h3 class="h4"> Locating Ringmaster Logs </h3>
 <a name="Locating_Ringmaster_Logs" id="Locating_Ringmaster_Logs"></a>
 <p>To locate the ringmaster logs, follow these steps: </p>
@@ -1183,7 +1211,7 @@
 <li> If you don't get enough information, you may want to set the ringmaster 
debug level to 4. This can be done by passing <span 
class="codefrag">--ringmaster.debug 4</span> to the hod command line.</li>
   
 </ul>
-<a name="N10811"></a><a name="Locating+Hodring+Logs"></a>
+<a name="N10829"></a><a name="Locating+Hodring+Logs"></a>
 <h3 class="h4"> Locating Hodring Logs </h3>
 <a name="Locating_Hodring_Logs" id="Locating_Hodring_Logs"></a>
 <p>To locate hodring logs, follow the steps below: </p>
svn commit: r705430 [2/4] - in /hadoop/core/trunk: docs/ src/contrib/hod/ src/docs/src/documentation/content/xdocs/

Reply via email to