[ 
https://issues.apache.org/jira/browse/HADOOP-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604270#action_12604270
 ] 

Hemanth Yamijala commented on HADOOP-3531:
------------------------------------------

Debugging this issue, I found the cause to be a timing problem that does not 
happen on all machines.

In the HodRing, we have code that determines if a launched Hadoop command has 
exited with a non-zero error code, and in such cases an error is reported back 
to the ringmaster. The check is made soon after the command is launched. On 
some machines, the time limit between launching the command and its exit with 
the error code is a few 100s of milliseconds. On such machines, the code 
determining whether the Hadoop command exited thinks that all is fine, and 
fails later when it tries to check if the JobTracker's Jetty server is up. In 
the process it loses about a minute's time.

If the max-master-failures variable is > 1, a second attempt is made to launch 
the JobTracker. On similar hardware and configuration, the same timing issue 
shows up. By the time 2 machines have failed, the HOD client times out waiting 
for the JobTracker URL and the cluster is deallocated by deleting the Torque 
job.

This is a fairly serious issue, because it nullifies the enhancement made in 
HADOOP-3184, as the JobTracker is not launched on enough machines to give it a 
chance of coming up on a good machine.

Introducing a minor delay of just a second in the HodRing code fixed the 
problem that is described above. It seems fair to wait a bit for the Hadoop 
command to actually exit (if there are errors) before checking for it's error 
code.

> Hod does not  report job tracker failure on hod client side when job tracker 
> fails to come up
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3531
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3531
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.18.0
>            Reporter: Karam Singh
>            Priority: Blocker
>
> Hod does not  report job tracker failure on hod client side when job tracker 
> fails to come up. 
> When max-master-failure > 1
> hod client does not properly show why job tracker failed to come up, while in 
> case namenode proper error message is displayed.
> Also in namenode failure ringmaster log contains information such as -: 
> "Detected errors (3) beyond allowed number of failures (2). Flagging error to 
> client"
> while no such information is there in ringmaster log for job tracker failures

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to