[
https://issues.apache.org/jira/browse/HADOOP-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604548#action_12604548
]
Hemanth Yamijala commented on HADOOP-3531:
------------------------------------------
bq. Inserting delays to fix synchro problems is a perilous course. Can we do
something better, like alternate between checking for an error code, and
checking if the jetty interface came up?
Ari, excellent suggestion. I have done just that in the attached patch. So,
what happens now is that even if the command's exit code is missed the first
time, I continue to check for exit status while checking for the jetty
interface status. Thus, at some point if the command is found to have exited,
the code returns immediately with a good error message.
I tested this patch by introducing spurious failures on a test cluster. I was
able to validate that the correct code paths were covered. However, some
testing in a more controlled environment where we can control the allocated
nodes and set them up for failure could help. Karam, can you please take this
patch, and re-run your tests.
> Hod does not report job tracker failure on hod client side when job tracker
> fails to come up
> ---------------------------------------------------------------------------------------------
>
> Key: HADOOP-3531
> URL: https://issues.apache.org/jira/browse/HADOOP-3531
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/hod
> Affects Versions: 0.18.0
> Reporter: Karam Singh
> Assignee: Hemanth Yamijala
> Priority: Blocker
> Fix For: 0.18.0
>
> Attachments: 3531.patch
>
>
> Hod does not report job tracker failure on hod client side when job tracker
> fails to come up.
> When max-master-failure > 1
> hod client does not properly show why job tracker failed to come up, while in
> case namenode proper error message is displayed.
> Also in namenode failure ringmaster log contains information such as -:
> "Detected errors (3) beyond allowed number of failures (2). Flagging error to
> client"
> while no such information is there in ringmaster log for job tracker failures
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.