[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Hemanth Yamijala (JIRA) Thu, 05 Jun 2008 23:01:08 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602929#action_12602929
 ]


Hemanth Yamijala commented on HADOOP-3184:
------------------------------------------

Mahadev, thank you for the review.

bq. 1) shouldRetryMasterLaunch is defined but is not used anywhere

This is removed now.

bq. 2) you might want to wrap around most the statements since they exceed 80 
character columns.

Also done.

bq. 3) is there something we can report back to the user on the command line 
that some machines are faultty - CRITICAL contact admin? it would be really 
helpful if we can do that .

I presume you mean in the case where some machines failed, but the cluster 
eventually came up, right ? Because otherwise, we do print a report on the 
command line for the users that the hodring on this machine failed due to this 
reason. The services folks could then check the ringmaster log to see what 
other machines failed. 

If you meant the former (i.e. the case of eventual success), I agree that it 
would be useful feature to have. However, it would take more work to build this 
functionality into the client. I propose we leave this as such for now, and 
make the enhancement in a later release.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During 
> ring formation, the entire ring should not be dependent upon every single 
> node being good. Instead, it should either exclude any ring member that does 
> not adequately join the ring in a specified amount of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, 
> local name-cache corrupt, slow network links, drives just beginning to fail, 
> etc.
> Many of these conditions are known, and we can monitor for those separately, 
> but this enhancement would shield users from unknown failure conditions that 
> we haven't yet anticipated. This way, a user will get a cluster, instead of 
> hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation

Reply via email to