[jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status

Hemanth Yamijala (JIRA) Fri, 12 Jun 2009 00:30:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718749#action_12718749
 ]


Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

To summarize some of the comments above, the issue we are discussing is whether 
the node health checker script should be launched as a separate process from 
the tasktracker (TT) itself, rather than as a thread in the TT, as done in the 
patch currently. There are some motivations for doing the same:

- A periodic process launch from a java service like the TT has caused problems 
in the past - for e.g. look at HADOOP-5059.
- Owen also mentioned instances where they'd seen the service itself lock up 
due to the process launch (and the underlying fork()/exec()) failing.

So, the proposal is to solve this problem by having the node health checker 
script as a separate process. This process can be configured with the following:
- Path to a script
- An interval
- TT's address for communication.

The process would periodically run the script (as done in the patch today) and 
report the status to the TT using RPC. To keep management simple, we can, in 
the first cut, launch this process from the TT itself and stop it when the TT 
is going down. In future, it should be possible to decouple this even more and 
have them run independently. The simplicity we buy in the first iteration is to 
not require administrators from worrying about managing this independently for 
the time being - until we gain some experience with how the health check script 
is running.

Does this sound fine ?


> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5478) Provide a node health check script and run it periodically to check the node health status

Reply via email to