[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720185#action_12720185
 ] 

Hong Tang commented on HADOOP-5478:
-----------------------------------

bq Hong, is this to check if the TT is alive ? In which case, did you mean 
another signal, like -0 or kill -3. -9 is SIGKILL and would kill the TT. Also, 
in that case are you suggesting that we could keep the health checker around 
and continue trying to report after a while ?

@hemanth sorry for not being clear. I gave a bit more thoughts on the problem, 
and I think the following logic may be simpler and more robust (1,2 are the 
current logic, 3 is my suggestion) : 

(1) periodically launch the health checking script; 
(2) reporting status that back to TT (both good and bad); 
(3) if it fails to receive response from TT, wait for X seconds, do an extra 
kill (to ensure TT is dead), and quit itself. 

I scanned through the code, it seems that NodeHealthChecker.stop() would be a 
good place to perform step (3).

> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, 
> hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to