[jira] Updated: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

Wendy Chien (JIRA) Fri, 16 Feb 2007 13:26:26 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wendy Chien updated HADOOP-442:
-------------------------------

    Attachment: hadoop-442-11.patch

Thanks for looking over the patch, Dhruba!  I updated it to incorporate 
Dhruba's comments.

1. TestDecommission.waitNodeState waits for 1 second now. 
2. I do mean to check for DECOMMISSION_INRPOGRESS to make sure the decommission 
began, but I want to stop it before it finishes so I can test commissioning a 
node works too.  
3. refreshNodes now returns void.
4. UnregisteredDatanodeException was already there, but I also added 
DisallowedDatanodeException to that clause.  I'm inclined to leave them 
together since they are similar.
6. added synchronized to verifyNodeRegistration, and removed it from 
start/stopDecommission.
7. removed the new code from pendingTransfers
8. I moved verifyNodeShutdown to FSNamesystem.  


> slaves file should include an 'exclude' section, to prevent "bad" datanodes 
> and tasktrackers from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>         Attachments: hadoop-442-10.patch, hadoop-442-11.patch, 
> hadoop-442-8.patch
>
>
> I recently had a few nodes go bad, such that they were inaccessible to ssh, 
> but were still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was 
> helpless until I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from 
> the slaves file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, 
> what I'd like is to be able to prevent rogue processes from connecting to the 
> masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list 
> nodes that shouldn't be accessed, and should be ignored if they try to 
> connect. That would also help in configuring the slaves file for a large 
> cluster - I'd list the full range of machines in the cluster, then list the 
> ones that are down in the 'exclude' section

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster

Reply via email to