[ https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wendy Chien updated HADOOP-442: ------------------------------- Attachment: hadoop-442-11.patch Thanks for looking over the patch, Dhruba! I updated it to incorporate Dhruba's comments. 1. TestDecommission.waitNodeState waits for 1 second now. 2. I do mean to check for DECOMMISSION_INRPOGRESS to make sure the decommission began, but I want to stop it before it finishes so I can test commissioning a node works too. 3. refreshNodes now returns void. 4. UnregisteredDatanodeException was already there, but I also added DisallowedDatanodeException to that clause. I'm inclined to leave them together since they are similar. 6. added synchronized to verifyNodeRegistration, and removed it from start/stopDecommission. 7. removed the new code from pendingTransfers 8. I moved verifyNodeShutdown to FSNamesystem. > slaves file should include an 'exclude' section, to prevent "bad" datanodes > and tasktrackers from disrupting a cluster > ----------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-442 > URL: https://issues.apache.org/jira/browse/HADOOP-442 > Project: Hadoop > Issue Type: Bug > Components: conf > Reporter: Yoram Arnon > Assigned To: Wendy Chien > Attachments: hadoop-442-10.patch, hadoop-442-11.patch, > hadoop-442-8.patch > > > I recently had a few nodes go bad, such that they were inaccessible to ssh, > but were still running their java processes. > tasks that executed on them were failing, causing jobs to fail. > I couldn't stop the java processes, because of the ssh issue, so I was > helpless until I could actually power down these nodes. > restarting the cluster doesn't help, even when removing the bad nodes from > the slaves file - they just reconnect and are accepted. > while we plan to avoid tasks from launching on the same nodes over and over, > what I'd like is to be able to prevent rogue processes from connecting to the > masters. > Ideally, the slaves file will contain an 'exclude' section, which will list > nodes that shouldn't be accessed, and should be ignored if they try to > connect. That would also help in configuring the slaves file for a large > cluster - I'd list the full range of machines in the cluster, then list the > ones that are down in the 'exclude' section -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.