Github user revans2 commented on the issue: https://github.com/apache/storm/pull/1674 My example was theoretical, I honestly don't know if in practice nimbus would see the supervisor appear then disappear if nimbus's nic was bad. The more common case would be if a ZK nic was bad, that could cause this. But that is beside the point. There are lots of different things that could make most of the cluster look bad or actually go bad for real. I am fine with detecting/handling some cases that we know about and can reproduce, but we should have some sort of default catch all. For example if HDFS loses too many nodes it goes into read only mode, but for YARN it ignores it but uses metrics to alert the cluster owners. If we feel that because we are primarily compute, like YARN and want to have a metric about how many nodes are blacklisted that seems like a perfectly fine default. If we run into other situations and can detect/auto-correct them even better.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---