Derek Dagit created STORM-1383:
----------------------------------
Summary: Supervisors should not crash if nimbus is unavailable
Key: STORM-1383
URL: https://issues.apache.org/jira/browse/STORM-1383
Project: Apache Storm
Issue Type: Improvement
Components: storm-core
Affects Versions: 0.11.0
Reporter: Derek Dagit
Assignee: Derek Dagit
In cases of maintenance or unexpected downtime of nimbus nodes, supervisors
will crash in a loop. This can cause a lot of confusion among users
(supervisors crash repeatedly) and admins (monitoring/alerting triggered for
the entire cluster).
Supervisors periodically check with nimbus to synchronize blob versions, and as
part of this, a connection is made to the leader nimbus daemon. Formerly,
supervisors did not periodically contact nimbus, and so nimbus downtime did not
cascade to cluster-wide supervisor failures.
It might be nice to handle the case when nimbus cannot be contacted, and
continue in the normal loop.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)