Johannes Donath created STORM-3664:
--------------------------------------
Summary: Nimbus cannot recover from LocalFsBlobStore deletion
Key: STORM-3664
URL: https://issues.apache.org/jira/browse/STORM-3664
Project: Apache Storm
Issue Type: Bug
Components: blobstore, storm-server
Affects Versions: 2.1.0, 2.2.0
Reporter: Johannes Donath
When all Nimbus instances in a cluster loose access to previously stored Blobs
while at least one topology is deployed, the cluster cannot recover as none of
the nodes is ever elected as leader due to missing blobs. Recovery is only
possible when manually removing blob and topology data from Zookeeper.
I understand that the LocalFs blob store implementation is not particularly
suited for high availability deployments. However, this issue prevents sensible
automated disaster recovery on small deployments where a full deployment of
HDFS would not provide any benefits and simply introduce additional complexity.
h3. Reproduction Steps
# Deploy one or multiple Nimbus instances
# Deploy a Topology (such as the WordCount example)
# Stop all Nimbus Instances
# Remove all Blob directories
# Start all Nimbus Instances
h3. Expected Behavior
When a topology's blobs are permanently lost, the topology itself should be
marked as failed in favor of maintaining the cluster's availability as a single
lost topology suffices to take down the entire system.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)