Johannes Donath created STORM-3664:
--------------------------------------

             Summary: Nimbus cannot recover from LocalFsBlobStore deletion
                 Key: STORM-3664
                 URL: https://issues.apache.org/jira/browse/STORM-3664
             Project: Apache Storm
          Issue Type: Bug
          Components: blobstore, storm-server
    Affects Versions: 2.1.0, 2.2.0
            Reporter: Johannes Donath


When all Nimbus instances in a cluster loose access to previously stored Blobs 
while at least one topology is deployed, the cluster cannot recover as none of 
the nodes is ever elected as leader due to missing blobs. Recovery is only 
possible when manually removing blob and topology data from Zookeeper.

I understand that the LocalFs blob store implementation is not particularly 
suited for high availability deployments. However, this issue prevents sensible 
automated disaster recovery on small deployments where a full deployment of 
HDFS would not provide any benefits and simply introduce additional complexity.
h3. Reproduction Steps
 # Deploy one or multiple Nimbus instances
 # Deploy a Topology (such as the WordCount example)
 # Stop all Nimbus Instances
 # Remove all Blob directories
 # Start all Nimbus Instances

h3. Expected Behavior

When a topology's blobs are permanently lost, the topology itself should be 
marked as failed in favor of maintaining the cluster's availability as a single 
lost topology suffices to take down the entire system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to