Hello, I think it might be of interest for you Storm developers to learn that I currently have a case of issue with Storm 1.1.0 which was supposed to resolved in this release according to https://issues.apache.org/jira/browse/STORM-1977 ; and I can look for any more information which you'd need to make a diagnostic on why this issue still can happen.
Indeed, I have a Storm UI process which can't get any information on its Storm cluster, and I see many following exception in nimbus.log: 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id d5120ad7-a81c-4c39-afc5-a7f876b04c73 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id aba18011-3258-4023-bbef-14d21a7066e1 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ... deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 [ERROR] Internal error processing getClusterInfo org.apache.storm.generated.KeyNotFoundException: null at org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299) ~[storm-core-1.1.0.jar:1.1.0] at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121] at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.7.0.jar:?] at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) ~[clojure-1.7.0.jar:?] at org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.daemon.nimbus$get_cluster_info$iter__10687__10691$fn__10692.invoke(nimbus.clj:1550) ~[storm-core-1.1.0.jar:1.1.0] at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.7.0.jar:?] at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.7.0.jar:?] at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?] at clojure.core$seq__4128.invoke(core.clj:137) ~[clojure-1.7.0.jar:?] at clojure.core$dorun.invoke(core.clj:3009) ~[clojure-1.7.0.jar:?] at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?] at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10782.getClusterInfo(nimbus.clj:1971) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:162) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518) ~[storm-core-1.1.0.jar:1.1.0] at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) ~[storm-core-1.1.0.jar:1.1.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121] What is amazing is that I got this same issue on two Storm clusters running on different VMs ; just they share the same data in they Kafka Broker cluster (one cluster is the production one, which was quickly fixed, and the other one is the "backup" cluster to be used if the production one fails for quick "back to production") If left one of these cluster with this behavior because I felt that it could be interesting for Storm developers to have more information on this issue, if needed to properly diagnose it. I can keep this cluster as is for max 2 days. Is there anything useful which I can collect on it to help Storm developers to understand the cause (and hopefully use it to make Storm more robust) ? Few details: * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4 Supervisors VMs + 3 Zookeeper VMs * Running with Java Server JRE 1.8.0_121 * Running on AWS EC2 instances * We run about 10 topologies, with automatic self-healing on them (if they stop consuming Kafka items, our self-healer call "Kill topology", and then eventually restarts the topology * We have a self-healing on Nimbus UI based on calling its REST services. If it's not responding fast enough, we restart Nimbus UI * We figured out the issue because Nimbus UI was restarted every 2 minutes * To fix our production server which had the same symptom, we had to stop all Storm processes, then stop all Zookeepers, then remove all data in Zookeeper "snapshot files", then restart all Zookeeper, then restart all Storm process, then re-submit all our topologies Please be as clear as possible about which commands we should run to give you more details if needed Best regards, Alexandre Vermeerbergen