Hello,

I think it might be of interest for you Storm developers to learn that I
currently have a case of issue with Storm 1.1.0 which was supposed to
resolved in this release according to
https://issues.apache.org/jira/browse/STORM-1977 ; and I can look for any
more information which you'd need to make a diagnostic on why this issue
still can happen.

Indeed, I have a Storm UI process which can't get any information on its
Storm cluster, and I see many following exception in nimbus.log:

2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created
download session for
statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser
with id d5120ad7-a81c-4c39-afc5-a7f876b04c73
2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created
download session for
statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser
with id aba18011-3258-4023-bbef-14d21a7066e1
2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 [ERROR]
Internal error processing getClusterInfo
org.apache.storm.generated.KeyNotFoundException: null
        at
org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
~[storm-core-1.1.0.jar:1.1.0]
        at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
~[?:?]
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_121]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
~[clojure-1.7.0.jar:?]
        at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
~[clojure-1.7.0.jar:?]
        at
org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.daemon.nimbus$get_cluster_info$iter__10687__10691$fn__10692.invoke(nimbus.clj:1550)
~[storm-core-1.1.0.jar:1.1.0]
        at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.7.0.jar:?]
        at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.7.0.jar:?]
        at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
        at clojure.core$seq__4128.invoke(core.clj:137)
~[clojure-1.7.0.jar:?]
        at clojure.core$dorun.invoke(core.clj:3009) ~[clojure-1.7.0.jar:?]
        at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
        at
org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__10782.getClusterInfo(nimbus.clj:1971)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3920)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3904)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518)
~[storm-core-1.1.0.jar:1.1.0]
        at
org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
~[storm-core-1.1.0.jar:1.1.0]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[?:1.8.0_121]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]


What is amazing is that I got this same issue on two Storm clusters running
on different VMs ; just they share the same data in they Kafka Broker
cluster (one cluster is the production one, which was quickly fixed, and
the other one is the "backup" cluster to be used if the production one
fails for quick "back to production")

If left one of these cluster with this behavior because I felt that it
could be interesting for Storm developers to have more information on this
issue, if needed to properly diagnose it.

I can keep this cluster as is for max 2 days.

Is there anything useful which I can collect on it to help Storm developers
to understand the cause (and hopefully use it to make Storm more robust) ?

Few details:

* Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
Supervisors VMs + 3 Zookeeper VMs

* Running with Java Server JRE 1.8.0_121
* Running on AWS EC2 instances

* We run about 10 topologies, with automatic self-healing on them (if they
stop consuming Kafka items, our self-healer call "Kill topology", and then
eventually restarts the topology

* We have a self-healing on Nimbus UI based on calling its REST services.
If it's not responding fast enough, we restart Nimbus UI
* We figured out the issue because Nimbus UI was restarted every 2 minutes

* To fix our production server which had the same symptom, we had to stop
all Storm processes, then stop all Zookeepers, then remove all data in
Zookeeper "snapshot files", then restart all Zookeeper, then restart all
Storm process, then re-submit all our topologies

Please be as clear as possible about which commands we should run to give
you more details if needed

Best regards,
Alexandre Vermeerbergen

Reply via email to