Hello,

Tomorrow I will have to restart the cluster on which I have this issue with
Storm 1.1.0.
Is there are anybody interested in my running some commands to get more
logs before I repair this cluster?

Best regards,
Alexandre Vermeerbergen

2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen <[email protected]
>:

> Hello,
>
> I think it might be of interest for you Storm developers to learn that I
> currently have a case of issue with Storm 1.1.0 which was supposed to
> resolved in this release according to https://issues.apache.org/
> jira/browse/STORM-1977 ; and I can look for any more information which
> you'd need to make a diagnostic on why this issue still can happen.
>
> Indeed, I have a Storm UI process which can't get any information on its
> Storm cluster, and I see many following exception in nimbus.log:
>
> 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] Created
> download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id
> d5120ad7-a81c-4c39-afc5-a7f876b04c73
> 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] Created
> download session for statefulAlerting_ec2-52-51-199-56-eu-west-1-compute-
> amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id
> aba18011-3258-4023-bbef-14d21a7066e1
> 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox ...
> deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar
> 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 [ERROR]
> Internal error processing getClusterInfo
> org.apache.storm.generated.KeyNotFoundException: null
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at 
> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
> ~[?:?]
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121]
>         at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_121]
>         at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
> ~[clojure-1.7.0.jar:?]
>         at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
> ~[clojure-1.7.0.jar:?]
>         at 
> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.daemon.nimbus$get_cluster_info$iter__
> 10687__10691$fn__10692.invoke(nimbus.clj:1550)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at clojure.lang.LazySeq.sval(LazySeq.java:40)
> ~[clojure-1.7.0.jar:?]
>         at clojure.lang.LazySeq.seq(LazySeq.java:49)
> ~[clojure-1.7.0.jar:?]
>         at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
>         at clojure.core$seq__4128.invoke(core.clj:137)
> ~[clojure-1.7.0.jar:?]
>         at clojure.core$dorun.invoke(core.clj:3009) ~[clojure-1.7.0.jar:?]
>         at clojure.core$doall.invoke(core.clj:3025) ~[clojure-1.7.0.jar:?]
>         at 
> org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.daemon.nimbus$mk_reified_nimbus$
> reify__10782.getClusterInfo(nimbus.clj:1971) ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.generated.Nimbus$Processor$
> getClusterInfo.getResult(Nimbus.java:3920) ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.generated.Nimbus$Processor$
> getClusterInfo.getResult(Nimbus.java:3904) ~[storm-core-1.1.0.jar:1.1.0]
>         at 
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at 
> org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.security.auth.SimpleTransportPlugin$
> SimpleWrapProcessor.process(SimpleTransportPlugin.java:162)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.thrift.server.AbstractNonblockingServer$
> FrameBuffer.invoke(AbstractNonblockingServer.java:518)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18)
> ~[storm-core-1.1.0.jar:1.1.0]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_121]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_121]
>         at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
>
>
> What is amazing is that I got this same issue on two Storm clusters
> running on different VMs ; just they share the same data in they Kafka
> Broker cluster (one cluster is the production one, which was quickly fixed,
> and the other one is the "backup" cluster to be used if the production one
> fails for quick "back to production")
>
> If left one of these cluster with this behavior because I felt that it
> could be interesting for Storm developers to have more information on this
> issue, if needed to properly diagnose it.
>
> I can keep this cluster as is for max 2 days.
>
> Is there anything useful which I can collect on it to help Storm
> developers to understand the cause (and hopefully use it to make Storm more
> robust) ?
>
> Few details:
>
> * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4
> Supervisors VMs + 3 Zookeeper VMs
>
> * Running with Java Server JRE 1.8.0_121
> * Running on AWS EC2 instances
>
> * We run about 10 topologies, with automatic self-healing on them (if they
> stop consuming Kafka items, our self-healer call "Kill topology", and then
> eventually restarts the topology
>
> * We have a self-healing on Nimbus UI based on calling its REST services.
> If it's not responding fast enough, we restart Nimbus UI
> * We figured out the issue because Nimbus UI was restarted every 2 minutes
>
> * To fix our production server which had the same symptom, we had to stop
> all Storm processes, then stop all Zookeepers, then remove all data in
> Zookeeper "snapshot files", then restart all Zookeeper, then restart all
> Storm process, then re-submit all our topologies
>
> Please be as clear as possible about which commands we should run to give
> you more details if needed
>
> Best regards,
> Alexandre Vermeerbergen
>
>

Reply via email to