Sorry I meant reproduce, not replicate. :) 2017년 8월 24일 (목) 오후 8:34, Jungtaek Lim <[email protected]>님이 작성:
> Alexandre, > > I found that your storm local dir is placed to "/tmp/storm" which parts or > all could be removed at any time. > Could you move the path to non-temporary place and try to replicate? > > Thanks, > Jungtaek Lim (HeartSaVioR) > > > 2017년 8월 24일 (목) 오후 6:40, Alexandre Vermeerbergen < > [email protected]>님이 작성: > >> Hello Jungtaek, >> >> Thank you very much for your answer. >> >> Please find attached the full Nimbus log (gzipped) related to this issue. >> >> Please note that the last ERROR repeats forever until we "repair" Storm. >> >> From the logs, it could be that the issue began close to when a topology >> was restarted (killed, then started) >> >> Maybe this caused a corruption in Zookeeper. Is there anything which I >> can collect in our Zookeeper nodes/logs to help analysis? >> >> Best regards, >> Alexandre >> >> >> >> >> 2017-08-24 9:29 GMT+02:00 Jungtaek Lim <[email protected]>: >> >>> Hi Alexandre, I missed this mail since I was on vacation. >>> >>> I followed the stack trace but hard to analyze without context. Do you >>> mind >>> providing full nimbus log? >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> 2017년 8월 16일 (수) 오전 4:12, Alexandre Vermeerbergen < >>> [email protected]>님이 >>> 작성: >>> >>> > Hello, >>> > >>> > Tomorrow I will have to restart the cluster on which I have this issue >>> with >>> > Storm 1.1.0. >>> > Is there are anybody interested in my running some commands to get more >>> > logs before I repair this cluster? >>> > >>> > Best regards, >>> > Alexandre Vermeerbergen >>> > >>> > 2017-08-13 16:14 GMT+02:00 Alexandre Vermeerbergen < >>> > [email protected] >>> > >: >>> > >>> > > Hello, >>> > > >>> > > I think it might be of interest for you Storm developers to learn >>> that I >>> > > currently have a case of issue with Storm 1.1.0 which was supposed to >>> > > resolved in this release according to https://issues.apache.org/ >>> > > jira/browse/STORM-1977 ; and I can look for any more information >>> which >>> > > you'd need to make a diagnostic on why this issue still can happen. >>> > > >>> > > Indeed, I have a Storm UI process which can't get any information on >>> its >>> > > Storm cluster, and I see many following exception in nimbus.log: >>> > > >>> > > 2017-08-02 05:11:15.971 o.a.s.d.nimbus pool-14-thread-21 [INFO] >>> Created >>> > > download session for >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute- >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormcode.ser with id >>> > > d5120ad7-a81c-4c39-afc5-a7f876b04c73 >>> > > 2017-08-02 05:11:15.978 o.a.s.d.nimbus pool-14-thread-27 [INFO] >>> Created >>> > > download session for >>> statefulAlerting_ec2-52-51-199-56-eu-west-1-compute- >>> > > amazonaws-com_defaultStormTopic-29-1501650673-stormconf.ser with id >>> > > aba18011-3258-4023-bbef-14d21a7066e1 >>> > > 2017-08-02 06:20:59.208 o.a.s.d.nimbus timer [INFO] Cleaning inbox >>> ... >>> > > deleted: stormjar-fbdadeab-105d-4510-9beb-0f0d87e1a77d.jar >>> > > 2017-08-06 03:42:02.854 o.a.s.t.ProcessFunction pool-14-thread-34 >>> [ERROR] >>> > > Internal error processing getClusterInfo >>> > > org.apache.storm.generated.KeyNotFoundException: null >>> > > at >>> > >>> org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:147) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at >>> > >>> org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:299) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown >>> Source) >>> > > ~[?:?] >>> > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( >>> > > DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_121] >>> > > at java.lang.reflect.Method.invoke(Method.java:498) >>> > ~[?:1.8.0_121] >>> > > at >>> clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) >>> > > ~[clojure-1.7.0.jar:?] >>> > > at >>> clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) >>> > > ~[clojure-1.7.0.jar:?] >>> > > at >>> > >>> org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:489) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at org.apache.storm.daemon.nimbus$get_cluster_info$iter__ >>> > > 10687__10691$fn__10692.invoke(nimbus.clj:1550) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at clojure.lang.LazySeq.sval(LazySeq.java:40) >>> > > ~[clojure-1.7.0.jar:?] >>> > > at clojure.lang.LazySeq.seq(LazySeq.java:49) >>> > > ~[clojure-1.7.0.jar:?] >>> > > at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?] >>> > > at clojure.core$seq__4128.invoke(core.clj:137) >>> > > ~[clojure-1.7.0.jar:?] >>> > > at clojure.core$dorun.invoke(core.clj:3009) >>> > ~[clojure-1.7.0.jar:?] >>> > > at clojure.core$doall.invoke(core.clj:3025) >>> > ~[clojure-1.7.0.jar:?] >>> > > at >>> > org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1524) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at org.apache.storm.daemon.nimbus$mk_reified_nimbus$ >>> > > reify__10782.getClusterInfo(nimbus.clj:1971) >>> > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at org.apache.storm.generated.Nimbus$Processor$ >>> > > getClusterInfo.getResult(Nimbus.java:3920) >>> ~[storm-core-1.1.0.jar:1.1.0] >>> > > at org.apache.storm.generated.Nimbus$Processor$ >>> > > getClusterInfo.getResult(Nimbus.java:3904) >>> ~[storm-core-1.1.0.jar:1.1.0] >>> > > at >>> > >>> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at >>> > org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:39) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at org.apache.storm.security.auth.SimpleTransportPlugin$ >>> > > SimpleWrapProcessor.process(SimpleTransportPlugin.java:162) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at org.apache.storm.thrift.server.AbstractNonblockingServer$ >>> > > FrameBuffer.invoke(AbstractNonblockingServer.java:518) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at >>> > org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) >>> > > ~[storm-core-1.1.0.jar:1.1.0] >>> > > at >>> > >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> > > [?:1.8.0_121] >>> > > at >>> > >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> > > [?:1.8.0_121] >>> > > at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121] >>> > > >>> > > >>> > > What is amazing is that I got this same issue on two Storm clusters >>> > > running on different VMs ; just they share the same data in they >>> Kafka >>> > > Broker cluster (one cluster is the production one, which was quickly >>> > fixed, >>> > > and the other one is the "backup" cluster to be used if the >>> production >>> > one >>> > > fails for quick "back to production") >>> > > >>> > > If left one of these cluster with this behavior because I felt that >>> it >>> > > could be interesting for Storm developers to have more information on >>> > this >>> > > issue, if needed to properly diagnose it. >>> > > >>> > > I can keep this cluster as is for max 2 days. >>> > > >>> > > Is there anything useful which I can collect on it to help Storm >>> > > developers to understand the cause (and hopefully use it to make >>> Storm >>> > more >>> > > robust) ? >>> > > >>> > > Few details: >>> > > >>> > > * Storm 1.1.0 cluster with Nimbus & NimbusUI running on a VM, and 4 >>> > > Supervisors VMs + 3 Zookeeper VMs >>> > > >>> > > * Running with Java Server JRE 1.8.0_121 >>> > > * Running on AWS EC2 instances >>> > > >>> > > * We run about 10 topologies, with automatic self-healing on them (if >>> > they >>> > > stop consuming Kafka items, our self-healer call "Kill topology", and >>> > then >>> > > eventually restarts the topology >>> > > >>> > > * We have a self-healing on Nimbus UI based on calling its REST >>> services. >>> > > If it's not responding fast enough, we restart Nimbus UI >>> > > * We figured out the issue because Nimbus UI was restarted every 2 >>> > minutes >>> > > >>> > > * To fix our production server which had the same symptom, we had to >>> stop >>> > > all Storm processes, then stop all Zookeepers, then remove all data >>> in >>> > > Zookeeper "snapshot files", then restart all Zookeeper, then restart >>> all >>> > > Storm process, then re-submit all our topologies >>> > > >>> > > Please be as clear as possible about which commands we should run to >>> give >>> > > you more details if needed >>> > > >>> > > Best regards, >>> > > Alexandre Vermeerbergen >>> > > >>> > > >>> > >>> >> >>
