The exception is caused by the fix in STORM-2994, the new code should only run in AT_LEAST_ONCE mode, not in the others.
Have raised https://issues.apache.org/jira/browse/STORM-3059 to fix it. I disagree that the spout should catch and swallow unknown/unexpected exceptions. Storm is designed to be fail-fast, and to restart processes when they error out unexpectedly. I don't think the spout would work any better if it caught and ignored these exceptions. 2018-05-06 9:30 GMT+02:00 Alexandre Vermeerbergen <[email protected]> : > Hello All, > > [ ] -1 Do not release this package because storm-kafka-client spout > glitches can crash workers, leading to degraded performances. > > I have downloaded the binary artifacts of this storm 1.2.0rc2, copied > the binaries of storm-kafka-client, flux and flux-wrapper from the > Nessus staging repository quoted by Taylor, and ran tests on a > relatively modest configuration (1 VM for Nimbus, 1 VM for a > Supervisor node, 1 VM a Zookeeper node) with Java 8 update 172 on > CentOS 7. We use storm-kafka-client with Kafka 0.10.2.0 libs against a > large cluster of Kafka Brokers at version 1.0.1. We have ~15 > topologies running on this setup. > > The first glitch I noticed is a that, unlike with Storm 1.2.0 which we > use in production, we had deserialization exceptions in of our our > Spout: > - With Storm 1.2.0, these exceptions were somehow "swallowed", and > this spout would consume nothing and not crashing its worker anyway > - With Storm 1.2.2rc2, these exceptions showed up, with a crash of the > Spout's worker process. Storm restarts the worker, and then the same > crash occurs not long afte. All this leads to many Netty errors in all > our topologies' logs, which clearly means bad performances (CPU load > quite high on the Supervisor VM) > After fixing the cause of the deserialization exception (which was in > our code), this issue disappeared. > > The second glitch which I just noticed is that we have a another > "situation" which is similar but yet a little bit different : on > another of our topologies, we have a spout throwing the following > exception, also leading to its worker's crash: > > 2018-05-06 06:55:49.560 o.a.s.k.s.KafkaSpout > Thread-6-eventKafkaSpout-executor[3 3] [INFO] Initialization complete > 2018-05-06 06:55:49.636 o.a.s.util Thread-6-eventKafkaSpout-executor[3 > 3] [ERROR] Async loop died! > java.lang.NullPointerException: null > at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple( > KafkaSpout.java:507) > ~[stormjar.jar:?] > at org.apache.storm.kafka.spout.KafkaSpout. > emitIfWaitingNotEmitted(KafkaSpout.java:440) > ~[stormjar.jar:?] > at org.apache.storm.kafka.spout.KafkaSpout.nextTuple( > KafkaSpout.java:308) > ~[stormjar.jar:?] > at org.apache.storm.daemon.executor$fn__10727$fn__10742$ > fn__10773.invoke(executor.clj:654) > ~[storm-core-1.2.2.jar:1.2.2] > at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) > [storm-core-1.2.2.jar:1.2.2] > at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > 2018-05-06 06:55:49.641 o.a.s.d.executor > Thread-6-eventKafkaSpout-executor[3 3] [ERROR] > java.lang.NullPointerException: null > at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple( > KafkaSpout.java:507) > ~[stormjar.jar:?] > at org.apache.storm.kafka.spout.KafkaSpout. > emitIfWaitingNotEmitted(KafkaSpout.java:440) > ~[stormjar.jar:?] > at org.apache.storm.kafka.spout.KafkaSpout.nextTuple( > KafkaSpout.java:308) > ~[stormjar.jar:?] > at org.apache.storm.daemon.executor$fn__10727$fn__10742$ > fn__10773.invoke(executor.clj:654) > ~[storm-core-1.2.2.jar:1.2.2] > at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) > [storm-core-1.2.2.jar:1.2.2] > at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > 2018-05-06 06:55:49.702 o.a.s.util Thread-6-eventKafkaSpout-executor[3 > 3] [ERROR] Halting process: ("Worker died") > java.lang.RuntimeException: ("Worker died") > at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341) > [storm-core-1.2.2.jar:1.2.2] > at clojure.lang.RestFn.invoke(RestFn.java:423) > [clojure-1.7.0.jar:?] > at org.apache.storm.daemon.worker$fn__11404$fn__11405. > invoke(worker.clj:792) > [storm-core-1.2.2.jar:1.2.2] > at org.apache.storm.daemon.executor$mk_executor_data$fn__ > 10612$fn__10613.invoke(executor.clj:281) > [storm-core-1.2.2.jar:1.2.2] > at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:494) > [storm-core-1.2.2.jar:1.2.2] > at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > 2018-05-06 06:55:49.706 o.a.s.d.worker Thread-15 [INFO] Shutting down > worker metricAggregation_ec2-34-248-249-45-eu-west-1-compute- > amazonaws-com_defaultStormTopic-106-1525589734 > 871ced6c-14c4-4a59-a774-579bf357314f 6714 > 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Terminating > messaging context > 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Shutting down > executors > > This exception looks like > https://issues.apache.org/jira/browse/STORM-3032, but I can't tell for > sure, except that the line number of the exception corresponds to this > line of KafkaSpout.java: > > offsetManagers.get(tp).addToEmitMsgs(msgId.offset()); > > > To sum up, I am voting [-1] on this 1.2.0rc2, because I have the > feeling that exceptions in Kafka Spout should be gracefully caught and > never lead to work crashes. I understand that the root cause of these > exceptions can come from the specific code we have in our topologies, > and for the 1st case I was glad to see it because it was an easy fix > on our side, but nevertheless on a production system, one can have > sometimes exceptions and the performance pain of workers crash is > simply not affordable in production. > > I don't know if a JIRA already exists on this generic issue that kafka > spout exceptions should be gracefully catched (and maybe lead to > "Failed" tuples, so that there would be a tracking anyway? or at least > a log message in Storm UI ?) > > Please note that this JIRA would differ from STORM-3032 because > STORM-3032 seems to be specific to one case, where as in my opinion > the crash issue is more generic - as my two different cases show. > > Best regards, > Alexandre Vermeerbergen > > > 2018-05-03 19:18 GMT+02:00 P. Taylor Goetz <[email protected]>: > > CORRECTION: The Nexus staging repository for this rc is: > > > > https://repository.apache.org/content/repositories/orgapachestorm-1064 > > > > > > On May 3, 2018, at 11:42 AM, P. Taylor Goetz <[email protected]> wrote: > > > > This is a call to vote on releasing Apache Storm 1.2.2 (rc2) > > > > Full list of changes in this release: > > > > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1. > 2.2-rc2/RELEASE_NOTES.html > > > > The tag/commit to be voted upon is v1.2.2: > > > > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=tree;h= > 7cb19fb3befa65e5ff9e5e02f38e16de865982a9;hb=e001672cf0ea59fe6989b563fb6bbb > 450fe8e7e5 > > > > The source archive being voted upon can be found here: > > > > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1. > 2.2-rc2/apache-storm-1.2.2-src.tar.gz > > > > Other release files, signatures and digests can be found here: > > > > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1.2.2-rc2/ > > > > The release artifacts are signed with the following key: > > > > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=blob_ > plain;f=KEYS;hb=22b832708295fa2c15c4f3c70ac0d2bc6fded4bd > > > > The Nexus staging repository for this release is: > > > > https://repository.apache.org/content/repositories/orgapachestorm-1062 > > > > Please vote on releasing this package as Apache Storm 1.2.2. > > > > When voting, please list the actions taken to verify the release. > > > > This vote will be open for 72 hours or until at least 3 PMC members vote > +1. > > > > [ ] +1 Release this package as Apache Storm 1.2.2 > > [ ] 0 No opinion > > [ ] -1 Do not release this package because... > > > > Thanks to everyone who contributed to this release. > > > > -Taylor > > > > >
