I put up what I believe should be a fix at https://github.com/apache/storm/pull/2663, would you be willing to try it out?
Regarding killing the entire worker, you are right that it can be overkill in some cases, but there's a tradeoff you have to make. Heron runs each component (spout/bolt) in independent JVMs, so if e.g. the spout crashes there then only the JVM hosting that spout will crash and restart. They pay for it by having to communicate between JVMs more, since they never have situations where a spout can send a tuple to a bolt without having to serialize it and go between processes. 2018-05-06 10:38 GMT+02:00 Alexandre Vermeerbergen <[email protected] >: > Hello Stig, > > Thank you very much for your very fast answer and for opening > https://issues.apache.org/jira/browse/STORM-3059. > > Regarding my generic concern that Kafka Spout exceptions shouldn't > kill it's worker process, I am still concerned by the scope of the > "kill/recovery". > Indeed, a worker process generally not only hosts spouts, but also bolts. > The fact that a spout occasional crash leads to the killing of > everything else running on the same worker process seems overkill (no > pun intended) to me. > > To give an analogy with a web application server, it's like if we > would agree that an exception thrown by a servlet could lead to a kill > of the application server's container process. Yeah with a cluster of > containers and a good load balancer in front this could be OK in > production, but yet... I still feel this overkill. > > Back to my precise issues, is there possibility to have > https://issues.apache.org/jira/browse/STORM-3059 in Storm 1.2.2 final > ? > > Best regards, > Alexandre Vermeerbergen > > > > 2018-05-06 9:58 GMT+02:00 Stig Rohde Døssing <[email protected]>: > > The exception is caused by the fix in STORM-2994, the new code should > only > > run in AT_LEAST_ONCE mode, not in the others. > > > > Have raised https://issues.apache.org/jira/browse/STORM-3059 to fix it. > > > > I disagree that the spout should catch and swallow unknown/unexpected > > exceptions. Storm is designed to be fail-fast, and to restart processes > > when they error out unexpectedly. I don't think the spout would work any > > better if it caught and ignored these exceptions. > > > > 2018-05-06 9:30 GMT+02:00 Alexandre Vermeerbergen < > [email protected]> > > : > > > >> Hello All, > >> > >> [ ] -1 Do not release this package because storm-kafka-client spout > >> glitches can crash workers, leading to degraded performances. > >> > >> I have downloaded the binary artifacts of this storm 1.2.0rc2, copied > >> the binaries of storm-kafka-client, flux and flux-wrapper from the > >> Nessus staging repository quoted by Taylor, and ran tests on a > >> relatively modest configuration (1 VM for Nimbus, 1 VM for a > >> Supervisor node, 1 VM a Zookeeper node) with Java 8 update 172 on > >> CentOS 7. We use storm-kafka-client with Kafka 0.10.2.0 libs against a > >> large cluster of Kafka Brokers at version 1.0.1. We have ~15 > >> topologies running on this setup. > >> > >> The first glitch I noticed is a that, unlike with Storm 1.2.0 which we > >> use in production, we had deserialization exceptions in of our our > >> Spout: > >> - With Storm 1.2.0, these exceptions were somehow "swallowed", and > >> this spout would consume nothing and not crashing its worker anyway > >> - With Storm 1.2.2rc2, these exceptions showed up, with a crash of the > >> Spout's worker process. Storm restarts the worker, and then the same > >> crash occurs not long afte. All this leads to many Netty errors in all > >> our topologies' logs, which clearly means bad performances (CPU load > >> quite high on the Supervisor VM) > >> After fixing the cause of the deserialization exception (which was in > >> our code), this issue disappeared. > >> > >> The second glitch which I just noticed is that we have a another > >> "situation" which is similar but yet a little bit different : on > >> another of our topologies, we have a spout throwing the following > >> exception, also leading to its worker's crash: > >> > >> 2018-05-06 06:55:49.560 o.a.s.k.s.KafkaSpout > >> Thread-6-eventKafkaSpout-executor[3 3] [INFO] Initialization complete > >> 2018-05-06 06:55:49.636 o.a.s.util Thread-6-eventKafkaSpout-executor[3 > >> 3] [ERROR] Async loop died! > >> java.lang.NullPointerException: null > >> at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple( > >> KafkaSpout.java:507) > >> ~[stormjar.jar:?] > >> at org.apache.storm.kafka.spout.KafkaSpout. > >> emitIfWaitingNotEmitted(KafkaSpout.java:440) > >> ~[stormjar.jar:?] > >> at org.apache.storm.kafka.spout.KafkaSpout.nextTuple( > >> KafkaSpout.java:308) > >> ~[stormjar.jar:?] > >> at org.apache.storm.daemon.executor$fn__10727$fn__10742$ > >> fn__10773.invoke(executor.clj:654) > >> ~[storm-core-1.2.2.jar:1.2.2] > >> at org.apache.storm.util$async_loop$fn__553.invoke(util.clj: > 484) > >> [storm-core-1.2.2.jar:1.2.2] > >> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > >> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > >> 2018-05-06 06:55:49.641 o.a.s.d.executor > >> Thread-6-eventKafkaSpout-executor[3 3] [ERROR] > >> java.lang.NullPointerException: null > >> at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple( > >> KafkaSpout.java:507) > >> ~[stormjar.jar:?] > >> at org.apache.storm.kafka.spout.KafkaSpout. > >> emitIfWaitingNotEmitted(KafkaSpout.java:440) > >> ~[stormjar.jar:?] > >> at org.apache.storm.kafka.spout.KafkaSpout.nextTuple( > >> KafkaSpout.java:308) > >> ~[stormjar.jar:?] > >> at org.apache.storm.daemon.executor$fn__10727$fn__10742$ > >> fn__10773.invoke(executor.clj:654) > >> ~[storm-core-1.2.2.jar:1.2.2] > >> at org.apache.storm.util$async_loop$fn__553.invoke(util.clj: > 484) > >> [storm-core-1.2.2.jar:1.2.2] > >> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > >> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > >> 2018-05-06 06:55:49.702 o.a.s.util Thread-6-eventKafkaSpout-executor[3 > >> 3] [ERROR] Halting process: ("Worker died") > >> java.lang.RuntimeException: ("Worker died") > >> at org.apache.storm.util$exit_process_BANG_.doInvoke(util. > clj:341) > >> [storm-core-1.2.2.jar:1.2.2] > >> at clojure.lang.RestFn.invoke(RestFn.java:423) > >> [clojure-1.7.0.jar:?] > >> at org.apache.storm.daemon.worker$fn__11404$fn__11405. > >> invoke(worker.clj:792) > >> [storm-core-1.2.2.jar:1.2.2] > >> at org.apache.storm.daemon.executor$mk_executor_data$fn__ > >> 10612$fn__10613.invoke(executor.clj:281) > >> [storm-core-1.2.2.jar:1.2.2] > >> at org.apache.storm.util$async_loop$fn__553.invoke(util.clj: > 494) > >> [storm-core-1.2.2.jar:1.2.2] > >> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > >> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] > >> 2018-05-06 06:55:49.706 o.a.s.d.worker Thread-15 [INFO] Shutting down > >> worker metricAggregation_ec2-34-248-249-45-eu-west-1-compute- > >> amazonaws-com_defaultStormTopic-106-1525589734 > >> 871ced6c-14c4-4a59-a774-579bf357314f 6714 > >> 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Terminating > >> messaging context > >> 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Shutting down > >> executors > >> > >> This exception looks like > >> https://issues.apache.org/jira/browse/STORM-3032, but I can't tell for > >> sure, except that the line number of the exception corresponds to this > >> line of KafkaSpout.java: > >> > >> offsetManagers.get(tp).addToEmitMsgs(msgId.offset()); > >> > >> > >> To sum up, I am voting [-1] on this 1.2.0rc2, because I have the > >> feeling that exceptions in Kafka Spout should be gracefully caught and > >> never lead to work crashes. I understand that the root cause of these > >> exceptions can come from the specific code we have in our topologies, > >> and for the 1st case I was glad to see it because it was an easy fix > >> on our side, but nevertheless on a production system, one can have > >> sometimes exceptions and the performance pain of workers crash is > >> simply not affordable in production. > >> > >> I don't know if a JIRA already exists on this generic issue that kafka > >> spout exceptions should be gracefully catched (and maybe lead to > >> "Failed" tuples, so that there would be a tracking anyway? or at least > >> a log message in Storm UI ?) > >> > >> Please note that this JIRA would differ from STORM-3032 because > >> STORM-3032 seems to be specific to one case, where as in my opinion > >> the crash issue is more generic - as my two different cases show. > >> > >> Best regards, > >> Alexandre Vermeerbergen > >> > >> > >> 2018-05-03 19:18 GMT+02:00 P. Taylor Goetz <[email protected]>: > >> > CORRECTION: The Nexus staging repository for this rc is: > >> > > >> > https://repository.apache.org/content/repositories/ > orgapachestorm-1064 > >> > > >> > > >> > On May 3, 2018, at 11:42 AM, P. Taylor Goetz <[email protected]> > wrote: > >> > > >> > This is a call to vote on releasing Apache Storm 1.2.2 (rc2) > >> > > >> > Full list of changes in this release: > >> > > >> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1. > >> 2.2-rc2/RELEASE_NOTES.html > >> > > >> > The tag/commit to be voted upon is v1.2.2: > >> > > >> > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=tree;h= > >> 7cb19fb3befa65e5ff9e5e02f38e16de865982a9;hb= > e001672cf0ea59fe6989b563fb6bbb > >> 450fe8e7e5 > >> > > >> > The source archive being voted upon can be found here: > >> > > >> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1. > >> 2.2-rc2/apache-storm-1.2.2-src.tar.gz > >> > > >> > Other release files, signatures and digests can be found here: > >> > > >> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1.2.2-rc2/ > >> > > >> > The release artifacts are signed with the following key: > >> > > >> > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=blob_ > >> plain;f=KEYS;hb=22b832708295fa2c15c4f3c70ac0d2bc6fded4bd > >> > > >> > The Nexus staging repository for this release is: > >> > > >> > https://repository.apache.org/content/repositories/ > orgapachestorm-1062 > >> > > >> > Please vote on releasing this package as Apache Storm 1.2.2. > >> > > >> > When voting, please list the actions taken to verify the release. > >> > > >> > This vote will be open for 72 hours or until at least 3 PMC members > vote > >> +1. > >> > > >> > [ ] +1 Release this package as Apache Storm 1.2.2 > >> > [ ] 0 No opinion > >> > [ ] -1 Do not release this package because... > >> > > >> > Thanks to everyone who contributed to this release. > >> > > >> > -Taylor > >> > > >> > > >> >
