Hello Stig, Yes I can try your fix very quickly if you have a binary artifact (storm-kafka-client.jar, I guess) which I could download. Or "copy paste" instructions that I could use to build it (I'm sorry : I tend to be slow at understanding how to retrieve specific pull requests to build artifacts).
Best regards, Alexandre 2018-05-06 10:51 GMT+02:00 Stig Rohde Døssing <[email protected]>: > I put up what I believe should be a fix at > https://github.com/apache/storm/pull/2663, would you be willing to try it > out? > > Regarding killing the entire worker, you are right that it can be overkill > in some cases, but there's a tradeoff you have to make. Heron runs each > component (spout/bolt) in independent JVMs, so if e.g. the spout crashes > there then only the JVM hosting that spout will crash and restart. They pay > for it by having to communicate between JVMs more, since they never have > situations where a spout can send a tuple to a bolt without having to > serialize it and go between processes. > > 2018-05-06 10:38 GMT+02:00 Alexandre Vermeerbergen <[email protected] >>: > >> Hello Stig, >> >> Thank you very much for your very fast answer and for opening >> https://issues.apache.org/jira/browse/STORM-3059. >> >> Regarding my generic concern that Kafka Spout exceptions shouldn't >> kill it's worker process, I am still concerned by the scope of the >> "kill/recovery". >> Indeed, a worker process generally not only hosts spouts, but also bolts. >> The fact that a spout occasional crash leads to the killing of >> everything else running on the same worker process seems overkill (no >> pun intended) to me. >> >> To give an analogy with a web application server, it's like if we >> would agree that an exception thrown by a servlet could lead to a kill >> of the application server's container process. Yeah with a cluster of >> containers and a good load balancer in front this could be OK in >> production, but yet... I still feel this overkill. >> >> Back to my precise issues, is there possibility to have >> https://issues.apache.org/jira/browse/STORM-3059 in Storm 1.2.2 final >> ? >> >> Best regards, >> Alexandre Vermeerbergen >> >> >> >> 2018-05-06 9:58 GMT+02:00 Stig Rohde Døssing <[email protected]>: >> > The exception is caused by the fix in STORM-2994, the new code should >> only >> > run in AT_LEAST_ONCE mode, not in the others. >> > >> > Have raised https://issues.apache.org/jira/browse/STORM-3059 to fix it. >> > >> > I disagree that the spout should catch and swallow unknown/unexpected >> > exceptions. Storm is designed to be fail-fast, and to restart processes >> > when they error out unexpectedly. I don't think the spout would work any >> > better if it caught and ignored these exceptions. >> > >> > 2018-05-06 9:30 GMT+02:00 Alexandre Vermeerbergen < >> [email protected]> >> > : >> > >> >> Hello All, >> >> >> >> [ ] -1 Do not release this package because storm-kafka-client spout >> >> glitches can crash workers, leading to degraded performances. >> >> >> >> I have downloaded the binary artifacts of this storm 1.2.0rc2, copied >> >> the binaries of storm-kafka-client, flux and flux-wrapper from the >> >> Nessus staging repository quoted by Taylor, and ran tests on a >> >> relatively modest configuration (1 VM for Nimbus, 1 VM for a >> >> Supervisor node, 1 VM a Zookeeper node) with Java 8 update 172 on >> >> CentOS 7. We use storm-kafka-client with Kafka 0.10.2.0 libs against a >> >> large cluster of Kafka Brokers at version 1.0.1. We have ~15 >> >> topologies running on this setup. >> >> >> >> The first glitch I noticed is a that, unlike with Storm 1.2.0 which we >> >> use in production, we had deserialization exceptions in of our our >> >> Spout: >> >> - With Storm 1.2.0, these exceptions were somehow "swallowed", and >> >> this spout would consume nothing and not crashing its worker anyway >> >> - With Storm 1.2.2rc2, these exceptions showed up, with a crash of the >> >> Spout's worker process. Storm restarts the worker, and then the same >> >> crash occurs not long afte. All this leads to many Netty errors in all >> >> our topologies' logs, which clearly means bad performances (CPU load >> >> quite high on the Supervisor VM) >> >> After fixing the cause of the deserialization exception (which was in >> >> our code), this issue disappeared. >> >> >> >> The second glitch which I just noticed is that we have a another >> >> "situation" which is similar but yet a little bit different : on >> >> another of our topologies, we have a spout throwing the following >> >> exception, also leading to its worker's crash: >> >> >> >> 2018-05-06 06:55:49.560 o.a.s.k.s.KafkaSpout >> >> Thread-6-eventKafkaSpout-executor[3 3] [INFO] Initialization complete >> >> 2018-05-06 06:55:49.636 o.a.s.util Thread-6-eventKafkaSpout-executor[3 >> >> 3] [ERROR] Async loop died! >> >> java.lang.NullPointerException: null >> >> at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple( >> >> KafkaSpout.java:507) >> >> ~[stormjar.jar:?] >> >> at org.apache.storm.kafka.spout.KafkaSpout. >> >> emitIfWaitingNotEmitted(KafkaSpout.java:440) >> >> ~[stormjar.jar:?] >> >> at org.apache.storm.kafka.spout.KafkaSpout.nextTuple( >> >> KafkaSpout.java:308) >> >> ~[stormjar.jar:?] >> >> at org.apache.storm.daemon.executor$fn__10727$fn__10742$ >> >> fn__10773.invoke(executor.clj:654) >> >> ~[storm-core-1.2.2.jar:1.2.2] >> >> at org.apache.storm.util$async_loop$fn__553.invoke(util.clj: >> 484) >> >> [storm-core-1.2.2.jar:1.2.2] >> >> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] >> >> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] >> >> 2018-05-06 06:55:49.641 o.a.s.d.executor >> >> Thread-6-eventKafkaSpout-executor[3 3] [ERROR] >> >> java.lang.NullPointerException: null >> >> at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple( >> >> KafkaSpout.java:507) >> >> ~[stormjar.jar:?] >> >> at org.apache.storm.kafka.spout.KafkaSpout. >> >> emitIfWaitingNotEmitted(KafkaSpout.java:440) >> >> ~[stormjar.jar:?] >> >> at org.apache.storm.kafka.spout.KafkaSpout.nextTuple( >> >> KafkaSpout.java:308) >> >> ~[stormjar.jar:?] >> >> at org.apache.storm.daemon.executor$fn__10727$fn__10742$ >> >> fn__10773.invoke(executor.clj:654) >> >> ~[storm-core-1.2.2.jar:1.2.2] >> >> at org.apache.storm.util$async_loop$fn__553.invoke(util.clj: >> 484) >> >> [storm-core-1.2.2.jar:1.2.2] >> >> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] >> >> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] >> >> 2018-05-06 06:55:49.702 o.a.s.util Thread-6-eventKafkaSpout-executor[3 >> >> 3] [ERROR] Halting process: ("Worker died") >> >> java.lang.RuntimeException: ("Worker died") >> >> at org.apache.storm.util$exit_process_BANG_.doInvoke(util. >> clj:341) >> >> [storm-core-1.2.2.jar:1.2.2] >> >> at clojure.lang.RestFn.invoke(RestFn.java:423) >> >> [clojure-1.7.0.jar:?] >> >> at org.apache.storm.daemon.worker$fn__11404$fn__11405. >> >> invoke(worker.clj:792) >> >> [storm-core-1.2.2.jar:1.2.2] >> >> at org.apache.storm.daemon.executor$mk_executor_data$fn__ >> >> 10612$fn__10613.invoke(executor.clj:281) >> >> [storm-core-1.2.2.jar:1.2.2] >> >> at org.apache.storm.util$async_loop$fn__553.invoke(util.clj: >> 494) >> >> [storm-core-1.2.2.jar:1.2.2] >> >> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] >> >> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172] >> >> 2018-05-06 06:55:49.706 o.a.s.d.worker Thread-15 [INFO] Shutting down >> >> worker metricAggregation_ec2-34-248-249-45-eu-west-1-compute- >> >> amazonaws-com_defaultStormTopic-106-1525589734 >> >> 871ced6c-14c4-4a59-a774-579bf357314f 6714 >> >> 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Terminating >> >> messaging context >> >> 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Shutting down >> >> executors >> >> >> >> This exception looks like >> >> https://issues.apache.org/jira/browse/STORM-3032, but I can't tell for >> >> sure, except that the line number of the exception corresponds to this >> >> line of KafkaSpout.java: >> >> >> >> offsetManagers.get(tp).addToEmitMsgs(msgId.offset()); >> >> >> >> >> >> To sum up, I am voting [-1] on this 1.2.0rc2, because I have the >> >> feeling that exceptions in Kafka Spout should be gracefully caught and >> >> never lead to work crashes. I understand that the root cause of these >> >> exceptions can come from the specific code we have in our topologies, >> >> and for the 1st case I was glad to see it because it was an easy fix >> >> on our side, but nevertheless on a production system, one can have >> >> sometimes exceptions and the performance pain of workers crash is >> >> simply not affordable in production. >> >> >> >> I don't know if a JIRA already exists on this generic issue that kafka >> >> spout exceptions should be gracefully catched (and maybe lead to >> >> "Failed" tuples, so that there would be a tracking anyway? or at least >> >> a log message in Storm UI ?) >> >> >> >> Please note that this JIRA would differ from STORM-3032 because >> >> STORM-3032 seems to be specific to one case, where as in my opinion >> >> the crash issue is more generic - as my two different cases show. >> >> >> >> Best regards, >> >> Alexandre Vermeerbergen >> >> >> >> >> >> 2018-05-03 19:18 GMT+02:00 P. Taylor Goetz <[email protected]>: >> >> > CORRECTION: The Nexus staging repository for this rc is: >> >> > >> >> > https://repository.apache.org/content/repositories/ >> orgapachestorm-1064 >> >> > >> >> > >> >> > On May 3, 2018, at 11:42 AM, P. Taylor Goetz <[email protected]> >> wrote: >> >> > >> >> > This is a call to vote on releasing Apache Storm 1.2.2 (rc2) >> >> > >> >> > Full list of changes in this release: >> >> > >> >> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1. >> >> 2.2-rc2/RELEASE_NOTES.html >> >> > >> >> > The tag/commit to be voted upon is v1.2.2: >> >> > >> >> > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=tree;h= >> >> 7cb19fb3befa65e5ff9e5e02f38e16de865982a9;hb= >> e001672cf0ea59fe6989b563fb6bbb >> >> 450fe8e7e5 >> >> > >> >> > The source archive being voted upon can be found here: >> >> > >> >> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1. >> >> 2.2-rc2/apache-storm-1.2.2-src.tar.gz >> >> > >> >> > Other release files, signatures and digests can be found here: >> >> > >> >> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1.2.2-rc2/ >> >> > >> >> > The release artifacts are signed with the following key: >> >> > >> >> > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=blob_ >> >> plain;f=KEYS;hb=22b832708295fa2c15c4f3c70ac0d2bc6fded4bd >> >> > >> >> > The Nexus staging repository for this release is: >> >> > >> >> > https://repository.apache.org/content/repositories/ >> orgapachestorm-1062 >> >> > >> >> > Please vote on releasing this package as Apache Storm 1.2.2. >> >> > >> >> > When voting, please list the actions taken to verify the release. >> >> > >> >> > This vote will be open for 72 hours or until at least 3 PMC members >> vote >> >> +1. >> >> > >> >> > [ ] +1 Release this package as Apache Storm 1.2.2 >> >> > [ ] 0 No opinion >> >> > [ ] -1 Do not release this package because... >> >> > >> >> > Thanks to everyone who contributed to this release. >> >> > >> >> > -Taylor >> >> > >> >> > >> >> >>
