Hello Stig,

Thank you very much for your very fast answer and for opening
https://issues.apache.org/jira/browse/STORM-3059.

Regarding my generic concern that Kafka Spout exceptions shouldn't
kill it's worker process, I am still concerned by the scope of the
"kill/recovery".
Indeed, a worker process generally not only hosts spouts, but also bolts.
The fact that a spout occasional crash leads to the killing of
everything else running on the same worker process seems overkill (no
pun intended) to me.

To give an analogy with a web application server, it's like if we
would agree that an exception thrown by a servlet could lead to a kill
of the application server's container process. Yeah with a cluster of
containers and a good load balancer in front this could be OK in
production, but yet... I still feel this overkill.

Back to my precise issues, is there possibility to have
https://issues.apache.org/jira/browse/STORM-3059 in Storm 1.2.2 final
?

Best regards,
Alexandre Vermeerbergen



2018-05-06 9:58 GMT+02:00 Stig Rohde Døssing <[email protected]>:
> The exception is caused by the fix in STORM-2994, the new code should only
> run in AT_LEAST_ONCE mode, not in the others.
>
> Have raised https://issues.apache.org/jira/browse/STORM-3059 to fix it.
>
> I disagree that the spout should catch and swallow unknown/unexpected
> exceptions. Storm is designed to be fail-fast, and to restart processes
> when they error out unexpectedly. I don't think the spout would work any
> better if it caught and ignored these exceptions.
>
> 2018-05-06 9:30 GMT+02:00 Alexandre Vermeerbergen <[email protected]>
> :
>
>> Hello All,
>>
>> [ ] -1 Do not release this package because storm-kafka-client spout
>> glitches can crash workers, leading to degraded performances.
>>
>> I have downloaded the binary artifacts of this storm 1.2.0rc2, copied
>> the binaries of storm-kafka-client, flux and flux-wrapper from the
>> Nessus staging repository quoted by Taylor, and ran tests on a
>> relatively modest configuration (1 VM for Nimbus, 1 VM for a
>> Supervisor node, 1 VM a Zookeeper node) with Java 8 update 172 on
>> CentOS 7. We use storm-kafka-client with Kafka 0.10.2.0 libs against a
>> large cluster of Kafka Brokers at version 1.0.1. We have ~15
>> topologies running on this setup.
>>
>> The first glitch I noticed is a that, unlike with Storm 1.2.0 which we
>> use in production, we had deserialization exceptions in of our our
>> Spout:
>> - With Storm 1.2.0, these exceptions were somehow "swallowed", and
>> this spout would consume nothing and not crashing its worker anyway
>> - With Storm 1.2.2rc2, these exceptions showed up, with a crash of the
>> Spout's worker process. Storm restarts the worker, and then the same
>> crash occurs not long afte. All this leads to many Netty errors in all
>> our topologies' logs, which clearly means bad performances (CPU load
>> quite high on the Supervisor VM)
>> After fixing the cause of the deserialization exception (which was in
>> our code), this issue disappeared.
>>
>> The second glitch which I just noticed is that we have a another
>> "situation" which is similar but yet a little bit different : on
>> another of our topologies, we have a spout throwing the following
>> exception, also leading to its worker's crash:
>>
>> 2018-05-06 06:55:49.560 o.a.s.k.s.KafkaSpout
>> Thread-6-eventKafkaSpout-executor[3 3] [INFO] Initialization complete
>> 2018-05-06 06:55:49.636 o.a.s.util Thread-6-eventKafkaSpout-executor[3
>> 3] [ERROR] Async loop died!
>> java.lang.NullPointerException: null
>>         at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple(
>> KafkaSpout.java:507)
>> ~[stormjar.jar:?]
>>         at org.apache.storm.kafka.spout.KafkaSpout.
>> emitIfWaitingNotEmitted(KafkaSpout.java:440)
>> ~[stormjar.jar:?]
>>         at org.apache.storm.kafka.spout.KafkaSpout.nextTuple(
>> KafkaSpout.java:308)
>> ~[stormjar.jar:?]
>>         at org.apache.storm.daemon.executor$fn__10727$fn__10742$
>> fn__10773.invoke(executor.clj:654)
>> ~[storm-core-1.2.2.jar:1.2.2]
>>         at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484)
>> [storm-core-1.2.2.jar:1.2.2]
>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
>> 2018-05-06 06:55:49.641 o.a.s.d.executor
>> Thread-6-eventKafkaSpout-executor[3 3] [ERROR]
>> java.lang.NullPointerException: null
>>         at org.apache.storm.kafka.spout.KafkaSpout.emitOrRetryTuple(
>> KafkaSpout.java:507)
>> ~[stormjar.jar:?]
>>         at org.apache.storm.kafka.spout.KafkaSpout.
>> emitIfWaitingNotEmitted(KafkaSpout.java:440)
>> ~[stormjar.jar:?]
>>         at org.apache.storm.kafka.spout.KafkaSpout.nextTuple(
>> KafkaSpout.java:308)
>> ~[stormjar.jar:?]
>>         at org.apache.storm.daemon.executor$fn__10727$fn__10742$
>> fn__10773.invoke(executor.clj:654)
>> ~[storm-core-1.2.2.jar:1.2.2]
>>         at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484)
>> [storm-core-1.2.2.jar:1.2.2]
>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
>> 2018-05-06 06:55:49.702 o.a.s.util Thread-6-eventKafkaSpout-executor[3
>> 3] [ERROR] Halting process: ("Worker died")
>> java.lang.RuntimeException: ("Worker died")
>>         at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
>> [storm-core-1.2.2.jar:1.2.2]
>>         at clojure.lang.RestFn.invoke(RestFn.java:423)
>> [clojure-1.7.0.jar:?]
>>         at org.apache.storm.daemon.worker$fn__11404$fn__11405.
>> invoke(worker.clj:792)
>> [storm-core-1.2.2.jar:1.2.2]
>>         at org.apache.storm.daemon.executor$mk_executor_data$fn__
>> 10612$fn__10613.invoke(executor.clj:281)
>> [storm-core-1.2.2.jar:1.2.2]
>>         at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:494)
>> [storm-core-1.2.2.jar:1.2.2]
>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
>> 2018-05-06 06:55:49.706 o.a.s.d.worker Thread-15 [INFO] Shutting down
>> worker metricAggregation_ec2-34-248-249-45-eu-west-1-compute-
>> amazonaws-com_defaultStormTopic-106-1525589734
>> 871ced6c-14c4-4a59-a774-579bf357314f 6714
>> 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Terminating
>> messaging context
>> 2018-05-06 06:55:49.707 o.a.s.d.worker Thread-15 [INFO] Shutting down
>> executors
>>
>> This exception looks like
>> https://issues.apache.org/jira/browse/STORM-3032, but I can't tell for
>> sure, except that the line number of the exception corresponds to this
>> line of KafkaSpout.java:
>>
>>                 offsetManagers.get(tp).addToEmitMsgs(msgId.offset());
>>
>>
>> To sum up, I am voting [-1] on this 1.2.0rc2, because I have the
>> feeling that exceptions in Kafka Spout should be gracefully caught and
>> never lead to work crashes. I understand that the root cause of these
>> exceptions can come from the specific code we have in our topologies,
>> and for the 1st case I was glad to see it because it was an easy fix
>> on our side, but nevertheless on a production system, one can have
>> sometimes exceptions and the performance pain of workers crash is
>> simply not affordable in production.
>>
>> I don't know if a JIRA already exists on this generic issue that kafka
>> spout exceptions should be gracefully catched (and maybe lead to
>> "Failed" tuples, so that there would be a tracking anyway? or at least
>> a log message in Storm UI ?)
>>
>> Please note that this JIRA would differ from STORM-3032 because
>> STORM-3032 seems to be specific to one case, where as in my opinion
>> the crash issue is more generic - as my two different cases show.
>>
>> Best regards,
>> Alexandre Vermeerbergen
>>
>>
>> 2018-05-03 19:18 GMT+02:00 P. Taylor Goetz <[email protected]>:
>> > CORRECTION: The Nexus staging repository for this rc is:
>> >
>> > https://repository.apache.org/content/repositories/orgapachestorm-1064
>> >
>> >
>> > On May 3, 2018, at 11:42 AM, P. Taylor Goetz <[email protected]> wrote:
>> >
>> > This is a call to vote on releasing Apache Storm 1.2.2 (rc2)
>> >
>> > Full list of changes in this release:
>> >
>> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1.
>> 2.2-rc2/RELEASE_NOTES.html
>> >
>> > The tag/commit to be voted upon is v1.2.2:
>> >
>> > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=tree;h=
>> 7cb19fb3befa65e5ff9e5e02f38e16de865982a9;hb=e001672cf0ea59fe6989b563fb6bbb
>> 450fe8e7e5
>> >
>> > The source archive being voted upon can be found here:
>> >
>> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1.
>> 2.2-rc2/apache-storm-1.2.2-src.tar.gz
>> >
>> > Other release files, signatures and digests can be found here:
>> >
>> > https://dist.apache.org/repos/dist/dev/storm/apache-storm-1.2.2-rc2/
>> >
>> > The release artifacts are signed with the following key:
>> >
>> > https://git-wip-us.apache.org/repos/asf?p=storm.git;a=blob_
>> plain;f=KEYS;hb=22b832708295fa2c15c4f3c70ac0d2bc6fded4bd
>> >
>> > The Nexus staging repository for this release is:
>> >
>> > https://repository.apache.org/content/repositories/orgapachestorm-1062
>> >
>> > Please vote on releasing this package as Apache Storm 1.2.2.
>> >
>> > When voting, please list the actions taken to verify the release.
>> >
>> > This vote will be open for 72 hours or until at least 3 PMC members vote
>> +1.
>> >
>> > [ ] +1 Release this package as Apache Storm 1.2.2
>> > [ ]  0 No opinion
>> > [ ] -1 Do not release this package because...
>> >
>> > Thanks to everyone who contributed to this release.
>> >
>> > -Taylor
>> >
>> >
>>

Reply via email to