Hello Stig,

Thank you very much for your answer : I have tested our many
topologies using Kafka Client 2.0.1 instead of Kafka Client 2.0.0, let
them run at full charge for couple of days, and I can confirm that
this exception no longer occurs !

May I suggest storm-kafka-client documentation to mention that this
version of Kafka Client is recommended ?

Kind regards,
Alexandre Vermeerbergen

Le lun. 12 nov. 2018 à 23:17, Stig Rohde Døssing
<stigdoess...@gmail.com> a écrit :
>
> I don't think beginningOffsets is null. I think it's missing one of the
> partitions, which would mean the right hand side of the line is null, which
> gives an NPE when we try to assign it to a primitive long.
>
> I think this could be due to
> https://issues.apache.org/jira/browse/KAFKA-7044, going by the commit
> message for the fix
> https://github.com/apache/kafka/commit/e2ec2d79c8d5adefc0c764583cec47144dbc5705#diff-b45245913eaae46aa847d2615d62cde0.
> Specifically part 2 sounds a lot like what I think might be happening here.
>
> "
>
> `ConsumerGroupCommand.getLogEndOffsets()` and `getLogStartOffsets()`
> assumed that endOffsets()/beginningOffsets() which eventually call
> Fetcher.fetchOffsetsByTimes(), would return a map with all the topic
> partitions passed to endOffsets()/beginningOffsets() and that values
> are not null. Because of (1), null values were possible if some of the
> topic partitions were already known (in metadata cache) and some not
> (metadata cache did not have entries for some of the topic
> partitions). However, even with fixing (1),
> endOffsets()/beginningOffsets() may return a map with some topic
> partitions missing, when list offset request returns a non-retriable
> error.
>
> "
>
> Basically KafkaOffsetMetric also assumes that when beginningOffsets(topics)
> is called, the returned map will contain a value for all requested topics.
> Could you try upgrading to Kafka 2.0.1?
>
> If necessary we can also work around this on the Storm side by skipping the
> metrics if the requested partition isn't in the return values for
> beginningOffsets/endOffsets. Feel free to raise an issue for this.
>
> Den man. 12. nov. 2018 kl. 21.56 skrev Alexandre Vermeerbergen <
> avermeerber...@gmail.com>:
>
> > Hello,
> >
> > Using Storm 1.2.3-snapshot of the 3rd of November 2018 with all libs
> > (storm-core & storm-kafka-client) taken from same Git, we get the
> > following crash coming from a NullPointerException in
> > KafkaOffsetMetric.getValueAndReset :
> >
> > 2018-11-12 19:31:30.496 o.a.s.util
> > Thread-9-metricsFromKafka-executor[13 13] [ERROR] Async loop died!
> >
> > java.lang.RuntimeException: java.lang.NullPointerException
> >
> >           at
> > org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:522)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:487)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:477)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.disruptor$consume_batch.invoke(disruptor.clj:70)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.daemon.executor$fn__9620$fn__9635$fn__9666.invoke(executor.clj:634)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at org.apache.storm.util$async_loop$fn__561.invoke(util.clj:484)
> > [storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
> >
> >           at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
> >
> > Caused by: java.lang.NullPointerException
> >
> >           at
> > org.apache.storm.kafka.spout.metrics.KafkaOffsetMetric.getValueAndReset(KafkaOffsetMetric.java:89)
> > ~[stormjar.jar:?]
> >
> >           at
> > org.apache.storm.daemon.executor$metrics_tick$fn__9544.invoke(executor.clj:345)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at clojure.core$map$fn__4553.invoke(core.clj:2622)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core$seq__4128.invoke(core.clj:137)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core$filter$fn__4580.invoke(core.clj:2679)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.lang.Cons.next(Cons.java:39) ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.lang.RT.next(RT.java:674) ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core$next__4112.invoke(core.clj:64)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core.protocols$fn__6523.invoke(protocols.clj:170)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at
> > clojure.core.protocols$fn__6478$G__6473__6487.invoke(protocols.clj:19)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core.protocols$seq_reduce.invoke(protocols.clj:31)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core.protocols$fn__6506.invoke(protocols.clj:101)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at
> > clojure.core.protocols$fn__6452$G__6447__6465.invoke(protocols.clj:13)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core$reduce.invoke(core.clj:6519)
> > ~[clojure-1.7.0.jar:?]
> >
> >           at clojure.core$into.invoke(core.clj:6600) ~[clojure-1.7.0.jar:?]
> >
> >           at
> > org.apache.storm.daemon.executor$metrics_tick.invoke(executor.clj:349)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.daemon.executor$fn__9620$tuple_action_fn__9626.invoke(executor.clj:522)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.daemon.executor$mk_task_receiver$fn__9609.invoke(executor.clj:471)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.disruptor$clojure_handler$reify__9120.onEvent(disruptor.clj:41)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           at
> > org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509)
> > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> >
> >           ... 7 more
> >
> >
> > In source code, the null pointer exception comes from the following
> > line of KafkaOffsetMetric.java:
> >
> > long earliestTimeOffset = beginningOffsets.get(topicPartition);
> >
> > The NullPointerException causes the crash of the worker process
> > hosting the Spout, which leads to countless Netty error messages until
> > the Spout is restaured on another worker.
> >
> > Note: We are using Storm Kafka Client with Kafka Client 2.0.0 and
> > Scala 2.12, on a cluster with 7 Supervisor nodes; the topology that
> > getting these crashes consumes a very high volume of data on a Kafka
> > topic having 16 partitions.
> > All this running with ORACLE Java 8 update 192 on CentOS 7.
> >
> > Any idea why beginningOffsets could be null ?
> >
> > Kind regards,
> > Alexandre Vermeerbergen
> >

Reply via email to