It looks like KAFKA-7044 affects 1.1.0 and up, so people on earlier
versions aren't affected. I think we should either make a work around for
the issue by skipping the metrics if the bug occurs, or add a link to
KAFKA-7044 to the documentation.

Den man. 19. nov. 2018 kl. 09.42 skrev Alexandre Vermeerbergen <
[email protected]>:

> Hello Stig,
>
> Thank you very much for your answer : I have tested our many
> topologies using Kafka Client 2.0.1 instead of Kafka Client 2.0.0, let
> them run at full charge for couple of days, and I can confirm that
> this exception no longer occurs !
>
> May I suggest storm-kafka-client documentation to mention that this
> version of Kafka Client is recommended ?
>
> Kind regards,
> Alexandre Vermeerbergen
>
> Le lun. 12 nov. 2018 à 23:17, Stig Rohde Døssing
> <[email protected]> a écrit :
> >
> > I don't think beginningOffsets is null. I think it's missing one of the
> > partitions, which would mean the right hand side of the line is null,
> which
> > gives an NPE when we try to assign it to a primitive long.
> >
> > I think this could be due to
> > https://issues.apache.org/jira/browse/KAFKA-7044, going by the commit
> > message for the fix
> >
> https://github.com/apache/kafka/commit/e2ec2d79c8d5adefc0c764583cec47144dbc5705#diff-b45245913eaae46aa847d2615d62cde0
> .
> > Specifically part 2 sounds a lot like what I think might be happening
> here.
> >
> > "
> >
> > `ConsumerGroupCommand.getLogEndOffsets()` and `getLogStartOffsets()`
> > assumed that endOffsets()/beginningOffsets() which eventually call
> > Fetcher.fetchOffsetsByTimes(), would return a map with all the topic
> > partitions passed to endOffsets()/beginningOffsets() and that values
> > are not null. Because of (1), null values were possible if some of the
> > topic partitions were already known (in metadata cache) and some not
> > (metadata cache did not have entries for some of the topic
> > partitions). However, even with fixing (1),
> > endOffsets()/beginningOffsets() may return a map with some topic
> > partitions missing, when list offset request returns a non-retriable
> > error.
> >
> > "
> >
> > Basically KafkaOffsetMetric also assumes that when
> beginningOffsets(topics)
> > is called, the returned map will contain a value for all requested
> topics.
> > Could you try upgrading to Kafka 2.0.1?
> >
> > If necessary we can also work around this on the Storm side by skipping
> the
> > metrics if the requested partition isn't in the return values for
> > beginningOffsets/endOffsets. Feel free to raise an issue for this.
> >
> > Den man. 12. nov. 2018 kl. 21.56 skrev Alexandre Vermeerbergen <
> > [email protected]>:
> >
> > > Hello,
> > >
> > > Using Storm 1.2.3-snapshot of the 3rd of November 2018 with all libs
> > > (storm-core & storm-kafka-client) taken from same Git, we get the
> > > following crash coming from a NullPointerException in
> > > KafkaOffsetMetric.getValueAndReset :
> > >
> > > 2018-11-12 19:31:30.496 o.a.s.util
> > > Thread-9-metricsFromKafka-executor[13 13] [ERROR] Async loop died!
> > >
> > > java.lang.RuntimeException: java.lang.NullPointerException
> > >
> > >           at
> > >
> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:522)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > >
> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:487)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > >
> org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:477)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > > org.apache.storm.disruptor$consume_batch.invoke(disruptor.clj:70)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > >
> org.apache.storm.daemon.executor$fn__9620$fn__9635$fn__9666.invoke(executor.clj:634)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> org.apache.storm.util$async_loop$fn__561.invoke(util.clj:484)
> > > [storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
> > >
> > >           at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
> > >
> > > Caused by: java.lang.NullPointerException
> > >
> > >           at
> > >
> org.apache.storm.kafka.spout.metrics.KafkaOffsetMetric.getValueAndReset(KafkaOffsetMetric.java:89)
> > > ~[stormjar.jar:?]
> > >
> > >           at
> > >
> org.apache.storm.daemon.executor$metrics_tick$fn__9544.invoke(executor.clj:345)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at clojure.core$map$fn__4553.invoke(core.clj:2622)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.lang.RT.seq(RT.java:507) ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core$seq__4128.invoke(core.clj:137)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core$filter$fn__4580.invoke(core.clj:2679)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.lang.LazySeq.sval(LazySeq.java:40)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.lang.LazySeq.seq(LazySeq.java:49)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.lang.Cons.next(Cons.java:39)
> ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.lang.RT.next(RT.java:674) ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core$next__4112.invoke(core.clj:64)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core.protocols$fn__6523.invoke(protocols.clj:170)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at
> > > clojure.core.protocols$fn__6478$G__6473__6487.invoke(protocols.clj:19)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core.protocols$seq_reduce.invoke(protocols.clj:31)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core.protocols$fn__6506.invoke(protocols.clj:101)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at
> > > clojure.core.protocols$fn__6452$G__6447__6465.invoke(protocols.clj:13)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core$reduce.invoke(core.clj:6519)
> > > ~[clojure-1.7.0.jar:?]
> > >
> > >           at clojure.core$into.invoke(core.clj:6600)
> ~[clojure-1.7.0.jar:?]
> > >
> > >           at
> > > org.apache.storm.daemon.executor$metrics_tick.invoke(executor.clj:349)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > >
> org.apache.storm.daemon.executor$fn__9620$tuple_action_fn__9626.invoke(executor.clj:522)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > >
> org.apache.storm.daemon.executor$mk_task_receiver$fn__9609.invoke(executor.clj:471)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > >
> org.apache.storm.disruptor$clojure_handler$reify__9120.onEvent(disruptor.clj:41)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           at
> > >
> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:509)
> > > ~[storm-core-1.2.3-SNAPSHOT.jar:1.2.3-SNAPSHOT]
> > >
> > >           ... 7 more
> > >
> > >
> > > In source code, the null pointer exception comes from the following
> > > line of KafkaOffsetMetric.java:
> > >
> > > long earliestTimeOffset = beginningOffsets.get(topicPartition);
> > >
> > > The NullPointerException causes the crash of the worker process
> > > hosting the Spout, which leads to countless Netty error messages until
> > > the Spout is restaured on another worker.
> > >
> > > Note: We are using Storm Kafka Client with Kafka Client 2.0.0 and
> > > Scala 2.12, on a cluster with 7 Supervisor nodes; the topology that
> > > getting these crashes consumes a very high volume of data on a Kafka
> > > topic having 16 partitions.
> > > All this running with ORACLE Java 8 update 192 on CentOS 7.
> > >
> > > Any idea why beginningOffsets could be null ?
> > >
> > > Kind regards,
> > > Alexandre Vermeerbergen
> > >
>

Reply via email to