All,

I followed the back port of storm-kafka-client from 1.1.x to work on Storm
1.0.x using the Maven override of the Timer class in future Storm.

It looks like the Metron KafkaSpoutConfig.Builder (a little difficult to
follow the config build up so I might be wrong)  is resulting in in a
NamedSubscription which uses the consumer.subscribe API.

Wanted to let you know that this API isn't a good fit for Storm since
partition reassignment is left to Kafka vs manual assignment within Storm.
What we are seeing is about 1%-2% duplicate count across two spout
instances just during initial topology startup.

Consider this situation: Two Kafka partitions and two spout tasks.

Task id 1 starts first and polls Kafka which is then assigned all
partitions.
Milliseconds to two seconds later, task id 2 starts up which forces a Kafka
partition rebalance across both tasks.
Task id 1 will have already played tuples through the topology but can no
longer commit for them.
Task id 2 will potentially replay the same tuples again post rebalance with
the situation worsening as the number of spout tasks increase.

If the topology should use multiple tasks per executor then there is a
potential for a blocking action which would then cause other consumers in
the same executor  (if 1:1 task/executor is not being used) to poll timeout
and it rolls on from there during partition reassignments post startup.

It looks like there is a ManualPartitionNamedSubscription with a
RoundRobinManualPartitioner but I have yet to get it to work.

This JIRA captures the problem:
https://issues.apache.org/jira/browse/STORM-2542 and there are other,
competing PRs/JIRAs now as well.

Based on a current Storm dev discussion, they are looking to address this
in both master and 1.x by converting to the 'assign' API which directs
exact partitions to consumers and defaulting to this Kafka API going
forward. There is some discussion there about Flux compatibility with the
builder methods and I thought you all could use visibility on that in
particular.

Kris

Reply via email to