Stig Rohde Døssing closed STORM-1682.
Resolution: Not A Problem
We misidentified the cause of our topologies hanging as this issue when it was
actually caused by https://issues.apache.org/jira/browse/STORM-1750. The code
change is fairly big, and the only effect is to prevent the spout from pausing
consumption from a partition if that partition temporarily has no leader. Since
the spout should resume the partition when getBrokersInfo is called, which
happens every 30 seconds, it's not really a problem.
> Kafka spout can lose partitions
> Key: STORM-1682
> URL: https://issues.apache.org/jira/browse/STORM-1682
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-kafka
> Affects Versions: 0.10.0, 1.0.0, 2.0.0
> Reporter: Stig Rohde Døssing
> Assignee: Stig Rohde Døssing
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
> The KafkaSpout can lose partitions for a period, or hang because
> may get a NoNodeException if there is no broker info in Zookeeper
> corresponding to the leader id in Zookeeper. When this error occurs, the
> spout ignores the partition until the next time getBrokersInfo is called,
> which isn't until the next time the spout gets an exception on fetch. If the
> timing is really bad, it might ignore all the partitions and never restart.
> As far as I'm aware, Kafka doesn't update leader and brokerinfo atomically,
> so it's possible to get unlucky and hit the NoNodeException when a broker has
> just died.
> I have a few suggestions for dealing with this.
> getBrokerInfo could simply retry the inner loop over partitions if it gets
> the NoNodeException (probably with a limit and a short sleep between
> attempts). If it fails repeatedly, the spout should be crashed.
> Alternatively the DynamicBrokersReader could instead lookup all brokers in
> Zookeeper, create a consumer and send a TopicMetadataRequest on it. The
> response contains the leader for each partition and host/port for the
> relevant brokers.
> Edit: I noticed that the spout periodically refreshes the brokers info, so
> the issue isn't as bad as I thought. I still think this change has value,
> since it avoids the spout temporarily dropping a partition.
This message was sent by Atlassian JIRA