Stig Rohde Døssing created STORM-1682:
-----------------------------------------
Summary: Kafka spout can lose partitions
Key: STORM-1682
URL: https://issues.apache.org/jira/browse/STORM-1682
Project: Apache Storm
Issue Type: Bug
Components: storm-kafka
Affects Versions: 0.10.0, 1.0.0, 2.0.0
Reporter: Stig Rohde Døssing
Assignee: Stig Rohde Døssing
The KafkaSpout can lose partitions for a period, or hang because getBrokersInfo
(https://github.com/apache/storm/blob/master/external/storm-kafka/src/jvm/org/apache/storm/kafka/DynamicBrokersReader.java#L77)
may get a NoNodeException if there is no broker info in Zookeeper
corresponding to the leader id in Zookeeper. When this error occurs, the spout
ignores the partition until the next time getBrokersInfo is called, which isn't
until the next time the spout gets an exception on fetch. If the timing is
really bad, it might ignore all the partitions and never restart.
As far as I'm aware, Kafka doesn't update leader and brokerinfo atomically, so
it's possible to get unlucky and hit the NoNodeException when a broker has just
died.
I have a few suggestions for dealing with this.
getBrokerInfo could simply retry the inner loop over partitions if it gets the
NoNodeException (probably with a limit and a short sleep between attempts). If
it fails repeatedly, the spout should be crashed.
Alternatively the DynamicBrokersReader could instead lookup all brokers in
Zookeeper, create a consumer and send a TopicMetadataRequest on it. A minor
downside to this is that the API changed slightly between Kafka 0.8 and 0.9, so
there'd be a little reflection involved to support both versions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)