Chris Riccomini created SAMZA-592: ------------------------------------- Summary: getSystemStreamMetadata loops forever when it receives bad metadata Key: SAMZA-592 URL: https://issues.apache.org/jira/browse/SAMZA-592 Project: Samza Issue Type: Bug Components: kafka Affects Versions: 0.9.0 Reporter: Chris Riccomini Fix For: 0.9.0
While investigating SAMZA-576, [~ewencp] discovered a bug in the KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite loop when it receives bad metadata from a broker. See [this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349] comment. We experienced this bug last week. We were running a healthy cluster down with topics that have a replication factor of 2. We brought down a *single* broker, and jobs would not start while the broker was down. The containers just repeated this error message: {noformat} 2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets for streams [some-topic] due to kafka.common.ReplicaNotAvailableException. Retrying. {noformat} Checking the cluster showed that all partitions were still available, and bringing down the single broker resulted in proper leadership failover. Samza, however, was not able to start. I was told by [~clarkhaskins] that it was actually safe to ignore the ReplicaNotAvailableException when fetching metadata. [~ewencp], can you confirm this? It seems that there are two issues: # KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its metadata fetch results in an error code. # We should allow the metadata fetch to proceed, rather than throwing an exception, if there is a ReplicaNotAvailableException during metadata refreshes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)