Chris Riccomini created SAMZA-592:
-------------------------------------

             Summary: getSystemStreamMetadata loops forever when it receives 
bad metadata
                 Key: SAMZA-592
                 URL: https://issues.apache.org/jira/browse/SAMZA-592
             Project: Samza
          Issue Type: Bug
          Components: kafka
    Affects Versions: 0.9.0
            Reporter: Chris Riccomini
             Fix For: 0.9.0


While investigating SAMZA-576, [~ewencp] discovered a bug in the 
KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite 
loop when it receives bad metadata from a broker. See 
[this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349]
 comment.

We experienced this bug last week. We were running a healthy cluster down with 
topics that have a replication factor of 2. We brought down a *single* broker, 
and jobs would not start while the broker was down. The containers just 
repeated this error message:

{noformat}
  2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets for 
streams [some-topic] due to kafka.common.ReplicaNotAvailableException. Retrying.
{noformat}

Checking the cluster showed that all partitions were still available, and 
bringing down the single broker resulted in proper leadership failover. Samza, 
however, was not able to start.

I was told by [~clarkhaskins] that it was actually safe to ignore the 
ReplicaNotAvailableException when fetching metadata. [~ewencp], can you confirm 
this?

It seems that there are two issues:

# KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its 
metadata fetch results in an error code.
# We should allow the metadata fetch to proceed, rather than throwing an 
exception, if there is a ReplicaNotAvailableException during metadata refreshes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to