[ 
https://issues.apache.org/jira/browse/SAMZA-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Riccomini updated SAMZA-592:
----------------------------------
    Attachment: SAMZA-592-1.patch

Attaching updated patch with [~nickpan47]'s feedback on the import statement.

[~ewencp], I could use a spot check on this patch, if you've got the cycles.

> getSystemStreamMetadata loops forever when it receives bad metadata
> -------------------------------------------------------------------
>
>                 Key: SAMZA-592
>                 URL: https://issues.apache.org/jira/browse/SAMZA-592
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>             Fix For: 0.9.0
>
>         Attachments: SAMZA-592-0.patch, SAMZA-592-1.patch
>
>
> While investigating SAMZA-576, [~ewencp] discovered a bug in the 
> KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite 
> loop when it receives bad metadata from a broker. See 
> [this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349]
>  comment.
> We experienced this bug last week. We were running a healthy cluster down 
> with topics that have a replication factor of 2. We brought down a *single* 
> broker, and jobs would not start while the broker was down. The containers 
> just repeated this error message:
> {noformat}
>   2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets 
> for streams [some-topic] due to kafka.common.ReplicaNotAvailableException. 
> Retrying.
> {noformat}
> Checking the cluster showed that all partitions were still available, and 
> bringing down the single broker resulted in proper leadership failover. 
> Samza, however, was not able to start.
> I was told by [~clarkhaskins] that it was actually safe to ignore the 
> ReplicaNotAvailableException when fetching metadata. [~ewencp], can you 
> confirm this?
> It seems that there are two issues:
> # KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its 
> metadata fetch results in an error code.
> # We should allow the metadata fetch to proceed, rather than throwing an 
> exception, if there is a ReplicaNotAvailableException during metadata 
> refreshes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to