[
https://issues.apache.org/jira/browse/SAMZA-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Riccomini updated SAMZA-592:
----------------------------------
Attachment: SAMZA-592-1.patch
Attaching updated patch with [~nickpan47]'s feedback on the import statement.
[~ewencp], I could use a spot check on this patch, if you've got the cycles.
> getSystemStreamMetadata loops forever when it receives bad metadata
> -------------------------------------------------------------------
>
> Key: SAMZA-592
> URL: https://issues.apache.org/jira/browse/SAMZA-592
> Project: Samza
> Issue Type: Bug
> Components: kafka
> Affects Versions: 0.9.0
> Reporter: Chris Riccomini
> Assignee: Chris Riccomini
> Fix For: 0.9.0
>
> Attachments: SAMZA-592-0.patch, SAMZA-592-1.patch
>
>
> While investigating SAMZA-576, [~ewencp] discovered a bug in the
> KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite
> loop when it receives bad metadata from a broker. See
> [this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349]
> comment.
> We experienced this bug last week. We were running a healthy cluster down
> with topics that have a replication factor of 2. We brought down a *single*
> broker, and jobs would not start while the broker was down. The containers
> just repeated this error message:
> {noformat}
> 2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets
> for streams [some-topic] due to kafka.common.ReplicaNotAvailableException.
> Retrying.
> {noformat}
> Checking the cluster showed that all partitions were still available, and
> bringing down the single broker resulted in proper leadership failover.
> Samza, however, was not able to start.
> I was told by [~clarkhaskins] that it was actually safe to ignore the
> ReplicaNotAvailableException when fetching metadata. [~ewencp], can you
> confirm this?
> It seems that there are two issues:
> # KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its
> metadata fetch results in an error code.
> # We should allow the metadata fetch to proceed, rather than throwing an
> exception, if there is a ReplicaNotAvailableException during metadata
> refreshes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)