[
https://issues.apache.org/jira/browse/KAFKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manikumar Reddy reassigned KAFKA-2459:
--------------------------------------
Assignee: Manikumar Reddy (was: Neha Narkhede)
> Connection backoff/blackout period should start when a connection is
> disconnected, not when the connection attempt was initiated
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-2459
> URL: https://issues.apache.org/jira/browse/KAFKA-2459
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer, producer
> Affects Versions: 0.8.2.1
> Reporter: Ewen Cheslack-Postava
> Assignee: Manikumar Reddy
>
> Currently the connection code for new clients marks the time when a
> connection was initiated (NodeConnectionState.lastConnectMs) and then uses
> this to compute blackout periods for nodes, during which connections will not
> be attempted and the node is not considered a candidate for leastLoadedNode.
> However, in cases where the connection attempt takes longer than the
> blackout/backoff period (default 10ms), this results in incorrect behavior.
> If a broker is not available and, for example, the broker does not explicitly
> reject the connection, instead waiting for a connection timeout (e.g. due to
> firewall settings), then the backoff period will have already elapsed and the
> node will immediately be considered ready for a new connection attempt and a
> node to be selected by leastLoadedNode for metadata updates. I think it
> should be easy to reproduce and verify this problem manually by using tc to
> introduce enough latency to make connection failures take > 10ms.
> The correct behavior would use the disconnection event to mark the end of the
> last connection attempt and then wait for the backoff period to elapse after
> that.
> See
> http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E
> for the original description of the problem.
> This is related to KAFKA-1843 because leastLoadedNode currently will
> consistently choose the same node if this blackout period is not handled
> correctly, but is a much smaller issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)