[ 
https://issues.apache.org/jira/browse/KAFKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewen Cheslack-Postava updated KAFKA-2459:
-----------------------------------------
    Reviewer: Guozhang Wang
      Status: Patch Available  (was: Open)

> Connection backoff/blackout period should start when a connection is 
> disconnected, not when the connection attempt was initiated
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-2459
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2459
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer, producer 
>    Affects Versions: 0.8.2.1
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Eno Thereska
>
> Currently the connection code for new clients marks the time when a 
> connection was initiated (NodeConnectionState.lastConnectMs) and then uses 
> this to compute blackout periods for nodes, during which connections will not 
> be attempted and the node is not considered a candidate for leastLoadedNode.
> However, in cases where the connection attempt takes longer than the 
> blackout/backoff period (default 10ms), this results in incorrect behavior. 
> If a broker is not available and, for example, the broker does not explicitly 
> reject the connection, instead waiting for a connection timeout (e.g. due to 
> firewall settings), then the backoff period will have already elapsed and the 
> node will immediately be considered ready for a new connection attempt and a 
> node to be selected by leastLoadedNode for metadata updates. I think it 
> should be easy to reproduce and verify this problem manually by using tc to 
> introduce enough latency to make connection failures take > 10ms.
> The correct behavior would use the disconnection event to mark the end of the 
> last connection attempt and then wait for the backoff period to elapse after 
> that.
> See 
> http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E
>  for the original description of the problem.
> This is related to KAFKA-1843 because leastLoadedNode currently will 
> consistently choose the same node if this blackout period is not handled 
> correctly, but is a much smaller issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to