[ 
https://issues.apache.org/jira/browse/KAFKA-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918451#comment-13918451
 ] 

Jay Kreps commented on KAFKA-1286:
----------------------------------

If you are saying that we repeat the metadata request to the same (down) node > 
once that is a bug. I think the problem is the 
selectMetadataDestination(Cluster cluster) method. This method attempts to be 
smart about where to direct metadata requests, specifically, it tries to prefer 
nodes for which we have an existing connection or for which a connection is in 
the process of being established. Somehow I think that logic is not 
round-robining when the connection establishment fails.

Let me know if you want to take a look or else I can...

> Retry Can Block 
> ----------------
>
>                 Key: KAFKA-1286
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1286
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: producer 
>            Reporter: Guozhang Wang
>
> Under the following scenario the retry logic can block
> 1. The last broker's socket closed, sender.handleDisconnect() triggered, put 
> the node as disconnected.
> 2. In the next sender.run(), since the node is disconnected, remove the 
> partition from ready set, and call sender.initConnection(), which will not 
> throw exception.
> 3. So in this round of send, the only request it tries to send to is the 
> metadata request, to the last broker; and the sender will firstly try to 
> connect to that broker.
> 4. In selector.poll(), the finishConnect() call will throw exception, and in 
> handleDisconnects(), inFlight request's batches will be null since it is a 
> metadata request.
> 5. Now we will go back to 1, and loop forever. Note that this infinite loop 
> can be triggered even without calling producer.close.
> Also, we need to introduce the retry backoff config, otherwise the retries 
> will be exhausted too soon (in my tests 10 retries can be exhausted in about 
> 600ms).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to