[ 
https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456854#comment-17456854
 ] 

David Mao edited comment on KAFKA-13388 at 12/10/21, 2:10 AM:
--------------------------------------------------------------

[~dhofftgt] 

Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we 
see:

 
{code:java}
if (discoverBrokerVersions) {
 this.connectionStates.checkingApiVersions(node); 
 nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder());
 {code}
which is a separate queue for nodes needing to send the api versions request.

Then in 

 
{code:java}
private void handleInitiateApiVersionRequests(long now) {
    Iterator<Map.Entry<String, ApiVersionsRequest.Builder>> iter = 
nodesNeedingApiVersionsFetch.entrySet().iterator();
    while (iter.hasNext()) {
        Map.Entry<String, ApiVersionsRequest.Builder> entry = iter.next();
        String node = entry.getKey();
        if (selector.isChannelReady(node) && 
inFlightRequests.canSendMore(node)) {
            log.debug("Initiating API versions fetch from node {}.", node);
            ApiVersionsRequest.Builder apiVersionRequestBuilder = 
entry.getValue();
            ClientRequest clientRequest = newClientRequest(node, 
apiVersionRequestBuilder, now, true);
            doSend(clientRequest, true, now);
            iter.remove();
        }{code}
we only send out the api versions request if the channel is ready (TLS 
handshake complete, SASL handshake complete).

This is actually a pretty insidious bug because I think we end up in a state 
where we do not apply any request timeout to the channel if there is some delay 
in completing any of the handshaking/authentication steps, since the inflight 
requests are empty.


was (Author: david.mao):
[~dhofftgt] 

Looking at where the NetworkClient enters the CHECKING_API_VERSIONS state, we 
see:

 
{code:java}
if (discoverBrokerVersions) {
 this.connectionStates.checkingApiVersions(node); 
 nodesNeedingApiVersionsFetch.put(node, new ApiVersionsRequest.Builder());
 {code}
which is a separate queue for nodes needing to send the api versions request.

Then in 

 
{code:java}
private void handleInitiateApiVersionRequests(long now) {
    Iterator<Map.Entry<String, ApiVersionsRequest.Builder>> iter = 
nodesNeedingApiVersionsFetch.entrySet().iterator();
    while (iter.hasNext()) {
        Map.Entry<String, ApiVersionsRequest.Builder> entry = iter.next();
        String node = entry.getKey();
        if (selector.isChannelReady(node) && 
inFlightRequests.canSendMore(node)) {
            log.debug("Initiating API versions fetch from node {}.", node);
            ApiVersionsRequest.Builder apiVersionRequestBuilder = 
entry.getValue();
            ClientRequest clientRequest = newClientRequest(node, 
apiVersionRequestBuilder, now, true);
            doSend(clientRequest, true, now);
            iter.remove();
        }{code}
we only send out the api versions request if the channel is ready (TLS 
handshake complete, SASL handshake complete).

This is actually a pretty insidious bug because I think we end up in a state 
where we do not apply any request timeout to the channel if there is some 
problem completing any of the handshaking/authentication steps, since the 
inflight requests are empty.

> Kafka Producer nodes stuck in CHECKING_API_VERSIONS
> ---------------------------------------------------
>
>                 Key: KAFKA-13388
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13388
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: David Hoffman
>            Priority: Minor
>         Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png, 
> image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for 
> xxx-17:120002 ms has passed since batch creation
> {code}
>  I would have assumed a request timout or connection timeout should have also 
> been logged. I could not find any other associated errors. 
> I added some instrumenting to my app and have traced this down to broker 
> connections hanging in CHECKING_API_VERSIONS state. -It appears there is no 
> effective timeout for Kafka Producer broker connections in 
> CHECKING_API_VERSIONS state.-
> In the code see the after the NetworkClient connects to a broker node it 
> makes a request to check api versions, when it receives the response it marks 
> the node as ready. -I am seeing that sometimes a reply is not received for 
> the check api versions request the connection just hangs in 
> CHECKING_API_VERSIONS state until it is disposed I assume after the idle 
> connection timeout.-
> Update: not actually sure what causes the connection to get stuck in 
> CHECKING_API_VERSIONS.
> -I am guessing the connection setup timeout should be still in play for this, 
> but it is not.- 
>  -There is a connectingNodes set that is consulted when checking timeouts and 
> the node is removed- 
>  -when ClusterConnectionStates.checkingApiVersions(String id) is called to 
> transition the node into CHECKING_API_VERSIONS-



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to