[ 
https://issues.apache.org/jira/browse/KAFKA-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685353#comment-17685353
 ] 

Kirk True commented on KAFKA-14317:
-----------------------------------

[~dajac] could you take another look at the PR? Thanks!

> ProduceRequest timeouts are logged as network exceptions
> --------------------------------------------------------
>
>                 Key: KAFKA-14317
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14317
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, logging, producer 
>    Affects Versions: 3.3.0
>            Reporter: Kirk True
>            Assignee: Kirk True
>            Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In {{{}NetworkClient.handleTimedOutRequests{}}}, we disconnect the broker 
> connection:
> {code:java}
> private void handleTimedOutRequests(List<ClientResponse> responses, long now) 
> {
>     List<String> nodeIds = 
> this.inFlightRequests.nodesWithTimedOutRequests(now);
>     for (String nodeId : nodeIds) {
>         // close connection to the node
>         this.selector.close(nodeId);
>         log.info("Disconnecting from node {} due to request timeout.", 
> nodeId);
>         processDisconnection(responses, nodeId, now, 
> ChannelState.LOCAL_CLOSE);
>     }
> }
> {code}
> This calls {{processDisconnection}} which calls 
> {{{}cancelInFlightRequests{}}}:
> {code:java}
> for (InFlightRequest request : inFlightRequests) {
>     if (log.isDebugEnabled()) {
>         log.debug("Cancelled in-flight {} request with correlation id {} due 
> to node {} being disconnected " +
>                 "(elapsed time since creation: {}ms, elapsed time since send: 
> {}ms, request timeout: {}ms): {}",
>             request.header.apiKey(), request.header.correlationId(), nodeId,
>             request.timeElapsedSinceCreateMs(now), 
> request.timeElapsedSinceSendMs(now),
>             request.requestTimeoutMs, request.request);
>     } else {
>         log.info("Cancelled in-flight {} request with correlation id {} due 
> to node {} being disconnected " +
>                 "(elapsed time since creation: {}ms, elapsed time since send: 
> {}ms, request timeout: {}ms)",
>             request.header.apiKey(), request.header.correlationId(), nodeId,
>             request.timeElapsedSinceCreateMs(now), 
> request.timeElapsedSinceSendMs(now),
>             request.requestTimeoutMs);
>     }
>     if (!request.isInternalRequest) {
>         if (responses != null)
>             responses.add(request.disconnected(now, null));
>     } else if (request.header.apiKey() == ApiKeys.METADATA) {
>         metadataUpdater.handleFailedRequest(now, Optional.empty());
>     }
> }
> {code}
> We create a new {{ClientResponse}} in which the {{disconnected}} flag is set.
> We then complete the record batch In {{Sender.handleProduceResponse}} with:
> {code:java}
> if (response.wasDisconnected()) {
>     log.trace("Cancelled request with header {} due to node {} being 
> disconnected",
>     requestHeader, response.destination());
>     for (ProducerBatch batch : batches.values())
>         completeBatch(batch, new 
> ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, 
> String.format("Disconnected from node %s", response.destination())),
>     correlationId, now);
> }
> {code}
> This seems like it could be confusing for customers that they would see 
> network exceptions on a request timeout instead of a timeout error.
> One implication of completing the batch with a network exception is that the 
> producer will try to refresh metadata after a request timeout. I can see 
> arguments for why this is necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to