[
https://issues.apache.org/jira/browse/KAFKA-15402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036036#comment-18036036
]
Kirk True commented on KAFKA-15402:
-----------------------------------
This appears to be a case of head-of-line blocking for {{FETCH}} request
processing on the broker.
I added code in the consumer ({{{}AbstractFetch.createFetchRequest(){}}}) that
forces the max wait value to 0 in the outgoing {{FETCH}} request when the fetch
session epoch is set to {{{}FINAL_EPOCH{}}}.
However, it doesn't fix the issue if there's already an inflight {{FETCH}}
request with a higher max wait. The inflight {{FETCH}} request with a longer
max wait blocks the {{FETCH}} request with the shorter max wait from processing.
Here's an example timeline of events for an integration test that uses the
consumer with a {{fetch.max.wait.ms}} value of 500 (default):
# Time 123: The test produces N records
# Time 234: The test reads the records in a {{Consumer.poll()}} loop, sending
{{FETCH}} requests 1-118
# Time 379: The test confirms that all N records were consumed and exits the
loop
# Time 380: The test invokes {{Consumer.close()}} (this form uses a default
close timeout of 30 seconds)
# Time 381: The broker starts processing {{FETCH}} request 118 (with a 500 ms.
wait)
# Time 437: As part of its closing process, the consumer attempts to close the
broker's fetch session cache entry by sending {{FETCH}} request 119 with the
max wait forced to 0 ms (I changed this on my branch)
# Time 879: Around ~500 ms after it was sent to the broker, the consumer
receives the response for {{FETCH}} request 118
# Time 880: The broker starts processing {{FETCH}} request 119 (with a 0 ms.
wait)
# Time 902: The consumer receives the {{FETCH}} response for 119
# Time 915: {{Consumer.close()}} returns back to the test, having taken
approximately 535 ms. to execute
> Performance regression on close consumer after upgrading to 3.5.0
> -----------------------------------------------------------------
>
> Key: KAFKA-15402
> URL: https://issues.apache.org/jira/browse/KAFKA-15402
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer
> Affects Versions: 3.5.0, 3.5.1, 3.6.0
> Reporter: Benoit Delbosc
> Assignee: Kirk True
> Priority: Major
> Fix For: 4.2.0
>
> Attachments: image-2023-08-24-18-51-21-720.png,
> image-2023-08-24-18-51-57-435.png, image-2023-08-25-10-50-28-079.png
>
>
> Hi,
> After upgrading to Kafka client version 3.5.0, we have observed a significant
> increase in the duration of our Java unit tests. These unit tests heavily
> rely on the Kafka Admin, Producer, and Consumer API.
> When using Kafka server version 3.4.1, the duration of the unit tests
> increased from 8 seconds (with Kafka client 3.4.1) to 18 seconds (with Kafka
> client 3.5.0).
> Upgrading the Kafka server to 3.5.1 show similar results.
> I have come across the issue KAFKA-15178, which could be the culprit. I will
> attempt to test the proposed patch.
> In the meantime, if you have any ideas that could help identify and address
> the regression, please let me know.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)