[
https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philip Nee reassigned KAFKA-14196:
----------------------------------
Assignee: Philip Nee
> Flaky OffsetValidationTest seems to indicate potential duplication issue
> during rebalance
> -----------------------------------------------------------------------------------------
>
> Key: KAFKA-14196
> URL: https://issues.apache.org/jira/browse/KAFKA-14196
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer
> Affects Versions: 3.2.1
> Reporter: Philip Nee
> Assignee: Philip Nee
> Priority: Major
>
> Several flaky tests under OffsetValidationTest are indicating potential
> consumer duplication issue, when autocommit is enabled. Below shows the
> failure message:
>
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>
> After investigating the log, I discovered that the data consumed between the
> start of a rebalance event and the async commit was lost for those failing
> tests. In the example below, the rebalance event kicks in at around
> 1662054846995 (first record), and the async commit of the offset 3739 is
> completed at around 1662054847015 (right before partitions_revoked).
>
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]}
> {code}
> A few things to note here:
> # This is highly flaky, I found 1/4 runs will fail the tests
> # Manually calling commitSync in the onPartitionsRevoke cb seems to
> alleviate the issue
> # Setting includeMetadataInTimeout to false also seems to alleviate the
> issue.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)