[ 
https://issues.apache.org/jira/browse/KAFKA-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600166#comment-17600166
 ] 

Philip Nee commented on KAFKA-14196:
------------------------------------

If I understand this correctly: Seems like this is introduced in 
https://issues.apache.org/jira/browse/KAFKA-14024, which originated from 
https://issues.apache.org/jira/browse/KAFKA-13310.  I think the cause of the 
flakiness/duplication is, the consumer is busy waiting for the prior async 
commit to complete (in order to complete the rebalance process), while fetching 
new data.  After the async complete finished, the partition gets revoked, and 
the fetch progress will be lost, and eventually causes duplicated consumption.

A few comments:
 # Do we want to continue to fetch, while waiting for the async commit to 
complete? I believe this is the expectation of the new poll API.
 # If we don't want to block consumer from fetching, then we will need to 
continue to commit asynchronously.  I see this could be problematic, as the 
consumer could stuck in the poll loop while busy catching up with committing 
the fetched data, and never complete the rebalance process.

> Flaky OffsetValidationTest seems to indicate potential duplication issue 
> during rebalance
> -----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-14196
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14196
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 3.2.1
>            Reporter: Philip Nee
>            Assignee: Philip Nee
>            Priority: Major
>
> Several flaky tests under OffsetValidationTest are indicating potential 
> consumer duplication issue, when autocommit is enabled.  Below shows the 
> failure message:
>  
> {code:java}
> Total consumed records 3366 did not match consumed position 3331 {code}
>  
> After investigating the log, I discovered that the data consumed between the 
> start of a rebalance event and the async commit was lost for those failing 
> tests.  In the example below, the rebalance event kicks in at around 
> 1662054846995 (first record), and the async commit of the offset 3739 is 
> completed at around 1662054847015 (right before partitions_revoked).
>  
> {code:java}
> {"timestamp":1662054846995,"name":"records_consumed","count":3,"partitions":[{"topic":"test_topic","partition":0,"count":3,"minOffset":3739,"maxOffset":3741}]}
> {"timestamp":1662054846998,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3742,"maxOffset":3743}]}
> {"timestamp":1662054847008,"name":"records_consumed","count":2,"partitions":[{"topic":"test_topic","partition":0,"count":2,"minOffset":3744,"maxOffset":3745}]}
> {"timestamp":1662054847016,"name":"partitions_revoked","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847031,"name":"partitions_assigned","partitions":[{"topic":"test_topic","partition":0}]}
> {"timestamp":1662054847038,"name":"records_consumed","count":23,"partitions":[{"topic":"test_topic","partition":0,"count":23,"minOffset":3739,"maxOffset":3761}]}
>  {code}
> A few things to note here:
>  # This is highly flaky, I found 1/4 runs will fail the tests
>  # Manually calling commitSync in the onPartitionsRevoke cb seems to 
> alleviate the issue
>  # Setting includeMetadataInTimeout to false also seems to alleviate the 
> issue.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to