[ 
https://issues.apache.org/jira/browse/KAFKA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021760#comment-18021760
 ] 

Sanskar Jhajharia commented on KAFKA-19332:
-------------------------------------------

After investigating the flakiness reported in this ticket, I reviewed both the 
historical CI data and attempted to reproduce the issue locally. The Develocity 
Report [1] clearly shows that while the test has been flaky over the last 
ninety days, the number of failures is very rare. On average, this test has 
failed {*}only once per month{*}, which is statistically negligible given the 
overall CI volume. In practice, this means the flakiness has no real impact on 
the contributor workflow or CI reliability.

Reproducibility was also tested thoroughly in {*}local environments{*}. Even 
with extensive reruns, ranging from {*}500 to 1000 iterations, the test 
consistently passed without issue{*}. This strongly suggests that the flakiness 
is tied to environmental factors within CI, such as timing or scheduling 
artifacts, rather than being caused by a deterministic issue within the test 
logic or the underlying Kafka code. In other words, the conditions that trigger 
the failure seem unique to CI and cannot be reproduced in a controlled local 
setting.

It is also worth noting that the failure does not occur during the core logic 
of the test [2] [3]. The intent of the test is to validate behaviour when 
switching read isolation levels, and in particular how the consumer handles 
transactional markers. However, in every recorded failure the test does not 
reach this portion of the logic. Instead, the flake occurs earlier, when the 
consumer attempts to process the abort transaction marker offset for Message 4. 
Since this happens prior to the test exercising its main verification path, the 
failure points to an environmental timing artifact rather than a flaw in the 
functional behaviour being tested.

Given the rarity of the issue, the inability to reproduce locally, and the fact 
that the flake occurs before the main test logic is even invoked, there is no 
evidence of a defect in Kafka itself or in the intended functionality that the 
test covers. The impact is minimal, and the test remains valuable as a 
safeguard for transactional isolation-level behaviour. On this basis, there is 
no actionable change to be made at this time. In conclusion, this ticket is 
being closed. We will continue to monitor CI to ensure that the failure rate 
does not increase. Should the frequency of the flakiness grow, or should it 
become reproducible in a local environment, the issue can be revisited with a 
more targeted investigation and fix. At present, however, the evidence 
indicates that the flakiness is rare, environment-dependent, and not disruptive 
enough to justify further intervention.

[1]: 
[https://develocity.apache.org/scans/tests?search.names=CI%20workflow,Git%20repository&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=github,trunk&search.tasks=test&search.timeZoneId=Asia%2FCalcutta&search.values=CI,https:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.clients.consumer.ShareConsumerTest&tests.sortField=FLAKY]
 

[2]: 
[https://github.com/apache/kafka/blob/trunk/clients/clients-integration-tests/src/test/java/org/apache/kafka/clients/consumer/ShareConsumerTest.java#L2681-L2684]
 

[3]: 
[https://github.com/apache/kafka/blob/trunk/clients/clients-integration-tests/src/test/java/org/apache/kafka/clients/consumer/ShareConsumerTest.java#L2579-L2587]
 

> Fix flaky test : 
> testAlterReadCommittedToReadUncommittedIsolationLevelWithReleaseAck and 
> testAlterReadCommittedToReadUncommittedIsolationLevelWithRejectAck
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19332
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19332
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Shivsundar R
>            Assignee: Sanskar Jhajharia
>            Priority: Major
>
> The test has been flaky in AK builds - 
> [https://develocity.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=Asia%2FCalcutta&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.clients.consumer.ShareConsumerTest&tests.test=testAlterReadCommittedToReadUncommittedIsolationLevelWithReleaseAck()%5B2%5D]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to