[jira] [Comment Edited] (KAFKA-14089) Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic

Chris Egerton (Jira) Wed, 20 Jul 2022 16:42:06 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569204#comment-17569204
 ]


Chris Egerton edited comment on KAFKA-14089 at 7/20/22 11:41 PM:
-----------------------------------------------------------------

Thanks [~mimaison]. We don't assert on order of records, just that the expected 
seqnos were present in any order, so the wonkiness around 65535 isn't actually 
an issue (and it's even present in the stringified representation of both the 
expected _and_ the actual seqno sets).

After doing some Bash scrubbing on the file attached to the ticket, it looks 
like seqnos start to be missing (i.e., they're in the expected set but not the 
actual) between 114463 and 114754. Not every seqno in that range is missing, 
but there's 105 missing in total. After that, starting at 114755, there's 105 
extra (i.e., in the actual set but not the expected) seqnos.

Given that the issues crop up at the very end of the seqno set, it seems like 
this could be caused by non-graceful shutdown of the worker after exactly-once 
support is disabled, or even possibly the recently-discovered KAFKA-14079. 
-It's a little worrisome, though, since the results here indicate possible data 
loss.- Actually, on second thought, this is probably not data loss, since we're 
reading the records that have been produced to Kafka, but not necessarily the 
records whose offsets have been committed.

If this was on Jenkins, do you have a link to the CI run that caused it? Or if 
it was encountered elsewhere, do you have any logs available? I'll try to kick 
off some local runs but I'm in the middle of stress-testing my laptop with the 
latest KIP-618 system tests and may not be able to reproduce locally.

I suspect a fix for this would involve reading the last-committed offset for 
each task, then only checking seqnos for that task up to the seqno in that 
offset. But I'd like to have a better idea of what exactly is causing the 
failure before pulling the trigger on that, especially if it's unclean 
task/worker shutdown and we can find a way to fix that instead of adjusting our 
tests to handle sloppy shutdowns.


was (Author: chrisegerton):
Thanks [~mimaison]. We don't assert on order of records, just that the expected 
seqnos were present in any order, so the wonkiness around 65535 isn't actually 
an issue (and it's even present in the stringified representation of both the 
expected _and_ the actual seqno sets).

 

After doing some Bash scrubbing on the file attached to the ticket, it looks 
like seqnos start to be missing (i.e., they're in the expected set but not the 
actual) between 114463 and 114754. Not every seqno in that range is missing, 
but there's 105 missing in total. After that, starting at 114755, there's 105 
extra (i.e., in the actual set but not the expected) seqnos.

 

Given that the issues crop up at the very end of the seqno set, it seems like 
this could be caused by non-graceful shutdown of the worker after exactly-once 
support is disabled, or even possibly the recently-discovered KAFKA-14079. It's 
a little worrisome, though, since the results here indicate possible data loss.

 

If this was on Jenkins, do you have a link to the CI run that caused it? Or if 
it was encountered elsewhere, do you have any logs available? I'll try to kick 
off some local runs but I'm in the middle of stress-testing my laptop with the 
latest KIP-618 system tests and may not be able to reproduce locally.

> Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic
> ---------------------------------------------------------------
>
>                 Key: KAFKA-14089
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14089
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.3.0
>            Reporter: Mickael Maison
>            Assignee: Chris Egerton
>            Priority: Major
>         Attachments: failure.txt
>
>
> It looks like the sequence got broken around "65535, 65537, 65536, 65539, 
> 65538, 65541, 65540, 65543"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-14089) Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic

Reply via email to