[ 
https://issues.apache.org/jira/browse/KAFKA-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569204#comment-17569204
 ] 

Chris Egerton edited comment on KAFKA-14089 at 7/20/22 11:28 PM:
-----------------------------------------------------------------

Thanks [~mimaison]. We don't assert on order of records, just that the expected 
seqnos were present in any order, so the wonkiness around 65535 isn't actually 
an issue (and it's even present in the stringified representation of both the 
expected _and_ the actual seqno sets).

 

After doing some Bash scrubbing on the file attached to the ticket, it looks 
like seqnos start to be missing (i.e., they're in the expected set but not the 
actual) between 114463 and 114754. Not every seqno in that range is missing, 
but there's 105 missing in total. After that, starting at 114755, there's 105 
extra (i.e., in the actual set but not the expected) seqnos.

 

Given that the issues crop up at the very end of the seqno set, it seems like 
this could be caused by non-graceful shutdown of the worker after exactly-once 
support is disabled, or even possibly the recently-discovered KAFKA-14079. It's 
a little worrisome, though, since the results here indicate possible data loss.

 

If this was on Jenkins, do you have a link to the CI run that caused it? Or if 
it was encountered elsewhere, do you have any logs available? I'll try to kick 
off some local runs but I'm in the middle of stress-testing my laptop with the 
latest KIP-618 system tests and may not be able to reproduce locally.


was (Author: chrisegerton):
Thanks [~mimaison]. We don't assert on order of records, just that the expected 
seqnos were present in any order, so the wonkiness around 65535 isn't actually 
an issue (and it's even present in the stringified representation of both the 
expected _and_ the actual seqno sets).

 

After doing some Bash scrubbing on the file attached to the ticket, it looks 
like seqnos start to be missing (i.e., they're in the expected set but not the 
actual) between 114463 and 114754. Not every seqno in that range is missing, 
but there's 105 in total. After that, starting at 114755, there's 105 extra 
(i.e., in the actual set but not the expected) seqnos.

 

Given that the issues crop up at the very end of the seqno set, it seems like 
this could be caused by non-graceful shutdown of the worker after exactly-once 
support is disabled, or even possibly the recently-discovered KAFKA-14079. It's 
a little worrisome, though, since the results here indicate possible data loss.

 

If this was on Jenkins, do you have a link to the CI run that caused it? Or if 
it was encountered elsewhere, do you have any logs available? I'll try to kick 
off some local runs but I'm in the middle of stress-testing my laptop with the 
latest KIP-618 system tests and may not be able to reproduce locally.

> Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic
> ---------------------------------------------------------------
>
>                 Key: KAFKA-14089
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14089
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.3.0
>            Reporter: Mickael Maison
>            Assignee: Chris Egerton
>            Priority: Major
>         Attachments: failure.txt
>
>
> It looks like the sequence got broken around "65535, 65537, 65536, 65539, 
> 65538, 65541, 65540, 65543"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to