[
https://issues.apache.org/jira/browse/KAFKA-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569204#comment-17569204
]
Chris Egerton edited comment on KAFKA-14089 at 7/20/22 11:41 PM:
-----------------------------------------------------------------
Thanks [~mimaison]. We don't assert on order of records, just that the expected
seqnos were present in any order, so the wonkiness around 65535 isn't actually
an issue (and it's even present in the stringified representation of both the
expected _and_ the actual seqno sets).
After doing some Bash scrubbing on the file attached to the ticket, it looks
like seqnos start to be missing (i.e., they're in the expected set but not the
actual) between 114463 and 114754. Not every seqno in that range is missing,
but there's 105 missing in total. After that, starting at 114755, there's 105
extra (i.e., in the actual set but not the expected) seqnos.
Given that the issues crop up at the very end of the seqno set, it seems like
this could be caused by non-graceful shutdown of the worker after exactly-once
support is disabled, or even possibly the recently-discovered KAFKA-14079.
-It's a little worrisome, though, since the results here indicate possible data
loss.- Actually, on second thought, this is probably not data loss, since we're
reading the records that have been produced to Kafka, but not necessarily the
records whose offsets have been committed.
If this was on Jenkins, do you have a link to the CI run that caused it? Or if
it was encountered elsewhere, do you have any logs available? I'll try to kick
off some local runs but I'm in the middle of stress-testing my laptop with the
latest KIP-618 system tests and may not be able to reproduce locally.
I suspect a fix for this would involve reading the last-committed offset for
each task, then only checking seqnos for that task up to the seqno in that
offset. But I'd like to have a better idea of what exactly is causing the
failure before pulling the trigger on that, especially if it's unclean
task/worker shutdown and we can find a way to fix that instead of adjusting our
tests to handle sloppy shutdowns.
was (Author: chrisegerton):
Thanks [~mimaison]. We don't assert on order of records, just that the expected
seqnos were present in any order, so the wonkiness around 65535 isn't actually
an issue (and it's even present in the stringified representation of both the
expected _and_ the actual seqno sets).
After doing some Bash scrubbing on the file attached to the ticket, it looks
like seqnos start to be missing (i.e., they're in the expected set but not the
actual) between 114463 and 114754. Not every seqno in that range is missing,
but there's 105 missing in total. After that, starting at 114755, there's 105
extra (i.e., in the actual set but not the expected) seqnos.
Given that the issues crop up at the very end of the seqno set, it seems like
this could be caused by non-graceful shutdown of the worker after exactly-once
support is disabled, or even possibly the recently-discovered KAFKA-14079. It's
a little worrisome, though, since the results here indicate possible data loss.
If this was on Jenkins, do you have a link to the CI run that caused it? Or if
it was encountered elsewhere, do you have any logs available? I'll try to kick
off some local runs but I'm in the middle of stress-testing my laptop with the
latest KIP-618 system tests and may not be able to reproduce locally.
> Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic
> ---------------------------------------------------------------
>
> Key: KAFKA-14089
> URL: https://issues.apache.org/jira/browse/KAFKA-14089
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.3.0
> Reporter: Mickael Maison
> Assignee: Chris Egerton
> Priority: Major
> Attachments: failure.txt
>
>
> It looks like the sequence got broken around "65535, 65537, 65536, 65539,
> 65538, 65541, 65540, 65543"
--
This message was sent by Atlassian Jira
(v8.20.10#820010)