[ 
https://issues.apache.org/jira/browse/KAFKA-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789898#comment-17789898
 ] 

A. Sophie Blee-Goldman commented on KAFKA-15798:
------------------------------------------------

Took a quick look at this in the name of running down some of the worst flaky 
tests in Streams. I think it's pretty clear that this is failing because of the 
state updater thread (see below), but it's not as clear to me whether this 
hints at a real bug with the state updater thread or whether it only broke the 
named topologies feature.

If it's the latter, we should probably block people from using named topologies 
when the state updater thread is enabled in 3.7. Although I'm actually leaning 
towards going a step further and just taking out the named topologies 
altogether – we can just remove the "public" API classes for now, as extracting 
all the internal logic is somewhat of a bigger project that we shouldn't rush.

Of course, this is all assuming there is something about the state updater that 
broke named topologies – someone more familiar with the state updater should 
definitely verify that this isn't a real bug in normal Streams first! cc 
[~cadonna] [~lucasb] 

Oh, and this is how I know the state updater thread is responsible: if you look 
at [the graph of failure rates for this 
test|https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.container=org.apache.kafka.streams.integration.NamedTopologyIntegrationTest&tests.sortField=FLAKY&tests.test=shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()],
 you'll see it goes from literally zero flakiness to the 2nd most commonly 
failing test in all of Streams on Oct 4th. This is the day we turned on the 
state updater thread by default.

(It's also a bit concerning that we didn't catch this sooner. The uptick in 
failure rate of this test is actually quite sudden. Would be great if we could 
somehow manually alert on this sort of thing)

> Flaky Test 
> NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-15798
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15798
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams, unit tests
>            Reporter: Justine Olshan
>            Priority: Major
>              Labels: flaky-test
>
> I saw a few examples recently. 2 have the same error, but the third is 
> different
> [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14629/22/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology___2/]
> [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_21_and_Scala_2_13___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/]
>  
> The failure is like
> {code:java}
> java.lang.AssertionError: Did not receive all 5 records from topic 
> output-stream-1 within 60000 ms, currently accumulated data is [] Expected: 
> is a value equal to or greater than <5> but: <0> was less than <5>{code}
> The other failure was
> [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/]
> {code:java}
> java.lang.AssertionError: Expected: <[0, 1]> but: was <[0]>{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to