[jira] [Resolved] (KAFKA-16103) Review client logic for triggering offset commit callbacks

2024-04-22 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-16103.

Resolution: Fixed

> Review client logic for triggering offset commit callbacks
> --
>
> Key: KAFKA-16103
> URL: https://issues.apache.org/jira/browse/KAFKA-16103
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lucas Brutschy
>Priority: Critical
>  Labels: kip-848-client-support, offset
> Fix For: 3.8.0
>
>
> Review logic for triggering commit callbacks, ensuring that all callbacks are 
> triggered before returning from commitSync



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16599) Always await async commit callbacks in commitSync and close

2024-04-22 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16599:
--

 Summary: Always await async commit callbacks in commitSync and 
close
 Key: KAFKA-16599
 URL: https://issues.apache.org/jira/browse/KAFKA-16599
 Project: Kafka
  Issue Type: Task
Reporter: Lucas Brutschy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16008) Fix PlaintextConsumerTest.testMaxPollIntervalMs

2024-02-26 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-16008.

Resolution: Duplicate

> Fix PlaintextConsumerTest.testMaxPollIntervalMs
> ---
>
> Key: KAFKA-16008
> URL: https://issues.apache.org/jira/browse/KAFKA-16008
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: Kirk True
>Assignee: Lucas Brutschy
>Priority: Critical
>  Labels: consumer-threading-refactor, integration-tests, timeout
> Fix For: 3.8.0
>
>
> The integration test {{PlaintextConsumerTest.testMaxPollIntervalMs}} is 
> failing when using the {{AsyncKafkaConsumer}}.
> The error is:
> {code}
> org.opentest4j.AssertionFailedError: Timed out before expected rebalance 
> completed
>     at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38)
> at org.junit.jupiter.api.Assertions.fail(Assertions.java:134)
> at 
> kafka.api.AbstractConsumerTest.awaitRebalance(AbstractConsumerTest.scala:317)
> at 
> kafka.api.PlaintextConsumerTest.testMaxPollIntervalMs(PlaintextConsumerTest.scala:194)
> {code}
> The logs include this line:
>  
> {code}
> [2023-12-13 15:11:16,134] WARN [Consumer clientId=ConsumerTestConsumer, 
> groupId=my-test] consumer poll timeout has expired. This means the time 
> between subsequent calls to poll() was longer than the configured 
> max.poll.interval.ms, which typically implies that the poll loop is spending 
> too much time processing messages. You can address this either by increasing 
> max.poll.interval.ms or by reducing the maximum size of batches returned in 
> poll() with max.poll.records. 
> (org.apache.kafka.clients.consumer.internals.HeartbeatRequestManager:188)
> {code} 
> I don't know if that's related or not.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16010) Fix PlaintextConsumerTest.testMultiConsumerSessionTimeoutOnStopPolling

2024-02-21 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-16010.

Resolution: Duplicate

> Fix PlaintextConsumerTest.testMultiConsumerSessionTimeoutOnStopPolling
> --
>
> Key: KAFKA-16010
> URL: https://issues.apache.org/jira/browse/KAFKA-16010
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: Kirk True
>Assignee: Lucas Brutschy
>Priority: Critical
>  Labels: consumer-threading-refactor, integration-tests, timeout
> Fix For: 3.8.0
>
>
> The integration test 
> {{PlaintextConsumerTest.testMultiConsumerSessionTimeoutOnStopPolling}} is 
> failing when using the {{AsyncKafkaConsumer}}.
> The error is:
> {code}
> org.opentest4j.AssertionFailedError: Did not get valid assignment for 
> partitions [topic1-2, topic1-4, topic-1, topic-0, topic1-5, topic1-1, 
> topic1-0, topic1-3] after one consumer left
>   at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38)
>   at org.junit.jupiter.api.Assertions.fail(Assertions.java:134)
>   at 
> kafka.api.AbstractConsumerTest.validateGroupAssignment(AbstractConsumerTest.scala:286)
>   at 
> kafka.api.PlaintextConsumerTest.runMultiConsumerSessionTimeoutTest(PlaintextConsumerTest.scala:1883)
>   at 
> kafka.api.PlaintextConsumerTest.testMultiConsumerSessionTimeoutOnStopPolling(PlaintextConsumerTest.scala:1281)
> {code}
> The logs include these lines:
>  
> {code}
> [2023-12-13 15:26:40,736] WARN [Consumer clientId=ConsumerTestConsumer, 
> groupId=my-test] consumer poll timeout has expired. This means the time 
> between subsequent calls to poll() was longer than the configured 
> max.poll.interval.ms, which typically implies that the poll loop is spending 
> too much time processing messages. You can address this either by increasing 
> max.poll.interval.ms or by reducing the maximum size of batches returned in 
> poll() with max.poll.records. 
> (org.apache.kafka.clients.consumer.internals.HeartbeatRequestManager:188)
> [2023-12-13 15:26:40,736] WARN [Consumer clientId=ConsumerTestConsumer, 
> groupId=my-test] consumer poll timeout has expired. This means the time 
> between subsequent calls to poll() was longer than the configured 
> max.poll.interval.ms, which typically implies that the poll loop is spending 
> too much time processing messages. You can address this either by increasing 
> max.poll.interval.ms or by reducing the maximum size of batches returned in 
> poll() with max.poll.records. 
> (org.apache.kafka.clients.consumer.internals.HeartbeatRequestManager:188)
> [2023-12-13 15:26:40,736] WARN [Consumer clientId=ConsumerTestConsumer, 
> groupId=my-test] consumer poll timeout has expired. This means the time 
> between subsequent calls to poll() was longer than the configured 
> max.poll.interval.ms, which typically implies that the poll loop is spending 
> too much time processing messages. You can address this either by increasing 
> max.poll.interval.ms or by reducing the maximum size of batches returned in 
> poll() with max.poll.records. 
> (org.apache.kafka.clients.consumer.internals.HeartbeatRequestManager:188)
> {code} 
> I don't know if that's related or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16290) Investigate propagating subscription state updates via queues

2024-02-21 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16290:
--

 Summary: Investigate propagating subscription state updates via 
queues
 Key: KAFKA-16290
 URL: https://issues.apache.org/jira/browse/KAFKA-16290
 Project: Kafka
  Issue Type: Task
  Components: clients, consumer
Reporter: Lucas Brutschy


We are mostly using the queues for interaction between application thread and 
background thread, but the subscription object is shared between the threads, 
and it is updated directly without going through the queues. 

The way we allow updates to the subscription state from both threads is 
definitely not right, and will bring trouble. Places like the assign() is 
probably the most obvious, where we send an event to the background to commit, 
but then update the subscription in the foreground right away.

It seems sensible to aim for having all updates to the subscription state in 
the background, triggered from the app thread via events (and I think we 
already have related events for all updates, just that the subscription state 
was left out in the app thread).




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16284) Performance regression in RocksDB

2024-02-20 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16284:
--

 Summary: Performance regression in RocksDB
 Key: KAFKA-16284
 URL: https://issues.apache.org/jira/browse/KAFKA-16284
 Project: Kafka
  Issue Type: Task
  Components: streams
Reporter: Lucas Brutschy


In benchmarks, we are noticing a performance regression in the performance of 
`RocksDBStore`.

The regression happens between those two commits:
 
{code:java}
trunk - 70c8b8d0af - regressed - 2024-01-06T14:00:20Z
trunk - d5aa341a18 - not regressed - 2023-12-31T11:47:14Z
{code}

The regression can be reproduced by the following test:
 
{code:java}
package org.apache.kafka.streams.state.internals;

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.utils.Bytes;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.processor.StateStoreContext;
import org.apache.kafka.test.InternalMockProcessorContext;
import org.apache.kafka.test.MockRocksDbConfigSetter;
import org.apache.kafka.test.StreamsTestUtils;
import org.apache.kafka.test.TestUtils;
import org.junit.Before;
import org.junit.Test;

import java.io.File;
import java.nio.ByteBuffer;
import java.util.Properties;

public class RocksDBStorePerfTest {

InternalMockProcessorContext context;
RocksDBStore rocksDBStore;
final static String DB_NAME = "db-name";
final static String METRICS_SCOPE = "metrics-scope";

RocksDBStore getRocksDBStore() {
return new RocksDBStore(DB_NAME, METRICS_SCOPE);
}
@Before
public void setUp() {
final Properties props = StreamsTestUtils.getStreamsConfig();
props.put(StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS_CONFIG, 
MockRocksDbConfigSetter.class);
File dir = TestUtils.tempDirectory();
context = new InternalMockProcessorContext<>(
dir,
Serdes.String(),
Serdes.String(),
new StreamsConfig(props)
);
}

@Test
public void testPerf() {
long start = System.currentTimeMillis();
for (int i = 0; i < 10; i++) {
System.out.println("Iteration: "+i+" Time: " + 
(System.currentTimeMillis() - start));
RocksDBStore rocksDBStore = getRocksDBStore();
rocksDBStore.init((StateStoreContext) context, rocksDBStore);
for (int j = 0; j < 100; j++) {
rocksDBStore.put(new 
Bytes(ByteBuffer.allocate(4).putInt(j).array()), "perf".getBytes());
}
rocksDBStore.close();
}
long end = System.currentTimeMillis();
System.out.println("Time: " + (end - start));
}

}
 {code}
 
I have isolated the regression to commit 
[5bc3aa4|https://github.com/apache/kafka/commit/5bc3aa428067dff1f2b9075ff5d1351fb05d4b10].
 On my machine, the test takes ~8 seconds before 
[5bc3aa4|https://github.com/apache/kafka/commit/5bc3aa428067dff1f2b9075ff5d1351fb05d4b10]
 and ~30 seconds after 
[5bc3aa4|https://github.com/apache/kafka/commit/5bc3aa428067dff1f2b9075ff5d1351fb05d4b10].




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (KAFKA-16167) Fix PlaintextConsumerTest.testAutoCommitOnCloseAfterWakeup

2024-02-19 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy reopened KAFKA-16167:


> Fix PlaintextConsumerTest.testAutoCommitOnCloseAfterWakeup
> --
>
> Key: KAFKA-16167
> URL: https://issues.apache.org/jira/browse/KAFKA-16167
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Critical
>  Labels: consumer-threading-refactor, integration-tests, 
> kip-848-client-support
> Fix For: 3.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16248) Kafka consumer should cache leader offset ranges

2024-02-13 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16248:
--

 Summary: Kafka consumer should cache leader offset ranges
 Key: KAFKA-16248
 URL: https://issues.apache.org/jira/browse/KAFKA-16248
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


We noticed a streams application received an OFFSET_OUT_OF_RANGE error 
following a network partition and streams task rebalance and subsequently reset 
its offsets to the beginning.

Inspecting the logs, we saw multiple consumer log messages like: 
{code:java}
Setting offset for partition tp to the committed offset 
FetchPosition{offset=1234, offsetEpoch=Optional.empty...)
{code}
Inspecting the streams code, it looks like kafka streams calls `commitSync` 
passing through an explicit OffsetAndMetadata object but does not populate the 
offset leader epoch.

The offset leader epoch is required in the offset commit to ensure that all 
consumers in the consumer group have coherent metadata before fetching. 
Otherwise after a consumer group rebalance, a consumer may fetch with a stale 
leader epoch with respect to the committed offset and get an offset out of 
range error from a zombie partition leader.

An example of where this can cause issues:
1. We have a consumer group with consumer 1 and consumer 2. Partition P is 
assigned to consumer 1 which has up-to-date metadata for P. Consumer 2 has 
stale metadata for P.
2. Consumer 1 fetches partition P with offset 50, epoch 8. commits the offset 
50 without an epoch.
3. The consumer group rebalances and P is now assigned to consumer 2. Consumer 
2 has a stale leader epoch for P (let's say leader epoch 7). Consumer 2 will 
now try to fetch with leader epoch 7, offset 50. If we have a zombie leader due 
to a network partition, the zombie leader may accept consumer 2's fetch leader 
epoch and return an OFFSET_OUT_OF_RANGE to consumer 2.

If in step 1, consumer 1 committed the leader epoch for the message, then when 
consumer 2 receives assignment P it would force a metadata refresh to discover 
a sufficiently new leader epoch for the committed offset.

Kafka Streams cannot fully determine the leader epoch of the offsets it wants 
to commit - in EOS mode, streams commits the offset after the last control 
records (to avoid always having a lag of >0), but the leader epoch of the 
control record is not known to streams (since only non-control records are 
returned from Consumer.poll).

A fix discussed with [~hachikuji] is to have the consumer cache leader epoch 
ranges, similar to how the broker maintains a leader epoch cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16220) KTableKTableForeignKeyInnerJoinCustomPartitionerIntegrationTest#shouldThrowIllegalArgumentExceptionWhenCustomPartionerReturnsMultiplePartitions is flaky

2024-02-12 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-16220.

Resolution: Fixed

> KTableKTableForeignKeyInnerJoinCustomPartitionerIntegrationTest#shouldThrowIllegalArgumentExceptionWhenCustomPartionerReturnsMultiplePartitions
>  is flaky
> 
>
> Key: KAFKA-16220
> URL: https://issues.apache.org/jira/browse/KAFKA-16220
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Lucas Brutschy
>Assignee: Lucas Brutschy
>Priority: Major
>  Labels: flaky, flaky-test
>
> This test has seen significant flakyness
>  
> https://ge.apache.org/s/fac7lploprvuu/tests/task/:streams:test/details/org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinCustomPartitionerIntegrationTest/shouldThrowIllegalArgumentExceptionWhenCustomPartitionerReturnsMultiplePartitions()?top-execution=1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14384) Flaky Test SelfJoinUpgradeIntegrationTest.shouldUpgradeWithTopologyOptimizationOff

2024-02-05 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14384.

Resolution: Fixed

> Flaky Test 
> SelfJoinUpgradeIntegrationTest.shouldUpgradeWithTopologyOptimizationOff
> --
>
> Key: KAFKA-14384
> URL: https://issues.apache.org/jira/browse/KAFKA-14384
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Critical
>  Labels: flaky-test
>
> h3. Stacktrace
> java.lang.AssertionError: Did not receive all 5 records from topic 
> selfjoin-outputSelfJoinUpgradeIntegrationTestshouldUpgradeWithTopologyOptimizationOff
>  within 6 ms Expected: is a value equal to or greater than <5> but: <0> 
> was less than <5> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueWithTimestampRecordsReceived$2(IntegrationTestUtils.java:763)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:382)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:350)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueWithTimestampRecordsReceived(IntegrationTestUtils.java:759)
>  at 
> org.apache.kafka.streams.integration.SelfJoinUpgradeIntegrationTest.processKeyValueAndVerifyCount(SelfJoinUpgradeIntegrationTest.java:244)
>  at 
> org.apache.kafka.streams.integration.SelfJoinUpgradeIntegrationTest.shouldUpgradeWithTopologyOptimizationOff(SelfJoinUpgradeIntegrationTest.java:155)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12835/4/testReport/org.apache.kafka.streams.integration/SelfJoinUpgradeIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldUpgradeWithTopologyOptimizationOff/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14385) Flaky Test QueryableStateIntegrationTest.shouldNotMakeStoreAvailableUntilAllStoresAvailable

2024-02-05 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14385.

Resolution: Fixed

> Flaky Test 
> QueryableStateIntegrationTest.shouldNotMakeStoreAvailableUntilAllStoresAvailable
> ---
>
> Key: KAFKA-14385
> URL: https://issues.apache.org/jira/browse/KAFKA-14385
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Critical
>  Labels: flaky-test
>
> Failed twice on the same build (Java 8 & 11)
> h3. Stacktrace
> java.lang.AssertionError: KafkaStreams did not transit to RUNNING state 
> within 15000 milli seconds. Expected:  but: was  at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at 
> org.apache.kafka.test.StreamsTestUtils.startKafkaStreamsAndWaitForRunningState(StreamsTestUtils.java:134)
>  at 
> org.apache.kafka.test.StreamsTestUtils.startKafkaStreamsAndWaitForRunningState(StreamsTestUtils.java:121)
>  at 
> org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldNotMakeStoreAvailableUntilAllStoresAvailable(QueryableStateIntegrationTest.java:1038)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12836/3/testReport/org.apache.kafka.streams.integration/QueryableStateIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldNotMakeStoreAvailableUntilAllStoresAvailable/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-8691) Flakey test ProcessorContextTest#shouldNotAllowToScheduleZeroMillisecondPunctuation

2024-02-05 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-8691.
---
Resolution: Fixed

> Flakey test  
> ProcessorContextTest#shouldNotAllowToScheduleZeroMillisecondPunctuation
> 
>
> Key: KAFKA-8691
> URL: https://issues.apache.org/jira/browse/KAFKA-8691
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Boyang Chen
>Priority: Critical
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/6384/consoleFull]
> org.apache.kafka.streams.processor.internals.ProcessorContextTest > 
> shouldNotAllowToScheduleZeroMillisecondPunctuation PASSED*23:37:09* ERROR: 
> Failed to write output for test null.Gradle Test Executor 5*23:37:09* 
> java.lang.NullPointerException: Cannot invoke method write() on null 
> object*23:37:09*at 
> org.codehaus.groovy.runtime.NullObject.invokeMethod(NullObject.java:91)*23:37:09*
> at 
> org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:47)*23:37:09*
>  at 
> org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)*23:37:09*
>   at 
> org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:34)*23:37:09*
>at 
> org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)*23:37:09*
>   at java_io_FileOutputStream$write.call(Unknown Source)*23:37:09*
> at 
> build_5nv3fyjgqff9aim9wbxfnad9z$_run_closure5$_closure75$_closure108.doCall(/home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.12/build.gradle:244)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-9897) Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores

2024-02-05 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-9897.
---
Resolution: Fixed

> Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores
> -
>
> Key: KAFKA-9897
> URL: https://issues.apache.org/jira/browse/KAFKA-9897
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.6.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> [https://builds.apache.org/job/kafka-pr-jdk14-scala2.13/22/testReport/junit/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/shouldQuerySpecificActivePartitionStores/]
> {quote}org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get 
> state store source-table because the stream thread is PARTITIONS_ASSIGNED, 
> not RUNNING at 
> org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:85)
>  at 
> org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:61)
>  at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1183) at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores(StoreQueryIntegrationTest.java:178){quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16220) KTableKTableForeignKeyInnerJoinCustomPartitionerIntegrationTest#shouldThrowIllegalArgumentExceptionWhenCustomPartionerReturnsMultiplePartitions flaky

2024-02-02 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16220:
--

 Summary: 
KTableKTableForeignKeyInnerJoinCustomPartitionerIntegrationTest#shouldThrowIllegalArgumentExceptionWhenCustomPartionerReturnsMultiplePartitions
 flaky
 Key: KAFKA-16220
 URL: https://issues.apache.org/jira/browse/KAFKA-16220
 Project: Kafka
  Issue Type: Bug
  Components: streams, unit tests
Reporter: Lucas Brutschy
Assignee: Lucas Brutschy


This test has seen significant flakyness

 

https://ge.apache.org/s/fac7lploprvuu/tests/task/:streams:test/details/org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinCustomPartitionerIntegrationTest/shouldThrowIllegalArgumentExceptionWhenCustomPartitionerReturnsMultiplePartitions()?top-execution=1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16198) Reconciliation may lose partitions when topic metadata is delayed

2024-01-26 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16198:
--

 Summary: Reconciliation may lose partitions when topic metadata is 
delayed
 Key: KAFKA-16198
 URL: https://issues.apache.org/jira/browse/KAFKA-16198
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer
Reporter: Lucas Brutschy
Assignee: Lucas Brutschy
 Fix For: 3.8.0


The current reconciliation code in `AsyncKafkaConsumer`s `MembershipManager` 
may lose part of the server-provided assignment when metadata is delayed. The 
reason is incorrect handling of partially resolved topic names, as in this 
example:

 * We get assigned {{T1-1}} and {{T2-1}}
 * We reconcile {{{}T1-1{}}}, {{T2-1}} remains in {{assignmentUnresolved}} 
since the topic id {{T2}} is not known yet
 * We get new cluster metadata, which includes {{{}T2{}}}, so {{T2-1}} is moved 
to {{assignmentReadyToReconcile}}
 * We call {{reconcile}} -- {{T2-1}} is now treated as the full assignment, so 
{{T1-1}} is being revoked
 * We end up with assignment {{T2-1, which is inconsistent with the broker-side 
target assignment.}}

 

Generally, this seems to be a problem around semantics of the internal 
collections `assignmentUnresolved` and `assignmentReadyToReconcile`. Absence of 
a topic in `assignmentReadyToReconcile` may mean either revocation of the topic 
partition(s), or unavailability of a topic name for the topic.

Internal state with simpler and correct invariant could be achieved by using a 
single collection `currentTargetAssignment` which is based on topic IDs and 
always corresponds to the latest assignment received from the broker. During 
every attempted reconciliation, all topic IDs will be resolved from the local 
cache, which should not introduce a lot of overhead. `assignmentUnresolved` and 
`assignmentReadyToReconcile` are removed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16169) FencedException in commitAsync not propagated without callback

2024-01-19 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16169:
--

 Summary: FencedException in commitAsync not propagated without 
callback
 Key: KAFKA-16169
 URL: https://issues.apache.org/jira/browse/KAFKA-16169
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer
Reporter: Lucas Brutschy


The javadocs for {{commitAsync()}} (w/o callback) say:
@throws org.apache.kafka.common.errors.FencedInstanceIdException if this 
consumer instance gets fenced by broker.
 
If no callback is passed into {{{}commitAsync(){}}}, no offset commit callback 
invocation is submitted. However, we only check for a 
{{FencedInstanceIdException}} when we execute a callback. It seems to me that 
with {{commitAsync()}} we would not throw at all when the consumer gets fenced.
In any case, we need a unit test that verifies that the 
{{FencedInstanceIdException}} is thrown for each version of 
{{{}commitAsync(){}}}.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16155) Investigate testAutoCommitIntercept

2024-01-17 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16155:
--

 Summary: Investigate testAutoCommitIntercept
 Key: KAFKA-16155
 URL: https://issues.apache.org/jira/browse/KAFKA-16155
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer
Reporter: Lucas Brutschy


Even with KAFKA-15942, the test PlaintextConsumerTest.testAutoCommitIntercept 
flakes on the the initial setup (before using interceptors, so interceptors are 
unrelated here, except for being used later in the test).

The problem is that we are seeking two topic partitions to offset 10 and 20, 
respectively, but when we commit, we seem to have lost one of the offsets, 
likely due to a race condition. 

When I output `subscriptionState.allConsumed` repeatedly, I get this output:

 
{code:java}
allConsumed: {topic-0=OffsetAndMetadata{offset=100, leaderEpoch=0, 
metadata=''}, topic-1=OffsetAndMetadata{offset=0, leaderEpoch=null, 
metadata=''}}
seeking topic-0 to FetchPosition{offset=10, offsetEpoch=Optional.empty, 
currentLeader=LeaderAndEpoch{leader=Optional[localhost:58298 (id: 0 rack: 
null)], epoch=0}}
seeking topic-1 to FetchPosition{offset=20, offsetEpoch=Optional.empty, 
currentLeader=LeaderAndEpoch{leader=Optional[localhost:58301 (id: 1 rack: 
null)], epoch=0}}
allConsumed: {topic-0=OffsetAndMetadata{offset=10, leaderEpoch=null, 
metadata=''}, topic-1=OffsetAndMetadata{offset=20, leaderEpoch=null, 
metadata=''}} 
allConsumed: {topic-0=OffsetAndMetadata{offset=10, leaderEpoch=null, 
metadata=''}} allConsumed: {topic-0=OffsetAndMetadata{offset=10, 
leaderEpoch=null, metadata=''}, topic-1=OffsetAndMetadata{offset=0, 
leaderEpoch=null, metadata=''}} autocommit start 
{topic-0=OffsetAndMetadata{offset=10, leaderEpoch=null, metadata=''}, 
topic-1=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}}
{code}
So we after we seek to 10 / 20, we lose one of the offsets, maybe because we 
haven't reconciled the assignment yet. Later, we get the second topic partition 
assigned, but the offset is initialized to 0.

We should investigate whether this can be made more like the behavior in the 
original consumer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16133) Commits during reconciliation always time out

2024-01-15 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16133:
--

 Summary: Commits during reconciliation always time out
 Key: KAFKA-16133
 URL: https://issues.apache.org/jira/browse/KAFKA-16133
 Project: Kafka
  Issue Type: Bug
  Components: clients, consumer
Affects Versions: 3.7.0
Reporter: Lucas Brutschy


This only affects the AsyncKafkaConsumer, which is in Preview in 3.7.

In MembershipManagerImpl there is a confusion between timeouts and deadlines. 
[https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/MembershipManagerImpl.java#L836C38-L836C38]

This causes all autocommits during reconciliation to immediately time out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16097) State updater removes task without pending action in EOSv2

2024-01-10 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-16097.

Resolution: Fixed

> State updater removes task without pending action in EOSv2
> --
>
> Key: KAFKA-16097
> URL: https://issues.apache.org/jira/browse/KAFKA-16097
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.8.0
>Reporter: Lucas Brutschy
>Priority: Major
>
> A long-running soak encountered the following exception:
>  
> {code:java}
> [2024-01-08 03:06:00,586] ERROR [i-081c089d2ed054443-StreamThread-3] Thread 
> encountered an error processing soak test 
> (org.apache.kafka.streams.StreamsSoakTest)
> java.lang.IllegalStateException: Got a removed task 1_0 from the state 
> updater that is not for recycle, closing, or updating input partitions; this 
> should not happen
>     at 
> org.apache.kafka.streams.processor.internals.TaskManager.handleRemovedTasksFromStateUpdater(TaskManager.java:939)
>     at 
> org.apache.kafka.streams.processor.internals.TaskManager.checkStateUpdater(TaskManager.java:788)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.checkStateUpdater(StreamThread.java:1141)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnceWithoutProcessingThreads(StreamThread.java:949)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:686)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:645)
> [2024-01-08 03:06:00,587] ERROR [i-081c089d2ed054443-StreamThread-3] 
> stream-client [i-081c089d2ed054443] Encountered the following exception 
> during processing and sent shutdown request for the entire application. 
> (org.apache.kafka.streams.KafkaStreams)
> org.apache.kafka.streams.errors.StreamsException: 
> java.lang.IllegalStateException: Got a removed task 1_0 from the state 
> updater that is not for recycle, closing, or updating input partitions; this 
> should not happen
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:729)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:645)
> Caused by: java.lang.IllegalStateException: Got a removed task 1_0 from the 
> state updater that is not for recycle, closing, or updating input partitions; 
> this should not happen
>     at 
> org.apache.kafka.streams.processor.internals.TaskManager.handleRemovedTasksFromStateUpdater(TaskManager.java:939)
>     at 
> org.apache.kafka.streams.processor.internals.TaskManager.checkStateUpdater(TaskManager.java:788)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.checkStateUpdater(StreamThread.java:1141)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnceWithoutProcessingThreads(StreamThread.java:949)
>     at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:686)
>     ... 1 more{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15941) Flaky test: shouldRestoreNullRecord() – org.apache.kafka.streams.integration.RestoreIntegrationTest

2024-01-09 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15941.

Resolution: Cannot Reproduce

> Flaky test: shouldRestoreNullRecord() – 
> org.apache.kafka.streams.integration.RestoreIntegrationTest
> ---
>
> Key: KAFKA-15941
> URL: https://issues.apache.org/jira/browse/KAFKA-15941
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Apoorv Mittal
>Priority: Major
>  Labels: flaky-test
>
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14699/24/tests/
> {code:java}
> org.opentest4j.AssertionFailedError: Condition not met within timeout 6. 
> Did not receive all [KeyValue(2, \x00\x00\x00)] records from topic output 
> (got []) ==> expected:  but was: 
> Stacktraceorg.opentest4j.AssertionFailedError: Condition not met 
> within timeout 6. Did not receive all [KeyValue(2, \x00\x00\x00)] records 
> from topic output (got []) ==> expected:  but was:   at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
>   at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)  at 
> org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210) at 
> org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331) 
>at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328) 
> at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312) at 
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilFinalKeyValueRecordsReceived(IntegrationTestUtils.java:878)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilFinalKeyValueRecordsReceived(IntegrationTestUtils.java:827)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilFinalKeyValueRecordsReceived(IntegrationTestUtils.java:790)
>  at 
> org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreNullRecord(RestoreIntegrationTest.java:244)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16098) State updater may attempt to resume a task that is not assigned anymore

2024-01-09 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16098:
--

 Summary: State updater may attempt to resume a task that is not 
assigned anymore
 Key: KAFKA-16098
 URL: https://issues.apache.org/jira/browse/KAFKA-16098
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: Lucas Brutschy
 Attachments: streams.log.gz

A long-running soak test brought to light this `IllegalStateException`:
{code:java}
[2024-01-07 08:54:13,688] ERROR [i-0637ca8609f50425f-StreamThread-1] Thread 
encountered an error processing soak test 
(org.apache.kafka.streams.StreamsSoakTest)
java.lang.IllegalStateException: No current assignment for partition 
network-id-repartition-1
    at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
    at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.resume(SubscriptionState.java:753)
    at 
org.apache.kafka.clients.consumer.internals.LegacyKafkaConsumer.resume(LegacyKafkaConsumer.java:963)
    at 
org.apache.kafka.clients.consumer.KafkaConsumer.resume(KafkaConsumer.java:1524)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.transitRestoredTaskToRunning(TaskManager.java:857)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.handleRestoredTasksFromStateUpdater(TaskManager.java:979)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.checkStateUpdater(TaskManager.java:791)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.checkStateUpdater(StreamThread.java:1141)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnceWithoutProcessingThreads(StreamThread.java:949)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:686)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:645)
[2024-01-07 08:54:13,688] ERROR [i-0637ca8609f50425f-StreamThread-1] 
stream-client [i-0637ca8609f50425f] Encountered the following exception during 
processing and sent shutdown request for the entire application. 
(org.apache.kafka.streams.KafkaStreams)
org.apache.kafka.streams.errors.StreamsException: 
java.lang.IllegalStateException: No current assignment for partition 
network-id-repartition-1
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:729)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:645)
Caused by: java.lang.IllegalStateException: No current assignment for partition 
network-id-repartition-1
    at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
    at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.resume(SubscriptionState.java:753)
    at 
org.apache.kafka.clients.consumer.internals.LegacyKafkaConsumer.resume(LegacyKafkaConsumer.java:963)
    at 
org.apache.kafka.clients.consumer.KafkaConsumer.resume(KafkaConsumer.java:1524)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.transitRestoredTaskToRunning(TaskManager.java:857)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.handleRestoredTasksFromStateUpdater(TaskManager.java:979)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.checkStateUpdater(TaskManager.java:791)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.checkStateUpdater(StreamThread.java:1141)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnceWithoutProcessingThreads(StreamThread.java:949)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:686)
    ... 1 more {code}
Log (with some common messages filtered) attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16097) State updater removes task without pending action in EOSv2

2024-01-09 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16097:
--

 Summary: State updater removes task without pending action in EOSv2
 Key: KAFKA-16097
 URL: https://issues.apache.org/jira/browse/KAFKA-16097
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: Lucas Brutschy


A long-running soak encountered the following exception:

 
{code:java}
[2024-01-08 03:06:00,586] ERROR [i-081c089d2ed054443-StreamThread-3] Thread 
encountered an error processing soak test 
(org.apache.kafka.streams.StreamsSoakTest)
java.lang.IllegalStateException: Got a removed task 1_0 from the state updater 
that is not for recycle, closing, or updating input partitions; this should not 
happen
    at 
org.apache.kafka.streams.processor.internals.TaskManager.handleRemovedTasksFromStateUpdater(TaskManager.java:939)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.checkStateUpdater(TaskManager.java:788)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.checkStateUpdater(StreamThread.java:1141)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnceWithoutProcessingThreads(StreamThread.java:949)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:686)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:645)
[2024-01-08 03:06:00,587] ERROR [i-081c089d2ed054443-StreamThread-3] 
stream-client [i-081c089d2ed054443] Encountered the following exception during 
processing and sent shutdown request for the entire application. 
(org.apache.kafka.streams.KafkaStreams)
org.apache.kafka.streams.errors.StreamsException: 
java.lang.IllegalStateException: Got a removed task 1_0 from the state updater 
that is not for recycle, closing, or updating input partitions; this should not 
happen
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:729)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:645)
Caused by: java.lang.IllegalStateException: Got a removed task 1_0 from the 
state updater that is not for recycle, closing, or updating input partitions; 
this should not happen
    at 
org.apache.kafka.streams.processor.internals.TaskManager.handleRemovedTasksFromStateUpdater(TaskManager.java:939)
    at 
org.apache.kafka.streams.processor.internals.TaskManager.checkStateUpdater(TaskManager.java:788)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.checkStateUpdater(StreamThread.java:1141)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnceWithoutProcessingThreads(StreamThread.java:949)
    at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:686)
    ... 1 more{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16089) Kafka Streams still leaking memory in 3.7

2024-01-08 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16089:
--

 Summary: Kafka Streams still leaking memory in 3.7
 Key: KAFKA-16089
 URL: https://issues.apache.org/jira/browse/KAFKA-16089
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 3.7.0
Reporter: Lucas Brutschy
 Fix For: 3.7.0
 Attachments: graphviz (1).svg

In 
[https://github.com/apache/kafka/commit/58d6d2e5922df60cdc5c5c1bcd1c2d82dd2b71e2]
 a leak was fixed in the release candidate for 3.7.

 

However, Kafka Streams still seems to be leaking memory (just slower) after the 
fix.

 

Attached is the `jeprof` output right before a crash after ~11 hours.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16086) Kafka Streams has RocksDB native memory leak

2024-01-05 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16086:
--

 Summary: Kafka Streams has RocksDB native memory leak
 Key: KAFKA-16086
 URL: https://issues.apache.org/jira/browse/KAFKA-16086
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: Lucas Brutschy
 Attachments: image.png

The current 3.7 and trunk versions are leaking native memory while running 
Kafka streams over several hours. This will likely kill any real workload over 
time, so this should be treated as a blocker bug for 3.7.

This is discovered in a long-running soak test. Attached is the memory 
consumption, which steadily approaches 100% and then the JVM is killed.

Rerunning the same test with jemalloc native memory profiling, we see these 
allocated objects after a few hours:
 
{noformat}
(jeprof) top
Total: 13283138973 B
10296829713 77.5% 77.5% 10296829713 77.5% rocksdb::port::cacheline_aligned_alloc
2487325671 18.7% 96.2% 2487325671 18.7% rocksdb::BlockFetcher::ReadBlockContents
150937547 1.1% 97.4% 150937547 1.1% 
rocksdb::lru_cache::LRUHandleTable::LRUHandleTable
119591613 0.9% 98.3% 119591613 0.9% prof_backtrace_impl
47331433 0.4% 98.6% 105040933 0.8% rocksdb::BlockBasedTable::PutDataBlockToCache
32516797 0.2% 98.9% 32516797 0.2% rocksdb::Arena::AllocateNewBlock
29796095 0.2% 99.1% 30451535 0.2% Java_org_rocksdb_Options_newOptions
18172716 0.1% 99.2% 20008397 0.2% rocksdb::InternalStats::InternalStats
16032145 0.1% 99.4% 16032145 0.1% rocksdb::ColumnFamilyDescriptorJni::construct
12454120 0.1% 99.5% 12454120 0.1% std::_Rb_tree::_M_insert_unique{noformat}
 

 

The first hypothesis is that this is caused by the leaking `Options` object 
introduced in this line:

 

[https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBStore.java#L312|https://github.com/apache/kafka/pull/14852]

 

Introduced in this PR: 
[https://github.com/apache/kafka/pull/14852|https://github.com/apache/kafka/pull/14852]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16077) State updater fails to close task when input partitions are updated

2024-01-03 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16077:
--

 Summary: State updater fails to close task when input partitions 
are updated
 Key: KAFKA-16077
 URL: https://issues.apache.org/jira/browse/KAFKA-16077
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 3.7.0
Reporter: Lucas Brutschy


There is a race condition in the state updater that can cause the following:
 # We have an active task in the state updater
 # We get fenced. We recreate the producer, transactions now uninitialized. We 
ask the state updater to give back the task, add a pending action to close the 
task clean once it’s handed back
 # We get a new assignment with updated input partitions. The task is still 
owned by the state updater, so we ask the state updater again to hand it back 
and add a pending action to update its input partition
 # The task is handed back by the state updater. We update its input partitions 
but forget to close it clean (pending action was overwritten)
 # Now the task is in an initialized state, but the underlying producer does 
not have transactions initialized

This can lead to an exception like this:
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log-org.apache.kafka.streams.errors.StreamsException:
 Exception caught in process. taskId=1_0, processor=KSTREAM-SOURCE-05, 
topic=node-name-repartition, partition=0, offset=618798, 
stacktrace=java.lang.IllegalStateException: TransactionalId 
stream-soak-test-d647640a-12e5-4e74-a0af-e105d0d0cb67-2: Invalid transition 
attempted from state UNINITIALIZED to state IN_TRANSACTION
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:999)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:985)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.clients.producer.internals.TransactionManager.beginTransaction(TransactionManager.java:311)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.clients.producer.KafkaProducer.beginTransaction(KafkaProducer.java:660)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.StreamsProducer.maybeBeginTransaction(StreamsProducer.java:240)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.StreamsProducer.send(StreamsProducer.java:258)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:253)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:175)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:85)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forwardInternal(ProcessorContextImpl.java:291)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:270)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:229)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.kstream.internals.KStreamKTableJoinProcessor.doJoin(KStreamKTableJoinProcessor.java:130)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.kstream.internals.KStreamKTableJoinProcessor.process(KStreamKTableJoinProcessor.java:99)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:152)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forwardInternal(ProcessorContextImpl.java:291)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:270)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:229)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 
org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:84)
streams-soak-3-7-eos-v2_20240103131321/18.237.32.3/streams.log- at 

[jira] [Resolved] (KAFKA-15696) Revoke partitions on Consumer.close()

2023-12-19 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15696.

Resolution: Fixed

> Revoke partitions on Consumer.close()
> -
>
> Key: KAFKA-15696
> URL: https://issues.apache.org/jira/browse/KAFKA-15696
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Kirk True
>Assignee: Philip Nee
>Priority: Major
>  Labels: consumer-threading-refactor, kip-848-client-support, 
> kip-848-e2e, kip-848-preview
>
> Upon closing of the {{Consumer}} we need to revoke assignment. This involves 
> stop fetching, committing offsets if auto-commit enabled and invoking the 
> onPartitionsRevoked callback. There is a mechanism introduced in PR 
> [14406|https://github.com/apache/kafka/pull/14406] that allows for performing 
> network I/O on shutdown. The new method 
> {{ConsumerNetworkThread.runAtClose()}} will be executed when 
> {{Consumer.close()}} is invoked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15548) Send GroupConsumerHeartbeatRequest on Consumer.close()

2023-12-19 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15548.

Resolution: Fixed

> Send GroupConsumerHeartbeatRequest on Consumer.close()
> --
>
> Key: KAFKA-15548
> URL: https://issues.apache.org/jira/browse/KAFKA-15548
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: consumer-threading-refactor, kip-848-client-support, 
> kip-848-e2e, kip-848-preview
>
> Upon closing of the {{Consumer}} we need to send the last 
> GroupConsumerHeartbeatRequest with epoch = -1 to leave the group (or -2 if 
> static member). There is a mechanism introduced in PR 
> [14406|https://github.com/apache/kafka/pull/14406] that allows for performing 
> network I/O on shutdown. The new method 
> {{ConsumerNetworkThread.runAtClose()}} will be executed when 
> {{Consumer.close()}} is invoked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16001) Migrate ConsumerNetworkThreadTestBuilder away from ConsumerTestBuilder

2023-12-12 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16001:
--

 Summary: Migrate ConsumerNetworkThreadTestBuilder away from 
ConsumerTestBuilder
 Key: KAFKA-16001
 URL: https://issues.apache.org/jira/browse/KAFKA-16001
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16000) Migrate MembershipManagerImpl away from ConsumerTestBuilder

2023-12-12 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-16000:
--

 Summary: Migrate MembershipManagerImpl away from 
ConsumerTestBuilder
 Key: KAFKA-16000
 URL: https://issues.apache.org/jira/browse/KAFKA-16000
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15999) Migrate HeartbeatRequestManagerTest away from ConsumerTestBuilder

2023-12-12 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15999:
--

 Summary: Migrate HeartbeatRequestManagerTest away from 
ConsumerTestBuilder
 Key: KAFKA-15999
 URL: https://issues.apache.org/jira/browse/KAFKA-15999
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15977) DelegationTokenEndToEndAuthorizationWithOwnerTest leaks threads

2023-12-06 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15977:
--

 Summary: DelegationTokenEndToEndAuthorizationWithOwnerTest leaks 
threads
 Key: KAFKA-15977
 URL: https://issues.apache.org/jira/browse/KAFKA-15977
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


[https://ci-builds.apache.org/blue/rest/organizations/jenkins/pipelines/Kafka/pipelines/kafka-pr/branches/PR-14878/runs/8/nodes/11/steps/90/log/?start=0]

 

I had an unrelated PR fail with the following thread leak:

 

```

Gradle Test Run :core:test > Gradle Test Executor 95 > 
DelegationTokenEndToEndAuthorizationWithOwnerTest > executionError STARTED 
kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.executionError 
failed, log available in 
/home/jenkins/workspace/Kafka_kafka-pr_PR-14878/core/build/reports/testOutput/kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.executionError.test.stdout
 Gradle Test Run :core:test > Gradle Test Executor 95 > 
DelegationTokenEndToEndAuthorizationWithOwnerTest > executionError FAILED 
org.opentest4j.AssertionFailedError: Found 1 unexpected threads during 
@AfterAll: `kafka-admin-client-thread | adminclient-483` ==> expected:  
but was: 

```

 

All the following tests on that error fail with initialization errors, because 
the admin client thread is never closed.

 

This is preceded by the following test failure:

 

```

Gradle Test Run :core:test > Gradle Test Executor 95 > 
DelegationTokenEndToEndAuthorizationWithOwnerTest > 
testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl(String, boolean) > [1] 
quorum=kraft, isIdempotenceEnabled=true FAILED 
org.opentest4j.AssertionFailedError: expected acls: 
(principal=User:scram-user2, host=*, operation=CREATE_TOKENS, 
permissionType=ALLOW) (principal=User:scram-user2, host=*, 
operation=DESCRIBE_TOKENS, permissionType=ALLOW) but got: at 
app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:38) at 
app//org.junit.jupiter.api.Assertions.fail(Assertions.java:134) at 
app//kafka.utils.TestUtils$.waitAndVerifyAcls(TestUtils.scala:1142) at 
app//kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.$anonfun$configureSecurityAfterServersStart$1(DelegationTokenEndToEndAuthorizationWithOwnerTest.scala:71)
 at 
app//kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.$anonfun$configureSecurityAfterServersStart$1$adapted(DelegationTokenEndToEndAuthorizationWithOwnerTest.scala:70)
 at app//scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at 
app//scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at 
app//scala.collection.AbstractIterable.foreach(Iterable.scala:933) at 
app//kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.configureSecurityAfterServersStart(DelegationTokenEndToEndAuthorizationWithOwnerTest.scala:70)

```

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15967) Fix revocation in reconcilation logic

2023-12-04 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15967:
--

 Summary: Fix revocation in reconcilation logic
 Key: KAFKA-15967
 URL: https://issues.apache.org/jira/browse/KAFKA-15967
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy


Looks like there is a problem in the reconciliation logic.

We are getting 6 partitions from an HB, we add them to 
{{{}assignmentReadyToReconcile{}}}. Next HB we get only 4 partitions (2 are 
revoked), we also add them to {{{}assignmentReadyToReconcile{}}}, but the 2 
partitions that were supposed to be removed from the assignment are never 
removed because they are still in {{{}assignmentReadyToReconcile{}}}.

This was discovered during integration testing of 
[https://github.com/apache/kafka/pull/14878] - part of the test 
testRemoteAssignorRange was disabled and should be re-enabled once this is 
fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15690) EosIntegrationTest is flaky.

2023-12-04 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15690.

Resolution: Fixed

> EosIntegrationTest is flaky.
> 
>
> Key: KAFKA-15690
> URL: https://issues.apache.org/jira/browse/KAFKA-15690
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Calvin Liu
>Assignee: Lucas Brutschy
>Priority: Major
>  Labels: flaky-test
>
> EosIntegrationTest
> shouldNotViolateEosIfOneTaskGetsFencedUsingIsolatedAppInstances[exactly_once_v2,
>  processing threads = false]
> {code:java}
> org.junit.runners.model.TestTimedOutException: test timed out after 600 
> seconds   at 
> org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleServerDisconnect(NetworkClient.java:)
> at 
> org.apache.kafka.clients.NetworkClient.processDisconnection(NetworkClient.java:821)
>   at 
> org.apache.kafka.clients.NetworkClient.processTimeoutDisconnection(NetworkClient.java:779)
>at 
> org.apache.kafka.clients.NetworkClient.handleTimedOutRequests(NetworkClient.java:837)
> org.apache.kafka.streams.errors.StreamsException: Exception caught in 
> process. taskId=0_1, processor=KSTREAM-SOURCE-00, 
> topic=multiPartitionInputTopic, partition=1, offset=15, 
> stacktrace=java.lang.RuntimeException: Detected we've been interrupted.   
> at 
> org.apache.kafka.streams.integration.EosIntegrationTest$1$1.transform(EosIntegrationTest.java:892)
>at 
> org.apache.kafka.streams.integration.EosIntegrationTest$1$1.transform(EosIntegrationTest.java:867)
>at 
> org.apache.kafka.streams.kstream.internals.TransformerSupplierAdapter$1.transform(TransformerSupplierAdapter.java:49)
> at 
> org.apache.kafka.streams.kstream.internals.TransformerSupplierAdapter$1.transform(TransformerSupplierAdapter.java:38)
> at 
> org.apache.kafka.streams.kstream.internals.KStreamFlatTransform$KStreamFlatTransformProcessor.process(KStreamFlatTransform.java:66)
>  {code}
>   shouldWriteLatestOffsetsToCheckpointOnShutdown[exactly_once_v2, processing 
> threads = false] 
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.UnknownServerException: The server experienced 
> an unexpected error when processing the request.   at 
> org.apache.kafka.streams.integration.utils.KafkaEmbedded.deleteTopic(KafkaEmbedded.java:204)
>  at 
> org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster.deleteTopicsAndWait(EmbeddedKafkaCluster.java:286)
>at 
> org.apache.kafka.streams.integration.utils.EmbeddedKafkaCluster.deleteTopicsAndWait(EmbeddedKafkaCluster.java:274)
>at 
> org.apache.kafka.streams.integration.EosIntegrationTest.createTopics(EosIntegrationTest.java:174)
> at 
> java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> org.apache.kafka.streams.errors.StreamsException: Exception caught in 
> process. taskId=0_1, processor=KSTREAM-SOURCE-00, 
> topic=multiPartitionInputTopic, partition=1, offset=15, 
> stacktrace=java.lang.RuntimeException: Detected we've been interrupted.   
> at 
> org.apache.kafka.streams.integration.EosIntegrationTest$1$1.transform(EosIntegrationTest.java:892)
>at 
> org.apache.kafka.streams.integration.EosIntegrationTest$1$1.transform(EosIntegrationTest.java:867)
>at 
> org.apache.kafka.streams.kstream.internals.TransformerSupplierAdapter$1.transform(TransformerSupplierAdapter.java:49)
> at 
> org.apache.kafka.streams.kstream.internals.TransformerSupplierAdapter$1.transform(TransformerSupplierAdapter.java:38)
> at 
> org.apache.kafka.streams.kstream.internals.KStreamFlatTransform$KStreamFlatTransformProcessor.process(KStreamFlatTransform.java:66)
>  {code}
> shouldWriteLatestOffsetsToCheckpointOnShutdown[exactly_once, processing 
> threads = false] 
> {code:java}
> org.opentest4j.AssertionFailedError: Condition not met within timeout 6. 
> StreamsTasks did not request commit. ==> expected:  but was: 
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)   
>  at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) 
> at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210)
> java.lang.IllegalStateException: Replica 
> [Topic=__transaction_state,Partition=2,Replica=1] should be in the 
> OfflineReplica,ReplicaDeletionStarted states before moving to 
> ReplicaDeletionIneligible state. Instead it is in OnlineReplica state   
> at 
> 

[jira] [Resolved] (KAFKA-15887) Autocommit during close consistently fails with exception in background thread

2023-11-29 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15887.

Resolution: Fixed

> Autocommit during close consistently fails with exception in background thread
> --
>
> Key: KAFKA-15887
> URL: https://issues.apache.org/jira/browse/KAFKA-15887
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Lucas Brutschy
>Assignee: Philip Nee
>Priority: Blocker
>
> when I run {{AsyncKafkaConsumerTest}} I get this every time I call close:
> {code:java}
> java.lang.IndexOutOfBoundsException: Index: 0
>   at java.base/java.util.Collections$EmptyList.get(Collections.java:4483)
>   at 
> java.base/java.util.Collections$UnmodifiableList.get(Collections.java:1310)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.findCoordinatorSync(ConsumerNetworkThread.java:302)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.ensureCoordinatorReady(ConsumerNetworkThread.java:288)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.maybeAutoCommitAndLeaveGroup(ConsumerNetworkThread.java:276)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.cleanup(ConsumerNetworkThread.java:257)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.run(ConsumerNetworkThread.java:101)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15930) Flaky test - testWithGroupId - TransactionsBounceTest

2023-11-29 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15930:
--

 Summary: Flaky test - testWithGroupId - TransactionsBounceTest
 Key: KAFKA-15930
 URL: https://issues.apache.org/jira/browse/KAFKA-15930
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test is intermittently failing.

[https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=Europe%2FVienna=trunk=kafka.api.TransactionsBounceTest=testWithGroupId()]

 
{code:java}
org.apache.kafka.common.KafkaException: Unexpected error in 
TxnOffsetCommitResponse: The server disconnected before a response was received.
    at 
app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702)
    at 
app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236)
    at 
app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154)
    at 
app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:594)
    at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:586)
    at 
app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:460)
    at 
app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:337)
    at 
app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:252)
    at java.base@17.0.7/java.lang.Thread.run(Thread.java:833)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15929) Flaky test - testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress - TopicCommandIntegrationTest

2023-11-29 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15929:
--

 Summary: Flaky test - 
testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress - 
TopicCommandIntegrationTest
 Key: KAFKA-15929
 URL: https://issues.apache.org/jira/browse/KAFKA-15929
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test is intermittently failing.

https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=Europe%2FVienna=trunk=org.apache.kafka.tools.TopicCommandIntegrationTest=testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress(String)%5B1%5D
{code:java}
java.lang.NullPointerException: Cannot invoke 
"org.apache.kafka.clients.admin.PartitionReassignment.addingReplicas()" because 
"reassignments" is null
    at 
org.apache.kafka.tools.TopicCommandIntegrationTest.testDescribeUnderReplicatedPartitionsWhenReassignmentIsInProgress(TopicCommandIntegrationTest.java:800)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15928) Flaky test - testBalancePartitionLeaders - QuorumControllerTest

2023-11-29 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15928:
--

 Summary: Flaky test - testBalancePartitionLeaders - 
QuorumControllerTest
 Key: KAFKA-15928
 URL: https://issues.apache.org/jira/browse/KAFKA-15928
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test is intermittently failing.

[https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=Europe%2FVienna=trunk=org.apache.kafka.controller.QuorumControllerTest=testBalancePartitionLeaders()]

 

```
org.opentest4j.AssertionFailedError: Condition not met within timeout 1. 
Leaders were not balanced after unfencing all of the brokers ==> expected: 
 but was: 
 at 
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
 at 
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
 at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
 at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
 at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210)
 at 
org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
 at 
org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
 at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)
 at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)
 at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)
 at 
org.apache.kafka.controller.QuorumControllerTest.testBalancePartitionLeaders(QuorumControllerTest.java:490)
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15927) Flaky test - testReplicateSourceDefault - MirrorConnectorsIntegrationExactlyOnceTest

2023-11-29 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15927:
--

 Summary: Flaky test - testReplicateSourceDefault - 
MirrorConnectorsIntegrationExactlyOnceTest
 Key: KAFKA-15927
 URL: https://issues.apache.org/jira/browse/KAFKA-15927
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test is intermittently failing

[https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=Europe%2FVienna=trunk=*MirrorConnectorsIntegrationExactlyOnceTest]
{code:java}
org.opentest4j.AssertionFailedError: `delete.retention.ms` should be different, 
because it's in exclude filter!  ==> expected: not equal but was: <8640>
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:152)
   at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   at 
app//org.junit.jupiter.api.AssertNotEquals.failEqual(AssertNotEquals.java:277)  
 at 
app//org.junit.jupiter.api.AssertNotEquals.assertNotEquals(AssertNotEquals.java:263)
 at app//org.junit.jupiter.api.Assertions.assertNotEquals(Assertions.java:2828) 
 at 
app//org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.lambda$testReplicateSourceDefault$9(MirrorConnectorsIntegrationBaseTest.java:826)
   at 
app//org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
   at 
app//org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
 at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)   
 at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)   
 at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)   
 at 
app//org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.testReplicateSourceDefault(MirrorConnectorsIntegrationBaseTest.java:813)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15926) Flaky test - testReplicateSourceDefault - MirrorConnectorsIntegrationSSLTest

2023-11-29 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15926:
--

 Summary: Flaky test - testReplicateSourceDefault - 
MirrorConnectorsIntegrationSSLTest
 Key: KAFKA-15926
 URL: https://issues.apache.org/jira/browse/KAFKA-15926
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test is intermittently failing.

See 
[https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=Europe%2FVienna=trunk=*MirrorConnectorsIntegrationSSLTest]
 for failed runs
{code:java}
org.opentest4j.AssertionFailedError: `delete.retention.ms` should be different, 
because it's in exclude filter!  ==> expected: not equal but was: <8640>
    at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:152)
    at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
    at 
app//org.junit.jupiter.api.AssertNotEquals.failEqual(AssertNotEquals.java:277)
    at 
app//org.junit.jupiter.api.AssertNotEquals.assertNotEquals(AssertNotEquals.java:263)
    at 
app//org.junit.jupiter.api.Assertions.assertNotEquals(Assertions.java:2828)
    at 
app//org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.lambda$testReplicateSourceDefault$9(MirrorConnectorsIntegrationBaseTest.java:826)
    at 
app//org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
    at 
app//org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
    at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)
    at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)
    at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)
    at 
app//org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.testReplicateSourceDefault(MirrorConnectorsIntegrationBaseTest.java:813)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15925) Flaky test testReplicateSourceDefault - MirrorConnectorsIntegrationTransactionsTest

2023-11-29 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15925:
--

 Summary: Flaky test testReplicateSourceDefault - 
MirrorConnectorsIntegrationTransactionsTest
 Key: KAFKA-15925
 URL: https://issues.apache.org/jira/browse/KAFKA-15925
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test is intermittently failing. 

[https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=Europe%2FVienna=trunk=*MirrorConnectorsIntegrationTransactionsTest]

[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14842/10/tests/]

 
{code:java}
org.opentest4j.AssertionFailedError: `delete.retention.ms` should be different, 
because it's in exclude filter!  ==> expected: not equal but was: <8640>
at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:152)
   at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   at 
app//org.junit.jupiter.api.AssertNotEquals.failEqual(AssertNotEquals.java:277)  
 at 
app//org.junit.jupiter.api.AssertNotEquals.assertNotEquals(AssertNotEquals.java:263)
 at app//org.junit.jupiter.api.Assertions.assertNotEquals(Assertions.java:2828) 
 at 
app//org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.lambda$testReplicateSourceDefault$9(MirrorConnectorsIntegrationBaseTest.java:826)
   at 
app//org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
   at 
app//org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
 at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)   
 at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)   
 at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)   
 at 
app//org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.testReplicateSourceDefault(MirrorConnectorsIntegrationBaseTest.java:813){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15544) Enable existing client integration tests for new protocol

2023-11-28 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15544.

Resolution: Fixed

> Enable existing client integration tests for new protocol 
> --
>
> Key: KAFKA-15544
> URL: https://issues.apache.org/jira/browse/KAFKA-15544
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Andrew Schofield
>Priority: Blocker
>  Labels: kip-848, kip-848-client-support, kip-848-e2e, 
> kip-848-preview
>
> Enable & validate integration tests defined in `PlaintextConsumerTest` that 
> relate to the consumer group functionality work with the new consumer group 
> protocol. The majority of these tests should work with both consumer group 
> protocols, so parameterization of the tests seems like the most appropriate 
> way forward.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15916) Flaky test - testClientSideTimeoutAfterFailureToReceiveResponse - KafkaAdminClientTest

2023-11-28 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15916:
--

 Summary: Flaky test - 
testClientSideTimeoutAfterFailureToReceiveResponse - KafkaAdminClientTest
 Key: KAFKA-15916
 URL: https://issues.apache.org/jira/browse/KAFKA-15916
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test intermittently gives the following result:

{code}
java.lang.IllegalStateException: No requests pending for inbound response 
MetadataResponseData(throttleTimeMs=0, 
brokers=[MetadataResponseBroker(nodeId=1, host='localhost', port=8122, 
rack=null), MetadataResponseBroker(nodeId=0, host='localhost', port=8121, 
rack=null), MetadataResponseBroker(nodeId=2, host='localhost', port=8123, 
rack=null)], clusterId='mockClusterId', controllerId=0, topics=[], 
clusterAuthorizedOperations=-2147483648)
at org.apache.kafka.clients.MockClient.respond(MockClient.java:393)
at org.apache.kafka.clients.MockClient.respond(MockClient.java:367)
at 
org.apache.kafka.clients.admin.KafkaAdminClientTest.testClientSideTimeoutAfterFailureToReceiveResponse(KafkaAdminClientTest.java:7017)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15915) Flaky test - testUnrecoverableError - ProducerIdManagerTest

2023-11-28 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15915:
--

 Summary: Flaky test - testUnrecoverableError - 
ProducerIdManagerTest
 Key: KAFKA-15915
 URL: https://issues.apache.org/jira/browse/KAFKA-15915
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


Test intermittently gives the following result:

{code}
java.lang.UnsupportedOperationException: Success.failed
at scala.util.Success.failed(Try.scala:277)
at 
kafka.coordinator.transaction.ProducerIdManagerTest.verifyFailure(ProducerIdManagerTest.scala:234)
at 
kafka.coordinator.transaction.ProducerIdManagerTest.testUnrecoverableErrors(ProducerIdManagerTest.scala:199)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15914) Flaky test - testAlterSinkConnectorOffsetsOverriddenConsumerGroupId - OffsetsApiIntegrationTest

2023-11-28 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15914:
--

 Summary: Flaky test - 
testAlterSinkConnectorOffsetsOverriddenConsumerGroupId - 
OffsetsApiIntegrationTest
 Key: KAFKA-15914
 URL: https://issues.apache.org/jira/browse/KAFKA-15914
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


```
org.opentest4j.AssertionFailedError: Condition not met within timeout 3. 
Sink connector consumer group offsets should catch up to the topic end offsets 
==> expected:  but was: 
at 
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at 
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210)
at 
org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
at 
org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)
at 
org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:917)
at 
org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.alterAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:396)
at 
org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testAlterSinkConnectorOffsetsOverriddenConsumerGroupId(OffsetsApiIntegrationTest.java:297)
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14624) State restoration is broken with standby tasks and cache-enabled stores in processor API

2023-11-27 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14624.

Resolution: Fixed

> State restoration is broken with standby tasks and cache-enabled stores in 
> processor API
> 
>
> Key: KAFKA-14624
> URL: https://issues.apache.org/jira/browse/KAFKA-14624
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.3.1
>Reporter: Balaji Rao
>Assignee: Lucas Brutschy
>Priority: Major
>
> I found that cache-enabled state stores in PAPI with standby tasks sometimes 
> returns stale data when a partition moves from one app instance to another 
> and back. [Here's|https://github.com/balajirrao/kafka-streams-multi-runner] a 
> small project that I used to reproduce the issue.
> I dug around a bit and it seems like it's a bug in standby task state 
> restoration when caching is enabled. If a partition moves from instance 1 to 
> 2 and then back to instance 1,  since the `CachingKeyValueStore` doesn't 
> register a restore callback, it can return potentially stale data for 
> non-dirty keys. 
> I could fix the issue by modifying the `CachingKeyValueStore` to register a 
> restore callback in which the cache restored keys are added to the cache. Is 
> this fix in the right direction?
> {code:java}
> // register the store
> context.register(
> root,
> (RecordBatchingStateRestoreCallback) records -> {
> for (final ConsumerRecord record : 
> records) {
> put(Bytes.wrap(record.key()), record.value());
> }
> }
> );
> {code}
>  
> I would like to contribute a fix, if I can get some help!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15798) Flaky Test NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()

2023-11-27 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15798.

Resolution: Fixed

Test disabled in https://github.com/apache/kafka/pull/14830 since feature will 
be removed.

> Flaky Test 
> NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()
> -
>
> Key: KAFKA-15798
> URL: https://issues.apache.org/jira/browse/KAFKA-15798
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Justine Olshan
>Assignee: Lucas Brutschy
>Priority: Major
>  Labels: flaky-test
>
> I saw a few examples recently. 2 have the same error, but the third is 
> different
> [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14629/22/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology___2/]
> [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_21_and_Scala_2_13___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/]
>  
> The failure is like
> {code:java}
> java.lang.AssertionError: Did not receive all 5 records from topic 
> output-stream-1 within 6 ms, currently accumulated data is [] Expected: 
> is a value equal to or greater than <5> but: <0> was less than <5>{code}
> The other failure was
> [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/]
> {code:java}
> java.lang.AssertionError: Expected: <[0, 1]> but: was <[0]>{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14014) Flaky test NamedTopologyIntegrationTest.shouldAllowRemovingAndAddingNamedTopologyToRunningApplicationWithMultipleNodesAndResetsOffsets()

2023-11-27 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14014.

Resolution: Fixed

Test disabled since feature will be removed.

> Flaky test 
> NamedTopologyIntegrationTest.shouldAllowRemovingAndAddingNamedTopologyToRunningApplicationWithMultipleNodesAndResetsOffsets()
> 
>
> Key: KAFKA-14014
> URL: https://issues.apache.org/jira/browse/KAFKA-14014
> Project: Kafka
>  Issue Type: Test
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Matthew de Detrich
>Priority: Critical
>  Labels: flaky-test
>
> {code:java}
> java.lang.AssertionError: 
> Expected: <[KeyValue(B, 1), KeyValue(A, 2), KeyValue(C, 2)]>
>  but: was <[KeyValue(B, 1), KeyValue(A, 2), KeyValue(C, 1)]>
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.kafka.streams.integration.NamedTopologyIntegrationTest.shouldAllowRemovingAndAddingNamedTopologyToRunningApplicationWithMultipleNodesAndResetsOffsets(NamedTopologyIntegrationTest.java:540)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> {code}
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12310/2/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldAllowRemovingAndAddingNamedTopologyToRunningApplicationWithMultipleNodesAndResetsOffsets/
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12310/2/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_17_and_Scala_2_13___shouldAllowRemovingAndAddingNamedTopologyToRunningApplicationWithMultipleNodesAndResetsOffsets/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13531) Flaky test NamedTopologyIntegrationTest

2023-11-27 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-13531.

Resolution: Fixed

Test disabled since feature will be removed.

> Flaky test NamedTopologyIntegrationTest
> ---
>
> Key: KAFKA-13531
> URL: https://issues.apache.org/jira/browse/KAFKA-13531
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Matthias J. Sax
>Assignee: Matthew de Detrich
>Priority: Critical
>  Labels: flaky-test
> Attachments: 
> org.apache.kafka.streams.integration.NamedTopologyIntegrationTest.shouldRemoveOneNamedTopologyWhileAnotherContinuesProcessing().test.stdout
>
>
> org.apache.kafka.streams.integration.NamedTopologyIntegrationTest.shouldRemoveNamedTopologyToRunningApplicationWithMultipleNodesAndResetsOffsets
> {quote}java.lang.AssertionError: Did not receive all 3 records from topic 
> output-stream-2 within 6 ms, currently accumulated data is [] Expected: 
> is a value equal to or greater than <3> but: <0> was less than <3> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:648)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:368)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:336)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:644)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:617)
>  at 
> org.apache.kafka.streams.integration.NamedTopologyIntegrationTest.shouldRemoveNamedTopologyToRunningApplicationWithMultipleNodesAndResetsOffsets(NamedTopologyIntegrationTest.java:439){quote}
> STDERR
> {quote}java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.GroupSubscribedToTopicException: Deleting 
> offsets of a topic is forbidden while the consumer group is actively 
> subscribed to it. at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
>  at 
> org.apache.kafka.streams.processor.internals.namedtopology.KafkaStreamsNamedTopologyWrapper.lambda$removeNamedTopology$3(KafkaStreamsNamedTopologyWrapper.java:213)
>  at 
> org.apache.kafka.common.internals.KafkaFutureImpl.lambda$whenComplete$2(KafkaFutureImpl.java:107)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> at 
> org.apache.kafka.common.internals.KafkaCompletableFuture.kafkaComplete(KafkaCompletableFuture.java:39)
>  at 
> org.apache.kafka.common.internals.KafkaFutureImpl.complete(KafkaFutureImpl.java:122)
>  at 
> org.apache.kafka.streams.processor.internals.TopologyMetadata.maybeNotifyTopologyVersionWaiters(TopologyMetadata.java:154)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.checkForTopologyUpdates(StreamThread.java:916)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:598)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:575)
>  Caused by: org.apache.kafka.common.errors.GroupSubscribedToTopicException: 
> Deleting offsets of a topic is forbidden while the consumer group is actively 
> subscribed to it. java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.GroupSubscribedToTopicException: Deleting 
> offsets of a topic is forbidden while the consumer group is actively 
> subscribed to it. at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
>  at 
> org.apache.kafka.streams.processor.internals.namedtopology.KafkaStreamsNamedTopologyWrapper.lambda$removeNamedTopology$3(KafkaStreamsNamedTopologyWrapper.java:213)
>  at 
> org.apache.kafka.common.internals.KafkaFutureImpl.lambda$whenComplete$2(KafkaFutureImpl.java:107)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> 

[jira] [Created] (KAFKA-15887) Autocommit during close consistently fails with exception in background thread

2023-11-22 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15887:
--

 Summary: Autocommit during close consistently fails with exception 
in background thread
 Key: KAFKA-15887
 URL: https://issues.apache.org/jira/browse/KAFKA-15887
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy
Assignee: Philip Nee


when I run {{AsyncKafkaConsumerTest}} I get this every time I call close:

{code:java}
java.lang.IndexOutOfBoundsException: Index: 0
at java.base/java.util.Collections$EmptyList.get(Collections.java:4483)
at 
java.base/java.util.Collections$UnmodifiableList.get(Collections.java:1310)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.findCoordinatorSync(ConsumerNetworkThread.java:302)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.ensureCoordinatorReady(ConsumerNetworkThread.java:288)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.maybeAutoCommitAndLeaveGroup(ConsumerNetworkThread.java:276)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.cleanup(ConsumerNetworkThread.java:257)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkThread.run(ConsumerNetworkThread.java:101)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15833) Restrict Consumer API to be used from one thread

2023-11-15 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15833:
--

 Summary: Restrict Consumer API to be used from one thread
 Key: KAFKA-15833
 URL: https://issues.apache.org/jira/browse/KAFKA-15833
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy


The legacy consumer restricts the API to be used from one thread only. This is 
not enforced in the new consumer. To avoid inconsistencies in the behavior, we 
should enforce the same restriction in the new consumer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15326) Decouple Processing Thread from Polling Thread

2023-11-02 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15326.

Resolution: Implemented

> Decouple Processing Thread from Polling Thread
> --
>
> Key: KAFKA-15326
> URL: https://issues.apache.org/jira/browse/KAFKA-15326
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Lucas Brutschy
>Assignee: Lucas Brutschy
>Priority: Critical
>
> As part of an ongoing effort to implement a better threading architecture in 
> Kafka streams, we decouple N stream threads into N polling threads and N 
> processing threads. The effort to consolidate N polling thread into a single 
> thread is follow-up after this ticket. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15463) StreamsException: Accessing from an unknown node

2023-09-19 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-15463.

Resolution: Not A Problem

>  StreamsException: Accessing from an unknown node
> -
>
> Key: KAFKA-15463
> URL: https://issues.apache.org/jira/browse/KAFKA-15463
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.2.1
>Reporter: Yevgeny
>Priority: Major
>
> After some time application was working fine, starting to get:
>  
> This is springboot application runs in kubernetes as stateful pod.
>  
>  
>  
> {code:java}
>   Exception in thread 
> "-ddf9819f-d6c7-46ce-930e-cd923e1b3c2c-StreamThread-1" 
> org.apache.kafka.streams.errors.StreamsException: Accessing from an unknown 
> node at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.getStateStore(ProcessorContextImpl.java:162)
>  at myclass1.java:28) at myclass2.java:48) at 
> java.base/java.util.stream.MatchOps$1MatchSink.accept(MatchOps.java:90) at 
> java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1602)
>  at 
> java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:129)
>  at 
> java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:527)
>  at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:513)
>  at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>  at 
> java.base/java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:230)
>  at 
> java.base/java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:196)
>  at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at 
> java.base/java.util.stream.ReferencePipeline.allMatch(ReferencePipeline.java:637)
>  at myclass3.java:48) at 
> org.apache.kafka.streams.kstream.internals.TransformerSupplierAdapter$1.transform(TransformerSupplierAdapter.java:49)
>  at 
> org.apache.kafka.streams.kstream.internals.TransformerSupplierAdapter$1.transform(TransformerSupplierAdapter.java:38)
>  at 
> org.apache.kafka.streams.kstream.internals.KStreamFlatTransform$KStreamFlatTransformProcessor.process(KStreamFlatTransform.java:66)
>  at 
> org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:146)
>  at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forwardInternal(ProcessorContextImpl.java:275)
>  at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:254)
>  at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:213)
>  at 
> org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:84)
>  at 
> org.apache.kafka.streams.processor.internals.StreamTask.lambda$doProcess$1(StreamTask.java:780)
>  at 
> org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:809)
>  at 
> org.apache.kafka.streams.processor.internals.StreamTask.doProcess(StreamTask.java:780)
>  at 
> org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:711)
>  at 
> org.apache.kafka.streams.processor.internals.TaskExecutor.processTask(TaskExecutor.java:100)
>  at 
> org.apache.kafka.streams.processor.internals.TaskExecutor.process(TaskExecutor.java:81)
>  at 
> org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:1177)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:769)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:589)
>  at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:551)
>    {code}
>  
> stream-thread 
> [-ddf9819f-d6c7-46ce-930e-cd923e1b3c2c-StreamThread-1] State 
> transition from PENDING_SHUTDOWN to DEAD
>  
>  
> Transformer is Prototype bean, the supplier supplys new instance of the 
> Transformer:
>  
>  
> {code:java}
> @Override public Transformer> get() 
> {     return ctx.getBean(MyTransformer.class); }{code}
>  
>  
> The only way to recover is to delete all topics used by kafkastreams, even if 
> application restarted same exception is thrown.
> *If messages in internal topics of 'store-changelog'  are deleted/offset 
> manipulated, can it cause the issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15378) Rolling upgrade system tests are failing

2023-08-18 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15378:
--

 Summary: Rolling upgrade system tests are failing
 Key: KAFKA-15378
 URL: https://issues.apache.org/jira/browse/KAFKA-15378
 Project: Kafka
  Issue Type: Task
  Components: streams
Affects Versions: 3.5.1
Reporter: Lucas Brutschy


The system tests are having failures for these tests:

```
kafkatest.tests.streams.streams_upgrade_test.StreamsUpgradeTest.test_rolling_upgrade_with_2_bounces.from_version=3.1.2.to_version=3.6.0-SNAPSHOT
kafkatest.tests.streams.streams_upgrade_test.StreamsUpgradeTest.test_rolling_upgrade_with_2_bounces.from_version=3.2.3.to_version=3.6.0-SNAPSHOT
kafkatest.tests.streams.streams_upgrade_test.StreamsUpgradeTest.test_rolling_upgrade_with_2_bounces.from_version=3.3.1.to_version=3.6.0-SNAPSHOT
```

See 
[https://jenkins.confluent.io/job/system-test-kafka-branch-builder/5801/console]
 for logs and other test data.

Note that system tests currently only run with [this 
fix](https://github.com/apache/kafka/commit/24d1780061a645bb2fbeefd8b8f50123c28ca94e),
 I think some CVE python library update broke the system tests... 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15326) Decouple Processing Thread from Polling Thread

2023-08-09 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-15326:
--

 Summary: Decouple Processing Thread from Polling Thread
 Key: KAFKA-15326
 URL: https://issues.apache.org/jira/browse/KAFKA-15326
 Project: Kafka
  Issue Type: Task
  Components: streams
Reporter: Lucas Brutschy
Assignee: Lucas Brutschy


As part of an ongoing effort to implement a better threading architecture in 
Kafka streams, we decouple N stream threads into N polling threads and N 
processing threads. The effort to consolidate N polling thread into a single 
thread is follow-up after this ticket. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14278) Convert INVALID_PRODUCER_EPOCH into PRODUCER_FENCED TxnOffsetCommit

2023-06-15 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14278.

Resolution: Fixed

> Convert INVALID_PRODUCER_EPOCH into PRODUCER_FENCED TxnOffsetCommit
> ---
>
> Key: KAFKA-14278
> URL: https://issues.apache.org/jira/browse/KAFKA-14278
> Project: Kafka
>  Issue Type: Sub-task
>  Components: producer , streams
>Reporter: Lucas Brutschy
>Assignee: Lucas Brutschy
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

2023-04-25 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14172.

Resolution: Fixed

> bug: State stores lose state when tasks are reassigned under EOS wit…
> -
>
> Key: KAFKA-14172
> URL: https://issues.apache.org/jira/browse/KAFKA-14172
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.1
>Reporter: Martin Hørslev
>Assignee: Guozhang Wang
>Priority: Critical
> Fix For: 3.5.0
>
>
> h1. State stores lose state when tasks are reassigned under EOS with standby 
> replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly 
> Once semantics ends up losing state after a rebalancing event that includes 
> reassignment of tasks to previous standby task within the acceptable standby 
> lag.
>  
> The problem is reproduceable and an integration test have been created to 
> showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided 
> [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example 
> [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14343) Write upgrade/downgrade tests for enabling the state updater

2022-12-20 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14343.

Resolution: Fixed

> Write upgrade/downgrade tests for enabling the state updater 
> -
>
> Key: KAFKA-14343
> URL: https://issues.apache.org/jira/browse/KAFKA-14343
> Project: Kafka
>  Issue Type: Task
>  Components: streams
>Reporter: Lucas Brutschy
>Assignee: Lucas Brutschy
>Priority: Major
>
> Write a test that verifies the upgrade from a version of Streams with state 
> updater disabled to a version with state updater enabled and vice versa, so 
> that we can offer a save upgrade path.
>  * upgrade test from a version of Streams with state updater disabled to a 
> version with state updater enabled (probably a system test since the old code 
> path will be removed from the code base)
>  * downgrade test from a version of Streams with state updater enabled to a 
> version with state updater disabled (probably a system test since the old 
> code path will be removed from the code base)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14532) IllegalStateException when fetch failure happens after assignment changed

2022-12-19 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14532:
--

 Summary: IllegalStateException when fetch failure happens after 
assignment changed
 Key: KAFKA-14532
 URL: https://issues.apache.org/jira/browse/KAFKA-14532
 Project: Kafka
  Issue Type: Bug
  Components: clients
Reporter: Lucas Brutschy
Assignee: Lucas Brutschy


On master, all our long-running test jobs are running into this exception: 

 

```

java.lang.IllegalStateException: No current assignment for partition 
stream-soak-test-KSTREAM-OUTERTHIS-86-store-changelog-1 2 at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:370)
 3 at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.clearPreferredReadReplica(SubscriptionState.java:623)
 4 at java.util.LinkedHashMap$LinkedKeySet.forEach(LinkedHashMap.java:559) 5 at 
org.apache.kafka.clients.consumer.internals.Fetcher$1.onFailure(Fetcher.java:349)
 6 at 
org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:179)
 7 at 
org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:149)
 8 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:613)
 9 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:427)
 10 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
 11 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251)
 12 at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1307)
 13 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1243) 
14 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1216) 
15 at 
org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:450)
 16 at 
org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:910)
 17 at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:773)
 18 at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:613)
 19 at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:575)
 20[2022-12-13 04:01:59,024] ERROR [i-016cf5d2c1889c316-StreamThread-1] 
stream-client [i-016cf5d2c1889c316] Encountered the following exception during 
processing and sent shutdown request for the entire application. 
(org.apache.kafka.streams.KafkaStreams) 
21org.apache.kafka.streams.errors.StreamsException: 
java.lang.IllegalStateException: No current assignment for partition 
stream-soak-test-KSTREAM-OUTERTHIS-86-store-changelog-1 22 at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:653)
 23 at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:575)
 24Caused by: java.lang.IllegalStateException: No current assignment for 
partition stream-soak-test-KSTREAM-OUTERTHIS-86-store-changelog-1 25 at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:370)
 26 at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.clearPreferredReadReplica(SubscriptionState.java:623)
 27 at java.util.LinkedHashMap$LinkedKeySet.forEach(LinkedHashMap.java:559) 28 
at 
org.apache.kafka.clients.consumer.internals.Fetcher$1.onFailure(Fetcher.java:349)
 29 at 
org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:179)
 30 at 
org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:149)
 31 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:613)
 32 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:427)
 33 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
 34 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251)
 35 at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1307)
 36 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1243) 
37 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1216) 
38 at 
org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:450)
 39 at 
org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:910)
 40 at 

[jira] [Created] (KAFKA-14530) Check state updater more than once in process loops

2022-12-19 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14530:
--

 Summary: Check state updater more than once in process loops
 Key: KAFKA-14530
 URL: https://issues.apache.org/jira/browse/KAFKA-14530
 Project: Kafka
  Issue Type: Task
  Components: streams
Reporter: Lucas Brutschy


In the new state restoration code, the state updater needs to be checked 
regularly by the main thread to transfer ownership of tasks back to the main 
thread once the state of the task is restored. The more often we check this, 
the faster we can start processing the tasks.

Currently, we only check the state updater once in every loop iteration of the 
state updater. And while we couldn't observe this to be strictly not often 
enough, we can increase the number of checks easily by moving the check inside 
the inner processing loop. This would mean that once we have iterated over 
`numIterations` records, we can already start processing tasks that have 
finished restoration in the meantime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14432) RocksDBStore relies on finalizers to not leak memory

2022-12-01 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14432:
--

 Summary: RocksDBStore relies on finalizers to not leak memory
 Key: KAFKA-14432
 URL: https://issues.apache.org/jira/browse/KAFKA-14432
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy
Assignee: Lucas Brutschy


Relying on finalizers in RocksDB has been deprecated for a long time, and 
starting with rocksdb 7, finalizers are removed completely (see 
[https://github.com/facebook/rocksdb/pull/9523]). 

Kafka Streams currently relies on finalizers in parts to not leak memory. This 
needs to be resolved before we can upgrade to RocksDB 7.

See  [https://github.com/apache/kafka/pull/12809] .

This is a native heap profile after running Kafka Streams without finalizers 
for a few hours:
Total: 13547.5 MB
 12936.3  95.5%  95.5%  12936.3  95.5% rocksdb::port::cacheline_aligned_alloc
   438.5   3.2%  98.7%438.5   3.2% rocksdb::BlockFetcher::ReadBlockContents
84.0   0.6%  99.3% 84.2   0.6% rocksdb::Arena::AllocateNewBlock
45.9   0.3%  99.7% 45.9   0.3% prof_backtrace_impl
 8.1   0.1%  99.7% 14.6   0.1% 
rocksdb::BlockBasedTable::PutDataBlockToCache
 6.4   0.0%  99.8%  12941.4  95.5% 
Java_org_rocksdb_Statistics_newStatistics___3BJ
 6.1   0.0%  99.8%  6.9   0.1% rocksdb::LRUCacheShard::Insert@2d8b20
 5.1   0.0%  99.9%  6.5   0.0% 
rocksdb::VersionSet::ProcessManifestWrites
 3.9   0.0%  99.9%  3.9   0.0% 
rocksdb::WritableFileWriter::WritableFileWriter
 3.2   0.0%  99.9%  3.2   0.0% std::string::_Rep::_S_create



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14299) Benchmark and stabilize state updater

2022-11-28 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14299.

  Assignee: Lucas Brutschy
Resolution: Fixed

> Benchmark and stabilize state updater
> -
>
> Key: KAFKA-14299
> URL: https://issues.apache.org/jira/browse/KAFKA-14299
> Project: Kafka
>  Issue Type: Task
>Reporter: Lucas Brutschy
>Assignee: Lucas Brutschy
>Priority: Major
>
> We need to benchmark and stabilize the separate state restoration code path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14415) `ThreadCache` is getting slower with every additional state store

2022-11-22 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14415:
--

 Summary: `ThreadCache` is getting slower with every additional 
state store
 Key: KAFKA-14415
 URL: https://issues.apache.org/jira/browse/KAFKA-14415
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


There are a few lines in `ThreadCache` that I think should be optimized. 
`sizeBytes` is called at least once, and potentially many times in every `put` 
and is linear in the number of caches (= number of state stores, so typically 
proportional to number of tasks). That means, with every additional task, every 
put gets a little slower. The throughput is 30% higher if replace it by 
constant time update…

Compare the throughput of TIME_ROCKS on trunk (green graph):

[http://kstreams-benchmark-results.s3-website-us-west-2.amazonaws.com/experiments/stateheavy-3-5-3-4-0-51b7eb7937-jenkins-20221113214104-streamsbench/]

This is the throughput of TIME_ROCKS when a constant time `sizeBytes` 
implementation is used:

[http://kstreams-benchmark-results.s3-website-us-west-2.amazonaws.com/experiments/stateheavy-3-5-LUCASCOMPARE-lucas-20221122140846-streamsbench/]

So the throughput is ~20% higher. 

The same seems to apply for the MEM backend (initial throughput >8000 instead 
of 6000), however, I cannot run the same benchmark here because the memory is 
filled too quickly.

[http://kstreams-benchmark-results.s3-website-us-west-2.amazonaws.com/experiments/stateheavy-3-5-LUCASSTATE-lucas-20221121231632-streamsbench/]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14309) Kafka Streams upgrade tests do not cover for FK-joins

2022-11-15 Thread Lucas Brutschy (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Brutschy resolved KAFKA-14309.

Resolution: Fixed

> Kafka Streams upgrade tests do not cover for FK-joins
> -
>
> Key: KAFKA-14309
> URL: https://issues.apache.org/jira/browse/KAFKA-14309
> Project: Kafka
>  Issue Type: Bug
>Reporter: Lucas Brutschy
>Assignee: Lucas Brutschy
>Priority: Major
>
> The current streams upgrade system test for FK joins inserts the production 
> of foreign key data and an actual foreign key join in every version of 
> SmokeTestDriver except for the latest. The effect was that FK join upgrades 
> are not tested at all, since no FK join code is executed after the bounce in 
> the system test.
> We should enable the FK-join code in the system test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14343) Write upgrade/downgrade tests for enabling the state updater

2022-10-31 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14343:
--

 Summary: Write upgrade/downgrade tests for enabling the state 
updater 
 Key: KAFKA-14343
 URL: https://issues.apache.org/jira/browse/KAFKA-14343
 Project: Kafka
  Issue Type: Task
  Components: streams
Reporter: Lucas Brutschy
Assignee: Lucas Brutschy


Write a test that verifies the upgrade from a version of Streams with state 
updater disabled to a version with state updater enabled and vice versa, so 
that we can offer a save upgrade path.
 * upgrade test from a version of Streams with state updater disabled to a 
version with state updater enabled (probably a system test since the old code 
path will be removed from the code base)

 * downgrade test from a version of Streams with state updater enabled to a 
version with state updater disabled (probably a system test since the old code 
path will be removed from the code base)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14309) Kafka Streams upgrade tests do not cover for FK-joins

2022-10-17 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14309:
--

 Summary: Kafka Streams upgrade tests do not cover for FK-joins
 Key: KAFKA-14309
 URL: https://issues.apache.org/jira/browse/KAFKA-14309
 Project: Kafka
  Issue Type: Bug
Reporter: Lucas Brutschy


The current streams upgrade system test for FK joins inserts the production of 
foreign key data and an actual foreign key join in every version of 
SmokeTestDriver except for the latest. The effect was that FK join upgrades are 
not tested at all, since no FK join code is executed after the bounce in the 
system test.

We should enable the FK-join code in the system test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14299) Benchmark and stabilize state updater

2022-10-13 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14299:
--

 Summary: Benchmark and stabilize state updater
 Key: KAFKA-14299
 URL: https://issues.apache.org/jira/browse/KAFKA-14299
 Project: Kafka
  Issue Type: Task
Reporter: Lucas Brutschy


We need to benchmark and stabilize the separate state restoration code path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14278) Convert INVALID_PRODUCER_EPOCH into PRODUCER_FENCED TxnOffsetCommit

2022-10-04 Thread Lucas Brutschy (Jira)
Lucas Brutschy created KAFKA-14278:
--

 Summary: Convert INVALID_PRODUCER_EPOCH into PRODUCER_FENCED 
TxnOffsetCommit
 Key: KAFKA-14278
 URL: https://issues.apache.org/jira/browse/KAFKA-14278
 Project: Kafka
  Issue Type: Sub-task
Reporter: Lucas Brutschy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)