[jira] [Resolved] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close
[ https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Calvin Liu resolved KAFKA-16217. Fix Version/s: (was: 3.6.3) Resolution: Fixed > Transactional producer stuck in IllegalStateException during close > -- > > Key: KAFKA-16217 > URL: https://issues.apache.org/jira/browse/KAFKA-16217 > Project: Kafka > Issue Type: Bug > Components: clients, producer >Affects Versions: 3.7.0, 3.6.1 >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > Labels: transactions > Fix For: 3.8.0, 3.7.1 > > > The producer is stuck during the close. It keeps retrying to abort the > transaction but it never succeeds. > {code:java} > [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | > producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > org.apache.kafka.clients.producer.internals.Sender run - [Producer > clientId=producer-transaction-ben > ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, > transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > Error in kafka producer I/O thread while aborting transaction: > java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` > because the previous call to `commitTransaction` timed out and must be retried > at > org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138) > at > org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274) > at java.base/java.lang.Thread.run(Thread.java:1583) > at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) > {code} > With the additional log, I found the root cause. If the producer is in a bad > transaction state(in my case, the TransactionManager.pendingTransition was > set to commitTransaction and did not get cleaned), then the producer calls > close and tries to abort the existing transaction, the producer will get > stuck in the transaction abortion. It is related to the fix > [https://github.com/apache/kafka/pull/13591]. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15585) DescribeTopic API
[ https://issues.apache.org/jira/browse/KAFKA-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Calvin Liu resolved KAFKA-15585. Resolution: Fixed > DescribeTopic API > - > > Key: KAFKA-15585 > URL: https://issues.apache.org/jira/browse/KAFKA-15585 > Project: Kafka > Issue Type: Sub-task >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > > Adding the new DescribeTopic API + the admin client and server-side handling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16540) Update partitions when the min isr config is updated.
Calvin Liu created KAFKA-16540: -- Summary: Update partitions when the min isr config is updated. Key: KAFKA-16540 URL: https://issues.apache.org/jira/browse/KAFKA-16540 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu If the min isr config is changed, we need to update the partitions accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15586) Clean shutdown detection, server side
[ https://issues.apache.org/jira/browse/KAFKA-15586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Calvin Liu resolved KAFKA-15586. Resolution: Fixed > Clean shutdown detection, server side > - > > Key: KAFKA-15586 > URL: https://issues.apache.org/jira/browse/KAFKA-15586 > Project: Kafka > Issue Type: Sub-task >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > > Upon the broker registration, if the broker has an unclean shutdown, it > should be removed from all the ELRs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16479) Add pagination supported describeTopic interface
Calvin Liu created KAFKA-16479: -- Summary: Add pagination supported describeTopic interface Key: KAFKA-16479 URL: https://issues.apache.org/jira/browse/KAFKA-16479 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu During the DescribeTopicPartitions API implementations, we found it awkward to place the pagination logic within the current admin client describe topic interface. So, in order to change the interface, we may need to have a boarder discussion like creating a KIP. Or even a step forward, to discuss a general client side pagination framework. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15583) High watermark can only advance if ISR size is larger than min ISR
[ https://issues.apache.org/jira/browse/KAFKA-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Calvin Liu resolved KAFKA-15583. Resolution: Fixed > High watermark can only advance if ISR size is larger than min ISR > -- > > Key: KAFKA-15583 > URL: https://issues.apache.org/jira/browse/KAFKA-15583 > Project: Kafka > Issue Type: Sub-task >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > > This is the new high watermark advancement requirement. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16250) Consumer group coordinator should perform sanity check on the offset commits.
Calvin Liu created KAFKA-16250: -- Summary: Consumer group coordinator should perform sanity check on the offset commits. Key: KAFKA-16250 URL: https://issues.apache.org/jira/browse/KAFKA-16250 Project: Kafka Issue Type: Improvement Reporter: Calvin Liu The current coordinator does not validate the offset commits before persisting it in the record. In a real case, though, I am not sure why the consumer generates the offset commits with a consumer offset valued at -2, the "illegal" consumer offset value caused confusion with the admin cli when describing the consumer group. The consumer offset field is marked "-". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close
Calvin Liu created KAFKA-16217: -- Summary: Transactional producer stuck in IllegalStateException during close Key: KAFKA-16217 URL: https://issues.apache.org/jira/browse/KAFKA-16217 Project: Kafka Issue Type: Bug Components: clients Reporter: Calvin Liu The producer is stuck during the close. It keeps retrying to abort the transaction but it never succeeds. {code:java} [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] org.apache.kafka.clients.producer.internals.Sender run - [Producer clientId=producer-transaction-ben ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] Error in kafka producer I/O thread while aborting transaction: java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` because the previous call to `commitTransaction` timed out and must be retried at org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138) at org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274) at java.base/java.lang.Thread.run(Thread.java:1583) at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) {code} With the additional log, I found the root cause. If the producer is in a bad transaction state(in my case, the TransactionManager.pendingTransition was set to commitTransaction and did not get cleaned), before the producer calls close and tries to abort the existing transaction, the producer will get stuck in the transaction abortion. It is related to the fix [https://github.com/apache/kafka/pull/13591]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15873) Improve the performance of the DescribeTopicPartitions API
Calvin Liu created KAFKA-15873: -- Summary: Improve the performance of the DescribeTopicPartitions API Key: KAFKA-15873 URL: https://issues.apache.org/jira/browse/KAFKA-15873 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu The current API involves sorting, copying, checking topics which will be out of the response limit. We should think about how to improve the performance for this API as it will be a main API for querying partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15820) Add a metric to track the number of partitions under min ISR
Calvin Liu created KAFKA-15820: -- Summary: Add a metric to track the number of partitions under min ISR Key: KAFKA-15820 URL: https://issues.apache.org/jira/browse/KAFKA-15820 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15584) ELR leader election
[ https://issues.apache.org/jira/browse/KAFKA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Calvin Liu resolved KAFKA-15584. Resolution: Fixed > ELR leader election > --- > > Key: KAFKA-15584 > URL: https://issues.apache.org/jira/browse/KAFKA-15584 > Project: Kafka > Issue Type: Sub-task >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > > With the ELR, here are the changes related to the leader election: > * ISR is allowed to be empty. > * ELR can be elected when ISR is empty > * When ISR and ELR are both empty, the lastKnownLeader can be uncleanly > elected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15762) ClusterConnectionStatesTest.testSingleIP is flaky
Calvin Liu created KAFKA-15762: -- Summary: ClusterConnectionStatesTest.testSingleIP is flaky Key: KAFKA-15762 URL: https://issues.apache.org/jira/browse/KAFKA-15762 Project: Kafka Issue Type: Bug Components: unit tests Reporter: Calvin Liu Build / JDK 11 and Scala 2.13 / testSingleIP() – org.apache.kafka.clients.ClusterConnectionStatesTest {code:java} org.opentest4j.AssertionFailedError: expected: <1> but was: <2> at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197) at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150) at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145) at app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527) at app//org.apache.kafka.clients.ClusterConnectionStatesTest.testSingleIP(ClusterConnectionStatesTest.java:267) at java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base@11.0.16.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base@11.0.16.1/java.lang.reflect.Method.invoke(Method.java:566) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15761) ConnectorRestartApiIntegrationTest.testMultiWorkerRestartOnlyConnector is flaky
Calvin Liu created KAFKA-15761: -- Summary: ConnectorRestartApiIntegrationTest.testMultiWorkerRestartOnlyConnector is flaky Key: KAFKA-15761 URL: https://issues.apache.org/jira/browse/KAFKA-15761 Project: Kafka Issue Type: Bug Components: unit tests Reporter: Calvin Liu Build / JDK 21 and Scala 2.13 / testMultiWorkerRestartOnlyConnector – org.apache.kafka.connect.integration.ConnectorRestartApiIntegrationTest {code:java} java.lang.AssertionError: Failed to stop connector and tasks within 12ms at org.junit.Assert.fail(Assert.java:89)at org.junit.Assert.assertTrue(Assert.java:42) at org.apache.kafka.connect.integration.ConnectorRestartApiIntegrationTest.runningConnectorAndTasksRestart(ConnectorRestartApiIntegrationTest.java:273) at org.apache.kafka.connect.integration.ConnectorRestartApiIntegrationTest.testMultiWorkerRestartOnlyConnector(ConnectorRestartApiIntegrationTest.java:231) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15760) org.apache.kafka.trogdor.coordinator.CoordinatorTest.testTaskRequestWithOldStartMsGetsUpdated is flaky
Calvin Liu created KAFKA-15760: -- Summary: org.apache.kafka.trogdor.coordinator.CoordinatorTest.testTaskRequestWithOldStartMsGetsUpdated is flaky Key: KAFKA-15760 URL: https://issues.apache.org/jira/browse/KAFKA-15760 Project: Kafka Issue Type: Bug Components: unit tests Reporter: Calvin Liu Build / JDK 17 and Scala 2.13 / testTaskRequestWithOldStartMsGetsUpdated() – org.apache.kafka.trogdor.coordinator.CoordinatorTest {code:java} java.util.concurrent.TimeoutException: testTaskRequestWithOldStartMsGetsUpdated() timed out after 12 milliseconds at org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29) at org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58) at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156) at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147) at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86) at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103) at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93) at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92) at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:218) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15759) DescribeClusterRequestTest is flaky
Calvin Liu created KAFKA-15759: -- Summary: DescribeClusterRequestTest is flaky Key: KAFKA-15759 URL: https://issues.apache.org/jira/browse/KAFKA-15759 Project: Kafka Issue Type: Bug Components: unit tests Reporter: Calvin Liu testDescribeClusterRequestIncludingClusterAuthorizedOperations(String).quorum=kraft – kafka.server.DescribeClusterRequestTest {code:java} org.opentest4j.AssertionFailedError: expected: but was: at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182) at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177) at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141) at kafka.server.DescribeClusterRequestTest.$anonfun$testDescribeClusterRequest$4(DescribeClusterRequestTest.scala:99) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at kafka.server.DescribeClusterRequestTest.testDescribeClusterRequest(DescribeClusterRequestTest.scala:86) at kafka.server.DescribeClusterRequestTest.testDescribeClusterRequestIncludingClusterAuthorizedOperations(DescribeClusterRequestTest.scala:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15700) FetchFromFollowerIntegrationTest is flaky
Calvin Liu created KAFKA-15700: -- Summary: FetchFromFollowerIntegrationTest is flaky Key: KAFKA-15700 URL: https://issues.apache.org/jira/browse/KAFKA-15700 Project: Kafka Issue Type: Bug Reporter: Calvin Liu It may relate to inappropriate timeout. testRackAwareRangeAssignor(String).quorum=zk {code:java} java.util.concurrent.TimeoutException at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204) at integration.kafka.server.FetchFromFollowerIntegrationTest.$anonfun$testRackAwareRangeAssignor$13(FetchFromFollowerIntegrationTest.scala:229) at integration.kafka.server.FetchFromFollowerIntegrationTest.$anonfun$testRackAwareRangeAssignor$13$adapted(FetchFromFollowerIntegrationTest.scala:228) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15699) MirrorConnectorsIntegrationBaseTest is flaky
Calvin Liu created KAFKA-15699: -- Summary: MirrorConnectorsIntegrationBaseTest is flaky Key: KAFKA-15699 URL: https://issues.apache.org/jira/browse/KAFKA-15699 Project: Kafka Issue Type: Bug Reporter: Calvin Liu -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15690) Flaky integration tests
Calvin Liu created KAFKA-15690: -- Summary: Flaky integration tests Key: KAFKA-15690 URL: https://issues.apache.org/jira/browse/KAFKA-15690 Project: Kafka Issue Type: Bug Reporter: Calvin Liu Finding the following integration tests flaky. EosIntegrationTest { * shouldNotViolateEosIfOneTaskGetsFencedUsingIsolatedAppInstances[exactly_once_v2, processing threads = false] * shouldWriteLatestOffsetsToCheckpointOnShutdown[exactly_once_v2, processing threads = false] * shouldWriteLatestOffsetsToCheckpointOnShutdown[exactly_once, processing threads = false] } MirrorConnectorsIntegrationBaseTest { * testReplicateSourceDefault() * testOffsetSyncsTopicsOnTarget() } FetchFromFollowerIntegrationTest { * testRackAwareRangeAssignor(String).quorum=zk } They are running long and may relate to timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15665) Enforce min ISR when complete partition reassignment
Calvin Liu created KAFKA-15665: -- Summary: Enforce min ISR when complete partition reassignment Key: KAFKA-15665 URL: https://issues.apache.org/jira/browse/KAFKA-15665 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu Current partition reassignment can be completed when the new ISR is under min ISR. We should fix this behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15581) Introduce ELR
[ https://issues.apache.org/jira/browse/KAFKA-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Calvin Liu resolved KAFKA-15581. Reviewer: David Arthur Resolution: Fixed > Introduce ELR > - > > Key: KAFKA-15581 > URL: https://issues.apache.org/jira/browse/KAFKA-15581 > Project: Kafka > Issue Type: Sub-task >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > > Introduce the PartitionRecord, PartitionChangeRecord and the basic ELR > handling in the controller -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15590) Replica.updateFetchState should also fence updates with stale leader epoch
Calvin Liu created KAFKA-15590: -- Summary: Replica.updateFetchState should also fence updates with stale leader epoch Key: KAFKA-15590 URL: https://issues.apache.org/jira/browse/KAFKA-15590 Project: Kafka Issue Type: Bug Reporter: Calvin Liu This is a follow-up ticket for KAFKA-15221. There is another type of race that a fetch request with stale leader epoch can update the fetch state. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15586) Clean shutdown detection, server side
Calvin Liu created KAFKA-15586: -- Summary: Clean shutdown detection, server side Key: KAFKA-15586 URL: https://issues.apache.org/jira/browse/KAFKA-15586 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu Upon the broker registration, if the broker has an unclean shutdown, it should be removed from all the ELRs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15585) DescribeTopic API
Calvin Liu created KAFKA-15585: -- Summary: DescribeTopic API Key: KAFKA-15585 URL: https://issues.apache.org/jira/browse/KAFKA-15585 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu Adding the new DescribeTopic API + the admin client and server-side handling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15584) ELR leader election
Calvin Liu created KAFKA-15584: -- Summary: ELR leader election Key: KAFKA-15584 URL: https://issues.apache.org/jira/browse/KAFKA-15584 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu With the ELR, here are the changes related to the leader election: * ISR is allowed to be empty. * ELR can be elected when ISR is empty * When ISR and ELR are both empty, the lastKnownLeader can be uncleanly elected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15583) High watermark can only advance if ISR size is larger than min ISR
Calvin Liu created KAFKA-15583: -- Summary: High watermark can only advance if ISR size is larger than min ISR Key: KAFKA-15583 URL: https://issues.apache.org/jira/browse/KAFKA-15583 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu This is the new high watermark advancement requirement. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15582) Clean shutdown detection, broker side
Calvin Liu created KAFKA-15582: -- Summary: Clean shutdown detection, broker side Key: KAFKA-15582 URL: https://issues.apache.org/jira/browse/KAFKA-15582 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu The clean shutdown file can now include the broker epoch before shutdown. During the broker start process, the broker should extract the broker epochs from the clean shutdown files. If successful, send the broker epoch through the broker registration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15581) Introduce ELR
Calvin Liu created KAFKA-15581: -- Summary: Introduce ELR Key: KAFKA-15581 URL: https://issues.apache.org/jira/browse/KAFKA-15581 Project: Kafka Issue Type: Sub-task Reporter: Calvin Liu Assignee: Calvin Liu Introduce the PartitionRecord, PartitionChangeRecord and the basic ELR handling in the controller -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15580) KIP-966: Unclean Recovery
Calvin Liu created KAFKA-15580: -- Summary: KIP-966: Unclean Recovery Key: KAFKA-15580 URL: https://issues.apache.org/jira/browse/KAFKA-15580 Project: Kafka Issue Type: New Feature Reporter: Calvin Liu -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15579) KIP-966: Eligible Leader Replicas
Calvin Liu created KAFKA-15579: -- Summary: KIP-966: Eligible Leader Replicas Key: KAFKA-15579 URL: https://issues.apache.org/jira/browse/KAFKA-15579 Project: Kafka Issue Type: New Feature Reporter: Calvin Liu Assignee: Calvin Liu -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15332) Eligible Leader Replicas
Calvin Liu created KAFKA-15332: -- Summary: Eligible Leader Replicas Key: KAFKA-15332 URL: https://issues.apache.org/jira/browse/KAFKA-15332 Project: Kafka Issue Type: New Feature Reporter: Calvin Liu Assignee: Calvin Liu A root ticket for the KIP-966 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15221) Potential race condition between requests from rebooted followers
Calvin Liu created KAFKA-15221: -- Summary: Potential race condition between requests from rebooted followers Key: KAFKA-15221 URL: https://issues.apache.org/jira/browse/KAFKA-15221 Project: Kafka Issue Type: Bug Affects Versions: 3.5.0 Reporter: Calvin Liu Assignee: Calvin Liu Fix For: 3.6.0, 3.5.1 When the leader processes the fetch request, it does not acquire locks when updating the replica fetch state. Then there can be a race between the fetch requests from a rebooted follower. T0, broker 1 sends a fetch to broker 0(leader). At the moment, broker 1 is not in ISR. T1, broker 1 crashes. T2 broker 1 is back online and receives a new broker epoch. Also, it sends a new Fetch request. T3 broker 0 receives the old fetch requests and decides to expand the ISR. T4 Right before broker 0 starts to fill the AlterPartitoin request, the new fetch request comes in and overwrites the fetch state. Then broker 0 uses the new broker epoch on the AlterPartition request. In this way, the AlterPartition request can get around KIP-903 and wrongly update the ISR. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR
[ https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Calvin Liu resolved KAFKA-14139. Resolution: Fixed > Replaced disk can lead to loss of committed data even with non-empty ISR > > > Key: KAFKA-14139 > URL: https://issues.apache.org/jira/browse/KAFKA-14139 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Calvin Liu >Priority: Major > Fix For: 3.5.0 > > > We have been thinking about disk failure cases recently. Suppose that a disk > has failed and the user needs to restart the disk from an empty state. The > concern is whether this can lead to the unnecessary loss of committed data. > For normal topic partitions, removal from the ISR during controlled shutdown > buys us some protection. After the replica is restarted, it must prove its > state to the leader before it can be added back to the ISR. And it cannot > become a leader until it does so. > An obvious exception to this is when the replica is the last member in the > ISR. In this case, the disk failure itself has compromised the committed > data, so some amount of loss must be expected. > We have been considering other scenarios in which the loss of one disk can > lead to data loss even when there are replicas remaining which have all of > the committed entries. One such scenario is this: > Suppose we have a partition with two replicas: A and B. Initially A is the > leader and it is the only member of the ISR. > # Broker B catches up to A, so A attempts to send an AlterPartition request > to the controller to add B into the ISR. > # Before the AlterPartition request is received, replica B has a hard > failure. > # The current controller successfully fences broker B. It takes no action on > this partition since B is already out of the ISR. > # Before the controller receives the AlterPartition request to add B, it > also fails. > # While the new controller is initializing, suppose that replica B finishes > startup, but the disk has been replaced (all of the previous state has been > lost). > # The new controller sees the registration from broker B first. > # Finally, the AlterPartition from A arrives which adds B back into the ISR > even though it has an empty log. > (Credit for coming up with this scenario goes to [~junrao] .) > I tested this in KRaft and confirmed that this sequence is possible (even if > perhaps unlikely). There are a few ways we could have potentially detected > the issue. First, perhaps the leader should have bumped the leader epoch on > all partitions when B was fenced. Then the inflight AlterPartition would be > doomed no matter when it arrived. > Alternatively, we could have relied on the broker epoch to distinguish the > dead broker's state from that of the restarted broker. This could be done by > including the broker epoch in both the `Fetch` request and in > `AlterPartition`. > Finally, perhaps even normal kafka replication should be using a unique > identifier for each disk so that we can reliably detect when it has changed. > For example, something like what was proposed for the metadata quorum here: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14617) Replicas with stale broker epoch should not be allowed to join the ISR
Calvin Liu created KAFKA-14617: -- Summary: Replicas with stale broker epoch should not be allowed to join the ISR Key: KAFKA-14617 URL: https://issues.apache.org/jira/browse/KAFKA-14617 Project: Kafka Issue Type: Improvement Reporter: Calvin Liu -- This message was sent by Atlassian Jira (v8.20.10#820010)