[jira] [Resolved] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close

2024-04-29 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-16217.

Fix Version/s: (was: 3.6.3)
   Resolution: Fixed

> Transactional producer stuck in IllegalStateException during close
> --
>
> Key: KAFKA-16217
> URL: https://issues.apache.org/jira/browse/KAFKA-16217
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>  Labels: transactions
> Fix For: 3.8.0, 3.7.1
>
>
> The producer is stuck during the close. It keeps retrying to abort the 
> transaction but it never succeeds. 
> {code:java}
> [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | 
> producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> org.apache.kafka.clients.producer.internals.Sender run - [Producer 
> clientId=producer-transaction-ben
> ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, 
> transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> Error in kafka producer I/O thread while aborting transaction:
> java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` 
> because the previous call to `commitTransaction` timed out and must be retried
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) 
> {code}
> With the additional log, I found the root cause. If the producer is in a bad 
> transaction state(in my case, the TransactionManager.pendingTransition was 
> set to commitTransaction and did not get cleaned), then the producer calls 
> close and tries to abort the existing transaction, the producer will get 
> stuck in the transaction abortion. It is related to the fix 
> [https://github.com/apache/kafka/pull/13591].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15585) DescribeTopic API

2024-04-19 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-15585.

Resolution: Fixed

> DescribeTopic API
> -
>
> Key: KAFKA-15585
> URL: https://issues.apache.org/jira/browse/KAFKA-15585
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>
> Adding the new DescribeTopic API + the admin client and server-side handling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16540) Update partitions when the min isr config is updated.

2024-04-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-16540:
--

 Summary: Update partitions when the min isr config is updated.
 Key: KAFKA-16540
 URL: https://issues.apache.org/jira/browse/KAFKA-16540
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


If the min isr config is changed, we need to update the partitions accordingly. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15586) Clean shutdown detection, server side

2024-04-11 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-15586.

Resolution: Fixed

> Clean shutdown detection, server side
> -
>
> Key: KAFKA-15586
> URL: https://issues.apache.org/jira/browse/KAFKA-15586
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>
> Upon the broker registration, if the broker has an unclean shutdown, it 
> should be removed from all the ELRs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16479) Add pagination supported describeTopic interface

2024-04-05 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-16479:
--

 Summary: Add pagination supported  describeTopic interface
 Key: KAFKA-16479
 URL: https://issues.apache.org/jira/browse/KAFKA-16479
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu


During the DescribeTopicPartitions API implementations, we found it awkward to 
place the pagination logic within the current admin client describe topic 
interface. So, in order to change the interface, we may need to have a boarder 
discussion like creating a KIP. Or even a step forward, to discuss a general 
client side pagination framework. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15583) High watermark can only advance if ISR size is larger than min ISR

2024-04-05 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-15583.

Resolution: Fixed

> High watermark can only advance if ISR size is larger than min ISR
> --
>
> Key: KAFKA-15583
> URL: https://issues.apache.org/jira/browse/KAFKA-15583
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>
> This is the new high watermark advancement requirement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16250) Consumer group coordinator should perform sanity check on the offset commits.

2024-02-13 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-16250:
--

 Summary: Consumer group coordinator should perform sanity check on 
the offset commits.
 Key: KAFKA-16250
 URL: https://issues.apache.org/jira/browse/KAFKA-16250
 Project: Kafka
  Issue Type: Improvement
Reporter: Calvin Liu


The current coordinator does not validate the offset commits before persisting 
it in the record.

In a real case, though, I am not sure why the consumer generates the offset 
commits with a consumer offset valued at -2, the "illegal" consumer offset 
value caused confusion with the admin cli when describing the consumer group. 
The consumer offset field is marked "-".

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close

2024-02-01 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-16217:
--

 Summary: Transactional producer stuck in IllegalStateException 
during close
 Key: KAFKA-16217
 URL: https://issues.apache.org/jira/browse/KAFKA-16217
 Project: Kafka
  Issue Type: Bug
  Components: clients
Reporter: Calvin Liu


The producer is stuck during the close. It keeps retrying to abort the 
transaction but it never succeeds. 
{code:java}
[ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | 
producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
org.apache.kafka.clients.producer.internals.Sender run - [Producer 
clientId=producer-transaction-ben
ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, 
transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] Error 
in kafka producer I/O thread while aborting transaction:
java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` 
because the previous call to `commitTransaction` timed out and must be retried
at 
org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138)
at 
org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274)
at java.base/java.lang.Thread.run(Thread.java:1583)
at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) 
{code}
With the additional log, I found the root cause. If the producer is in a bad 
transaction state(in my case, the TransactionManager.pendingTransition was set 
to commitTransaction and did not get cleaned), before the producer calls close 
and tries to abort the existing transaction, the producer will get stuck in the 
transaction abortion. It is related to the fix 
[https://github.com/apache/kafka/pull/13591].
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15873) Improve the performance of the DescribeTopicPartitions API

2023-11-21 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15873:
--

 Summary: Improve the performance of the DescribeTopicPartitions API
 Key: KAFKA-15873
 URL: https://issues.apache.org/jira/browse/KAFKA-15873
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu


The current API involves sorting, copying, checking topics which will be out of 
the response limit. We should think about how to improve the performance for 
this API as it will be a main API for querying partitions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15820) Add a metric to track the number of partitions under min ISR

2023-11-13 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15820:
--

 Summary: Add a metric to track the number of partitions under min 
ISR
 Key: KAFKA-15820
 URL: https://issues.apache.org/jira/browse/KAFKA-15820
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15584) ELR leader election

2023-11-06 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-15584.

Resolution: Fixed

> ELR leader election
> ---
>
> Key: KAFKA-15584
> URL: https://issues.apache.org/jira/browse/KAFKA-15584
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>
> With the ELR, here are the changes related to the leader election:
>  * ISR is allowed to be empty.
>  * ELR can be elected when ISR is empty
>  * When ISR and ELR are both empty, the lastKnownLeader can be uncleanly 
> elected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15762) ClusterConnectionStatesTest.testSingleIP is flaky

2023-10-30 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15762:
--

 Summary: ClusterConnectionStatesTest.testSingleIP is flaky
 Key: KAFKA-15762
 URL: https://issues.apache.org/jira/browse/KAFKA-15762
 Project: Kafka
  Issue Type: Bug
  Components: unit tests
Reporter: Calvin Liu


Build / JDK 11 and Scala 2.13 / testSingleIP() – 
org.apache.kafka.clients.ClusterConnectionStatesTest
{code:java}
org.opentest4j.AssertionFailedError: expected: <1> but was: <2> at 
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
   at 
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
   at 
app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)  at 
app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)  at 
app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)  at 
app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527)  at 
app//org.apache.kafka.clients.ClusterConnectionStatesTest.testSingleIP(ClusterConnectionStatesTest.java:267)
 at 
java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
 Method) at 
java.base@11.0.16.1/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
java.base@11.0.16.1/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base@11.0.16.1/java.lang.reflect.Method.invoke(Method.java:566) 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15761) ConnectorRestartApiIntegrationTest.testMultiWorkerRestartOnlyConnector is flaky

2023-10-30 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15761:
--

 Summary: 
ConnectorRestartApiIntegrationTest.testMultiWorkerRestartOnlyConnector is flaky
 Key: KAFKA-15761
 URL: https://issues.apache.org/jira/browse/KAFKA-15761
 Project: Kafka
  Issue Type: Bug
  Components: unit tests
Reporter: Calvin Liu


Build / JDK 21 and Scala 2.13 / testMultiWorkerRestartOnlyConnector – 
org.apache.kafka.connect.integration.ConnectorRestartApiIntegrationTest
{code:java}
java.lang.AssertionError: Failed to stop connector and tasks within 12ms
at org.junit.Assert.fail(Assert.java:89)at 
org.junit.Assert.assertTrue(Assert.java:42)  at 
org.apache.kafka.connect.integration.ConnectorRestartApiIntegrationTest.runningConnectorAndTasksRestart(ConnectorRestartApiIntegrationTest.java:273)
 at 
org.apache.kafka.connect.integration.ConnectorRestartApiIntegrationTest.testMultiWorkerRestartOnlyConnector(ConnectorRestartApiIntegrationTest.java:231)
 at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)   at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
  at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)   
 at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)  at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)  at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)  at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
   at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15760) org.apache.kafka.trogdor.coordinator.CoordinatorTest.testTaskRequestWithOldStartMsGetsUpdated is flaky

2023-10-30 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15760:
--

 Summary: 
org.apache.kafka.trogdor.coordinator.CoordinatorTest.testTaskRequestWithOldStartMsGetsUpdated
 is flaky
 Key: KAFKA-15760
 URL: https://issues.apache.org/jira/browse/KAFKA-15760
 Project: Kafka
  Issue Type: Bug
  Components: unit tests
Reporter: Calvin Liu


Build / JDK 17 and Scala 2.13 / testTaskRequestWithOldStartMsGetsUpdated() – 
org.apache.kafka.trogdor.coordinator.CoordinatorTest
{code:java}
java.util.concurrent.TimeoutException: 
testTaskRequestWithOldStartMsGetsUpdated() timed out after 12 milliseconds  
 at 
org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29)
   at 
org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58)
  at 
org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
 at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
   at 
org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)
at 
org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
 at 
org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
  at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
 at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
 at 
org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
 at 
org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)
   at 
org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)
   at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:218)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15759) DescribeClusterRequestTest is flaky

2023-10-30 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15759:
--

 Summary: DescribeClusterRequestTest is flaky
 Key: KAFKA-15759
 URL: https://issues.apache.org/jira/browse/KAFKA-15759
 Project: Kafka
  Issue Type: Bug
  Components: unit tests
Reporter: Calvin Liu


testDescribeClusterRequestIncludingClusterAuthorizedOperations(String).quorum=kraft
 – kafka.server.DescribeClusterRequestTest
{code:java}
org.opentest4j.AssertionFailedError: expected: 
 but was:  at 
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at 
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at 
org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)   at 
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)   at 
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177)   at 
org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141)  at 
kafka.server.DescribeClusterRequestTest.$anonfun$testDescribeClusterRequest$4(DescribeClusterRequestTest.scala:99)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at 
kafka.server.DescribeClusterRequestTest.testDescribeClusterRequest(DescribeClusterRequestTest.scala:86)
  at 
kafka.server.DescribeClusterRequestTest.testDescribeClusterRequestIncludingClusterAuthorizedOperations(DescribeClusterRequestTest.scala:53)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15700) FetchFromFollowerIntegrationTest is flaky

2023-10-26 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15700:
--

 Summary: FetchFromFollowerIntegrationTest is flaky
 Key: KAFKA-15700
 URL: https://issues.apache.org/jira/browse/KAFKA-15700
 Project: Kafka
  Issue Type: Bug
Reporter: Calvin Liu


It may relate to inappropriate timeout.

testRackAwareRangeAssignor(String).quorum=zk
{code:java}
java.util.concurrent.TimeoutException   at 
java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)   at 
integration.kafka.server.FetchFromFollowerIntegrationTest.$anonfun$testRackAwareRangeAssignor$13(FetchFromFollowerIntegrationTest.scala:229)
 at 
integration.kafka.server.FetchFromFollowerIntegrationTest.$anonfun$testRackAwareRangeAssignor$13$adapted(FetchFromFollowerIntegrationTest.scala:228)
 at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at 
scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15699) MirrorConnectorsIntegrationBaseTest is flaky

2023-10-26 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15699:
--

 Summary: MirrorConnectorsIntegrationBaseTest is flaky
 Key: KAFKA-15699
 URL: https://issues.apache.org/jira/browse/KAFKA-15699
 Project: Kafka
  Issue Type: Bug
Reporter: Calvin Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15690) Flaky integration tests

2023-10-26 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15690:
--

 Summary: Flaky integration tests
 Key: KAFKA-15690
 URL: https://issues.apache.org/jira/browse/KAFKA-15690
 Project: Kafka
  Issue Type: Bug
Reporter: Calvin Liu


Finding the following integration tests flaky.

EosIntegrationTest {
 * 
shouldNotViolateEosIfOneTaskGetsFencedUsingIsolatedAppInstances[exactly_once_v2,
 processing threads = false] 
 * shouldWriteLatestOffsetsToCheckpointOnShutdown[exactly_once_v2, processing 
threads = false] 
 * shouldWriteLatestOffsetsToCheckpointOnShutdown[exactly_once, processing 
threads = false] 

}

MirrorConnectorsIntegrationBaseTest {
 * testReplicateSourceDefault()
 * testOffsetSyncsTopicsOnTarget() 

}

FetchFromFollowerIntegrationTest {
 * testRackAwareRangeAssignor(String).quorum=zk

}

They are running long and may relate to timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15665) Enforce min ISR when complete partition reassignment

2023-10-21 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15665:
--

 Summary: Enforce min ISR when complete partition reassignment
 Key: KAFKA-15665
 URL: https://issues.apache.org/jira/browse/KAFKA-15665
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


Current partition reassignment can be completed when the new ISR is under min 
ISR. We should fix this behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15581) Introduce ELR

2023-10-19 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-15581.

  Reviewer: David Arthur
Resolution: Fixed

> Introduce ELR
> -
>
> Key: KAFKA-15581
> URL: https://issues.apache.org/jira/browse/KAFKA-15581
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>
> Introduce the PartitionRecord, PartitionChangeRecord and the basic ELR 
> handling in the controller



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15590) Replica.updateFetchState should also fence updates with stale leader epoch

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15590:
--

 Summary: Replica.updateFetchState should also fence updates with 
stale leader epoch
 Key: KAFKA-15590
 URL: https://issues.apache.org/jira/browse/KAFKA-15590
 Project: Kafka
  Issue Type: Bug
Reporter: Calvin Liu


This is a follow-up ticket for KAFKA-15221.

There is another type of race that a fetch request with stale leader epoch can 
update the fetch state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15586) Clean shutdown detection, server side

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15586:
--

 Summary: Clean shutdown detection, server side
 Key: KAFKA-15586
 URL: https://issues.apache.org/jira/browse/KAFKA-15586
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


Upon the broker registration, if the broker has an unclean shutdown, it should 
be removed from all the ELRs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15585) DescribeTopic API

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15585:
--

 Summary: DescribeTopic API
 Key: KAFKA-15585
 URL: https://issues.apache.org/jira/browse/KAFKA-15585
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


Adding the new DescribeTopic API + the admin client and server-side handling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15584) ELR leader election

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15584:
--

 Summary: ELR leader election
 Key: KAFKA-15584
 URL: https://issues.apache.org/jira/browse/KAFKA-15584
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


With the ELR, here are the changes related to the leader election:
 * ISR is allowed to be empty.
 * ELR can be elected when ISR is empty
 * When ISR and ELR are both empty, the lastKnownLeader can be uncleanly 
elected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15583) High watermark can only advance if ISR size is larger than min ISR

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15583:
--

 Summary: High watermark can only advance if ISR size is larger 
than min ISR
 Key: KAFKA-15583
 URL: https://issues.apache.org/jira/browse/KAFKA-15583
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


This is the new high watermark advancement requirement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15582) Clean shutdown detection, broker side

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15582:
--

 Summary: Clean shutdown detection, broker side
 Key: KAFKA-15582
 URL: https://issues.apache.org/jira/browse/KAFKA-15582
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


The clean shutdown file can now include the broker epoch before shutdown. 
During the broker start process, the broker should extract the broker epochs 
from the clean shutdown files. If successful, send the broker epoch through the 
broker registration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15581) Introduce ELR

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15581:
--

 Summary: Introduce ELR
 Key: KAFKA-15581
 URL: https://issues.apache.org/jira/browse/KAFKA-15581
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


Introduce the PartitionRecord, PartitionChangeRecord and the basic ELR handling 
in the controller



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15580) KIP-966: Unclean Recovery

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15580:
--

 Summary: KIP-966: Unclean Recovery
 Key: KAFKA-15580
 URL: https://issues.apache.org/jira/browse/KAFKA-15580
 Project: Kafka
  Issue Type: New Feature
Reporter: Calvin Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15579) KIP-966: Eligible Leader Replicas

2023-10-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15579:
--

 Summary: KIP-966: Eligible Leader Replicas
 Key: KAFKA-15579
 URL: https://issues.apache.org/jira/browse/KAFKA-15579
 Project: Kafka
  Issue Type: New Feature
Reporter: Calvin Liu
Assignee: Calvin Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15332) Eligible Leader Replicas

2023-08-10 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15332:
--

 Summary: Eligible Leader Replicas
 Key: KAFKA-15332
 URL: https://issues.apache.org/jira/browse/KAFKA-15332
 Project: Kafka
  Issue Type: New Feature
Reporter: Calvin Liu
Assignee: Calvin Liu


A root ticket for the KIP-966



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15221) Potential race condition between requests from rebooted followers

2023-07-19 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-15221:
--

 Summary: Potential race condition between requests from rebooted 
followers
 Key: KAFKA-15221
 URL: https://issues.apache.org/jira/browse/KAFKA-15221
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.5.0
Reporter: Calvin Liu
Assignee: Calvin Liu
 Fix For: 3.6.0, 3.5.1


When the leader processes the fetch request, it does not acquire locks when 
updating the replica fetch state. Then there can be a race between the fetch 
requests from a rebooted follower.

T0, broker 1 sends a fetch to broker 0(leader). At the moment, broker 1 is not 
in ISR.

T1, broker 1 crashes.

T2 broker 1 is back online and receives a new broker epoch. Also, it sends a 
new Fetch request.

T3 broker 0 receives the old fetch requests and decides to expand the ISR.

T4 Right before broker 0 starts to fill the AlterPartitoin request, the new 
fetch request comes in and overwrites the fetch state. Then broker 0 uses the 
new broker epoch on the AlterPartition request.

In this way, the AlterPartition request can get around KIP-903 and wrongly 
update the ISR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR

2023-04-13 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-14139.

Resolution: Fixed

> Replaced disk can lead to loss of committed data even with non-empty ISR
> 
>
> Key: KAFKA-14139
> URL: https://issues.apache.org/jira/browse/KAFKA-14139
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Calvin Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> We have been thinking about disk failure cases recently. Suppose that a disk 
> has failed and the user needs to restart the disk from an empty state. The 
> concern is whether this can lead to the unnecessary loss of committed data.
> For normal topic partitions, removal from the ISR during controlled shutdown 
> buys us some protection. After the replica is restarted, it must prove its 
> state to the leader before it can be added back to the ISR. And it cannot 
> become a leader until it does so.
> An obvious exception to this is when the replica is the last member in the 
> ISR. In this case, the disk failure itself has compromised the committed 
> data, so some amount of loss must be expected.
> We have been considering other scenarios in which the loss of one disk can 
> lead to data loss even when there are replicas remaining which have all of 
> the committed entries. One such scenario is this:
> Suppose we have a partition with two replicas: A and B. Initially A is the 
> leader and it is the only member of the ISR.
>  # Broker B catches up to A, so A attempts to send an AlterPartition request 
> to the controller to add B into the ISR.
>  # Before the AlterPartition request is received, replica B has a hard 
> failure.
>  # The current controller successfully fences broker B. It takes no action on 
> this partition since B is already out of the ISR.
>  # Before the controller receives the AlterPartition request to add B, it 
> also fails.
>  # While the new controller is initializing, suppose that replica B finishes 
> startup, but the disk has been replaced (all of the previous state has been 
> lost).
>  # The new controller sees the registration from broker B first.
>  # Finally, the AlterPartition from A arrives which adds B back into the ISR 
> even though it has an empty log.
> (Credit for coming up with this scenario goes to [~junrao] .)
> I tested this in KRaft and confirmed that this sequence is possible (even if 
> perhaps unlikely). There are a few ways we could have potentially detected 
> the issue. First, perhaps the leader should have bumped the leader epoch on 
> all partitions when B was fenced. Then the inflight AlterPartition would be 
> doomed no matter when it arrived.
> Alternatively, we could have relied on the broker epoch to distinguish the 
> dead broker's state from that of the restarted broker. This could be done by 
> including the broker epoch in both the `Fetch` request and in 
> `AlterPartition`.
> Finally, perhaps even normal kafka replication should be using a unique 
> identifier for each disk so that we can reliably detect when it has changed. 
> For example, something like what was proposed for the metadata quorum here: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14617) Replicas with stale broker epoch should not be allowed to join the ISR

2023-01-12 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-14617:
--

 Summary: Replicas with stale broker epoch should not be allowed to 
join the ISR
 Key: KAFKA-14617
 URL: https://issues.apache.org/jira/browse/KAFKA-14617
 Project: Kafka
  Issue Type: Improvement
Reporter: Calvin Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)