[jira] [Created] (KAFKA-16667) KRaftMigrationDriver gets stuck after successive failovers

2024-05-04 Thread David Arthur (Jira)
David Arthur created KAFKA-16667:


 Summary: KRaftMigrationDriver gets stuck after successive failovers
 Key: KAFKA-16667
 URL: https://issues.apache.org/jira/browse/KAFKA-16667
 Project: Kafka
  Issue Type: Bug
  Components: controller, migration
Reporter: David Arthur


This is a continuation of KAFKA-16171.

It turns out that the active KRaftMigrationDriver can get a stale read from ZK 
after becoming the active controller in ZK (i.e., writing to "/controller").

Because ZooKeeper only offers linearizability on writes to a given ZNode, it is 
possible that we get a stale read on the "/migration" ZNode after writing to 
"/controller" (and "/controller_epoch") when becoming active. 

 

The history looks like this:
 # Node B becomes leader in the Raft layer. KRaftLeaderEvents are enqueued on 
all KRaftMigrationDriver-s
 # Node A writes some state to ZK, updates "/migration", and checks 
"/controller_epoch" in one transaction. This happens before B claims controller 
leadership in ZK. The "/migration" state is updated from X to Y
 # Node B claims leadership by updating "/controller" and "/controller_epoch". 
Leader B reads "/migration" state X
 # Node A tries to write some state, fails on "/controller_epoch" check op.
 # Node A processes new leader and becomes inactive

 

This does not violate consistency guarantees made by ZooKeeper.

 

> Write operations in ZooKeeper are {_}linearizable{_}. In other words, each 
> {{write}} will appear to take effect atomically at some point between when 
> the client issues the request and receives the corresponding response.

and 

> Read operations in ZooKeeper are _not linearizable_ since they can return 
> potentially stale data. This is because a {{read}} in ZooKeeper is not a 
> quorum operation and a server will respond immediately to a client that is 
> performing a {{{}read{}}}.

 

--- 

 

The impact of this stale read is the same as KAFKA-16171. The 
KRaftMigrationDriver never gets past SYNC_KRAFT_TO_ZK because it has a stale 
zkVersion for the "/migration" ZNode. The result is brokers never learn about 
the new controller and cannot update any partition state.

The workaround for this bug is to re-elect the controller by shutting down the 
active KRaft controller. 

This bug was found during a migration where the KRaft controller was rapidly 
failing over due to an excess of metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1036: Extend RecordDeserializationException exception

2024-04-18 Thread David Arthur
Hi Fred, thanks for the KIP. Seems like a useful improvement.

As others have mentioned, I think we should avoid exposing Record in this
way.

Using ConsumerRecord seems okay, but maybe not the best fit for this case
(for the reasons Matthias gave).

Maybe we could create a new container interface to hold the partially
deserialized data? This could also indicate to the exception handler
whether the key, the value, or both had deserialization errors.

Thanks,
David

On Thu, Apr 18, 2024 at 10:16 AM Frédérik Rouleau
 wrote:

> Hi,
>
> But I guess my main question is really about what metadata we really
> > want to add to `RecordDeserializationException`? `Record` expose all
> > kind of internal (serialization) metadata like `keySize()`,
> > `valueSize()` and many more. For the DLQ use-case it seems we don't
> > really want any of these? So I am wondering if just adding
> > key/value/ts/headers would be sufficient?
> >
>
> I think that key/value/ts/headers, topicPartition and offset are all we
> need. I do not see any usage for other metadata. If someone has a use case,
> I would like to know it.
>
> So in that case we can directly add the data into the exception. We can
> keep ByteBuffer for the local field instead of byte[], that will avoid
> memory allocation if users do not require it.
> I wonder if we should return the ByteBuffer or directly the byte[] (or both
> ?) which is more convenient for end users. Any thoughts?
> Then we can have something like:
>
> public RecordDeserializationException(TopicPartition partition,
>  long offset,
>  ByteBuffer key,
>  ByteBuffer value,
>  Header[] headers,
>  long timestamp,
>  String message,
>  Throwable cause);
>
> public TopicPartition topicPartition();
>
> public long offset();
>
> public long timestamp();
>
> public byte[] key(); // Will allocate the array on call
>
> public byte[] value(); // Will allocate the array on call
>
> public Header[] headers();
>
>
>
> Regards,
> Fred
>


-- 
-David


[jira] [Created] (KAFKA-16539) Can't update specific broker configs in pre-migration mode

2024-04-11 Thread David Arthur (Jira)
David Arthur created KAFKA-16539:


 Summary: Can't update specific broker configs in pre-migration mode
 Key: KAFKA-16539
 URL: https://issues.apache.org/jira/browse/KAFKA-16539
 Project: Kafka
  Issue Type: Bug
  Components: config, kraft
Affects Versions: 3.6.2, 3.6.1, 3.7.0, 3.6.0
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.8.0, 3.7.1, 3.6.3


In migration mode, ZK brokers will have a forwarding manager configured. This 
is used to forward requests to the KRaft controller once we get to that part of 
the migration. However, prior to KRaft taking over as the controller (known as 
pre-migration mode), the ZK brokers are still attempting to forward 
IncrementalAlterConfigs to the controller.

This works fine for cluster level configs (e.g., "--entity-type broker 
--entity-default"), but this fails for specific broker configs (e.g., 
"--entity-type broker --entity-id 1").

This affects BROKER and BROKER_LOGGER config types.

To workaround this bug, you can either disable migrations on the brokers 
(assuming no migration has taken place), or proceed with the migration and get 
to the point where KRaft is the controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-932: Queues for Kafka

2024-04-04 Thread David Arthur
Andrew, thanks for the KIP! This is a pretty exciting effort.

I've finally made it through the KIP, still trying to grok the whole thing.
Sorry if some of my questions are basic :)


Concepts:

70. Does the Group Coordinator communicate with the Share Coordinator over
RPC or directly in-process?

71. For preventing name collisions with regular consumer groups, could we
define a reserved share group prefix? E.g., the operator defines "sg_" as a
prefix for share groups only, and if a regular consumer group tries to use
that name it fails.

72. When a consumer tries to use a share group, or a share consumer tries
to use a regular group, would INVALID_GROUP_ID make more sense
than INCONSISTENT_GROUP_PROTOCOL?



Share Group Membership:

73. What goes in the Metadata field for TargetAssignment#Member and
Assignment?

74. Under Trigger a rebalance, it says we rebalance when the partition
metadata changes. Would this be for any change, or just certain ones? For
example, if a follower drops out of the ISR and comes back, we probably
don't need to rebalance.

75. "For a share group, the group coordinator does *not* persist the
assignment" Can you explain why this is not needed?

76. " If the consumer just failed to heartbeat due to a temporary pause, it
could in theory continue to fetch and acknowledge records. When it finally
sends a heartbeat and realises it’s been kicked out of the group, it should
stop fetching records because its assignment has been revoked, and rejoin
the group."

A consumer with a long pause might still deliver some buffered records, but
if the share group coordinator has expired its session, it wouldn't accept
acknowledgments for that share consumer. In such a case, is any kind of
error raised to the application like "hey, I know we gave you these
records, but really we shouldn't have" ?


-

Record Delivery and acknowledgement

77. If we guarantee that a ShareCheckpoint is written at least every so
often, could we add a new log compactor that avoids compacting ShareDelta-s
that are still "active" (i.e., not yet superceded by a new
ShareCheckpoint). Mechnically, this could be done by keeping the LSO no
greater than the oldest "active" ShareCheckpoint. This might let us remove
the DeltaIndex thing.

78. Instead of the State in the ShareDelta/Checkpoint records, how about
MessageState? (State is kind of overloaded/ambiguous)

79. One possible limitation with the current persistence model is that all
the share state is stored in one topic. It seems like we are going to be
storing a lot more state than we do in __consumer_offsets since we're
dealing with message-level acks. With aggressive checkpointing and
compaction, we can mitigate the storage requirements, but the throughput
could be a limiting factor. Have we considered other possibilities for
persistence?


Cheers,
David


[jira] [Created] (KAFKA-16468) Listener not found error in SendRPCsToBrokersEvent

2024-04-03 Thread David Arthur (Jira)
David Arthur created KAFKA-16468:


 Summary: Listener not found error in SendRPCsToBrokersEvent
 Key: KAFKA-16468
 URL: https://issues.apache.org/jira/browse/KAFKA-16468
 Project: Kafka
  Issue Type: Bug
  Components: controller, migration
Reporter: David Arthur
 Fix For: 3.8.0


During the ZK to KRaft migration, the controller will send RPCs using the 
configured "control.plane.listener.name" or more commonly, the 
"inter.broker.listener.name". If a ZK broker did not register with this 
listener, we get a error at the time of sending the first RPC to a broker.

{code}
[2024-04-03 09:28:59,043] ERROR Encountered nonFatalFaultHandler fault: 
Unhandled error in SendRPCsToBrokersEvent 
(org.apache.kafka.server.fault.MockFaultHandler:44)
kafka.common.BrokerEndPointNotAvailableException: End point with listener name 
EXTERNAL not found for broker 0
at kafka.cluster.Broker.$anonfun$node$1(Broker.scala:94)
at scala.Option.getOrElse(Option.scala:201)
at kafka.cluster.Broker.node(Broker.scala:93)
at 
kafka.controller.ControllerChannelManager.addNewBroker(ControllerChannelManager.scala:122)
at 
kafka.controller.ControllerChannelManager.addBroker(ControllerChannelManager.scala:105)
at 
kafka.migration.MigrationPropagator.$anonfun$publishMetadata$2(MigrationPropagator.scala:98)
at 
kafka.migration.MigrationPropagator.$anonfun$publishMetadata$2$adapted(MigrationPropagator.scala:98)
at scala.collection.immutable.Set$Set3.foreach(Set.scala:261)
at 
kafka.migration.MigrationPropagator.publishMetadata(MigrationPropagator.scala:98)
at 
kafka.migration.MigrationPropagator.sendRPCsToBrokersFromMetadataImage(MigrationPropagator.scala:219)
at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SendRPCsToBrokersEvent.run(KRaftMigrationDriver.java:777)
at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:128)
at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:211)
at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:182)
at java.base/java.lang.Thread.run(Thread.java:833)
{code}

At this point, the KRaft controller has already migrated the metadata. Recovery 
at this point is possible by restarting the brokers with the correct listener 
names, but we can catch this much sooner in the process.

When a ZK broker registers with the KRaft controller, we should reject the 
registration if the expected listener name is not present. This will prevent 
the migration from starting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16466) QuorumController is swallowing some exception messages

2024-04-02 Thread David Arthur (Jira)
David Arthur created KAFKA-16466:


 Summary: QuorumController is swallowing some exception messages
 Key: KAFKA-16466
 URL: https://issues.apache.org/jira/browse/KAFKA-16466
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 3.7.0
Reporter: David Arthur
 Fix For: 3.8.0, 3.7.1


In some cases in QuorumController, we throw exceptions from the control manager 
methods. Unless these are explicitly caught and handled, they will eventually 
bubble up to the ControllerReadEvent/ControllerWriteEvent an hit the generic 
error handler.

In the generic error handler of QuorumController, we examine the exception to 
determine if it is a fault or not. In the case where it is not a fault, we log 
the error like:
{code:java}
 log.info("{}: {}", name, failureMessage);
{code}
which results in messages like
{code:java}
[2024-04-02 16:08:38,078] INFO [QuorumController id=3000] registerBroker: event 
failed with UnsupportedVersionException in 167 microseconds. 
(org.apache.kafka.controller.QuorumController:544)
{code}
In this case, the exception actually has more details in its own message
{code:java}
Unable to register because the broker does not support version 8 of 
metadata.version. It wants a version between 20 and 20, inclusive.
{code}

This was found while writing an integration test for KRaft migration where the 
brokers and controllers have a mismatched MetadataVersion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16463) Automatically delete metadata log directory on ZK brokers

2024-04-02 Thread David Arthur (Jira)
David Arthur created KAFKA-16463:


 Summary: Automatically delete metadata log directory on ZK brokers
 Key: KAFKA-16463
 URL: https://issues.apache.org/jira/browse/KAFKA-16463
 Project: Kafka
  Issue Type: Improvement
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.8.0


Throughout the process of a ZK to KRaft migration, the operator has the choice 
to revert back to ZK mode. Once this is done, there will be a copy of the 
metadata log on each broker in the cluster.

In order to re-attempt the migration in the future, this metadata log needs to 
be deleted. This can be pretty burdensome to the operator for large clusters. 

To improve this, we can automatically delete any metadata log present during 
startup of a ZK broker. This is safe to do because the ZK broker will just 
re-replicate the metadata log from the active controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16446) Log slow controller events

2024-03-28 Thread David Arthur (Jira)
David Arthur created KAFKA-16446:


 Summary: Log slow controller events
 Key: KAFKA-16446
 URL: https://issues.apache.org/jira/browse/KAFKA-16446
 Project: Kafka
  Issue Type: Improvement
Reporter: David Arthur


Occasionally, we will see very high p99 controller event processing times. 
Unless DEBUG logs are enabled, it is impossible to see which events are slow. 

Typically this happens during controller startup/failover, though it can also 
happen sporadically when the controller gets overloaded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16180) Full metadata request sometimes fails during zk migration

2024-03-14 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-16180.
--
Resolution: Fixed

> Full metadata request sometimes fails during zk migration
> -
>
> Key: KAFKA-16180
> URL: https://issues.apache.org/jira/browse/KAFKA-16180
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Colin McCabe
>Priority: Blocker
> Fix For: 3.6.2, 3.7.0
>
>
> Example:
> {code:java}
> java.util.NoSuchElementException: topic_name
> at 
> scala.collection.mutable.AnyRefMap$ExceptionDefault.apply(AnyRefMap.scala:508)
> at 
> scala.collection.mutable.AnyRefMap$ExceptionDefault.apply(AnyRefMap.scala:507)
> at scala.collection.mutable.AnyRefMap.apply(AnyRefMap.scala:207)
> at 
> kafka.server.metadata.ZkMetadataCache$.$anonfun$maybeInjectDeletedPartitionsFromFullMetadataRequest$2(ZkMetadataCache.scala:112)
> at 
> kafka.server.metadata.ZkMetadataCache$.$anonfun$maybeInjectDeletedPartitionsFromFullMetadataRequest$2$adapted(ZkMetadataCache.scala:105)
> at scala.collection.immutable.HashSet.foreach(HashSet.scala:958)
> at 
> kafka.server.metadata.ZkMetadataCache$.maybeInjectDeletedPartitionsFromFullMetadataRequest(ZkMetadataCache.scala:105)
> at 
> kafka.server.metadata.ZkMetadataCache.$anonfun$updateMetadata$1(ZkMetadataCache.scala:506)
> at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:183)
> at 
> kafka.server.metadata.ZkMetadataCache.updateMetadata(ZkMetadataCache.scala:496)
> at 
> kafka.server.ReplicaManager.maybeUpdateMetadataCache(ReplicaManager.scala:2482)
> at 
> kafka.server.KafkaApis.handleUpdateMetadataRequest(KafkaApis.scala:733)
> at kafka.server.KafkaApis.handle(KafkaApis.scala:349)
> at 
> kafka.server.KafkaRequestHandler.$anonfun$poll$8(KafkaRequestHandler.scala:210)
> at 
> kafka.server.KafkaRequestHandler.$anonfun$poll$8$adapted(KafkaRequestHandler.scala:210)
> at 
> io.confluent.kafka.availability.ThreadCountersManager.wrapEngine(ThreadCountersManager.java:146)
> at 
> kafka.server.KafkaRequestHandler.poll(KafkaRequestHandler.scala:210)
> at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:151)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16171) Controller failover during ZK migration can prevent metadata updates to ZK brokers

2024-03-13 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-16171.
--
Resolution: Fixed

> Controller failover during ZK migration can prevent metadata updates to ZK 
> brokers
> --
>
> Key: KAFKA-16171
> URL: https://issues.apache.org/jira/browse/KAFKA-16171
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, kraft, migration
>Affects Versions: 3.6.0, 3.7.0, 3.6.1
>    Reporter: David Arthur
>Assignee: David Arthur
>Priority: Blocker
> Fix For: 3.6.2, 3.7.0
>
>
> h2. Description
> During the ZK migration, after KRaft becomes the active controller we enter a 
> state called hybrid mode. This means we have a mixture of ZK and KRaft 
> brokers. The KRaft controller updates the ZK brokers using the deprecated 
> controller RPCs (LeaderAndIsr, UpdateMetadata, etc). 
>  
> A race condition exists where the KRaft controller will get stuck in a retry 
> loop while initializing itself after a failover which prevents it from 
> sending these RPCs to ZK brokers.
> h2. Impact
> Since the KRaft controller cannot send any RPCs to the ZK brokers, the ZK 
> brokers will not receive any metadata updates. The ZK brokers will be able to 
> send requests to the controller (such as AlterPartitions), but the metadata 
> updates which come as a result of those requests will never be seen. This 
> essentially looks like the controller is unavailable from the ZK brokers 
> perspective.
> h2. Detection and Mitigation
> This bug can be seen by observing failed ZK writes from a recently elected 
> controller.
> The tell-tale error message is:
> {code:java}
> Check op on KRaft Migration ZNode failed. Expected zkVersion = 507823. This 
> indicates that another KRaft controller is making writes to ZooKeeper. {code}
> with a stacktrace like:
> {noformat}
> java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. 
> Expected zkVersion = 507823. This indicates that another KRaft controller is 
> making writes to ZooKeeper.
>   at 
> kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2613)
>   at 
> kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2639)
>   at 
> kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2664)
>   at 
> scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
>   at 
> scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
>   at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:43)
>   at 
> kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2664)
>   at 
> kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:158)
>   at 
> kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:141)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$27(KRaftMigrationZkWriter.java:441)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:262)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:64)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:791)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:880)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$28(KRaftMigrationZkWriter.java:438)
>   at java.base/java.lang.Iterable.forEach(Iterable.java:75)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:436)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:115)
>   at 
> org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:790)
>   at 
> org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
>   at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
>   at 
> org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
>   at java.base/java.lang.Thread.run(Thread.java:

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2024-02-28 Thread David Arthur
Andrew/Jose, I like the suggested Flow API. It's also similar to the stream
observers in GPRC. I'm not sure we should expose something as complex as
the Flow API directly in KafkaAdminClient, but certainly we can provide a
similar interface.

---
Cancellations:

Another thing not yet discussed is how to cancel in-flight requests. For
other calls in KafkaAdminClient, we use KafkaFuture which has a "cancel"
method. With the callback approach, we need to be able to cancel the
request from within the callback as well as externally. Looking to the Flow
API again for inspiration, we could have the admin client pass an object to
the callback which can be used for cancellation. In the simple case, users
can ignore this object. In the advanced case, they can create a concrete
class for the callback and cache the cancellation object so it can be
accessed externally. This would be similar to the Subscription in the Flow
API.

---
Topics / Partitions:

For the case of topic descriptions, we actually have two data types
interleaved in one stream (topics and partitions). This means if we go with
TopicDescription in the "onNext" method, we will have a partial set of
topics in some cases. Also, we will end up calling "onNext" more than once
for each RPC in the case that a single RPC response spans multiple topics.

One alternative to a single "onNext" would be an interface more tailored to
the RPC like:

interface DescribeTopicsStreamObserver {
  // Called for each topic in the result stream.
  void onTopic(TopicInfo topic);

  // Called for each partition of the topic last handled by onTopic
  void onPartition(TopicPartitionInfo partition);

  // Called once the broker has finished streaming results to the admin
client. This marks the end of the stream.
  void onComplete();

  // Called if an error occurs on the underlying stream. This marks the end
of the stream.
  void onError(Throwable t);
}

---
Consumer API:

Offline, there was some discussion about using a simple SAM consumer-like
interface:

interface AdminResultsConsumer {
  void onNext(T next, Throwable t);
}

This has the benefit of being quite simple and letting callers supply a
lambda instead of a full anonymous class definition. This would use
nullable arguments like CompletableFuture#whenComplete. We could also use
an Optional pattern here instead of nullables.

---
Summary:

So far, it seems like we are looking at these different options. The main
difference in terms of API design is if the user will need to implement
more than one method, or if a lambda can suffice.

1. Generic, Flow-like interface: AdminResultsSubscriber
2. DescribeTopicsStreamObserver (in this message above)
3. AdminResultsConsumer
4. AdminResultsConsumer with an Optional-like type instead of nullable
arguments



-David




On Fri, Feb 23, 2024 at 4:00 PM José Armando García Sancio
 wrote:

> Hi Calvin
>
> On Fri, Feb 23, 2024 at 9:23 AM Calvin Liu 
> wrote:
> > As we agreed to implement the pagination for the new API
> > DescribeTopicPartitions, the client side must also add a proper interface
> > to handle the pagination.
> > The current KafkaAdminClient.describeTopics returns
> > the DescribeTopicsResult which is the future for querying all the topics.
> > It is awkward to fit the pagination into it because
>
> I suggest taking a look at Java's Flow API:
>
> https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/Flow.html
> It was design for this specific use case and many libraries integrate with
> it.
>
> If the Kafka client cannot be upgraded to support the Java 9 which
> introduced that API, you can copy the same interface and semantics.
> This would allow users to easily integrate with reactive libraries
> since they all integrate with Java Flow.
>
> Thanks,
> --
> -José
>


-- 
-David


Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2024-02-23 Thread David Arthur
Thanks for raising this here, Calvin. Since this is the first "streaming
results" type API in KafkaAdminClient (as far as I know), we're treading
new ground here.

As you mentioned, we can either accept a consumer or return some iterable
result. Returning a java.util.Stream is also an option, and a bit more
modern/convenient than java.util.Iterator. Personally, I like the consumer
approach, but I'm interested in hearing other's opinions.

This actually brings up another question: Do we think it's safe to assume
that one topic's description can fit into memory? The RPC supports paging
across partitions within a single topic, so maybe the admin API should as
well?

-David

On Fri, Feb 23, 2024 at 12:22 PM Calvin Liu  wrote:

> Hey,
> As we agreed to implement the pagination for the new API
> DescribeTopicPartitions, the client side must also add a proper interface
> to handle the pagination.
> The current KafkaAdminClient.describeTopics returns
> the DescribeTopicsResult which is the future for querying all the topics.
> It is awkward to fit the pagination into it because
>
>1. Each future corresponds to a topic. We also want to have the
>pagination on huge topics for their partitions.
>2. To avoid OOM, we should only fetch the new topics when we need them
>and release the used topics. Especially the main use case of looping the
>topic list is when the client prints all the topics.
>
> So, to better serve the pagination, @David Arthur
>  suggested to add a new interface in the Admin
> client between the following 2.
>
> describeTopics(TopicCollection topics, DescribeTopicsOptions options, 
> Consumer);
>
> Iterator describeTopics(TopicCollection topics, 
> DescribeTopicsOptions options);
>
> David and I would prefer the first Consumer version which works better as a 
> stream purposes.
>
>
> On Wed, Oct 11, 2023 at 4:28 PM Calvin Liu  wrote:
>
>> Hi David,
>> Thanks for the comment.
>> Yes, we can separate the ELR enablement from the metadata version. It is
>> also helpful to avoid blocking the following MV releases if the user is not
>> ready for ELR.
>> One thing to correct is that, the Unclean recovery is controlled
>> by unclean.recovery.manager.enabled, a separate config
>> from unclean.recovery.strategy. It determines whether unclean recovery will
>> be used in an unclean leader election.
>> Thanks
>>
>> On Wed, Oct 11, 2023 at 4:11 PM David Arthur  wrote:
>>
>>> One thing we should consider is a static config to totally enable/disable
>>> the ELR feature. If I understand the KIP correctly, we can effectively
>>> disable the unclean recovery by setting the recovery strategy config to
>>> "none".
>>>
>>> This would make development and rollout of this feature a bit smoother.
>>> Consider the case that we find bugs in ELR after a cluster has updated to
>>> its MetadataVersion. It's simpler to disable the feature through config
>>> rather than going through a MetadataVersion downgrade (once that's
>>> supported).
>>>
>>> Does that make sense?
>>>
>>> -David
>>>
>>> On Wed, Oct 11, 2023 at 1:40 PM Calvin Liu 
>>> wrote:
>>>
>>> > Hi Jun
>>> > -Good catch, yes, we don't need the -1 in the DescribeTopicRequest.
>>> > -No new value is added. The LeaderRecoveryState will still be set to 1
>>> if
>>> > we have an unclean leader election. The unclean leader election
>>> includes
>>> > the old random way and the unclean recovery. During the unclean
>>> recovery,
>>> > the LeaderRecoveryState will not change until the controller decides to
>>> > update the records with the new leader.
>>> > Thanks
>>> >
>>> > On Wed, Oct 11, 2023 at 9:02 AM Jun Rao 
>>> wrote:
>>> >
>>> > > Hi, Calvin,
>>> > >
>>> > > Another thing. Currently, when there is an unclean leader election,
>>> we
>>> > set
>>> > > the LeaderRecoveryState in PartitionRecord and PartitionChangeRecord
>>> to
>>> > 1.
>>> > > With the KIP, will there be new values for LeaderRecoveryState? If
>>> not,
>>> > > when will LeaderRecoveryState be set to 1?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Jun
>>> > >
>>> > > On Tue, Oct 10, 2023 at 4:24 PM Jun Rao  wrote:
>>> > >
>>> > > > Hi, Calvin,
>>> > > >
>>> > > > One more comment.
>&

Re: Github build queue

2024-02-09 Thread David Arthur
I tried to enable the merge queue on my public fork, but the option is not
available. I did a little searching and it looks like ASF does not allow
this feature to be used. I've filed an INFRA ticket to ask again
https://issues.apache.org/jira/browse/INFRA-25485

-David

On Fri, Feb 9, 2024 at 7:18 PM Ismael Juma  wrote:

> Also, on the mockito stubbings point, we did upgrade to Mockito 5.8 for the
> Java 11 and newer builds:
>
> https://github.com/apache/kafka/blob/trunk/gradle/dependencies.gradle#L64
>
> So, we should be good when it comes to that too.
>
> Ismael
>
> On Fri, Feb 9, 2024 at 4:15 PM Ismael Juma  wrote:
>
> > Nice!
> >
> > Ismael
> >
> > On Fri, Feb 9, 2024 at 3:43 PM Greg Harris  >
> > wrote:
> >
> >> Hey all,
> >>
> >> I implemented a fairly aggressive PR [1] to demote flaky tests to
> >> integration tests, and the end result is a much faster (10m locally,
> >> 1h on Jenkins) build which is also very reliable.
> >>
> >> I believe this would make unitTest suitable for use in the merge
> >> queue, with the caveat that it doesn't run 25k integration tests, and
> >> doesn't perform the mockito strict stubbing verification.
> >> This would still be a drastic improvement, as we would then be running
> >> the build and 87k unit tests that we aren't running today.
> >>
> >> Thanks!
> >> Greg
> >>
> >> [1] https://github.com/apache/kafka/pull/15349
> >>
> >> On Fri, Feb 9, 2024 at 9:25 AM Ismael Juma  wrote:
> >> >
> >> > Please check https://github.com/apache/kafka/pull/14186 before making
> >> the
> >> > `unitTest` and `integrationTest` split.
> >> >
> >> > Ismael
> >> >
> >> > On Fri, Feb 9, 2024 at 9:16 AM Josep Prat  >
> >> > wrote:
> >> >
> >> > > Regarding "Split our CI "test" job into unit and integration so we
> can
> >> > > start collecting data on those suites", can we run these 2 tasks in
> >> the
> >> > > same machine? So they won't need to compile classes twice for the
> same
> >> > > exact code?
> >> > >
> >> > > On Fri, Feb 9, 2024 at 6:05 PM Ismael Juma 
> wrote:
> >> > >
> >> > > > Why can't we add @Tag("integration") for all of those tests? Seems
> >> like
> >> > > > that would not be too hard.
> >> > > >
> >> > > > Ismael
> >> > > >
> >> > > > On Fri, Feb 9, 2024 at 9:03 AM Greg Harris
> >>  >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi David,
> >> > > > >
> >> > > > > +1 on that strategy.
> >> > > > >
> >> > > > > I see several flaky tests that aren't marked with
> >> @Tag("integration")
> >> > > > > or @IntegrationTest, and I think those would make using the
> >> unitTest
> >> > > > > target ineffective here. We could also start a new tag
> >> @Tag("flaky")
> >> > > > > and exclude that.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Greg
> >> > > > >
> >> > > > > On Fri, Feb 9, 2024 at 8:57 AM David Arthur 
> >> wrote:
> >> > > > > >
> >> > > > > > I do think we can add a PR to the merge queue while bypassing
> >> branch
> >> > > > > > potections (like we do for the Merge button today), but I'm
> not
> >> 100%
> >> > > > > sure.
> >> > > > > > I like the idea of running unit tests, though I don't think we
> >> have
> >> > > > data
> >> > > > > on
> >> > > > > > how long just the unit tests run on Jenkins (since we run the
> >> "test"
> >> > > > > target
> >> > > > > > which includes all tests). I'm also not sure how flaky the
> unit
> >> test
> >> > > > > suite
> >> > > > > > is alone.
> >> > > > > >
> >> > > > > > Since we already bypass the PR checks when merging, it seems
> >> that
> >> > > > adding
> >> > > > > a
> >> > > >

Re: Github build queue

2024-02-09 Thread David Arthur
> Regarding "Split our CI "test" job into unit and integration

I believe all of the "steps" inside the "stage" directive are run on the
same node sequentially. I think we could do something like

steps {
  doValidation()
  doUnitTest()
  doIntegrationTest()
  tryStreamsArchetype()
}

and it shouldn't affect the overall runtime much.


+1 to sticking with @Tag("integration") rather than adding a new tag. It
would be good to keep track of any unit tests we "downgrade" to integration
with a JIRA.


On Fri, Feb 9, 2024 at 12:18 PM Josep Prat 
wrote:

> Regarding "Split our CI "test" job into unit and integration so we can
> start collecting data on those suites", can we run these 2 tasks in the
> same machine? So they won't need to compile classes twice for the same
> exact code?
>
> On Fri, Feb 9, 2024 at 6:05 PM Ismael Juma  wrote:
>
> > Why can't we add @Tag("integration") for all of those tests? Seems like
> > that would not be too hard.
> >
> > Ismael
> >
> > On Fri, Feb 9, 2024 at 9:03 AM Greg Harris  >
> > wrote:
> >
> > > Hi David,
> > >
> > > +1 on that strategy.
> > >
> > > I see several flaky tests that aren't marked with @Tag("integration")
> > > or @IntegrationTest, and I think those would make using the unitTest
> > > target ineffective here. We could also start a new tag @Tag("flaky")
> > > and exclude that.
> > >
> > > Thanks,
> > > Greg
> > >
> > > On Fri, Feb 9, 2024 at 8:57 AM David Arthur  wrote:
> > > >
> > > > I do think we can add a PR to the merge queue while bypassing branch
> > > > potections (like we do for the Merge button today), but I'm not 100%
> > > sure.
> > > > I like the idea of running unit tests, though I don't think we have
> > data
> > > on
> > > > how long just the unit tests run on Jenkins (since we run the "test"
> > > target
> > > > which includes all tests). I'm also not sure how flaky the unit test
> > > suite
> > > > is alone.
> > > >
> > > > Since we already bypass the PR checks when merging, it seems that
> > adding
> > > a
> > > > required compile/check step before landing on trunk is strictly an
> > > > improvement.
> > > >
> > > > What about this as a short term plan:
> > > >
> > > > 1) Add the merge queue, only run compile/check
> > > > 2) Split our CI "test" job into unit and integration so we can start
> > > > collecting data on those suites
> > > > 3) Add "unitTest" to merge queue job once we're satisfied it won't
> > cause
> > > > disruption
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Feb 9, 2024 at 11:43 AM Josep Prat
>  > >
> > > > wrote:
> > > >
> > > > > Hi David,
> > > > > I like the idea, it will solve the problem we've seen a couple of
> > > times in
> > > > > the last 2 weeks where compilation for some Scala version failed,
> it
> > > was
> > > > > probably overlooked during the PR build because of the flakiness of
> > > tests
> > > > > and the compilation failure was buried among the amount of failed
> > > tests.
> > > > >
> > > > > Regarding the type of check, I'm not sure what's best, have a real
> > > quick
> > > > > check or a longer one including unit tests. A full test suite will
> > run
> > > per
> > > > > each commit in each PR (these we have definitely more than 8 per
> day)
> > > and
> > > > > this should be used to ensure changes are safe and sound. I'm not
> > sure
> > > if
> > > > > having unit tests run as well before the merge itself would cause
> too
> > > much
> > > > > of an extra load on the CI machines.
> > > > > We can go with `gradlew unitTest` and see if this takes too long or
> > > causes
> > > > > too many delays with the normal pipeline.
> > > > >
> > > > > Best,
> > > > >
> > > > > On Fri, Feb 9, 2024 at 4:16 PM Ismael Juma 
> > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > I think this is a helpful thing (and something I hoped we would
> us

Re: Github build queue

2024-02-09 Thread David Arthur
I do think we can add a PR to the merge queue while bypassing branch
potections (like we do for the Merge button today), but I'm not 100% sure.
I like the idea of running unit tests, though I don't think we have data on
how long just the unit tests run on Jenkins (since we run the "test" target
which includes all tests). I'm also not sure how flaky the unit test suite
is alone.

Since we already bypass the PR checks when merging, it seems that adding a
required compile/check step before landing on trunk is strictly an
improvement.

What about this as a short term plan:

1) Add the merge queue, only run compile/check
2) Split our CI "test" job into unit and integration so we can start
collecting data on those suites
3) Add "unitTest" to merge queue job once we're satisfied it won't cause
disruption




On Fri, Feb 9, 2024 at 11:43 AM Josep Prat 
wrote:

> Hi David,
> I like the idea, it will solve the problem we've seen a couple of times in
> the last 2 weeks where compilation for some Scala version failed, it was
> probably overlooked during the PR build because of the flakiness of tests
> and the compilation failure was buried among the amount of failed tests.
>
> Regarding the type of check, I'm not sure what's best, have a real quick
> check or a longer one including unit tests. A full test suite will run per
> each commit in each PR (these we have definitely more than 8 per day) and
> this should be used to ensure changes are safe and sound. I'm not sure if
> having unit tests run as well before the merge itself would cause too much
> of an extra load on the CI machines.
> We can go with `gradlew unitTest` and see if this takes too long or causes
> too many delays with the normal pipeline.
>
> Best,
>
> On Fri, Feb 9, 2024 at 4:16 PM Ismael Juma  wrote:
>
> > Hi David,
> >
> > I think this is a helpful thing (and something I hoped we would use when
> I
> > learned about it), but it does require the validation checks to be
> reliable
> > (or else the PR won't be merged). Sounds like you are suggesting to skip
> > the tests for the merge queue validation. Could we perhaps include the
> unit
> > tests as well? That would incentivize us to ensure the unit tests are
> fast
> > and reliable. Getting the integration tests to the same state will be a
> > longer journey.
> >
> > Ismael
> >
> > On Fri, Feb 9, 2024 at 7:04 AM David Arthur  wrote:
> >
> > > Hey folks,
> > >
> > > I recently learned about Github's Merge Queue feature, and I think it
> > could
> > > help us out.
> > >
> > > Essentially, when you hit the Merge button on a PR, it will add the PR
> > to a
> > > queue and let you run a CI job before merging. Just something simple
> like
> > > compile + static analysis would probably save us from a lot of
> headaches
> > on
> > > trunk.
> > >
> > > I can think of two situations this would help us avoid:
> > > * Two valid PRs are merged near one another, but they create a code
> > > breakage (rare)
> > > * A quick little "fixup" commit on a PR actually breaks something (less
> > > rare)
> > >
> > > Looking at our Github stats, we are averaging under 40 commits per
> week.
> > > Assuming those primarily come in on weekdays, that's 8 commits per day.
> > If
> > > we just run "gradlew check -x tests" for the merge queue job, I don't
> > think
> > > we'd get backlogged.
> > >
> > > Thoughts?
> > > David
> > >
> > >
> > >
> > >
> > > --
> > > David Arthur
> > >
> >
>
>
> --
> [image: Aiven] <https://www.aiven.io>
>
> *Josep Prat*
> Open Source Engineering Director, *Aiven*
> josep.p...@aiven.io   |   +491715557497
> aiven.io <https://www.aiven.io>   |   <https://www.facebook.com/aivencloud
> >
>   <https://www.linkedin.com/company/aiven/>   <
> https://twitter.com/aiven_io>
> *Aiven Deutschland GmbH*
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> Amtsgericht Charlottenburg, HRB 209739 B
>


-- 
David Arthur


Github build queue

2024-02-09 Thread David Arthur
Hey folks,

I recently learned about Github's Merge Queue feature, and I think it could
help us out.

Essentially, when you hit the Merge button on a PR, it will add the PR to a
queue and let you run a CI job before merging. Just something simple like
compile + static analysis would probably save us from a lot of headaches on
trunk.

I can think of two situations this would help us avoid:
* Two valid PRs are merged near one another, but they create a code
breakage (rare)
* A quick little "fixup" commit on a PR actually breaks something (less
rare)

Looking at our Github stats, we are averaging under 40 commits per week.
Assuming those primarily come in on weekdays, that's 8 commits per day. If
we just run "gradlew check -x tests" for the merge queue job, I don't think
we'd get backlogged.

Thoughts?
David




-- 
David Arthur


[jira] [Created] (KAFKA-16206) ZkConfigMigrationClient tries to delete topic configs twice

2024-01-29 Thread David Arthur (Jira)
David Arthur created KAFKA-16206:


 Summary: ZkConfigMigrationClient tries to delete topic configs 
twice
 Key: KAFKA-16206
 URL: https://issues.apache.org/jira/browse/KAFKA-16206
 Project: Kafka
  Issue Type: Bug
  Components: migration, kraft
Reporter: David Arthur


When deleting a topic, we see spurious ERROR logs from 
kafka.zk.migration.ZkConfigMigrationClient:
 
{code:java}
Did not delete ConfigResource(type=TOPIC, name='xxx') since the node did not 
exist. {code}

This seems to happen because ZkTopicMigrationClient#deleteTopic is deleting the 
topic, partitions, and config ZNodes in one shot. Subsequent calls from 
KRaftMigrationZkWriter to delete the config encounter a NO_NODE since the ZNode 
is already gone.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16205) Reduce number of metadata requests during hybrid mode

2024-01-29 Thread David Arthur (Jira)
David Arthur created KAFKA-16205:


 Summary: Reduce number of metadata requests during hybrid mode
 Key: KAFKA-16205
 URL: https://issues.apache.org/jira/browse/KAFKA-16205
 Project: Kafka
  Issue Type: Improvement
  Components: controller, kraft
Affects Versions: 3.6.0, 3.5.0, 3.4.0, 3.7.0
Reporter: David Arthur


When migrating a cluster with a high number of brokers and partitions, it is 
possible for the controller channel manager queue to get backed up. This can 
happen when many small RPCs are generated in response to several small 
MetadataDeltas being handled MigrationPropagator.

 

In the ZK controller, various optimizations have been made over the years to 
reduce the number of UMR and LISR sent during controlled shutdown or other 
large metadata events. For the ZK to KRaft migration, we use the MetadataLoader 
infrastructure to learn about and propagate metadata to ZK brokers.

 

We need to improve the batching in MigrationPropagator to avoid performance 
issues during the migration of large clusters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16171) Controller failover during ZK migration can lead to controller unavailability for ZK brokers

2024-01-19 Thread David Arthur (Jira)
David Arthur created KAFKA-16171:


 Summary: Controller failover during ZK migration can lead to 
controller unavailability for ZK brokers
 Key: KAFKA-16171
 URL: https://issues.apache.org/jira/browse/KAFKA-16171
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur
Assignee: David Arthur


h2. Description

During the ZK migration, after KRaft becomes the active controller we enter a 
state called hybrid mode. This means we have a mixture of ZK and KRaft brokers. 
The KRaft controller updates the ZK brokers using the deprecated controller 
RPCs (LeaderAndIsr, UpdateMetadata, etc). 

 

A race condition exists where the KRaft controller will get stuck in a retry 
loop while initializing itself after a failover which prevents it from sending 
these RPCs to ZK brokers.
h2. Impact

Since the KRaft controller cannot send any RPCs to the ZK brokers, the ZK 
brokers will not receive any metadata updates. The ZK brokers will be able to 
send requests to the controller (such as AlterPartitions), but the metadata 
updates which come as a result of those requests will never be seen. 
h2. Detection and Mitigation

This bug can be seen by observing failed ZK writes from a recently elected 
controller.

The tell-tale error message is:
{code:java}
Check op on KRaft Migration ZNode failed. Expected zkVersion = 507823. This 
indicates that another KRaft controller is making writes to ZooKeeper. {code}
with a stacktrace like:
{noformat}
java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. Expected 
zkVersion = 507823. This indicates that another KRaft controller is making 
writes to ZooKeeper.
at 
kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2613)
at 
kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2639)
at 
kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2664)
at 
scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
at 
scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:43)
at 
kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2664)
at 
kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:158)
at 
kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:141)
at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$27(KRaftMigrationZkWriter.java:441)
at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:262)
at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:64)
at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:791)
at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:880)
at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$28(KRaftMigrationZkWriter.java:438)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:436)
at 
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:115)
at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:790)
at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
at java.base/java.lang.Thread.run(Thread.java:1583)
at 
org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66){noformat}
To mitigate this problem, a new KRaft controller should be elected. This can be 
done by restarting the problematic active controller. To verify that the new 
controller does not encounter the race condition, look for 
{code:java}
[KRaftMigrationDriver id=9991] 9991 transitioning from SYNC_KRAFT_TO_ZK to 
KRAFT_CONTROLLER_TO_BROKER_COMM state {code}
 
h2. Details

Controller A loses leadership via Raft event (e.g., from a timeout in the Raft 
layer). A KRaftLeaderEvent is added to KRaftMigrationDriver event queue behind 
any pending MetadataChangeEvents. 

 

Controller B is elected and a KRaftLeaderEvent is added to 
KRaftMigrationDriver's queue. Since this controller is inactive, it processes 
the event immediately

Re: [VOTE] KIP-1013: Drop broker and tools support for Java 11 in Kafka 4.0 (deprecate in 3.7)

2024-01-08 Thread David Arthur
+1 binding

Thanks!
David

On Wed, Jan 3, 2024 at 8:19 PM Ismael Juma  wrote:

> Hi Mickael,
>
> Good catch. I fixed that and one other (similar) case (they were remnants
> of an earlier version of the proposal).
>
> Ismael
>
> On Wed, Jan 3, 2024 at 8:59 AM Mickael Maison 
> wrote:
>
> > Hi Ismael,
> >
> > I'm +1 (binding) too.
> >
> > One small typo, the KIP states "The remaining modules (clients,
> > streams, connect, tools, etc.) will continue to support Java 11.". I
> > think we want to remove support for Java 11 in the tools module so it
> > shouldn't be listed here.
> >
> > Thanks,
> > Mickael
> >
> > On Wed, Jan 3, 2024 at 11:09 AM Divij Vaidya 
> > wrote:
> > >
> > > +1 (binding)
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Wed, Jan 3, 2024 at 11:06 AM Viktor Somogyi-Vass
> > >  wrote:
> > >
> > > > Hi Ismael,
> > > >
> > > > I think it's important to make this change, the youtube video you
> > posted on
> > > > the discussion thread makes very good arguments and so does the KIP.
> > Java 8
> > > > is almost a liability and Java 11 already has smaller (and
> decreasing)
> > > > adoption than 17. It's a +1 (binding) from me.
> > > >
> > > > Thanks,
> > > > Viktor
> > > >
> > > > On Wed, Jan 3, 2024 at 7:00 AM Kamal Chandraprakash <
> > > > kamal.chandraprak...@gmail.com> wrote:
> > > >
> > > > > +1 (non-binding).
> > > > >
> > > > > On Wed, Jan 3, 2024 at 8:01 AM Satish Duggana <
> > satish.dugg...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks Ismael for the proposal.
> > > > > >
> > > > > > Adopting JDK 17 enhances developer productivity and has reached a
> > > > > > level of maturity that has led to its adoption by several other
> > major
> > > > > > projects, signifying its reliability and effectiveness.
> > > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > >
> > > > > > ~Satish.
> > > > > >
> > > > > > On Wed, 3 Jan 2024 at 06:59, Justine Olshan
> > > > > >  wrote:
> > > > > > >
> > > > > > > Thanks for driving this.
> > > > > > >
> > > > > > > +1 (binding) from me.
> > > > > > >
> > > > > > > Justine
> > > > > > >
> > > > > > > On Tue, Jan 2, 2024 at 4:30 PM Ismael Juma 
> > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I would like to start a vote on KIP-1013.
> > > > > > > >
> > > > > > > > As stated in the discussion thread, this KIP was proposed
> > after the
> > > > > KIP
> > > > > > > > freeze for Apache Kafka 3.7, but it is purely a documentation
> > > > update
> > > > > > (if we
> > > > > > > > decide to adopt it) and I believe it would serve our users
> > best if
> > > > we
> > > > > > > > communicate the deprecation for removal sooner (i.e. 3.7)
> > rather
> > > > than
> > > > > > later
> > > > > > > > (i.e. 3.8).
> > > > > > > >
> > > > > > > > Please take a look and cast your vote.
> > > > > > > >
> > > > > > > > Link:
> > > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789510
> > > > > > > >
> > > > > > > > Ismael
> > > > > > > >
> > > > > >
> > > > >
> > > >
> >
>


-- 
David Arthur


[jira] [Created] (KAFKA-16078) InterBrokerProtocolVersion defaults to non-production MetadataVersion

2024-01-03 Thread David Arthur (Jira)
David Arthur created KAFKA-16078:


 Summary: InterBrokerProtocolVersion defaults to non-production 
MetadataVersion
 Key: KAFKA-16078
 URL: https://issues.apache.org/jira/browse/KAFKA-16078
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur
Assignee: David Arthur






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1013: Drop broker and tools support for Java 11 in Kafka 4.0 (deprecate in 3.7)

2023-12-26 Thread David Arthur
Thanks, Ismael. I'm +1 on the proposal.

Does this KIP essentially replace KIP-750?

On Tue, Dec 26, 2023 at 3:57 PM Ismael Juma  wrote:

> Hi Colin,
>
> A couple of comments:
>
> 1. It is true that full support for OpenJDK 11 from Red Hat will end on
> October 2024 (extended life support will continue beyond that), but Temurin
> claims to continue until 2027[1].
> 2. If we set source/target/release to 11, then javac ensures compatibility
> with Java 11. In addition, we'd continue to run JUnit tests with Java 11
> for the modules that support it in CI for both PRs and master (just like we
> do today).
>
> Ismael
>
> [1] https://adoptium.net/support/
>
> On Tue, Dec 26, 2023 at 9:41 AM Colin McCabe  wrote:
>
> > Hi Ismael,
> >
> > +1 from me.
> >
> > Looking at the list of languages features for JDK17, from a developer
> > productivity standpoint, the biggest wins are probably pattern matching
> and
> > java.util.HexFormat.
> >
> > Also, Java 11 is getting long in the tooth, even though we never adopted
> > it. It was released 6 years ago, and according to wikipedia, Temurin and
> > Red Hat will stop shipping updates for JDK11 sometime next year. (This is
> > from https://en.wikipedia.org/wiki/Java_version_history .)
> >
> > It feels quite bad to "upgrade" to a 6 year old version of Java that is
> > soon to go out of support anyway. (Although a few Java distributions will
> > support JDK11 for longer, such as Amazon Corretto.)
> >
> > One thing that would be nice to add to the KIP is the mechanism that we
> > will use to ensure that the clients module stays compatible with JDK11.
> > Perhaps a nightly build of just that module with JDK11 would be a good
> > idea? I'm not sure what the easiest way to build just one module is --
> > hopefully we don't have to go through maven or something.
> >
> > best,
> > Colin
> >
> >
> > On Fri, Dec 22, 2023, at 10:39, Ismael Juma wrote:
> > > Hi all,
> > >
> > > I was watching the Java Highlights of 2023 from Nicolai Parlog[1] and
> it
> > > became clear that many projects are moving to Java 17 for its developer
> > > productivity improvements. It occurred to me that there is also an
> > > opportunity for the Apache Kafka project and I wrote a quick KIP with
> the
> > > proposal. Please take a look and let me know what you think:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789510
> > >
> > > P.S. I am aware that we're past the KIP freeze for Apache Kafka 3.7,
> but
> > > the proposed change would only change documentation and it's strictly
> > > better to share this information in 3.7 than 3.8 (if we decide to do
> it).
> > >
> > > [1] https://youtu.be/NxpHg_GzpnY?si=wA57g9kAhYulrlUO=411
> >
>


-- 
-David


Re: Kafka trunk test & build stability

2023-12-26 Thread David Arthur
S2. We’ve looked into this before, and it wasn’t possible at the time with
JUnit. We commonly set a timeout on each test class (especially integration
tests). It is probably worth looking at this again and seeing if something
has changed with JUnit (or our usage of it) that would allow a global
timeout.


S3. Dedicated infra sounds nice, if we can get it. It would at least remove
some variability between the builds, and hopefully eliminate the
infra/setup class of failures.


S4. Running tests for what has changed sounds nice, but I think it is risky
to implement broadly. As Sophie mentioned, there are probably some lines we
could draw where we feel confident that only running a subset of tests is
safe. As a start, we could probably work towards skipping CI for non-code
PRs.


---


As an aside, I experimented with build caching and running affected tests a
few months ago. I used the opportunity to play with Github Actions, and I
quite liked it. Here’s the workflow I used:
https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml. I
was trying to see if we could use a build cache to reduce the compilation
time on PRs. A nightly/periodic job would build trunk and populate a Gradle
build cache. PR builds would read from that cache which would enable them
to only compile changed code. The same idea could be extended to tests, but
I didn’t get that far.


As for Github Actions, the idea there is that ASF would provide generic
Action “runners” that would pick up jobs from the Github Action build queue
and run them. It is also possible to self-host runners to expand the build
capacity of the project (i.e., other organizations could donate
build capacity). The advantage of this is that we would have more control
over our build/reports and not be “stuck” with whatever ASF Jenkins offers.
The Actions workflows are very customizable and it would let us create our
own custom plugins. There is also a substantial marketplace of plugins. I
think it’s worth exploring this more, I just haven’t had time lately.

On Tue, Dec 26, 2023 at 3:24 PM Sophie Blee-Goldman 
wrote:

> Regarding:
>
> S-4. Separate tests ran depending on what module is changed.
> >
> - This makes sense although is tricky to implement successfully, as
> > unrelated tests may expose problems in an unrelated change (e.g changing
> > core stuff like clients, the server, etc)
>
>
> Imo this avenue could provide a massive improvement to dev productivity
> with very little effort or investment, and if we do it right, without even
> any risk. We should be able to draft a simple dependency graph between
> modules and then skip the tests for anything that is clearly, provably
> unrelated and/or upstream of the target changes. This has the potential to
> substantially speed up and improve the developer experience in modules at
> the end of the dependency graph, which I believe is worth doing even if it
> unfortunately would not benefit everyone equally.
>
> For example, we can save a lot of grief with just a simple set of rules
> that are easy to check. I'll throw out a few to start with:
>
>1. A pure docs PR (ie that only touches files under the docs/ directory)
>should be allowed to skip the tests of all modules
>2. Connect PRs (that only touch connect/) only need to run the Connect
>tests -- ie they can skip the tests for core, clients, streams, etc
>3. Similarly, Streams PRs should only need to run the Streams tests --
>but again, only if all the changes are contained within streams/
>
> I'll let others chime in on how or if we can construct some safe rules as
> to which modules can or can't be skipped between the core, clients, raft,
> storage, etc
>
> And over time we could in theory build up a literal dependency graph on a
> more granular level so that, for example, changes to the core/storage
> module are allowed to skip any Streams tests that don't use an embedded
> broker, ie all unit tests and TopologyTestDriver-based integration tests.
> The danger here would be in making sure this graph is kept up to date as
> tests are added and changed, but my point is just that there's a way to
> extend the benefit of this tactic to those who work primarily on the core
> module as well. Personally, I think we should just start out with the
> example ruleset listed above, workshop it a bit since there might be other
> obvious rules I left out, and try to implement it.
>
> Thoughts?
>
> On Tue, Dec 26, 2023 at 2:25 AM Stanislav Kozlovski
>  wrote:
>
> > Great discussion!
> >
> >
> > Greg, that was a good call out regarding the two long-running builds. I
> > missed that 90d view.
> >
> > My takeaway from that is that our average build time for tests is between
> > 3-4 hours. Which in of itself seems large.
> >
> > But then reconciling this with Sophie's statement - is it possible that
> > these timed-out 8-hour builds don't get captured in that view?
> >
> > It is weird that people are reporting these things and Gradle Enterprise

[jira] [Created] (KAFKA-16020) Time#waitForFuture should tolerate nanosecond overflow

2023-12-15 Thread David Arthur (Jira)
David Arthur created KAFKA-16020:


 Summary: Time#waitForFuture should tolerate nanosecond overflow
 Key: KAFKA-16020
 URL: https://issues.apache.org/jira/browse/KAFKA-16020
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur


Reported by [~jsancio] here 
https://github.com/apache/kafka/pull/15007#discussion_r1428359211

Time#waitForFuture should follow the JDK recommendation for comparing elapsed 
nanoseconds to a duration.

https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/System.html#nanoTime()

{quote}
For example, to measure how long some code takes to execute:

 
 long startTime = System.nanoTime();
 // ... the code being measured ...
 long elapsedNanos = System.nanoTime() - startTime;
To compare elapsed time against a timeout, use

 
 if (System.nanoTime() - startTime >= timeoutNanos) ...
instead of
 
 if (System.nanoTime() >= startTime + timeoutNanos) ...
because of the possibility of numerical overflow.
{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16007) ZK migrations can be slow for large clusters

2023-12-13 Thread David Arthur (Jira)
David Arthur created KAFKA-16007:


 Summary: ZK migrations can be slow for large clusters
 Key: KAFKA-16007
 URL: https://issues.apache.org/jira/browse/KAFKA-16007
 Project: Kafka
  Issue Type: Improvement
  Components: controller, kraft
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.7.0, 3.6.2


On a large cluster with many single-partition topics, the ZK to KRaft migration 
took nearly half an hour:

{code}
[KRaftMigrationDriver id=9990] Completed migration of metadata from ZooKeeper 
to KRaft. 157396 records were generated in 2245862 ms across 67132 batches. The 
record types were {TOPIC_RECORD=66282, PARTITION_RECORD=72067, 
CONFIG_RECORD=17116, PRODUCER_IDS_RECORD=1, ACCESS_CONTROL_ENTRY_RECORD=1930}. 
The current metadata offset is now 332267 with an epoch of 19. Saw 36 brokers 
in the migrated metadata [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35].
{code}

This is a result of how we generate batches of records when traversing the ZK 
tree. Since we now using metadata transactions for the migration, we can 
re-batch these without any consistency problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] 3.6.1 RC0

2023-12-04 Thread David Arthur
I have a fix for KAFKA-15968
<https://issues.apache.org/jira/browse/KAFKA-15968> here
https://github.com/apache/kafka/pull/14919/. After a bit of digging, I
found that this behavior has existed in the KRaft controller since the
beginning, so it is not a regression.

Another thing I observed while investigating this is that MetadataLoader
*does* treat CorruptRecordExceptions as fatal, which leads to the crash we
want. RaftClient calls handleCommit serially for all its listeners, so if
QuorumController#handleCommit is called first and does not crash, the call
to MetadataLoader#handleCommit will crash.

Considering these two factors, I don't strongly feel like we need to block
the release for this fix.

-David


On Mon, Dec 4, 2023 at 10:49 AM David Arthur 
wrote:

> Mickael,
>
> I just filed https://issues.apache.org/jira/browse/KAFKA-15968 while
> investigating a log corruption issue on the controller. I'm still
> investigating the issue to see how far back this goes, but I think this
> could be a blocker.
>
> Essentially, the bug is that the controller does not treat a
> CorruptRecordException as fatal, so the process will continue running. If
> this happens on an active controller, it could corrupt the cluster's
> metadata in general (since missing a single metadata record can cause lots
> of downstream problems).
>
> I'll update this thread by the end of day with a stronger
> blocker/non-blocker opinion.
>
> Thanks,
> David
>
>
> On Mon, Dec 4, 2023 at 6:48 AM Luke Chen  wrote:
>
>> Hi Mickael:
>>
>> I did:
>>1. Validated all checksums, signatures, and hashes
>>2. Ran quick start for KRaft using scala 2.12 artifacts
>>3. Spot checked the documentation and Javadoc
>>4. Validated the licence file
>>
>> When running the validation to scala 2.12 package, I found these libraries
>> are missing: (We only include scala 2.13 libraries in licence file)
>> scala-java8-compat_2.12-1.0.2 is missing in license file
>> scala-library-2.12.18 is missing in license file
>> scala-logging_2.12-3.9.4 is missing in license file
>> scala-reflect-2.12.18 is missing in license file
>>
>> It looks like this issue has been there for a long time, so it won't be a
>> block issue for v3.6.1.
>>
>> +1 (binding) from me.
>>
>> Thank you.
>> Luke
>>
>> On Sat, Dec 2, 2023 at 5:46 AM Bill Bejeck  wrote:
>>
>> > Hi Mickael,
>> >
>> > I did the following:
>> >
>> >1. Validated all checksums, signatures, and hashes
>> >2. Built from source
>> >3. Ran all the unit tests
>> >4. Spot checked the documentation and Javadoc
>> >5. Ran the ZK, Kraft, and Kafka Streams quickstart guides
>> >
>> > I did notice that the `fillDotVersion` in `js/templateData.js` needs
>> > updating to `3.6.1`, but this is minor and should not block the release.
>> >
>> > It's a +1(binding) for me, pending the successful system test run
>> >
>> > Thanks,
>> > Bill
>> >
>> > On Fri, Dec 1, 2023 at 1:49 PM Justine Olshan
>> > > >
>> > wrote:
>> >
>> > > I've started a system test run on my end.
>> > >
>> > > Justine
>> > >
>> > > On Wed, Nov 29, 2023 at 1:55 PM Justine Olshan 
>> > > wrote:
>> > >
>> > > > I built from source and ran a simple transactional produce bench. I
>> > ran a
>> > > > handful of unit tests as well.
>> > > > I scanned the docs and everything looked reasonable.
>> > > >
>> > > > I was wondering if we got the system test results mentioned > System
>> > > > tests: Still running I'll post an update once they complete.
>> > > >
>> > > > Justine
>> > > >
>> > > > On Wed, Nov 29, 2023 at 6:33 AM Mickael Maison <
>> > mickael.mai...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > >> Hi Josep,
>> > > >>
>> > > >> Good catch!
>> > > >> If it's the only issue we find, I don't think we should block the
>> > > >> release just to fix that.
>> > > >>
>> > > >> If we find another issue, I'll backport it before running another
>> RC,
>> > > >> otherwise I'll backport it once 3.6.1 is released.
>> > > >>
>> > > >> Thanks,
>> > > >> Mickael
>> > > >>
>> > > 

Re: [VOTE] 3.6.1 RC0

2023-12-04 Thread David Arthur
Mickael,

I just filed https://issues.apache.org/jira/browse/KAFKA-15968 while
investigating a log corruption issue on the controller. I'm still
investigating the issue to see how far back this goes, but I think this
could be a blocker.

Essentially, the bug is that the controller does not treat a
CorruptRecordException as fatal, so the process will continue running. If
this happens on an active controller, it could corrupt the cluster's
metadata in general (since missing a single metadata record can cause lots
of downstream problems).

I'll update this thread by the end of day with a stronger
blocker/non-blocker opinion.

Thanks,
David


On Mon, Dec 4, 2023 at 6:48 AM Luke Chen  wrote:

> Hi Mickael:
>
> I did:
>1. Validated all checksums, signatures, and hashes
>2. Ran quick start for KRaft using scala 2.12 artifacts
>3. Spot checked the documentation and Javadoc
>4. Validated the licence file
>
> When running the validation to scala 2.12 package, I found these libraries
> are missing: (We only include scala 2.13 libraries in licence file)
> scala-java8-compat_2.12-1.0.2 is missing in license file
> scala-library-2.12.18 is missing in license file
> scala-logging_2.12-3.9.4 is missing in license file
> scala-reflect-2.12.18 is missing in license file
>
> It looks like this issue has been there for a long time, so it won't be a
> block issue for v3.6.1.
>
> +1 (binding) from me.
>
> Thank you.
> Luke
>
> On Sat, Dec 2, 2023 at 5:46 AM Bill Bejeck  wrote:
>
> > Hi Mickael,
> >
> > I did the following:
> >
> >1. Validated all checksums, signatures, and hashes
> >2. Built from source
> >3. Ran all the unit tests
> >4. Spot checked the documentation and Javadoc
> >5. Ran the ZK, Kraft, and Kafka Streams quickstart guides
> >
> > I did notice that the `fillDotVersion` in `js/templateData.js` needs
> > updating to `3.6.1`, but this is minor and should not block the release.
> >
> > It's a +1(binding) for me, pending the successful system test run
> >
> > Thanks,
> > Bill
> >
> > On Fri, Dec 1, 2023 at 1:49 PM Justine Olshan
>  > >
> > wrote:
> >
> > > I've started a system test run on my end.
> > >
> > > Justine
> > >
> > > On Wed, Nov 29, 2023 at 1:55 PM Justine Olshan 
> > > wrote:
> > >
> > > > I built from source and ran a simple transactional produce bench. I
> > ran a
> > > > handful of unit tests as well.
> > > > I scanned the docs and everything looked reasonable.
> > > >
> > > > I was wondering if we got the system test results mentioned > System
> > > > tests: Still running I'll post an update once they complete.
> > > >
> > > > Justine
> > > >
> > > > On Wed, Nov 29, 2023 at 6:33 AM Mickael Maison <
> > mickael.mai...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Hi Josep,
> > > >>
> > > >> Good catch!
> > > >> If it's the only issue we find, I don't think we should block the
> > > >> release just to fix that.
> > > >>
> > > >> If we find another issue, I'll backport it before running another
> RC,
> > > >> otherwise I'll backport it once 3.6.1 is released.
> > > >>
> > > >> Thanks,
> > > >> Mickael
> > > >>
> > > >> On Wed, Nov 29, 2023 at 11:55 AM Josep Prat
> >  > > >
> > > >> wrote:
> > > >> >
> > > >> > Hi Mickael,
> > > >> > This PR[1] made me realize NOTICE-binary is missing the notice for
> > > >> > commons-io. I don't know if it's a blocker or not. I can cherry
> pick
> > > the
> > > >> > commit to the 3.6 branch if you want.
> > > >> >
> > > >> > Best,
> > > >> >
> > > >> >
> > > >> > [1]: https://github.com/apache/kafka/pull/14865
> > > >> >
> > > >> > On Tue, Nov 28, 2023 at 10:25 AM Josep Prat 
> > > >> wrote:
> > > >> >
> > > >> > > Hi Mickael,
> > > >> > > Thanks for running the release. It's a +1 for me (non-binding).
> > > >> > > I did the following:
> > > >> > > - Verified artifact's signatures and hashes
> > > >> > > - Checked JavaDoc (with navigation to Oracle JavaDoc)
> > > >> > > - Compiled source code
> > > >> > > - Run unit tests and integration tests
> > > >> > > - Run getting started with ZK and KRaft
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > On Tue, Nov 28, 2023 at 8:51 AM Kamal Chandraprakash <
> > > >> > > kamal.chandraprak...@gmail.com> wrote:
> > > >> > >
> > > >> > >> +1 (non-binding)
> > > >> > >>
> > > >> > >> 1. Built the source from 3.6.1-rc0 tag in scala 2.12 and 2.13
> > > >> > >> 2. Ran all the unit and integration tests.
> > > >> > >> 3. Ran quickstart and verified the produce-consume on a 3 node
> > > >> cluster.
> > > >> > >> 4. Verified the tiered storage functionality with local-tiered
> > > >> storage.
> > > >> > >>
> > > >> > >> On Tue, Nov 28, 2023 at 12:55 AM Federico Valeri <
> > > >> fedeval...@gmail.com>
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >> > Hi Mickael,
> > > >> > >> >
> > > >> > >> > - Build from source (Java 17, Scala 2.13)
> > > >> > >> > - Run unit and integration tests
> > > >> > >> > - Run custom client apps using staging artifacts
> > > >> > >> >
> > > >> > >> > +1 (non 

[jira] [Created] (KAFKA-15968) QuorumController does not treat CorruptRecordException as fatal

2023-12-04 Thread David Arthur (Jira)
David Arthur created KAFKA-15968:


 Summary: QuorumController does not treat CorruptRecordException as 
fatal
 Key: KAFKA-15968
 URL: https://issues.apache.org/jira/browse/KAFKA-15968
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.6.0, 3.7.0
Reporter: David Arthur


When QuorumController encounters a CorruptRecordException, it does not include 
the exception in the log message. Since CorruptRecordException extends 
ApiException, it gets caught by the first condition in 
EventHandlerExceptionInfo#fromInternal.

The controller treats ApiException as an excepted case (for things like authz 
errors of creating a topic that already exists) so it does not cause a 
failover. If the active controller sees a corrupt record, it should be a fatal 
error.

While we are fixing this, we should audit the subclasses of ApiException and 
make sure we are handling the fatal ones correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-1001; CurrentControllerId Metric

2023-11-20 Thread David Arthur
Thanks Colin,

+1 from me

-David

On Tue, Nov 14, 2023 at 3:53 PM Colin McCabe  wrote:

> Hi all,
>
> I'd like to call a vote for KIP-1001: Add CurrentControllerId metric.
>
> Take a look here:
> https://cwiki.apache.org/confluence/x/egyZE
>
> best,
> Colin
>


-- 
-David


[jira] [Resolved] (KAFKA-15825) KRaft controller writes empty state to ZK after migration

2023-11-14 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15825.
--
Resolution: Fixed

This bug was fixed as part of KAFKA-15605

> KRaft controller writes empty state to ZK after migration
> -
>
> Key: KAFKA-15825
> URL: https://issues.apache.org/jira/browse/KAFKA-15825
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 3.6.0
>Reporter: David Arthur
>    Assignee: David Arthur
>Priority: Major
> Fix For: 3.7.0, 3.6.1
>
>
> Immediately following the ZK migration, there is a race condition where the 
> KRaftMigrationDriver can use an empty MetadataImage when performing the full 
> "SYNC_KRAFT_TO_ZK" reconciliation. 
> After the next controller failover, or when the controller loads a metadata 
> snapshot, the correct state will be written to ZK. 
> The symptom of this bug is that we see the migration complete, and then all 
> the metadata removed from ZK. For example, 
> {code}
> [KRaftMigrationDriver id=9990] Completed migration of metadata from ZooKeeper 
> to KRaft. 573 records were generated in 2204 ms across 51 batches. The record 
> types were {TOPIC_RECORD=41, PARTITION_RECORD=410, CONFIG_RECORD=121, 
> PRODUCER_IDS_RECORD=1}. The current metadata offset is now 503794 with an 
> epoch of 21. Saw 6 brokers in the migrated metadata [0, 1, 2, 3, 4, 5].
> {code}
> immediately followed by:
> {code}
> [KRaftMigrationDriver id=9990] Made the following ZK writes when reconciling 
> with KRaft state: {DeleteBrokerConfig=7, DeleteTopic=41, UpdateTopicConfig=41}
> {code}
> If affected by this, a quick workaround is to cause the controller to 
> failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15825) KRaft controller writes empty state to ZK after migration

2023-11-14 Thread David Arthur (Jira)
David Arthur created KAFKA-15825:


 Summary: KRaft controller writes empty state to ZK after migration
 Key: KAFKA-15825
 URL: https://issues.apache.org/jira/browse/KAFKA-15825
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 3.6.0
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.7.0, 3.6.1


Immediately following the ZK migration, there is a race condition where the 
KRaftMigrationDriver can use an empty MetadataImage when performing the full 
"SYNC_KRAFT_TO_ZK" reconciliation. 

After the next controller failover, or when the controller loads a metadata 
snapshot, the correct state will be written to ZK. 

The symptom of this bug is that we see the migration complete, and then all the 
metadata removed from ZK. For example, 

{code}
[KRaftMigrationDriver id=9990] Completed migration of metadata from ZooKeeper 
to KRaft. 573 records were generated in 2204 ms across 51 batches. The record 
types were {TOPIC_RECORD=41, PARTITION_RECORD=410, CONFIG_RECORD=121, 
PRODUCER_IDS_RECORD=1}. The current metadata offset is now 503794 with an epoch 
of 21. Saw 6 brokers in the migrated metadata [0, 1, 2, 3, 4, 5].
{code}

immediately followed by:

{code}
[KRaftMigrationDriver id=9990] Made the following ZK writes when reconciling 
with KRaft state: {DeleteBrokerConfig=7, DeleteTopic=41, UpdateTopicConfig=41}
{code}

If affected by this, a quick workaround is to cause the controller to failover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15605) Topics marked for deletion in ZK are incorrectly migrated to KRaft

2023-11-14 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15605.
--
Fix Version/s: 3.7.0
   Resolution: Fixed

> Topics marked for deletion in ZK are incorrectly migrated to KRaft
> --
>
> Key: KAFKA-15605
> URL: https://issues.apache.org/jira/browse/KAFKA-15605
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, kraft
>Affects Versions: 3.6.0
>Reporter: David Arthur
>    Assignee: David Arthur
>Priority: Major
> Fix For: 3.7.0, 3.6.1
>
>
> When migrating topics from ZooKeeper, the KRaft controller reads all the 
> topic and partition metadata from ZK directly. This includes topics which 
> have been marked for deletion by the ZK controller. After being migrated to 
> KRaft, the pending topic deletions are never completed, so it is as if the 
> delete topic request never happened.
> Since the client request to delete these topics has already been returned as 
> successful, it would be confusing to the client that the topic still existed. 
> An operator or application would need to issue another topic deletion to 
> remove these topics once the controller had moved to KRaft. If they tried to 
> create a new topic with the same name, they would receive a 
> TOPIC_ALREADY_EXISTS error.
> The migration logic should carry over pending topic deletions and resolve 
> them either as part of the migration or shortly after.
> *Note to operators:*
> To determine if a migration was affected by this, an operator can check the 
> contents of {{/admin/delete_topics}} after the KRaft controller has migrated 
> the metadata. If any topics are listed under this ZNode, they were not 
> deleted and will still be present in KRaft. At this point the operator can 
> make a determination if the topics should be re-deleted (using 
> "kafka-topics.sh --delete") or left in place. In either case, the topics 
> should be removed from {{/admin/delete_topics}} to prevent unexpected topic 
> deletion in the event of a fallback to ZK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15799) ZK brokers incorrectly handle KRaft metadata snapshots

2023-11-08 Thread David Arthur (Jira)
David Arthur created KAFKA-15799:


 Summary: ZK brokers incorrectly handle KRaft metadata snapshots
 Key: KAFKA-15799
 URL: https://issues.apache.org/jira/browse/KAFKA-15799
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.6.1


While working on the fix for KAFKA-15605, I noticed that ZK brokers are 
unconditionally merging data from UpdateMetadataRequest with their existing 
MetadataCache. This is not the correct behavior when handling a metadata 
snapshot from the KRaft controller. 

For example, if a topic was deleted in KRaft and not transmitted as part of a 
delta update (e.g., during a failover) then the ZK brokers will never remove 
the topic from their cache (until they restart and rebuild their cache).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15698) KRaft mode brokers should clean up stray partitions from migration

2023-10-26 Thread David Arthur (Jira)
David Arthur created KAFKA-15698:


 Summary: KRaft mode brokers should clean up stray partitions from 
migration
 Key: KAFKA-15698
 URL: https://issues.apache.org/jira/browse/KAFKA-15698
 Project: Kafka
  Issue Type: Improvement
Reporter: David Arthur


Follow up to KAFKA-15605. After the brokers are migrated to KRaft and the 
migration is completed, we should let the brokers clean up any partitions that 
we marked as "stray" during the migration. This would be any partition that was 
being deleted when the migration began, or any partition that was deleted, but 
not seen as deleted by StopReplica (e.g., broker down).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15648) QuorumControllerTest#testBootstrapZkMigrationRecord is flaky

2023-10-19 Thread David Arthur (Jira)
David Arthur created KAFKA-15648:


 Summary: QuorumControllerTest#testBootstrapZkMigrationRecord is 
flaky
 Key: KAFKA-15648
 URL: https://issues.apache.org/jira/browse/KAFKA-15648
 Project: Kafka
  Issue Type: Bug
  Components: controller, unit tests
Reporter: David Arthur


Noticed that this test failed on Jenkins with 

{code}
org.apache.kafka.server.fault.FaultHandlerException: fatalFaultHandler: 
exception while completing controller activation: Should not have ZK migrations 
enabled on a cluster running metadata.version 3.0-IV1
at 
app//org.apache.kafka.controller.ActivationRecordsGenerator.recordsForNonEmptyLog(ActivationRecordsGenerator.java:154)
at 
app//org.apache.kafka.controller.ActivationRecordsGenerator.generate(ActivationRecordsGenerator.java:229)
at 
app//org.apache.kafka.controller.QuorumController$CompleteActivationEvent.generateRecordsAndResult(QuorumController.java:1237)
at 
app//org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:784)
at 
app//org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
at 
app//org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
at 
app//org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException: Should not have ZK migrations enabled on 
a cluster running metadata.version 3.0-IV1
... 8 more
{code}

When trying to reproduce this failure locally, I ran into a separate flaky 
failure

{code}
[2023-10-19 13:42:09,442] INFO Elected new leader: 
LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1). 
(org.apache.kafka.metalog.LocalLogManager$SharedLogData:300)
[2023-10-19 13:42:09,442] DEBUG 
append(batch=LeaderChangeBatch(newLeader=LeaderAndEpoch(leaderId=OptionalInt[0],
 epoch=1)), nextEndOffset=0) 
(org.apache.kafka.metalog.LocalLogManager$SharedLogData:276)
[2023-10-19 13:42:09,442] DEBUG [LocalLogManager 0] Node 0: running log check. 
(org.apache.kafka.metalog.LocalLogManager:536)
[2023-10-19 13:42:09,442] DEBUG [LocalLogManager 0] initialized local log 
manager for node 0 (org.apache.kafka.metalog.LocalLogManager:685)
[2023-10-19 13:42:09,442] DEBUG [QuorumController id=0] Creating in-memory 
snapshot -1 (org.apache.kafka.timeline.SnapshotRegistry:203)
[2023-10-19 13:42:09,442] INFO [QuorumController id=0] Creating new 
QuorumController with clusterId K8TDRiYZQuepVQHPgwP91A. ZK migration mode is 
enabled. (org.apache.kafka.controller.QuorumController:1912)
[2023-10-19 13:42:09,442] INFO [LocalLogManager 0] Node 0: registered 
MetaLogListener 1238203422 (org.apache.kafka.metalog.LocalLogManager:703)
[2023-10-19 13:42:09,443] DEBUG [LocalLogManager 0] Node 0: running log check. 
(org.apache.kafka.metalog.LocalLogManager:536)
[2023-10-19 13:42:09,443] DEBUG [LocalLogManager 0] Node 0: Executing 
handleLeaderChange LeaderAndEpoch(leaderId=OptionalInt[0], epoch=1) 
(org.apache.kafka.metalog.LocalLogManager:578)
[2023-10-19 13:42:09,443] DEBUG [QuorumController id=0] Executing 
handleLeaderChange[1]. (org.apache.kafka.controller.QuorumController:577)
[2023-10-19 13:42:09,443] INFO [QuorumController id=0] In the new epoch 1, the 
leader is (none). (org.apache.kafka.controller.QuorumController:1179)
[2023-10-19 13:42:09,443] DEBUG [QuorumController id=0] Processed 
handleLeaderChange[1] in 25 us 
(org.apache.kafka.controller.QuorumController:510)
[2023-10-19 13:42:09,443] DEBUG [QuorumController id=0] Executing 
handleLeaderChange[1]. (org.apache.kafka.controller.QuorumController:577)
[2023-10-19 13:42:09,443] INFO [QuorumController id=0] Becoming the active 
controller at epoch 1, next write offset 1. 
(org.apache.kafka.controller.QuorumController:1175)
[2023-10-19 13:42:09,443] DEBUG [QuorumController id=0] Processed 
handleLeaderChange[1] in 34 us 
(org.apache.kafka.controller.QuorumController:510)
[2023-10-19 13:42:09,443] WARN [QuorumController id=0] Performing controller 
activation. The metadata log appears to be empty. Appending 1 bootstrap 
record(s) at metadata.version 3.4-IV0 from bootstrap source 'test'. Putting the 
controller into pre-migration mode. No metadata updates will be allowed until 
the ZK metadata has been migrated. 
(org.apache.kafka.controller.QuorumController:108)
[2023-10-19 13:42:09,443] INFO [QuorumController id=0] Replayed a 
FeatureLevelRecord setting metadata version to 3.4-IV0 
(org.apache.kafka.controller.FeatureControlManager:400)
[2023-10-19 13:42:09,443] INFO [QuorumController id=0] Replayed a 
ZkMigrationStateRecord changing the migration state from NONE to PRE_MIGRATION. 
(org.apache.kafka.controller.FeatureControlManager:421)
[2023-10-19 13:42:09,443] DEBUG append(batch=LocalRecordBatch(leaderEpoch=1

[jira] [Created] (KAFKA-15605) Topic marked for deletion are incorrectly migrated to KRaft

2023-10-13 Thread David Arthur (Jira)
David Arthur created KAFKA-15605:


 Summary: Topic marked for deletion are incorrectly migrated to 
KRaft
 Key: KAFKA-15605
 URL: https://issues.apache.org/jira/browse/KAFKA-15605
 Project: Kafka
  Issue Type: Bug
  Components: controller, kraft
Affects Versions: 3.6.0
Reporter: David Arthur
 Fix For: 3.6.1


When migrating topics from ZooKeeper, the KRaft controller reads all the topic 
and partition metadata from ZK directly. This includes topics which have been 
marked for deletion by the ZK controller. 

Since the client request to delete these topics has already been returned as 
successful, it would be confusing to the client that the topic still existed. 
An operator or application would need to issue another topic deletion to remove 
these topics once the controller had moved to KRaft. If they tried to create a 
new topic with the same name, they would receive a TOPIC_ALREADY_EXISTS error.

The migration logic should carry over pending topic deletions and resolve them 
either as part of the migration or shortly after.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-10-11 Thread David Arthur
One thing we should consider is a static config to totally enable/disable
the ELR feature. If I understand the KIP correctly, we can effectively
disable the unclean recovery by setting the recovery strategy config to
"none".

This would make development and rollout of this feature a bit smoother.
Consider the case that we find bugs in ELR after a cluster has updated to
its MetadataVersion. It's simpler to disable the feature through config
rather than going through a MetadataVersion downgrade (once that's
supported).

Does that make sense?

-David

On Wed, Oct 11, 2023 at 1:40 PM Calvin Liu 
wrote:

> Hi Jun
> -Good catch, yes, we don't need the -1 in the DescribeTopicRequest.
> -No new value is added. The LeaderRecoveryState will still be set to 1 if
> we have an unclean leader election. The unclean leader election includes
> the old random way and the unclean recovery. During the unclean recovery,
> the LeaderRecoveryState will not change until the controller decides to
> update the records with the new leader.
> Thanks
>
> On Wed, Oct 11, 2023 at 9:02 AM Jun Rao  wrote:
>
> > Hi, Calvin,
> >
> > Another thing. Currently, when there is an unclean leader election, we
> set
> > the LeaderRecoveryState in PartitionRecord and PartitionChangeRecord to
> 1.
> > With the KIP, will there be new values for LeaderRecoveryState? If not,
> > when will LeaderRecoveryState be set to 1?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Oct 10, 2023 at 4:24 PM Jun Rao  wrote:
> >
> > > Hi, Calvin,
> > >
> > > One more comment.
> > >
> > > "The first partition to fetch details for. -1 means to fetch all
> > > partitions." It seems that FirstPartitionId of 0 naturally means
> fetching
> > > all partitions?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Oct 10, 2023 at 12:40 PM Calvin Liu  >
> > > wrote:
> > >
> > >> Hi Jun,
> > >> Yeah, with the current Metadata request handling, we only return
> errors
> > on
> > >> the Topic level, like topic not found. It seems that querying a
> specific
> > >> partition is not a valid use case. Will update.
> > >> Thanks
> > >>
> > >> On Tue, Oct 10, 2023 at 11:55 AM Jun Rao 
> > >> wrote:
> > >>
> > >> > Hi, Calvin,
> > >> >
> > >> > 60.  If the range query has errors for some of the partitions, do we
> > >> expect
> > >> > different responses when querying particular partitions?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> > On Tue, Oct 10, 2023 at 10:50 AM Calvin Liu
> >  > >> >
> > >> > wrote:
> > >> >
> > >> > > Hi Jun
> > >> > > 60. Yes, it is a good question. I was thinking the API could be
> > >> flexible
> > >> > to
> > >> > > query the particular partitions if the range query has errors for
> > >> some of
> > >> > > the partitions. Not sure whether it is a valid assumption, what do
> > you
> > >> > > think?
> > >> > >
> > >> > > 61. Good point, I will update them to partition level with the
> same
> > >> > limit.
> > >> > >
> > >> > > 62. Sure, will do.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > On Tue, Oct 10, 2023 at 10:12 AM Jun Rao  >
> > >> > wrote:
> > >> > >
> > >> > > > Hi, Calvin,
> > >> > > >
> > >> > > > A few more minor comments on your latest update.
> > >> > > >
> > >> > > > 60. DescribeTopicRequest: When will the Partitions field be
> used?
> > It
> > >> > > seems
> > >> > > > that the FirstPartitionId field is enough for AdminClient usage.
> > >> > > >
> > >> > > > 61. Could we make the limit for DescribeTopicRequest,
> > >> > > ElectLeadersRequest,
> > >> > > > GetReplicaLogInfo consistent? Currently, ElectLeadersRequest's
> > >> limit is
> > >> > > at
> > >> > > > topic level and GetReplicaLogInfo has a different partition
> level
> > >> limit
> > >> > > > from DescribeTopicRequest.
> > >> > > >
> > >> > > > 62. Should ElectLeadersRequest.DesiredLeaders be at the same
> level
> > >> as
> > >> > > > ElectLeadersRequest.TopicPartitions.Partitions? In the KIP, it
> > looks
> > >> > like
> > >> > > > it's at the same level as ElectLeadersRequest.TopicPartitions.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jun
> > >> > > >
> > >> > > > On Wed, Oct 4, 2023 at 3:55 PM Calvin Liu
> > >> 
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi David,
> > >> > > > > Thanks for the comments.
> > >> > > > > 
> > >> > > > > I thought that a new snapshot with the downgraded MV is
> created
> > in
> > >> > this
> > >> > > > > case. Isn’t it the case?
> > >> > > > > Yes, you are right, a metadata delta will be generated after
> the
> > >> MV
> > >> > > > > downgrade. Then the user can start the software downgrade.
> > >> > > > > -
> > >> > > > > Could you also elaborate a bit more on the reasoning behind
> > adding
> > >> > the
> > >> > > > > limits to the admin RPCs? This is a new pattern in Kafka so it
> > >> would
> > >> > be
> > >> > > > > good to clear on the motivation.
> > >> > > > > Thanks to Colin for bringing it up. The current
> MetadataRequest
> > >> does
> > >> > > not
> > >> > > > > have a limit on the number 

Re: Apache Kafka 3.6.0 release

2023-10-05 Thread David Arthur
ve opened a PR here:
> > > > > https://github.com/apache/kafka/pull/14398
> > > > > > > > > and
> > > > > > > > > > > > I'll work to get it merged promptly.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Sep 18, 2023 at 11:54 AM Greg Harris <
> > > > > greg.har...@aiven.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Satish,
> > > > > > > > > > > > >
> > > > > > > > > > > > > While validating 3.6.0-rc0, I noticed this
> > regression as
> > > > > compared
> > > > > > > > > to
> > > > > > > > > > > > > 3.5.1:
> > https://issues.apache.org/jira/browse/KAFKA-15473
> > > > > > > > > > > > >
> > > > > > > > > > > > > Impact: The `connector-plugins` endpoint lists
> > duplicates
> > > > > which may
> > > > > > > > > > > > > cause confusion for users, or poor behavior in
> > clients.
> > > > > > > > > > > > > Using the other REST API endpoints appears
> > unaffected.
> > > > > > > > > > > > > I'll open a PR for this later today.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Greg
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Sep 14, 2023 at 11:56 AM Satish Duggana
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks Justine for the update. I saw in the
> > morning that
> > > > > these
> > > > > > > > > > > changes
> > > > > > > > > > > > > > are pushed to trunk and 3.6.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~Satish.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, 14 Sept 2023 at 21:54, Justine Olshan
> > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Satish,
> > > > > > > > > > > > > > > We were able to merge
> > > > > > > > > > > > > > >
> > https://issues.apache.org/jira/browse/KAFKA-15459
> > > > > yesterday
> > > > > > > > > > > > > > > and pick to 3.6.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hopefully nothing more from me on this release.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Justine
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Sep 13, 2023 at 9:51 PM Satish Duggana
> <
> > > > > > > > > > > satish.dugg...@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks Luke for the update.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ~Satish.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, 14 Sept 2023 at 07:29, Luke Chen 

[jira] [Created] (KAFKA-15552) Duplicate Producer ID blocks during ZK migration

2023-10-05 Thread David Arthur (Jira)
David Arthur created KAFKA-15552:


 Summary: Duplicate Producer ID blocks during ZK migration
 Key: KAFKA-15552
 URL: https://issues.apache.org/jira/browse/KAFKA-15552
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.5.1, 3.4.1, 3.5.0, 3.4.0, 3.6.0
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.4.2, 3.5.2, 3.6.1


When migrating producer ID blocks from ZK to KRaft, we are taking the current 
producer ID block from ZK and writing it's "firstProducerId" into the producer 
IDs KRaft record. However, in KRaft we store the _next_ producer ID block in 
the log rather than storing the current block like ZK does. The end result is 
that the first block given to a caller of AllocateProducerIds is a duplicate of 
the last block allocated in ZK mode.

 

This can result in duplicate producer IDs being given to transactional or 
idempotent producers. In the case of transactional producers, this can cause 
long term problems since the producer IDs are persisted and reused for a long 
time.


The time between the last producer ID block being allocated by the ZK 
controller and all the brokers being restarted following the metadata migration 
is when this bug is possible.
 

Symptoms of this bug will include ReplicaManager OutOfOrderSequenceException 
and possibly some producer epoch validation errors. To see if a cluster is 
affected by this bug, search for the offending producer ID and see if it is 
being used by more than one producer.

 

For example, the following error was observed
{code}
Out of order sequence number for producer 376000 at offset 381338 in partition 
REDACTED: 0 (incoming seq. number), 21 (current end sequence number) 
{code}

Then searching for "376000" on 
org.apache.kafka.clients.producer.internals.TransactionManager logs, two 
brokers both show the same producer ID being provisioned

{code}
Broker 0 [Producer clientId=REDACTED-0] ProducerId set to 376000 with epoch 1
Broker 5 [Producer clientId=REDACTED-1] ProducerId set to 376000 with epoch 1
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

2023-10-05 Thread David Arthur
Hey, just chiming in regarding the ZK migration piece.

Generally speaking, one of the design goals of the migration was to have
minimal changes on the ZK brokers and especially the ZK controller. Since
ZK mode is our safe/well-known fallback mode, we wanted to reduce the
chances of introducing bugs there. Following that logic, I'd prefer option
(a) since it does not involve changing any migration code or (much) ZK
broker code. Disk failures should be pretty rare, so this seems like a
reasonable option.

a) If a migrating ZK mode broker encounters a directory failure,
>   it will shutdown. While this degrades failure handling during,
>   the temporary migration window, it is a useful simplification.
>   This is an attractive option, and it isn't ruled out, but it
>   is also not clear that it is necessary at this point.


If a ZK broker experiences a disk failure before the metadata is migrated,
it will prevent the migration from happening. If the metadata is already
migrated, then you simply have an offline broker.

If an operator wants to minimize the time window of the migration, they can
simply do the requisite rolling restarts one after the other.

1) Provision KRaft controllers
2) Configure ZK brokers for migration and do rolling restart (migration
happens automatically here)
3) Configure ZK brokers as KRaft and do rolling restart

This reduces the time window to essentially the time it takes to do two
rolling restarts of the cluster. One the brokers are in KRaft mode, they
won't have the "shutdown if log dir fails" behavior.



One question with this approach is how the KRaft controller learns about
the multiple log directories after the broker is restarted in KRaft mode.
If I understand the design correctly, this would be similar to a single
directory kraft broker being reconfigured as a multiple directory broker.
That is, the broker sees that the PartitionRecords are missing the
directory assignments and then sends AssignReplicasToDirs to the controller.

Thanks!
David


Re: [DISCUSS] KIP-966: Eligible Leader Replicas

2023-10-03 Thread David Arthur
Calvin, thanks for the KIP!

I'm getting up to speed on the discussion. I had a few questions

57. When is the CleanShutdownFile removed? I think it probably happens
after registering with the controller, but it would be good to clarify this.

58. Since the broker epoch comes from the controller, what would go
into the CleanShutdownFile in the case of a broker being unable to register
with the controller? For example:

1) Broker A registers

2) Controller sees A, gives epoch 1

3) Broker A crashes, no CleanShutdownFile

4) Broker A starts up and shuts down before registering


During 4) is a CleanShutdownFile produced? If so, what epoch goes in it?

59. What is the expected behavior when controlled shutdown times out?
Looking at BrokerServer, I think the logs have a chance of still being
closed cleanly, so this could be a regular clean shutdown scenario.




On Tue, Oct 3, 2023 at 6:04 PM Colin McCabe  wrote:

> On Tue, Oct 3, 2023, at 10:49, Jun Rao wrote:
> > Hi, Calvin,
> >
> > Thanks for the update KIP. A few more comments.
> >
> > 41. Why would a user choose the option to select a random replica as the
> > leader instead of using unclean.recovery.strateg=Aggressive? It seems
> that
> > the latter is strictly better? If that's not the case, could we fold this
> > option under unclean.recovery.strategy instead of introducing a separate
> > config?
>
> Hi Jun,
>
> I thought the flow of control was:
>
> If there is no leader for the partition {
>   If (there are unfenced ELR members) {
> choose_an_unfenced_ELR_member
>   } else if (there are fenced ELR members AND strategy=Aggressive) {
> do_unclean_recovery
>   } else if (there are no ELR members AND strategy != None) {
> do_unclean_recovery
>   } else {
> do nothing about the missing leader
>   }
> }
>
> do_unclean_recovery() {
>if (unclean.recovery.manager.enabled) {
> use UncleanRecoveryManager
>   } else {
> choose the last known leader if that is available, or a random leader
> if not)
>   }
> }
>
> However, I think this could be clarified, especially the behavior when
> unclean.recovery.manager.enabled=false. Inuitively the goal for
> unclean.recovery.manager.enabled=false is to be "the same as now, mostly"
> but it's very underspecified in the KIP, I agree.
>
> >
> > 50. ElectLeadersRequest: "If more than 20 topics are included, only the
> > first 20 will be served. Others will be returned with DesiredLeaders."
> Hmm,
> > not sure that I understand this. ElectLeadersResponse doesn't have a
> > DesiredLeaders field.
> >
> > 51. GetReplicaLogInfo: "If more than 2000 partitions are included, only
> the
> > first 2000 will be served" Do we return an error for the remaining
> > partitions? Actually, should we include an errorCode field at the
> partition
> > level in GetReplicaLogInfoResponse to cover non-existing partitions and
> no
> > authorization, etc?
> >
> > 52. The entry should matches => The entry should match
> >
> > 53. ElectLeadersRequest.DesiredLeaders: Should it be nullable since a
> user
> > may not specify DesiredLeaders?
> >
> > 54. Downgrade: Is that indeed possible? I thought earlier you said that
> > once the new version of the records are in the metadata log, one can't
> > downgrade since the old broker doesn't know how to parse the new version
> of
> > the metadata records?
> >
>
> MetadataVersion downgrade is currently broken but we have fixing it on our
> plate for Kafka 3.7.
>
> The way downgrade works is that "new features" are dropped, leaving only
> the old ones.
>
> > 55. CleanShutdownFile: Should we add a version field for future
> extension?
> >
> > 56. Config changes are public facing. Could we have a separate section to
> > document all the config changes?
>
> +1. A separate section for this would be good.
>
> best,
> Colin
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Sep 25, 2023 at 4:29 PM Calvin Liu 
> > wrote:
> >
> >> Hi Jun
> >> Thanks for the comments.
> >>
> >> 40. If we change to None, it is not guaranteed for no data loss. For
> users
> >> who are not able to validate the data with external resources, manual
> >> intervention does not give a better result but a loss of availability.
> So
> >> practically speaking, the Balance mode would be a better default value.
> >>
> >> 41. No, it represents how we want to do the unclean leader election. If
> it
> >> is false, the unclean leader election will be the old random way.
> >> Otherwise, the unclean recovery will be used.
> >>
> >> 42. Good catch. Updated.
> >>
> >> 43. Only the first 20 topics will be served. Others will be returned
> with
> >> InvalidRequestError
> >>
> >> 44. The order matters. The desired leader entries match with the topic
> >> partition list by the index.
> >>
> >> 45. Thanks! Updated.
> >>
> >> 46. Good advice! Updated.
> >>
> >> 47.1, updated the comment. Basically it will elect the replica in the
> >> desiredLeader field to be the leader
> >>
> >> 47.2 We can let the admin client do the conversion. Using the

[jira] [Created] (KAFKA-15532) ZkWriteBehindLag should not be reported by inactive controllers

2023-10-03 Thread David Arthur (Jira)
David Arthur created KAFKA-15532:


 Summary: ZkWriteBehindLag should not be reported by inactive 
controllers
 Key: KAFKA-15532
 URL: https://issues.apache.org/jira/browse/KAFKA-15532
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.6.0
Reporter: David Arthur


Since only the active controller is performing the dual-write to ZK during a 
migration, it should be the only controller to report the ZkWriteBehindLag 
metric. 

 

Currently, if the controller fails over during a migration, the previous active 
controller will incorrectly report its last value for ZkWriteBehindLag forever. 
Instead, it should report zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] 3.6.0 RC0

2023-09-19 Thread David Arthur
ookeeper to 3.8.1" should probably be
> > > > > renamed to include 3.8.2 since code uses version 3.8.2 of
> Zookeeper.
> > > > >
> > > > >
> > > > > Additionally, I have verified the following:
> > > > > 1. release tag is correctly made after the latest commit on the 3.6
> > > > > branch at
> > > > >
> > > >
> https://github.com/apache/kafka/commit/193d8c5be8d79b64c6c19d281322f09e3c5fe7de
> > > > >
> > > > > 2. protocol documentation contains the newly introduced error code
> as
> > > > > part of tiered storage
> > > > >
> > > > > 3. verified that public keys for RM are available at
> > > > > https://keys.openpgp.org/
> > > > >
> > > > > 4. verified that public keys for RM are available at
> > > > > https://people.apache.org/keys/committer/
> > > > >
> > > > > --
> > > > > Divij Vaidya
> > > > >
> > > > > On Tue, Sep 19, 2023 at 12:41 PM Sagar 
> > > > wrote:
> > > > > >
> > > > > > Hey Satish,
> > > > > >
> > > > > > I have commented on KAFKA-15473. I think the changes in the PR
> look
> > > > > fine. I
> > > > > > also feel this need not be a release blocker given there are
> other
> > > > > > possibilities in which duplicates can manifest on the response
> of the
> > > > end
> > > > > > point in question (albeit we can potentially see more in number
> due to
> > > > > > this).
> > > > > >
> > > > > > Would like to hear others' thoughts as well.
> > > > > >
> > > > > > Thanks!
> > > > > > Sagar.
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 19, 2023 at 3:14 PM Satish Duggana <
> > > > satish.dugg...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Greg,
> > > > > > > Thanks for reporting the KafkaConnect issue. I replied to this
> issue
> > > > > > > on "Apache Kafka 3.6.0 release" email thread and on
> > > > > > > https://issues.apache.org/jira/browse/KAFKA-15473.
> > > > > > >
> > > > > > > I would like to hear other KafkaConnect experts' opinions on
> whether
> > > > > > > this issue is really a release blocker.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Satish.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, 19 Sept 2023 at 00:27, Greg Harris
> > > > > 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hey all,
> > > > > > > >
> > > > > > > > I noticed this regression in RC0:
> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-15473
> > > > > > > > I've mentioned it in the release thread, and I'm working on
> a fix.
> > > > > > > >
> > > > > > > > I'm -1 (non-binding) until we determine if this regression
> is a
> > > > > blocker.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > On Mon, Sep 18, 2023 at 10:56 AM Josep Prat
> > > > > 
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Satish,
> > > > > > > > > Thanks for running the release.
> > > > > > > > >
> > > > > > > > > I ran the following validation steps:
> > > > > > > > > - Built from source with Java 11 and Scala 2.13
> > > > > > > > > - Verified Signatures and hashes of the artifacts generated
> > > > > > > > > - Navigated through Javadoc including links to JDK classes
> > > > > > > > > - Run the unit tests
> > > > > > > > > - Run integration tests
> > > > > > > > > - Run the quickstart in KRaft and Zoo

Re: [VOTE] 3.6.0 RC0

2023-09-18 Thread David Arthur
Hey Satish, thanks for getting the RC underway!

I noticed that the PR for the 3.6 blog post is merged. This makes the blog
post live on the Kafka website https://kafka.apache.org/blog.html. The blog
post (along with other public announcements) is usually the last thing we
do as part of the release. I think we should probably take this down until
we're done with the release, otherwise users stumbling on this post could
get confused. It also contains some broken links.

Thanks!
David

On Sun, Sep 17, 2023 at 1:31 PM Satish Duggana 
wrote:

> Hello Kafka users, developers and client-developers,
>
> This is the first candidate for the release of Apache Kafka 3.6.0. Some of
> the major features include:
>
> * KIP-405 : Kafka Tiered Storage
> * KIP-868 : KRaft Metadata Transactions
> * KIP-875: First-class offsets support in Kafka Connect
> * KIP-898: Modernize Connect plugin discovery
> * KIP-938: Add more metrics for measuring KRaft performance
> * KIP-902: Upgrade Zookeeper to 3.8.1
> * KIP-917: Additional custom metadata for remote log segment
>
> Release notes for the 3.6.0 release:
> https://home.apache.org/~satishd/kafka-3.6.0-rc0/RELEASE_NOTES.html
>
> *** Please download, test and vote by Wednesday, September 21, 12pm PT
>
> Kafka's KEYS file containing PGP keys we use to sign the release:
> https://kafka.apache.org/KEYS
>
> * Release artifacts to be voted upon (source and binary):
> https://home.apache.org/~satishd/kafka-3.6.0-rc0/
>
> * Maven artifacts to be voted upon:
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
>
> * Javadoc:
> https://home.apache.org/~satishd/kafka-3.6.0-rc0/javadoc/
>
> * Tag to be voted upon (off 3.6 branch) is the 3.6.0 tag:
> https://github.com/apache/kafka/releases/tag/3.6.0-rc0
>
> * Documentation:
> https://kafka.apache.org/36/documentation.html
>
> * Protocol:
> https://kafka.apache.org/36/protocol.html
>
> * Successful Jenkins builds for the 3.6 branch:
> There are a few runs of unit/integration tests. You can see the latest at
> https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.6/. We will
> continue
> running a few more iterations.
> System tests:
> We will send an update once we have the results.
>
> Thanks,
> Satish.
>


-- 
David Arthur


[jira] [Resolved] (KAFKA-15450) Disable ZK migration when JBOD configured

2023-09-12 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15450.
--
Resolution: Fixed

> Disable ZK migration when JBOD configured
> -
>
> Key: KAFKA-15450
> URL: https://issues.apache.org/jira/browse/KAFKA-15450
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.4.1, 3.6.0, 3.5.1
>    Reporter: David Arthur
>Priority: Critical
> Fix For: 3.6.0, 3.4.2, 3.5.2
>
>
> Since JBOD is not yet supported in KRaft (see 
> [KIP-858|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]),
>  we need to prevent users from starting a ZK to KRaft migration if JBOD is 
> used.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Apache Kafka 3.6.0 release

2023-09-12 Thread David Arthur
Satish,

KAFKA-15450 is merged to 3.6 (as well as trunk, 3.5, and 3.4)

Thanks!
David

On Tue, Sep 12, 2023 at 11:44 AM Ismael Juma  wrote:

> Justine,
>
> Probably best to have the conversation in the JIRA ticket vs the release
> thread. Generally, we want to only include low risk bug fixes that are
> fully compatible in patch releases.
>
> Ismael
>
> On Tue, Sep 12, 2023 at 7:16 AM Justine Olshan
> 
> wrote:
>
> > Thanks Satish. I understand.
> > Just curious, is this something that could be added to 3.6.1? It would be
> > nice to say that hanging transactions are fully covered in a 3.6 release.
> > I'm not as familiar with the rules around minor releases, but adding it
> > there would give more time to ensure stability.
> >
> > Thanks,
> > Justine
> >
> > On Tue, Sep 12, 2023 at 5:49 AM Satish Duggana  >
> > wrote:
> >
> > > Hi Justine,
> > > We can skip this change into 3.6 now as it is not a blocker or
> > > regression and it involves changes to the API implementation. Let us
> > > plan to add the gap in the release notes as you mentioned.
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Tue, 12 Sept 2023 at 04:44, Justine Olshan
> > >  wrote:
> > > >
> > > > Hey Satish,
> > > >
> > > > We just discovered a gap in KIP-890 part 1. We currently don't verify
> > on
> > > > txn offset commits, so it is still possible to have hanging
> > transactions
> > > on
> > > > the consumer offsets partitions.
> > > > I've opened a jira to wire the verification in that request.
> > > > https://issues.apache.org/jira/browse/KAFKA-15449
> > > >
> > > > This also isn't a regression, but it would be nice to have part 1
> fully
> > > > complete. I have opened a PR with the fix:
> > > > https://github.com/apache/kafka/pull/14370.
> > > >
> > > > I understand if there are concerns about last minute changes to this
> > API
> > > > and we can hold off if that makes the most sense.
> > > > If we take that route, I think we should still keep verification for
> > the
> > > > data partitions since it still provides full protection there and
> > > improves
> > > > the transactions experience. We will need to call out the gap in the
> > > > release notes for consumer offsets partitions
> > > >
> > > > Let me know what you think.
> > > > Justine
> > > >
> > > >
> > > > On Mon, Sep 11, 2023 at 12:29 PM David Arthur
> > > >  wrote:
> > > >
> > > > > Another (small) ZK migration issue was identified. This one isn't a
> > > > > regression (it has existed since 3.4), but I think it's reasonable
> to
> > > > > include. It's a small configuration check that could potentially
> save
> > > end
> > > > > users from some headaches down the line.
> > > > >
> > > > > https://issues.apache.org/jira/browse/KAFKA-15450
> > > > > https://github.com/apache/kafka/pull/14367
> > > > >
> > > > > I think we can get this one committed to trunk today.
> > > > >
> > > > > -David
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Sep 10, 2023 at 7:50 PM Ismael Juma 
> > wrote:
> > > > >
> > > > > > Hi Satish,
> > > > > >
> > > > > > That sounds great. I think we should aim to only allow blockers
> > > > > > (regressions, impactful security issues, etc.) on the 3.6 branch
> > > until
> > > > > > 3.6.0 is out.
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > >
> > > > > > On Sat, Sep 9, 2023, 12:20 AM Satish Duggana <
> > > satish.dugg...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ismael,
> > > > > > > It looks like we will publish RC0 by 14th Sep.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Satish.
> > > > > > >
> > > > > > > On Fri, 8 Sept 2023 at 19:23, Ismael Juma 
> > > wrote:
> > > > > > > >
> > > > > > > > Hi Satish,
> > > > > > > >
> > > >

Re: Apache Kafka 3.6.0 release

2023-09-11 Thread David Arthur
Another (small) ZK migration issue was identified. This one isn't a
regression (it has existed since 3.4), but I think it's reasonable to
include. It's a small configuration check that could potentially save end
users from some headaches down the line.

https://issues.apache.org/jira/browse/KAFKA-15450
https://github.com/apache/kafka/pull/14367

I think we can get this one committed to trunk today.

-David



On Sun, Sep 10, 2023 at 7:50 PM Ismael Juma  wrote:

> Hi Satish,
>
> That sounds great. I think we should aim to only allow blockers
> (regressions, impactful security issues, etc.) on the 3.6 branch until
> 3.6.0 is out.
>
> Ismael
>
>
> On Sat, Sep 9, 2023, 12:20 AM Satish Duggana 
> wrote:
>
> > Hi Ismael,
> > It looks like we will publish RC0 by 14th Sep.
> >
> > Thanks,
> > Satish.
> >
> > On Fri, 8 Sept 2023 at 19:23, Ismael Juma  wrote:
> > >
> > > Hi Satish,
> > >
> > > Do you have a sense of when we'll publish RC0?
> > >
> > > Thanks,
> > > Ismael
> > >
> > > On Fri, Sep 8, 2023 at 6:27 AM David Arthur
> > >  wrote:
> > >
> > > > Quick update on my two blockers: KAFKA-15435 is merged to trunk and
> > > > cherry-picked to 3.6. I have a PR open for KAFKA-15441 and will
> > hopefully
> > > > get it merged today.
> > > >
> > > > -David
> > > >
> > > > On Fri, Sep 8, 2023 at 5:26 AM Ivan Yurchenko 
> wrote:
> > > >
> > > > > Hi Satish and all,
> > > > >
> > > > > I wonder if https://issues.apache.org/jira/browse/KAFKA-14993
> > should be
> > > > > included in the 3.6 release plan. I'm thinking that when
> > implemented, it
> > > > > would be a small, but still a change in the RSM contract: throw an
> > > > > exception instead of returning an empty InputStream. Maybe it
> should
> > be
> > > > > included right away to save the migration later? What do you think?
> > > > >
> > > > > Best,
> > > > > Ivan
> > > > >
> > > > > On Fri, Sep 8, 2023, at 02:52, Satish Duggana wrote:
> > > > > > Hi Jose,
> > > > > > Thanks for looking into this issue and resolving it with a quick
> > fix.
> > > > > >
> > > > > > ~Satish.
> > > > > >
> > > > > > On Thu, 7 Sept 2023 at 21:40, José Armando García Sancio
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi Satish,
> > > > > > >
> > > > > > > On Wed, Sep 6, 2023 at 4:58 PM Satish Duggana <
> > > > > satish.dugg...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi Greg,
> > > > > > > > It seems https://issues.apache.org/jira/browse/KAFKA-14273
> has
> > > > been
> > > > > > > > there in 3.5.x too.
> > > > > > >
> > > > > > > I also agree that it should be a blocker for 3.6.0. It should
> > have
> > > > > > > been a blocker for those previous releases. I didn't fix it
> > because,
> > > > > > > unfortunately, I wasn't aware of the issue and jira.
> > > > > > > I'll create a PR with a fix in case the original author doesn't
> > > > > respond in time.
> > > > > > >
> > > > > > > Satish, do you agree?
> > > > > > >
> > > > > > > Thanks!
> > > > > > > --
> > > > > > > -José
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -David
> > > >
> >
>


-- 
-David


[jira] [Created] (KAFKA-15450) Disable ZK migration when JBOD configured

2023-09-11 Thread David Arthur (Jira)
David Arthur created KAFKA-15450:


 Summary: Disable ZK migration when JBOD configured
 Key: KAFKA-15450
 URL: https://issues.apache.org/jira/browse/KAFKA-15450
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.5.1, 3.4.1, 3.6.0
Reporter: David Arthur
 Fix For: 3.6.0


Since JBOD is not yet supported in KRaft (see 
[KIP-858|[https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]]),
 we need to prevent users from starting a ZK to KRaft migration if JBOD is used.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15441) Broker sessions can time out during ZK migration

2023-09-08 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15441.
--
Resolution: Fixed

> Broker sessions can time out during ZK migration
> 
>
> Key: KAFKA-15441
> URL: https://issues.apache.org/jira/browse/KAFKA-15441
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.6.0
>    Reporter: David Arthur
>Assignee: David Arthur
>Priority: Blocker
> Fix For: 3.6.0
>
>
> When a ZK to KRaft migration takes more than a few seconds to complete, the 
> sessions between the ZK brokers and the KRaft controller will expire. This 
> appears to be due to the heartbeat events being blocked in the purgatory on 
> the controller.
> The side effect of this expiration is that after the metadata is migrated, 
> the KRaft controller will immediately fence all of the brokers and remove 
> them from ISRs. This leads to a mass leadership change that can cause large 
> latency spikes on the brokers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Apache Kafka 3.6.0 release

2023-09-08 Thread David Arthur
Quick update on my two blockers: KAFKA-15435 is merged to trunk and
cherry-picked to 3.6. I have a PR open for KAFKA-15441 and will hopefully
get it merged today.

-David

On Fri, Sep 8, 2023 at 5:26 AM Ivan Yurchenko  wrote:

> Hi Satish and all,
>
> I wonder if https://issues.apache.org/jira/browse/KAFKA-14993 should be
> included in the 3.6 release plan. I'm thinking that when implemented, it
> would be a small, but still a change in the RSM contract: throw an
> exception instead of returning an empty InputStream. Maybe it should be
> included right away to save the migration later? What do you think?
>
> Best,
> Ivan
>
> On Fri, Sep 8, 2023, at 02:52, Satish Duggana wrote:
> > Hi Jose,
> > Thanks for looking into this issue and resolving it with a quick fix.
> >
> > ~Satish.
> >
> > On Thu, 7 Sept 2023 at 21:40, José Armando García Sancio
> >  wrote:
> > >
> > > Hi Satish,
> > >
> > > On Wed, Sep 6, 2023 at 4:58 PM Satish Duggana <
> satish.dugg...@gmail.com> wrote:
> > > >
> > > > Hi Greg,
> > > > It seems https://issues.apache.org/jira/browse/KAFKA-14273 has been
> > > > there in 3.5.x too.
> > >
> > > I also agree that it should be a blocker for 3.6.0. It should have
> > > been a blocker for those previous releases. I didn't fix it because,
> > > unfortunately, I wasn't aware of the issue and jira.
> > > I'll create a PR with a fix in case the original author doesn't
> respond in time.
> > >
> > > Satish, do you agree?
> > >
> > > Thanks!
> > > --
> > > -José
> >
>


-- 
-David


[jira] [Resolved] (KAFKA-15435) KRaft migration record counts in log message are incorrect

2023-09-08 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15435.
--
Resolution: Fixed

> KRaft migration record counts in log message are incorrect
> --
>
> Key: KAFKA-15435
> URL: https://issues.apache.org/jira/browse/KAFKA-15435
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.6.0
>Reporter: David Arthur
>    Assignee: David Arthur
>Priority: Blocker
> Fix For: 3.6.0
>
>
> The counting logic in MigrationManifest is incorrect and produces invalid 
> output. This information is critical for users wanting to validate the result 
> of a migration.
>  
> {code}
> Completed migration of metadata from ZooKeeper to KRaft. 7117 records were 
> generated in 54253 ms across 1629 batches. The record types were 
> {TOPIC_RECORD=2, CONFIG_RECORD=2, PARTITION_RECORD=2, 
> ACCESS_CONTROL_ENTRY_RECORD=2, PRODUCER_IDS_RECORD=1}. 
> {code}
> Due to the logic bug, the counts will never exceed 2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Apache Kafka 3.6.0 release

2023-09-06 Thread David Arthur
Thanks, Satish! Here's another blocker
https://issues.apache.org/jira/browse/KAFKA-15441 :)

For the 3.6 release notes and announcement, I'd like to include a special
note about ZK to KRaft migrations being GA (Generally Available). We have
finished closing all the gaps from the earlier releases of ZK migrations
(e.g., ACLs, SCRAM), so it is now possible to migrate all metadata to
KRaft. We have also made the migration more reliable and fault
tolerant with the inclusion of KIP-868 transactions. I'd be happy to write
something for the release notes when the time comes, if it's helpful.

Thanks!
David

On Tue, Sep 5, 2023 at 8:13 PM Satish Duggana 
wrote:

> Hi David,
> Thanks for bringing this issue to this thread.
> I marked https://issues.apache.org/jira/browse/KAFKA-15435 as a blocker.
>
> Thanks,
> Satish.
>
> On Tue, 5 Sept 2023 at 21:29, David Arthur  wrote:
> >
> > Hi Satish. Thanks for running the release!
> >
> > I'd like to raise this as a blocker for 3.6
> > https://issues.apache.org/jira/browse/KAFKA-15435.
> >
> > It's a very quick fix, so I should be able to post a PR soon.
> >
> > Thanks!
> > David
> >
> > On Mon, Sep 4, 2023 at 11:44 PM Justine Olshan
> 
> > wrote:
> >
> > > Thanks Satish. This is done 
> > >
> > > Justine
> > >
> > > On Mon, Sep 4, 2023 at 5:16 PM Satish Duggana <
> satish.dugg...@gmail.com>
> > > wrote:
> > >
> > > > Hey Justine,
> > > > I went through KAFKA-15424 and the PR[1]. It seems there are no
> > > > dependent changes missing in 3.6 branch. They seem to be low risk as
> > > > you mentioned. Please merge it to the 3.6 branch as well.
> > > >
> > > > 1. https://github.com/apache/kafka/pull/14324.
> > > >
> > > > Thanks,
> > > > Satish.
> > > >
> > > > On Tue, 5 Sept 2023 at 05:06, Justine Olshan
> > > >  wrote:
> > > > >
> > > > > Sorry I meant to add the jira as well.
> > > > > https://issues.apache.org/jira/browse/KAFKA-15424
> > > > >
> > > > > Justine
> > > > >
> > > > > On Mon, Sep 4, 2023 at 4:34 PM Justine Olshan <
> jols...@confluent.io>
> > > > wrote:
> > > > >
> > > > > > Hey Satish,
> > > > > >
> > > > > > I was working on adding dynamic configuration for
> > > > > > transaction verification. The PR is approved and ready to merge
> into
> > > > trunk.
> > > > > > I was thinking I could also add it to 3.6 since it is fairly low
> > > risk.
> > > > > > What do you think?
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Sat, Sep 2, 2023 at 6:21 PM Sophie Blee-Goldman <
> > > > ableegold...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks Satish! The fix has been merged and cherrypicked to 3.6
> > > > > >>
> > > > > >> On Sat, Sep 2, 2023 at 6:02 AM Satish Duggana <
> > > > satish.dugg...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Sophie,
> > > > > >> > Please feel free to add that to 3.6 branch as you say this is
> a
> > > > minor
> > > > > >> > change and will not cause any regressions.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Satish.
> > > > > >> >
> > > > > >> > On Sat, 2 Sept 2023 at 08:44, Sophie Blee-Goldman
> > > > > >> >  wrote:
> > > > > >> > >
> > > > > >> > > Hey Satish, someone reported a minor bug in the Streams
> > > > application
> > > > > >> > > shutdown which was a recent regression, though not strictly
> a
> > > new
> > > > one:
> > > > > >> > was
> > > > > >> > > introduced in 3.4 I believe.
> > > > > >> > >
> > > > > >> > > The fix seems to be super lightweight and low-risk so I was
> > > > hoping to
> > > > > >> > slip
> > > > > >> > > it into 3.6 if that's ok with you? They plan to have the
> patch
> > > > > >> tonight

[jira] [Created] (KAFKA-15441) Broker sessions can time out during ZK migration

2023-09-06 Thread David Arthur (Jira)
David Arthur created KAFKA-15441:


 Summary: Broker sessions can time out during ZK migration
 Key: KAFKA-15441
 URL: https://issues.apache.org/jira/browse/KAFKA-15441
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.6.0
Reporter: David Arthur
Assignee: David Arthur


When a ZK to KRaft migration takes more than a few seconds to complete, the 
sessions between the ZK brokers and the KRaft controller will expire. This 
appears to be due to the heartbeat events being blocked in the purgatory on the 
controller.

The side effect of this expiration is that after the metadata is migrated, the 
KRaft controller will immediately fence all of the brokers and remove them from 
ISRs. This leads to a mass leadership change that can cause large latency 
spikes on the brokers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Apache Kafka 3.6.0 release

2023-09-05 Thread David Arthur
for
> > > >> > > > > >> what
> > > >> > > > > >> > > early
> > > >> > > > > >> > > > > > > > access
> > > >> > > > > >> > > > > > > > > > > means.
> > > >> > > > > >> > > > > > > > > > > > > >
> > > >> > > > > >> > > > > > > > > > > > > > Does this make sense?
> > > >> > > > > >> > > > > > > > > > > > > >
> > > >> > > > > >> > > > > > > > > > > > > > Ismael
> > > >> > > > > >> > > > > > > > > > > > > >
> > > >> > > > > >> > > > > > > > > > > > > > On Thu, Jul 27, 2023 at 6:38 PM
> > Divij
> > > >> > > > Vaidya <
> > > >> > > > > >> > > > > > > > > > > divijvaidy...@gmail.com>
> > > >> > > > > >> > > > > > > > > > > > > > wro

[jira] [Created] (KAFKA-15435) KRaft migration record counts in log message are incorrect

2023-09-05 Thread David Arthur (Jira)
David Arthur created KAFKA-15435:


 Summary: KRaft migration record counts in log message are incorrect
 Key: KAFKA-15435
 URL: https://issues.apache.org/jira/browse/KAFKA-15435
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.6.0
Reporter: David Arthur


The counting logic in MigrationManifest is incorrect and produces invalid 
output. This information is critical for users wanting to validate the result 
of a migration.

 
{code}
Completed migration of metadata from ZooKeeper to KRaft. 7117 records were 
generated in 54253 ms across 1629 batches. The record types were 
{TOPIC_RECORD=2, CONFIG_RECORD=2, PARTITION_RECORD=2, 
ACCESS_CONTROL_ENTRY_RECORD=2, PRODUCER_IDS_RECORD=1}. 
{code}

Due to the logic bug, the counts will never exceed 2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15389) MetadataLoader may publish an empty image on first start

2023-08-21 Thread David Arthur (Jira)
David Arthur created KAFKA-15389:


 Summary: MetadataLoader may publish an empty image on first start
 Key: KAFKA-15389
 URL: https://issues.apache.org/jira/browse/KAFKA-15389
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur


When first loading from an empty log, there is a case where MetadataLoader can 
publish an image before the bootstrap records are processed. This isn't exactly 
incorrect, since all components implicitly start from the empty image state, 
but it might be unexpected for some MetadataPublishers. 

 

For example, in KRaftMigrationDriver, if an old MetadataVersion is encountered, 
the driver transitions to the INACTIVE state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15381) Controller waiting for migration should only allow failover when transactions are supported

2023-08-18 Thread David Arthur (Jira)
David Arthur created KAFKA-15381:


 Summary: Controller waiting for migration should only allow 
failover when transactions are supported
 Key: KAFKA-15381
 URL: https://issues.apache.org/jira/browse/KAFKA-15381
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur
 Fix For: 3.6.0


After a KRaft controller starts up in migration mode, it enters the 
"pre-migration" state. Unless transactions are supported, it is not safe for 
the controller to fail over in pre-migration mode. This is because a migration 
could have been partially committed when the failover occurs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15374) ZK migration fails on configs for default broker resource

2023-08-17 Thread David Arthur (Jira)
David Arthur created KAFKA-15374:


 Summary: ZK migration fails on configs for default broker resource
 Key: KAFKA-15374
 URL: https://issues.apache.org/jira/browse/KAFKA-15374
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.5.1, 3.4.1
Reporter: David Arthur
 Fix For: 3.6.0, 3.4.2, 3.5.2


This error was seen while performing a ZK to KRaft migration on a cluster with 
configs for the default broker resource

 
{code:java}
java.lang.NumberFormatException: For input string: ""
at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
at java.base/java.lang.Integer.parseInt(Integer.java:678)
at java.base/java.lang.Integer.valueOf(Integer.java:999)
at 
kafka.zk.ZkMigrationClient.$anonfun$migrateBrokerConfigs$2(ZkMigrationClient.scala:371)
at 
kafka.zk.migration.ZkConfigMigrationClient.$anonfun$iterateBrokerConfigs$1(ZkConfigMigrationClient.scala:174)
at 
kafka.zk.migration.ZkConfigMigrationClient.$anonfun$iterateBrokerConfigs$1$adapted(ZkConfigMigrationClient.scala:156)
at 
scala.collection.immutable.BitmapIndexedMapNode.foreach(HashMap.scala:1076)
at scala.collection.immutable.HashMap.foreach(HashMap.scala:1083)
at 
kafka.zk.migration.ZkConfigMigrationClient.iterateBrokerConfigs(ZkConfigMigrationClient.scala:156)
at 
kafka.zk.ZkMigrationClient.migrateBrokerConfigs(ZkMigrationClient.scala:370)
at 
kafka.zk.ZkMigrationClient.cleanAndMigrateAllMetadata(ZkMigrationClient.scala:530)
at 
org.apache.kafka.metadata.migration.KRaftMigrationDriver$MigrateMetadataEvent.run(KRaftMigrationDriver.java:618)
at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
at java.base/java.lang.Thread.run(Thread.java:833)
at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:64) 
{code}
 

This is due to not considering the default resource type when we collect the 
broker IDs in ZkMigrationClient#migrateBrokerConfigs.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15263) KRaftMigrationDriver can run the migration twice

2023-07-28 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15263.
--
Resolution: Fixed

> KRaftMigrationDriver can run the migration twice
> 
>
> Key: KAFKA-15263
> URL: https://issues.apache.org/jira/browse/KAFKA-15263
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.5.0, 3.5.1
>    Reporter: David Arthur
>Assignee: David Arthur
>Priority: Blocker
> Fix For: 3.6.0, 3.5.2
>
>
> There is a narrow race condition in KRaftMigrationDriver where a PollEvent 
> can run that sees the internal state as ZK_MIGRATION and is immediately 
> followed by another poll event (due to external call to {{{}wakeup(){}}}) 
> that results in two MigrateMetadataEvent being enqueued. 
> Since MigrateMetadataEvent lacks a check on the internal state, this causes 
> the metadata migration to occur twice. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15263) KRaftMigrationDriver can run the migration twice

2023-07-27 Thread David Arthur (Jira)
David Arthur created KAFKA-15263:


 Summary: KRaftMigrationDriver can run the migration twice
 Key: KAFKA-15263
 URL: https://issues.apache.org/jira/browse/KAFKA-15263
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur
Assignee: David Arthur


There is a narrow race condition in KRaftMigrationDriver where a PollEvent can 
run that sees the internal state as ZK_MIGRATION and is immediately followed by 
another poll event (due to external call to {{{}wakeup(){}}}) that results in 
two MigrateMetadataEvent being enqueued. 

Since MigrateMetadataEvent lacks a check on the internal state, this causes the 
metadata migration to occur twice. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-919: Allow AdminClient to Talk Directly with the KRaft Controller Quorum and add Controller Registration

2023-07-26 Thread David Arthur
Thanks for driving this KIP, Colin!

+1 binding

-David

On Wed, Jul 26, 2023 at 8:58 AM Divij Vaidya 
wrote:

> +1 (binding)
>
> --
> Divij Vaidya
>
>
> On Wed, Jul 26, 2023 at 2:56 PM ziming deng 
> wrote:
> >
> > +1 (binding) from me.
> >
> > Thanks for the KIP!
> >
> > --
> > Ziming
> >
> > > On Jul 26, 2023, at 20:18, Luke Chen  wrote:
> > >
> > > +1 (binding) from me.
> > >
> > > Thanks for the KIP!
> > >
> > > Luke
> > >
> > > On Tue, Jul 25, 2023 at 1:24 AM Colin McCabe 
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I'd like to start the vote for KIP-919: Allow AdminClient to Talk
> Directly
> > >> with the KRaft Controller Quorum and add Controller Registration.
> > >>
> > >> The KIP is here: https://cwiki.apache.org/confluence/x/Owo0Dw
> > >>
> > >> Thanks to everyone who reviewed the proposal.
> > >>
> > >> best,
> > >> Colin
> > >>
> >
>


-- 
-David


Re: KIP-919: Allow AdminClient to Talk Directly with the KRaft Controller Quorum

2023-07-21 Thread David Arthur
Hey Colin, thanks for the KIP! Some questions

1) "This registration will include information about the endpoints which
they possess"  Will this include all endpoints, or only those configured in
"advertised.listeners"

2) "Periodically, each controller will check that the controller
registration for its ID is as expected."  Does this need to be a periodic
check? Since the controller registration state will be in the log, can't
the follower just react to unexpected incarnation IDs (after it's caught
up)?

3) ControllerRegistrationRequest has a typo in the listeners section (it
mentions "broker")

4) Since we can't rely on the ApiVersions data, should we remove the field
we added to ApiVersionsResponse in KIP-866?

5)I filed https://issues.apache.org/jira/browse/KAFKA-15230 for the issues
mentioned under "Controller Changes" in case you want to link it

6) I don't see it explicitly mentioned, but I think it's the case that the
active controller must accept and persist any controller registration it
receives. This is unlike the behavior of broker registrations where we can
reject brokers we don't want. For controllers, I don't think we have that
option unless we go for some tighter Raft integration. Since the followers
must be participating in Raft to learn about the leader (and therefore,
will have replayed the full log), we can't really say "no" at that point.


Cheers,
David


On Thu, Jul 20, 2023 at 7:23 PM Colin McCabe  wrote:

> On Tue, Jul 18, 2023, at 09:30, Mickael Maison wrote:
> > H Colin,
> >
> > Thanks for the KIP.
> >
> > Just a few points:
> > 1. As Tom mentioned it would be good to clarify the APIs we expect
> > available on controllers. I assume we want to add DESCRIBE_CONFIGS as
> > part of this KIP.
>
> Hi Mickael,
>
> Yes, this is a good point. I added a table describing the APIs that will
> now be added.
>
> > 2. Currently we have no way of retrieving the list of configs that
> > apply to controllers. It would be good to have an object, so we can
> > add that to the docs but also use that in kafka-configs.
>
> I think this is out of scope.
>
> > 3. Should we have a new entity-type in kafka-configs for setting
> > controller configs?
>
> The BROKER entity type already applies to controllers. It probably needs a
> new name (NODE would be better) but that's out of scope for this KIP, I
> think.
>
> best,
> Colin
>
>
> >
> > Thanks,
> > Mickael
> >
> > On Tue, Jul 4, 2023 at 2:20 PM Luke Chen  wrote:
> >>
> >> Hi Colin,
> >>
> >> Thanks for the answers to my previous questions.
> >>
> >> > Yes, the common thread here is that all of these shell commands
> perform
> >> operations can be done without the broker. So it's reasonable to allow
> them
> >> to be done without going through the broker. I don't know if we need a
> >> separate note for each since the rationale is really the same for all
> (is
> >> it reasonable? if so allow it.)
> >>
> >> Yes, it makes sense. Could we make a note about the main rationale for
> >> selecting these command-line tools in the KIP to make it clear?
> >> Ex: The following command-line tools will get a new
> --bootstrap-controllers
> >> argument (because these shell commands perform operations can be done
> >> without the broker):
> >>
> >> > kafka-reassign-partitions.sh cannot be used to move the
> >> __cluster_metadata topic. However, it can be used to move partitions
> that
> >> reside on the brokers, even when using --bootstrap-controllers to talk
> >> directly to the quorum.
> >>
> >> Fair enough.
> >>
> >>
> >> 4. Does all the command-line tools with `--bootstrap-controllers`
> support
> >> all the options in the tool?
> >> For example, kafka-configs.sh, In addition to the `--alter` option you
> >> mentioned in the example, do we also support `--describe` or `--delete`
> >> option?
> >> If so, do we also support setting "quota" for users/clients/topics...
> via
> >> `--bootstrap-controllers`? (not intuitive, but maybe we just directly
> >> commit the change into the metadata from controller?)
> >>
> >> 5. Do we have any plan for this feature to be completed? v3.6.0?
> >>
> >>
> >> Thank you.
> >> Luke
> >>
> >>
> >> On Fri, Apr 28, 2023 at 1:42 AM Colin McCabe 
> wrote:
> >>
> >> > On Wed, Apr 26, 2023, at 22:08, Luke Chen wrote:
> >> > > Hi Colin,
> >> > >
> >> > > Some comments:
> >> > > 1. I agree we should set "top-level" errors for metadata response
> >> > >
> >> > > 2. In the "brokers" field of metadata response from controller,
> it'll
> >> > > respond with "Controller endpoint information as given in
> >> > > controller.quorum.voters", instead of the "alive"
> controllers(voters).
> >> > That
> >> > > will break the existing admin client because in admin client, we'll
> rely
> >> > on
> >> > > the metadata response to build the "current alive brokers" list, and
> >> > choose
> >> > > one from them to connect (either least load or other criteria). That
> >> > means,
> >> > > if now, we return the value in `controller.quorum.voters`, but one
> of
> 

[jira] [Created] (KAFKA-15230) ApiVersions data between controllers is not reliable

2023-07-21 Thread David Arthur (Jira)
David Arthur created KAFKA-15230:


 Summary: ApiVersions data between controllers is not reliable
 Key: KAFKA-15230
 URL: https://issues.apache.org/jira/browse/KAFKA-15230
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur


While testing ZK migrations, I noticed a case where the controller was not 
starting the migration due to the missing ApiVersions data from other 
controllers. This was unexpected because the quorum was running and the 
followers were replicating the metadata log as expected. After examining a heap 
dump of the leader, it was in fact the case that the ApiVersions map of 
NodeApiVersions was empty.

 

After further investigation and offline discussion with [~jsancio], we realized 
that after the initial leader election, the connection from the Raft leader to 
the followers will become idle and eventually timeout and close. This causes 
NetworkClient to purge the NodeApiVersions data for the closed connections.

 

There are two main side effects of this behavior: 

1) If migrations are not started within the idle timeout period (10 minutes, by 
default), then they will not be able to be started. After this timeout period, 
I was unable to restart the controllers in such a way that the leader had 
active connections with all followers.

2) Dynamically updating features, such as "metadata.version", is not guaranteed 
to be safe

 

There is a partial workaround for the migration issue. If we set "
connections.max.idle.ms" to -1, the Raft leader will never disconnect from the 
followers. However, if a follower restarts, the leader will not re-establish a 
connection.
 
The feature update issue has no safe workarounds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15196) Additional ZK migration metrics

2023-07-17 Thread David Arthur (Jira)
David Arthur created KAFKA-15196:


 Summary: Additional ZK migration metrics
 Key: KAFKA-15196
 URL: https://issues.apache.org/jira/browse/KAFKA-15196
 Project: Kafka
  Issue Type: Sub-task
Reporter: David Arthur
Assignee: David Arthur


This issue is to track the remaining metrics defined in KIP-866. So far, we 
have ZkMigrationState and ZkWriteBehindLag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15137) Don't log the entire request in KRaftControllerChannelManager

2023-06-30 Thread David Arthur (Jira)
David Arthur created KAFKA-15137:


 Summary: Don't log the entire request in 
KRaftControllerChannelManager
 Key: KAFKA-15137
 URL: https://issues.apache.org/jira/browse/KAFKA-15137
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.5.0, 3.6.0
Reporter: David Arthur
Assignee: Alyssa Huang
 Fix For: 3.5.1


While debugging some junit tests, I noticed some really long log lines in 
KRaftControllerChannelManager. When the broker is down, we log a WARN that 
includes the entire UpdateMetadataRequest or LeaderAndIsrRequest. For large 
clusters, these can be really large requests, so this could potentially cause 
excessive output in the log4j logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15098) KRaft migration does not proceed and broker dies if authorizer.class.name is set

2023-06-22 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15098.
--
Resolution: Fixed

> KRaft migration does not proceed and broker dies if authorizer.class.name is 
> set
> 
>
> Key: KAFKA-15098
> URL: https://issues.apache.org/jira/browse/KAFKA-15098
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.5.0
>Reporter: Ron Dagostino
>Assignee: David Arthur
>Priority: Blocker
> Fix For: 3.6.0, 3.5.1
>
>
> [ERROR] 2023-06-16 20:14:14,298 [main] kafka.Kafka$ - Exiting Kafka due to 
> fatal exception
> java.lang.IllegalArgumentException: requirement failed: ZooKeeper migration 
> does not yet support authorizers. Remove authorizer.class.name before 
> performing a migration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15109) ISR shrink/expand issues on ZK brokers during migration

2023-06-22 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-15109.
--
Resolution: Fixed

> ISR shrink/expand issues on ZK brokers during migration
> ---
>
> Key: KAFKA-15109
> URL: https://issues.apache.org/jira/browse/KAFKA-15109
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft, replication
>Affects Versions: 3.6.0
>Reporter: David Arthur
>    Assignee: David Arthur
>Priority: Critical
> Fix For: 3.6.0
>
>
> KAFKA-15021 introduced a new controller behavior that avoids increasing the 
> leader epoch during the controlled shutdown scenario. This prevents some 
> unnecessary thrashing of metadata and threads on the brokers and clients. 
> While a cluster is in a KIP-866 migration and has a KRaft controller with ZK 
> brokers, we cannot employ this leader epoch bump avoidance. The ZK brokers 
> must have the leader epoch bump in order for ReplicaManager to react to the 
> LeaderAndIsrRequest.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15109) ISR not expanding on ZK brokers during migration

2023-06-20 Thread David Arthur (Jira)
David Arthur created KAFKA-15109:


 Summary: ISR not expanding on ZK brokers during migration
 Key: KAFKA-15109
 URL: https://issues.apache.org/jira/browse/KAFKA-15109
 Project: Kafka
  Issue Type: Bug
  Components: kraft, replication
Affects Versions: 3.5.0
Reporter: David Arthur


KAFKA-15021 introduced a new controller behavior that avoids increasing the 
leader epoch during the controlled shutdown scenario. This prevents some 
unnecessary thrashing of metadata and threads on the brokers and clients. 

While a cluster is in a KIP-866 migration and has a KRaft controller with ZK 
brokers, we cannot employ this leader epoch bump avoidance. The ZK brokers must 
have the leader epoch bump in order for the ISR expansion to complete.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-14 Thread David Arthur
>
> > How would we handle commits that break the integration tests? Would
> > we revert commits on trunk, or fix-forward?
>
> This is currently up to committer discretion, and I don't think that
> would change if we were to re-tool the PR builds. In the presence of
> flaky failures, we can't reliably blame failures on particular commits
> without running much more expensive statistical tests.
>

I was more thinking about integration regressions rather than flaky
failures. However, if we're running a module's tests as well as the
"affected modules" tests, then I guess we should
be running the correct integration tests for each PR build. I guess this
gets back to Chris's point about how this approach favors the downstream
modules. Maybe that's unavoidable.

Come to think of it, a similar problem exists with system tests. We don't
run these for each PR (or each trunk commit, or even nightly AFAIK) since
they are prohibitively costly/lengthy to run.
However, they do sometimes find integration regressions that the junit
suite missed. In these cases we only have the choice to fix-forward.


On Tue, Jun 13, 2023 at 12:19 PM Greg Harris 
wrote:

> David,
>
> Thanks for finding that gradle plugin. The `changedModules` mode is
> exactly what I had in mind for fairness to modules earlier in the
> dependency graph.
>
> > if we moved to a policy where PRs only need some of the tests to pass
> > to merge, when would we run the full CI? On each trunk commit (i.e., PR
> > merge)?
>
> In a world where the PR runs includes only the changed modules and
> their dependencies, the full suite should be run for each commit on
> trunk and on release branches. I don't think that optimizing the trunk
> build runtime is of great benefit, and the current behavior seems
> reasonable to continue.
>
> > How would we handle commits that break the integration tests? Would
> > we revert commits on trunk, or fix-forward?
>
> This is currently up to committer discretion, and I don't think that
> would change if we were to re-tool the PR builds. In the presence of
> flaky failures, we can't reliably blame failures on particular commits
> without running much more expensive statistical tests.
>
> One place that I often see flakiness present is in new tests, where
> someone has chosen timeouts which work for them locally and in the PR
> build. After some 10s or 100s of runs, the flakiness becomes evident
> and someone looks into a fix-forward.
> I don't necessarily think I would advocate for a hard revert of an
> entire feature if one of the added tests is flaky, but that's my
> discretion. We can adopt a project policy of reverting whatever we
> can, but I don't think that's a more welcoming or productive project
> than we have now.
>
> Greg
>
> On Tue, Jun 13, 2023 at 7:24 AM David Arthur  wrote:
> >
> > Hey folks, interesting discussion.
> >
> > I came across a Gradle plugin that calculates a DAG of modules based on
> the
> > diff and can run only the affected module's tests or the affected +
> > downstream tests.
> >
> > https://github.com/dropbox/AffectedModuleDetector
> >
> > I tested it out locally, and it seems to work as advertised.
> >
> > Greg, if we moved to a policy where PRs only need some of the tests to
> pass
> > to merge, when would we run the full CI? On each trunk commit (i.e., PR
> > merge)? How would we handle commits that break the integration tests?
> Would
> > we revert commits on trunk, or fix-forward?
> >
> > -David
> >
> >
> > On Thu, Jun 8, 2023 at 2:02 PM Greg Harris  >
> > wrote:
> >
> > > Gaurav,
> > >
> > > The target-determinator is certainly the "off-the-shelf" solution I
> > > expected would be out there. If the project migrates to Bazel I think
> > > that would make the partial builds much easier to implement.
> > > I think we should look into the other benefits of migrating to bazel
> > > to see if it is worth it even if the partial builds feature is decided
> > > against, or after it is reverted.
> > >
> > > Chris,
> > >
> > > > Do you think we should aim to disable
> > > > merges without a full suite of passing CI runs (allowing for
> > > administrative
> > > > override in an emergency)? If so, what would the path be from our
> current
> > > > state to there? What can we do to ensure that we don't get stuck
> relying
> > > on
> > > > a once-temporary aid that becomes effectively permanent?
> > >
> > > Yes I think it would be nice to require a green build to merge,
> > > witho

Re: [ANNOUNCE] New committer: Divij Vaidya

2023-06-13 Thread David Arthur
Congrats Divij!

On Tue, Jun 13, 2023 at 12:34 PM Igor Soarez  wrote:

> Congratulations Divij!
>
> --
> Igor
>


-- 
-David


Re: [DISCUSS] Partial CI builds - Reducing flakiness with fewer tests

2023-06-13 Thread David Arthur
; > > correctness of the targets that need to be built and tested?
> > >
> > > Thanks,
> > > Gaurav
> > >
> > > [1]: https://bazel.build
> > > [2]: https://github.com/bazel-contrib/target-determinator
> > > [3]: https://bazel.build/remote/rbe
> > > [4]: https://bazel.build/remote/caching
> > >
> > > On 2023/06/05 17:47:07 Greg Harris wrote:
> > > > Hey all,
> > > >
> > > > I've been working on test flakiness recently, and I've been trying to
> > > > come up with ways to tackle the issue top-down as well as bottom-up,
> > > > and I'm interested to hear your thoughts on an idea.
> > > >
> > > > In addition to the current full-suite runs, can we in parallel
> trigger
> > > > a smaller test run which has only a relevant subset of tests? For
> > > > example, if someone is working on one sub-module, the CI would only
> > > > run tests in that module.
> > > >
> > > > I think this would be more likely to pass than the full suite due to
> > > > the fewer tests failing probabilistically, and would improve the
> > > > signal-to-noise ratio of the summary pass/fail marker on GitHub. This
> > > > should also be shorter to execute than the full suite, allowing for
> > > > faster cycle-time than the current full suite encourages.
> > > >
> > > > This would also strengthen the incentive for contributors
> specializing
> > > > in a module to de-flake tests, as they are rewarded with a tangible
> > > > improvement within their area of the project. Currently, even the
> > > > modules with the most reliable tests receive consistent CI failures
> > > > from other less reliable modules.
> > > >
> > > > I believe this is possible, even if there isn't an off-the-shelf
> > > > solution for it. We can learn of the changed files via a git diff,
> map
> > > > that to modules containing those files, and then execute the tests
> > > > just for those modules with gradle. GitHub also permits showing
> > > > multiple "checks" so that we can emit both the full-suite and partial
> > > > test results.
> > > >
> > > > Thanks,
> > > > Greg
> > > >
>


-- 
David Arthur


Re: [VOTE] 3.5.0 RC1

2023-06-12 Thread David Arthur
t; >> > > - KIP-900: KRaft kafka-storage.sh API additions to support SCRAM
> > for
> > > >> > > Kafka Brokers
> > > >> > >
> > > >> > > Release notes for the 3.5.0 release:
> > > >> > >
> > https://home.apache.org/~mimaison/kafka-3.5.0-rc1/RELEASE_NOTES.html
> > > >> > >
> > > >> > > *** Please download, test and vote by Friday June 9, 5pm PT
> > > >> > >
> > > >> > > Kafka's KEYS file containing PGP keys we use to sign the
> release:
> > > >> > > https://kafka.apache.org/KEYS
> > > >> > >
> > > >> > > * Release artifacts to be voted upon (source and binary):
> > > >> > > https://home.apache.org/~mimaison/kafka-3.5.0-rc1/
> > > >> > >
> > > >> > > * Maven artifacts to be voted upon:
> > > >> > >
> > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > > >> > >
> > > >> > > * Javadoc:
> > > >> > > https://home.apache.org/~mimaison/kafka-3.5.0-rc1/javadoc/
> > > >> > >
> > > >> > > * Tag to be voted upon (off 3.5 branch) is the 3.5.0 tag:
> > > >> > > https://github.com/apache/kafka/releases/tag/3.5.0-rc1
> > > >> > >
> > > >> > > * Documentation:
> > > >> > > https://kafka.apache.org/35/documentation.html
> > > >> > >
> > > >> > > * Protocol:
> > > >> > > https://kafka.apache.org/35/protocol.html
> > > >> > >
> > > >> > > * Successful Jenkins builds for the 3.5 branch:
> > > >> > > Unit/integration tests: I'm struggling to get all tests to pass
> > in the
> > > >> > > same build. I'll run a few more builds to ensure each test pass
> at
> > > >> > > least once in the CI. All tests passed locally.
> > > >> > > System tests: The build is still running, I'll send an update
> > once I
> > > >> > > have the results.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Mickael
> > > >> > >
> > > >>
> > > >
> > > >
> > > > --
> > > > [image: Aiven] <https://www.aiven.io>
> > > >
> > > > *Josep Prat*
> > > > Open Source Engineering Director, *Aiven*
> > > > josep.p...@aiven.io   |   +491715557497
> > > > aiven.io <https://www.aiven.io>   |   <
> > https://www.facebook.com/aivencloud>
> > > >   <https://www.linkedin.com/company/aiven/>   <
> > https://twitter.com/aiven_io>
> > > > *Aiven Deutschland GmbH*
> > > > Alexanderufer 3-7, 10117 Berlin
> > > > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> > > > Amtsgericht Charlottenburg, HRB 209739 B
> >
>


-- 
David Arthur


Re: [DISCUSS] Regarding Old PRs

2023-06-09 Thread David Arthur
Hey all, I just wanted to bump this one more time before I merge this PR
(thanks for the review, Josep!). I'll merge it at the end of the day today
unless anyone has more feedback.

Thanks!
David

On Wed, Jun 7, 2023 at 8:50 PM David Arthur  wrote:

> I filed KAFKA-15073 for this. Here is a patch
> https://github.com/apache/kafka/pull/13827. This simply adds a "stale"
> label to PRs with no activity in the last 90 days. I figure that's a good
> starting point.
>
> As for developer workflow, the "stale" action is quite flexible in how it
> finds candidate PRs to mark as stale. For example, we can exclude PRs that
> have an Assignee, or a particular set of labels. Docs are here
> https://github.com/actions/stale
>
> -David
>
>
> On Wed, Jun 7, 2023 at 2:36 PM Josep Prat 
> wrote:
>
> > Thanks David!
> >
> > ———
> > Josep Prat
> >
> > Aiven Deutschland GmbH
> >
> > Alexanderufer 3-7, 10117 Berlin
> >
> > Amtsgericht Charlottenburg, HRB 209739 B
> >
> > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> >
> > m: +491715557497
> >
> > w: aiven.io
> >
> > e: josep.p...@aiven.io
> >
> > On Wed, Jun 7, 2023, 20:28 David Arthur  > .invalid>
> > wrote:
> >
> > > Hey all, I started poking around at Github actions on my fork.
> > >
> > > https://github.com/mumrah/kafka/actions
> > >
> > > I'll post a PR if I get it working and we can discuss what kind of
> > settings
> > > we want (or if we want it all)
> > >
> > > -David
> > >
> > > On Tue, Jun 6, 2023 at 1:18 PM Chris Egerton 
> > > wrote:
> > >
> > > > Hi Josep,
> > > >
> > > > Thanks for bringing this up! Will try to keep things brief.
> > > >
> > > > I'm generally in favor of this initiative. A couple ideas that I
> really
> > > > liked: requiring a component label (producer, consumer, connect,
> > streams,
> > > > etc.) before closing, and disabling auto-close (i.e., automatically
> > > tagging
> > > > PRs as stale, but leaving it to a human being to actually close
> them).
> > > >
> > > > We might replace the "stale" label with a "close-by-" label so
> > that
> > > > it becomes even easier for us to find the PRs that are ready to be
> > closed
> > > > (as opposed to the ones that have just been labeled as stale without
> > > giving
> > > > the contributor enough time to respond).
> > > >
> > > > I've also gone ahead and closed some of my stale PRs. Others I've
> > > > downgraded to draft to signal that I'd like to continue to pursue
> them,
> > > but
> > > > have to iron out merge conflicts first. For the last ones, I'll ping
> > for
> > > > review.
> > > >
> > > > One question that came to mind--do we want to distinguish between
> > regular
> > > > and draft PRs? I'm guessing not, since they still add up to the total
> > PR
> > > > count against the project, but since they do also implicitly signal
> > that
> > > > they're not intended for review (yet) it may be friendlier to leave
> > them
> > > > up.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Tue, Jun 6, 2023 at 10:18 AM Mickael Maison <
> > mickael.mai...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Josep,
> > > > >
> > > > > Thanks for looking into this. This is clearly one aspect where we
> > need
> > > > > to improve.
> > > > >
> > > > > We had a similar initiative last year
> > > > > (https://lists.apache.org/thread/66yj9m6tcyz8zqb3lqlbnr386bqwsopt)
> > and
> > > > > we closed many PRs. Unfortunately we did not follow up with a
> process
> > > > > or automation and we are back to the same situation.
> > > > >
> > > > > Manually reviewing all these PRs is a huge task, so I think we
> should
> > > > > at least partially automate it. I'm not sure if we should manually
> > > > > review the oldest PRs (pre 2020). There's surely many interesting
> > > > > things but I wonder if we should instead focus on the more recent
> > ones
> > > > > as they have a higher chance of 1) still making sense, 2) gett

Re: [VOTE] KIP-938: Add more metrics for measuring KRaft performance

2023-06-08 Thread David Arthur
Ok, thanks for the explanations. +1 binding from me

-David

On Thu, Jun 8, 2023 at 6:08 PM Colin McCabe  wrote:

> One note. I added ForwardingManager queue metrics. This should be the last
> addition!
>
> best,
> Colin
>
> On Thu, Jun 8, 2023, at 14:47, Colin McCabe wrote:
> > On Thu, Jun 8, 2023, at 10:00, David Arthur wrote:
> >> Colin, thanks for the KIP! These all seem like pretty useful additions.
> A
> >> few quick questions
> >>
> >> 1) Will the value of TimedOutBrokerHeartbeatCount be zero for inactive
> >> controllers?
> >
> > No. It will just stay at whatever it was when the controller became
> > inactive (or 0 if it was never active).
> >
> >> Will the value reset to zero after an election, or only
> >> process restart?
> >
> > Only process restart.
> >
> > The reason is that if the metric resets on a new controller being
> > elected, a lot of metrics collection systems will potentially miss it.
> > After all, one pattern we've seen is excessive load leading to
> > excessive controller elections.
> >
> >> 2) Does HandleLoadSnapshotCount include the initial load?
> >
> > It does count the initial load.
> >
> >> I.e., will it always be at least 1
> >
> > Well, there isn't an initial load, if the cluster is brand new. :)
> >
> > But yes, in general a value of 1 is expected.
> >
> > best,
> > Colin
> >
> >>
> >> Thanks!
> >> David
> >>
> >> On Wed, Jun 7, 2023 at 11:17 PM Luke Chen  wrote:
> >>
> >>> Hi Colin,
> >>>
> >>> Thanks for the response.
> >>> I have no more comments.
> >>> +1 (binding)
> >>>
> >>> Luke
> >>>
> >>> On Thu, Jun 8, 2023 at 6:02 AM Colin McCabe 
> wrote:
> >>> >
> >>> > Hi Luke,
> >>> >
> >>> > Thanks for the review and the suggestion.
> >>> >
> >>> > I think we will add more "handling time" metrics later, but for now I
> >>> don't want to make this KIP any bigger than it is already...
> >>> >
> >>> > best,
> >>> > Colin
> >>> >
> >>> >
> >>> > On Wed, Jun 7, 2023, at 03:12, Luke Chen wrote:
> >>> > > Hi Colin,
> >>> > >
> >>> > > One comment:
> >>> > > Should we add a metric to record the snapshot handling time?
> >>> > > Since we know the snapshot loading might take long if the size is
> huge.
> >>> > > We might want to know how much time it is processed. WDYT?
> >>> > >
> >>> > > No matter you think we need it or not, the KIP LGTM.
> >>> > > +1 from me.
> >>> > >
> >>> > >
> >>> > > Thank you.
> >>> > > Luke
> >>> > >
> >>> > > On Wed, Jun 7, 2023 at 1:33 PM Colin McCabe 
> >>> wrote:
> >>> > >>
> >>> > >> Hi all,
> >>> > >>
> >>> > >> I added two new metrics to the list:
> >>> > >>
> >>> > >> * LatestSnapshotGeneratedBytes
> >>> > >> * LatestSnapshotGeneratedAgeMs
> >>> > >>
> >>> > >> These will help monitor the period snapshot generation process.
> >>> > >>
> >>> > >> best,
> >>> > >> Colin
> >>> > >>
> >>> > >>
> >>> > >> On Tue, Jun 6, 2023, at 22:21, Colin McCabe wrote:
> >>> > >> > Hi Divij,
> >>> > >> >
> >>> > >> > Yes, I am referring to the feature level. I changed the
> description
> >>> of
> >>> > >> > CurrentMetadataVersion to reference the feature level
> specifically.
> >>> > >> >
> >>> > >> > best,
> >>> > >> > Colin
> >>> > >> >
> >>> > >> >
> >>> > >> > On Tue, Jun 6, 2023, at 05:56, Divij Vaidya wrote:
> >>> > >> >> "Each metadata version has a corresponding integer in the
> >>> > >> >> MetadataVersion.java file."
> >>> > >> >>
> >>> > >> >> Please correct me if I'm wrong, but are you referring to
> >>> "featureLevel"
> >>> > >> >> in
> >>> > >> >> the enum at
> >>> > >> >>
> >>>
> https://github.com/apache/kafka/blob/trunk/server-common/src/main/java/org/apache/kafka/server/common/MetadataVersion.java#L45
> >>> > >> >> ? Is yes, can we please update the description of the metric to
> >>> make it
> >>> > >> >> easier for the users to understand this? For example, we can
> say,
> >>> > >> >> "Represents the current metadata version as an integer value.
> See
> >>> > >> >> MetadataVersion (hyperlink) for a mapping between string and
> >>> integer
> >>> > >> >> formats of metadata version".
> >>> > >> >>
> >>> > >> >> --
> >>> > >> >> Divij Vaidya
> >>> > >> >>
> >>> > >> >>
> >>> > >> >>
> >>> > >> >> On Tue, Jun 6, 2023 at 1:51 PM Ron Dagostino <
> rndg...@gmail.com>
> >>> wrote:
> >>> > >> >>
> >>> > >> >>> Thanks again for the KIP, Colin.  +1 (binding).
> >>> > >> >>>
> >>> > >> >>> Ron
> >>> > >> >>>
> >>> > >> >>> > On Jun 6, 2023, at 7:02 AM, Igor Soarez
> >>> 
> >>> > >> >>> wrote:
> >>> > >> >>> >
> >>> > >> >>> > Thanks for the KIP.
> >>> > >> >>> >
> >>> > >> >>> > Seems straightforward, LGTM.
> >>> > >> >>> > Non binding +1.
> >>> > >> >>> >
> >>> > >> >>> > --
> >>> > >> >>> > Igor
> >>> > >> >>> >
> >>> > >> >>>
> >>>
>


-- 
-David


Re: [VOTE] KIP-938: Add more metrics for measuring KRaft performance

2023-06-08 Thread David Arthur
Colin, thanks for the KIP! These all seem like pretty useful additions. A
few quick questions

1) Will the value of TimedOutBrokerHeartbeatCount be zero for inactive
controllers? Will the value reset to zero after an election, or only
process restart?
2) Does HandleLoadSnapshotCount include the initial load? I.e., will it
always be at least 1

Thanks!
David

On Wed, Jun 7, 2023 at 11:17 PM Luke Chen  wrote:

> Hi Colin,
>
> Thanks for the response.
> I have no more comments.
> +1 (binding)
>
> Luke
>
> On Thu, Jun 8, 2023 at 6:02 AM Colin McCabe  wrote:
> >
> > Hi Luke,
> >
> > Thanks for the review and the suggestion.
> >
> > I think we will add more "handling time" metrics later, but for now I
> don't want to make this KIP any bigger than it is already...
> >
> > best,
> > Colin
> >
> >
> > On Wed, Jun 7, 2023, at 03:12, Luke Chen wrote:
> > > Hi Colin,
> > >
> > > One comment:
> > > Should we add a metric to record the snapshot handling time?
> > > Since we know the snapshot loading might take long if the size is huge.
> > > We might want to know how much time it is processed. WDYT?
> > >
> > > No matter you think we need it or not, the KIP LGTM.
> > > +1 from me.
> > >
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Wed, Jun 7, 2023 at 1:33 PM Colin McCabe 
> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> I added two new metrics to the list:
> > >>
> > >> * LatestSnapshotGeneratedBytes
> > >> * LatestSnapshotGeneratedAgeMs
> > >>
> > >> These will help monitor the period snapshot generation process.
> > >>
> > >> best,
> > >> Colin
> > >>
> > >>
> > >> On Tue, Jun 6, 2023, at 22:21, Colin McCabe wrote:
> > >> > Hi Divij,
> > >> >
> > >> > Yes, I am referring to the feature level. I changed the description
> of
> > >> > CurrentMetadataVersion to reference the feature level specifically.
> > >> >
> > >> > best,
> > >> > Colin
> > >> >
> > >> >
> > >> > On Tue, Jun 6, 2023, at 05:56, Divij Vaidya wrote:
> > >> >> "Each metadata version has a corresponding integer in the
> > >> >> MetadataVersion.java file."
> > >> >>
> > >> >> Please correct me if I'm wrong, but are you referring to
> "featureLevel"
> > >> >> in
> > >> >> the enum at
> > >> >>
> https://github.com/apache/kafka/blob/trunk/server-common/src/main/java/org/apache/kafka/server/common/MetadataVersion.java#L45
> > >> >> ? Is yes, can we please update the description of the metric to
> make it
> > >> >> easier for the users to understand this? For example, we can say,
> > >> >> "Represents the current metadata version as an integer value. See
> > >> >> MetadataVersion (hyperlink) for a mapping between string and
> integer
> > >> >> formats of metadata version".
> > >> >>
> > >> >> --
> > >> >> Divij Vaidya
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Tue, Jun 6, 2023 at 1:51 PM Ron Dagostino 
> wrote:
> > >> >>
> > >> >>> Thanks again for the KIP, Colin.  +1 (binding).
> > >> >>>
> > >> >>> Ron
> > >> >>>
> > >> >>> > On Jun 6, 2023, at 7:02 AM, Igor Soarez
> 
> > >> >>> wrote:
> > >> >>> >
> > >> >>> > Thanks for the KIP.
> > >> >>> >
> > >> >>> > Seems straightforward, LGTM.
> > >> >>> > Non binding +1.
> > >> >>> >
> > >> >>> > --
> > >> >>> > Igor
> > >> >>> >
> > >> >>>
>


Re: [DISCUSS] Regarding Old PRs

2023-06-07 Thread David Arthur
I filed KAFKA-15073 for this. Here is a patch
https://github.com/apache/kafka/pull/13827. This simply adds a "stale"
label to PRs with no activity in the last 90 days. I figure that's a good
starting point.

As for developer workflow, the "stale" action is quite flexible in how it
finds candidate PRs to mark as stale. For example, we can exclude PRs that
have an Assignee, or a particular set of labels. Docs are here
https://github.com/actions/stale

-David


On Wed, Jun 7, 2023 at 2:36 PM Josep Prat 
wrote:

> Thanks David!
>
> ———
> Josep Prat
>
> Aiven Deutschland GmbH
>
> Alexanderufer 3-7, 10117 Berlin
>
> Amtsgericht Charlottenburg, HRB 209739 B
>
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
>
> m: +491715557497
>
> w: aiven.io
>
> e: josep.p...@aiven.io
>
> On Wed, Jun 7, 2023, 20:28 David Arthur  .invalid>
> wrote:
>
> > Hey all, I started poking around at Github actions on my fork.
> >
> > https://github.com/mumrah/kafka/actions
> >
> > I'll post a PR if I get it working and we can discuss what kind of
> settings
> > we want (or if we want it all)
> >
> > -David
> >
> > On Tue, Jun 6, 2023 at 1:18 PM Chris Egerton 
> > wrote:
> >
> > > Hi Josep,
> > >
> > > Thanks for bringing this up! Will try to keep things brief.
> > >
> > > I'm generally in favor of this initiative. A couple ideas that I really
> > > liked: requiring a component label (producer, consumer, connect,
> streams,
> > > etc.) before closing, and disabling auto-close (i.e., automatically
> > tagging
> > > PRs as stale, but leaving it to a human being to actually close them).
> > >
> > > We might replace the "stale" label with a "close-by-" label so
> that
> > > it becomes even easier for us to find the PRs that are ready to be
> closed
> > > (as opposed to the ones that have just been labeled as stale without
> > giving
> > > the contributor enough time to respond).
> > >
> > > I've also gone ahead and closed some of my stale PRs. Others I've
> > > downgraded to draft to signal that I'd like to continue to pursue them,
> > but
> > > have to iron out merge conflicts first. For the last ones, I'll ping
> for
> > > review.
> > >
> > > One question that came to mind--do we want to distinguish between
> regular
> > > and draft PRs? I'm guessing not, since they still add up to the total
> PR
> > > count against the project, but since they do also implicitly signal
> that
> > > they're not intended for review (yet) it may be friendlier to leave
> them
> > > up.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Tue, Jun 6, 2023 at 10:18 AM Mickael Maison <
> mickael.mai...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Josep,
> > > >
> > > > Thanks for looking into this. This is clearly one aspect where we
> need
> > > > to improve.
> > > >
> > > > We had a similar initiative last year
> > > > (https://lists.apache.org/thread/66yj9m6tcyz8zqb3lqlbnr386bqwsopt)
> and
> > > > we closed many PRs. Unfortunately we did not follow up with a process
> > > > or automation and we are back to the same situation.
> > > >
> > > > Manually reviewing all these PRs is a huge task, so I think we should
> > > > at least partially automate it. I'm not sure if we should manually
> > > > review the oldest PRs (pre 2020). There's surely many interesting
> > > > things but I wonder if we should instead focus on the more recent
> ones
> > > > as they have a higher chance of 1) still making sense, 2) getting
> > > > updates from their authors, 3) needing less rebasing. If something
> has
> > > > been broken since 2016 but we never bothered to fix the PR it means
> it
> > > > can't be anything critical!
> > > >
> > > > Finally as Colin mentioned, it looks like a non negligible chunk of
> > > > stale PRs comes from committers and regular contributors. So I'd
> > > > suggest we each try to clean our own backlog too.
> > > >
> > > > I wonder if we also need to do something in JIRA. Querying for
> > > > unresolved tickets returns over 4000 items. Considering we're not
> > > > quite at KAFKA-15000 yet, that's a lot.
> > > >
> > > > Thanks,
> > > > Mickael
> > > >
>

[jira] [Created] (KAFKA-15073) Automation for old/inactive PRs

2023-06-07 Thread David Arthur (Jira)
David Arthur created KAFKA-15073:


 Summary: Automation for old/inactive PRs
 Key: KAFKA-15073
 URL: https://issues.apache.org/jira/browse/KAFKA-15073
 Project: Kafka
  Issue Type: Improvement
  Components: build
Reporter: David Arthur


Following from a discussion on the mailing list. It would be nice to 
automatically triage inactive PRs. There are currently over 1000 open PRs. Most 
likely a majority of these will not ever be merged and should be closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Regarding Old PRs

2023-06-07 Thread David Arthur
Hey all, I started poking around at Github actions on my fork.

https://github.com/mumrah/kafka/actions

I'll post a PR if I get it working and we can discuss what kind of settings
we want (or if we want it all)

-David

On Tue, Jun 6, 2023 at 1:18 PM Chris Egerton 
wrote:

> Hi Josep,
>
> Thanks for bringing this up! Will try to keep things brief.
>
> I'm generally in favor of this initiative. A couple ideas that I really
> liked: requiring a component label (producer, consumer, connect, streams,
> etc.) before closing, and disabling auto-close (i.e., automatically tagging
> PRs as stale, but leaving it to a human being to actually close them).
>
> We might replace the "stale" label with a "close-by-" label so that
> it becomes even easier for us to find the PRs that are ready to be closed
> (as opposed to the ones that have just been labeled as stale without giving
> the contributor enough time to respond).
>
> I've also gone ahead and closed some of my stale PRs. Others I've
> downgraded to draft to signal that I'd like to continue to pursue them, but
> have to iron out merge conflicts first. For the last ones, I'll ping for
> review.
>
> One question that came to mind--do we want to distinguish between regular
> and draft PRs? I'm guessing not, since they still add up to the total PR
> count against the project, but since they do also implicitly signal that
> they're not intended for review (yet) it may be friendlier to leave them
> up.
>
> Cheers,
>
> Chris
>
> On Tue, Jun 6, 2023 at 10:18 AM Mickael Maison 
> wrote:
>
> > Hi Josep,
> >
> > Thanks for looking into this. This is clearly one aspect where we need
> > to improve.
> >
> > We had a similar initiative last year
> > (https://lists.apache.org/thread/66yj9m6tcyz8zqb3lqlbnr386bqwsopt) and
> > we closed many PRs. Unfortunately we did not follow up with a process
> > or automation and we are back to the same situation.
> >
> > Manually reviewing all these PRs is a huge task, so I think we should
> > at least partially automate it. I'm not sure if we should manually
> > review the oldest PRs (pre 2020). There's surely many interesting
> > things but I wonder if we should instead focus on the more recent ones
> > as they have a higher chance of 1) still making sense, 2) getting
> > updates from their authors, 3) needing less rebasing. If something has
> > been broken since 2016 but we never bothered to fix the PR it means it
> > can't be anything critical!
> >
> > Finally as Colin mentioned, it looks like a non negligible chunk of
> > stale PRs comes from committers and regular contributors. So I'd
> > suggest we each try to clean our own backlog too.
> >
> > I wonder if we also need to do something in JIRA. Querying for
> > unresolved tickets returns over 4000 items. Considering we're not
> > quite at KAFKA-15000 yet, that's a lot.
> >
> > Thanks,
> > Mickael
> >
> >
> > On Tue, Jun 6, 2023 at 11:35 AM Josep Prat 
> > wrote:
> > >
> > > Hi Devs,
> > > I would say we can split the problem in 2.
> > >
> > > Waiting for Author feedback:
> > > We could have a bot that would ping authors for the cases where we have
> > PRs
> > > that are stalled and have either:
> > > - Merge conflict
> > > - Unaddressed reviews
> > >
> > > Waiting for reviewers:
> > > For the PRs where we have no reviewers and there are no conflicts, I
> > think
> > > we would need some human interaction to determine modules (maybe this
> can
> > > be automated) and ping people who could review.
> > >
> > > What do you think?
> > >
> > > Best,
> > >
> > > On Tue, Jun 6, 2023 at 11:30 AM Josep Prat 
> wrote:
> > >
> > > > Hi Nikolay,
> > > >
> > > > With a bot it will be complicated to determine what to do when the PR
> > > > author is waiting for a reviewer. If a person goes over them, can
> > check if
> > > > they are waiting for reviews and tag the PR accordingly and maybe
> ping
> > a
> > > > maintainer.
> > > > If you look at my last email I described a flow (but AFAIU it will
> work
> > > > only if a human executes it) where the situation you point out would
> be
> > > > covered.
> > > >
> > > > ———
> > > > Josep Prat
> > > >
> > > > Aiven Deutschland GmbH
> > > >
> > > > Alexanderufer 3-7, 10117 Berlin
> > > >
> > > > Amtsgeric

Re: [DISCUSS] Regarding Old PRs

2023-06-02 Thread David Arthur
I think this is a great idea. If we don’t want the auto-close
functionality, we can set it to -1

I realize this isn’t a vote, but I’m +1 on this

-David

On Fri, Jun 2, 2023 at 15:34 Colin McCabe  wrote:

> That should read "30 days without activity"
>
> (I am assuming we have the ability to determine when a PR was last updated
> on GH)
>
> best,
> Colin
>
> On Fri, Jun 2, 2023, at 12:32, Colin McCabe wrote:
> > Hi all,
> >
> > Looking at GitHub, I have a bunch of Kafka PRs of my own that I've
> > allowed to become stale, and I guess are pushing up these numbers!
> > Overall I support the goal of putting a time limit on PRs, just so that
> > we can focus our review bandwidth.
> >
> > It may be best to start with a simple approach where we mark PRs as
> > stale after 30 days and email the submitter at that time. And then
> > delete after 60 days. (Of course the exact time periods might be
> > something gother than 30/60 but this is just an initial suggestion)
> >
> > best,
> > Colin
> >
> >
> > On Fri, Jun 2, 2023, at 00:37, Josep Prat wrote:
> >> Hi all,
> >>
> >> I want to say that in my experience, I always felt better as a
> contributor
> >> when a person told me something than when a bot did. That being said,
> I'm
> >> not against bots, and I think they might be a great solution once we
> have a
> >> manageable number of open PRs.
> >>
> >> Another great question that adding a bot poses is types of staleness
> >> detection. How do we distinguish between staleness from the author's
> side
> >> or from the lack of reviewers/maintainers' side? That's why I started
> with
> >> a human approach to be able to distinguish between these 2 cases. Both
> >> situations should have different messages and actually different
> intended
> >> receivers. In case of staleness because of author inactivity, the
> message
> >> should encourage the author to update the PR with the requested changes
> or
> >> resolve the conflicts. But In many cases, PRs are stale because of lack
> of
> >> reviewers. This would need a different message, targeting maintainers.
> >>
> >> Ideally (with bot or not) I believe the process should be as follows:
> >> - Check PRs that are stale.
> >> - See if they have labels, if not add proper labels (connect, core,
> >> clients...)
> >> -  PR has merge conflicts
> >> -- Merge conflicts exist and target files that still exist, ping the
> author
> >> asking for conflict resolution and add some additional label like
> `stale`.
> >> -- Merge conflicts exist and target files that do not exist anymore, let
> >> the author know that this PR is obsolete, label the PR as 'obsolete' and
> >> close the PR.
> >> - PR is mergeable, check whose action is needed (author or reviewers)
> >> -- Author: let the author know that there are pending comments to
> address.
> >> Add some additional label, maybe `stale` again
> >> -- Reviewer: ping some reviewers that have experience or lately touched
> >> this piece of the codebase, add a label `reviewer needed` or something
> >> similar
> >> - PRs that have `stale` label after X days, will be closed.
> >>
> >> Regarding the comments about only committers and collaborators being
> able
> >> to label PRs, this is true, not everyone can do this. However, this
> could
> >> be a great opportunity for the newly appointed contributors to exercise
> >> their new permissions :)
> >>
> >> Let me know if it makes sense to you all.
> >>
> >> Best,
>
-- 
David Arthur


Re: [VOTE] 3.5.0 RC0

2023-06-02 Thread David Arthur
Mickael, all of our migration fixes are in.

Thanks!
David

On Thu, Jun 1, 2023 at 6:10 PM Colin McCabe  wrote:

> Hi Mickael,
>
> Can you start the new RC tomorrow? There's one last PR we'd like to get in.
>
> If we can't get it in by tomorrow then let's go ahead anyway.
>
> Thanks very much,
> Colin
>
>
> On Thu, Jun 1, 2023, at 14:15, Mickael Maison wrote:
> > Hi David,
> >
> > The PR you mentioned is merged now. Can I start working on a new RC or is
> > there more work needed? It seems the associated ticket is still open.
> >
> > Thanks,
> > Mickael
> >
> > On Wed, 31 May 2023, 23:52 Justine Olshan,  >
> > wrote:
> >
> >> Hey Mickael --
> >> This is done. Thanks!
> >>
> >> On Wed, May 31, 2023 at 11:24 AM Mickael Maison <
> mickael.mai...@gmail.com>
> >> wrote:
> >>
> >> > Hi Justine,
> >> >
> >> > Yes you can merge that into 3.5.
> >> >
> >> > Thanks,
> >> > Mickael
> >> >
> >> > On Wed, May 31, 2023 at 7:56 PM Justine Olshan
> >> >  wrote:
> >> > >
> >> > > FYI -- I just saw this PR regarding a dependency for ARM. We may
> want
> >> to
> >> > > get this in for 3.5 as well. It should be quick.
> >> > >
> >> > > https://issues.apache.org/jira/browse/KAFKA-15044
> >> > > https://github.com/apache/kafka/pull/13786
> >> > >
> >> > > Justine
> >> > >
> >> > > On Wed, May 31, 2023 at 9:28 AM David Arthur
> >> > >  wrote:
> >> > >
> >> > > > Mickael,
> >> > > >
> >> > > > Colin has approved my patch for KAFKA-15010, I'm just waiting on a
> >> > build
> >> > > > before merging. I'll go ahead and backport the other fixes that
> need
> >> to
> >> > > > precede this one into 3.5.
> >> > > >
> >> > > > -David
> >> > > >
> >> > > > On Wed, May 31, 2023 at 11:52 AM Mickael Maison <
> >> > mickael.mai...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi,
> >> > > > >
> >> > > > > The issue mentioned by Greg has been fixed. As soon as the fix
> for
> >> > > > > KAFKA-15010 is merged I'll build another RC.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Mickael
> >> > > > >
> >> > > > > On Tue, May 30, 2023 at 10:33 AM Mickael Maison
> >> > > > >  wrote:
> >> > > > > >
> >> > > > > > Hi David,
> >> > > > > >
> >> > > > > > Feel free to backport the necessary fixes to 3.5.
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > Mickael
> >> > > > > >
> >> > > > > > On Tue, May 30, 2023 at 10:32 AM Mickael Maison
> >> > > > > >  wrote:
> >> > > > > > >
> >> > > > > > > Hi Greg,
> >> > > > > > >
> >> > > > > > > Thanks for the heads up, this indeed looks like something we
> >> > want in
> >> > > > > > > 3.5. I've replied in the PR.
> >> > > > > > >
> >> > > > > > > Mickael
> >> > > > > > >
> >> > > > > > > On Sat, May 27, 2023 at 11:44 PM David Arthur
> >> > > > > > >  wrote:
> >> > > > > > > >
> >> > > > > > > > Mickael, after looking more closely, I definitely think
> >> > KAFKA-15010
> >> > > > > is a
> >> > > > > > > > blocker. It creates the case where the controller can
> totally
> >> > miss
> >> > > > a
> >> > > > > > > > metadata update and not write it back to ZK. Since things
> >> like
> >> > > > > dynamic
> >> > > > > > > > configs and ACLs are only read from ZK by the ZK brokers,
> we
> >> > could
> >> > > > > have
> >> > > > > > > > significant problems while the broke

Re: [VOTE] 3.5.0 RC0

2023-05-31 Thread David Arthur
Mickael,

Colin has approved my patch for KAFKA-15010, I'm just waiting on a build
before merging. I'll go ahead and backport the other fixes that need to
precede this one into 3.5.

-David

On Wed, May 31, 2023 at 11:52 AM Mickael Maison 
wrote:

> Hi,
>
> The issue mentioned by Greg has been fixed. As soon as the fix for
> KAFKA-15010 is merged I'll build another RC.
>
> Thanks,
> Mickael
>
> On Tue, May 30, 2023 at 10:33 AM Mickael Maison
>  wrote:
> >
> > Hi David,
> >
> > Feel free to backport the necessary fixes to 3.5.
> >
> > Thanks,
> > Mickael
> >
> > On Tue, May 30, 2023 at 10:32 AM Mickael Maison
> >  wrote:
> > >
> > > Hi Greg,
> > >
> > > Thanks for the heads up, this indeed looks like something we want in
> > > 3.5. I've replied in the PR.
> > >
> > > Mickael
> > >
> > > On Sat, May 27, 2023 at 11:44 PM David Arthur
> > >  wrote:
> > > >
> > > > Mickael, after looking more closely, I definitely think KAFKA-15010
> is a
> > > > blocker. It creates the case where the controller can totally miss a
> > > > metadata update and not write it back to ZK. Since things like
> dynamic
> > > > configs and ACLs are only read from ZK by the ZK brokers, we could
> have
> > > > significant problems while the brokers are being migrated (when some
> are
> > > > KRaft and some are ZK). E.g., ZK brokers could be totally unaware of
> an ACL
> > > > change while the KRaft brokers have it. I have a fix ready here
> > > > https://github.com/apache/kafka/pull/13758. I think we can get it
> committed
> > > > soon.
> > > >
> > > > Another blocker is KAFKA-15004 which was just merged to trunk. This
> is
> > > > another dual-write bug where new topic/broker configs will not be
> written
> > > > back to ZK by the controller.
> > > >
> > > > The fix for KAFKA-15010 has a few dependencies on fixes we made this
> past
> > > > week, so we'll need to cherry-pick a few commits. The changes are
> totally
> > > > contained within the migration area of code, so I think the risk in
> > > > including them is fairly low.
> > > >
> > > > -David
> > > >
> > > > On Thu, May 25, 2023 at 2:15 PM Greg Harris
> 
> > > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > A contributor just pointed out a small but noticeable flaw in the
> > > > > implementation of KIP-581
> > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-581%3A+Value+of+optional+null+field+which+has+default+value
> > > > > which is planned for this release.
> > > > > Impact: the feature works for root values in a record, but does not
> > > > > work for any fields within structs. Fields within structs will
> > > > > continue to have their previous, backwards-compatible behavior.
> > > > > The contributor has submitted a bug-fix PR which reports the
> problem
> > > > > and does not yet have a merge-able solution, but they are actively
> > > > > responding and interested in having this fixed:
> > > > > https://github.com/apache/kafka/pull/13748
> > > > > The overall fix should be a one-liner + some unit tests. While
> this is
> > > > > not a regression, it does make the feature largely useless, as the
> > > > > majority of use-cases will be for struct fields.
> > > > >
> > > > > Thanks!
> > > > > Greg Harris
> > > > >
> > > > > On Wed, May 24, 2023 at 7:05 PM Ismael Juma 
> wrote:
> > > > > >
> > > > > > I agree the migration should be functional - it wasn't obvious
> if the
> > > > > > migration issues are edge cases or not. If they are edge cases,
> I think
> > > > > > 3.5.1 would be fine given the preview status.
> > > > > >
> > > > > > I understand that a new RC is needed, but that doesn't mean we
> should let
> > > > > > everything in. Each change carries some risk. And if we don't
> agree on
> > > > > the
> > > > > > bar for the migration work, we may be having the same discussion
> next
> > > > > week.
> > > > > > :)
> > > > > >
> > > > > > Ismael
> >

Re: [VOTE] 3.5.0 RC0

2023-05-27 Thread David Arthur
 > >
> > > > > So unfortunately I have to leave a -1 here for RC0. Let's aim for
> > > another
> > > > > RC next week.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > > On Wed, May 24, 2023, at 07:05, Mickael Maison wrote:
> > > > > > Hi David,
> > > > > >
> > > > > > We're already quite a bit behind schedule. If you think these
> fixes
> > > > > > are really important and can be ready in the next couple of
> days, I'm
> > > > > > open to backport them and build another release candidate. Let me
> > > know
> > > > > > once you've investigated the severity of KAFKA-15010.
> > > > > >
> > > > > > Thanks,
> > > > > > Mickael
> > > > > >
> > > > > >
> > > > > > On Tue, May 23, 2023 at 6:34 PM David Arthur
> > > > > >  wrote:
> > > > > >>
> > > > > >> Mickael, we have some migration fixes on trunk, is it okay to
> > > > > cherry-pick
> > > > > >> these to 3.5?
> > > > > >>
> > > > > >> KAFKA-15007 Use the correct MetadataVersion in
> MigrationPropagator
> > > > > >> KAFKA-15009 Handle new ACLs in KRaft snapshot during migration
> > > > > >>
> > > > > >> There is another issue KAFKA-15010 that I'm also investigating
> to
> > > > > determine
> > > > > >> the impact and likelihood of seeing it in practice. This one
> may be
> > > a
> > > > > >> significant migration blocker
> > > > > >>
> > > > > >> Cheers,
> > > > > >> David
> > > > > >>
> > > > > >> On Tue, May 23, 2023 at 9:57 AM Mickael Maison <
> > > > > mickael.mai...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Christo,
> > > > > >> >
> > > > > >> > Yes this is expected. This happens when nested fields also
> accept
> > > > > >> > optional tagged fields. The tables list all fields, so they
> may
> > > > > >> > include _tagged_fields multiple times.
> > > > > >> > Clearly the layout of this page could be improved, if you have
> > > ideas
> > > > > >> > how to describe the protocol in a better way, feel free to
> share
> > > > them.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Mickael
> > > > > >> >
> > > > > >> > On Tue, May 23, 2023 at 3:50 PM Mickael Maison <
> > > > > mickael.mai...@gmail.com>
> > > > > >> > wrote:
> > > > > >> > >
> > > > > >> > > Hi Josep,
> > > > > >> > >
> > > > > >> > > Good catch! I opened a PR to fix this:
> > > > > >> > > https://github.com/apache/kafka-site/pull/514
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Mickael
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Tue, May 23, 2023 at 3:36 PM Christo Lolov <
> > > > > christolo...@gmail.com>
> > > > > >> > wrote:
> > > > > >> > > >
> > > > > >> > > > Hey Mickael!
> > > > > >> > > >
> > > > > >> > > > I am giving a +1 (non-binding) for this candidate release.
> > > > > >> > > >
> > > > > >> > > > * Built from the binary tar.gz source with Java 17 and
> Scala
> > > > 2.13
> > > > > on
> > > > > >> > Intel
> > > > > >> > > > (m5.4xlarge) and ARM (m6g.4xlarge) machines.
> > > > > >> > > > * Ran unit and integration tests on Intel and ARM
> machines.
> > > > > >> > > > * Ran the Quickstart in both Zookeeper and KRaft modes on
> > > Intel
> > > > > and ARM
> > > > > >> > &

Re: [VOTE] 3.5.0 RC0

2023-05-23 Thread David Arthur
Mickael, we have some migration fixes on trunk, is it okay to cherry-pick
these to 3.5?

KAFKA-15007 Use the correct MetadataVersion in MigrationPropagator
KAFKA-15009 Handle new ACLs in KRaft snapshot during migration

There is another issue KAFKA-15010 that I'm also investigating to determine
the impact and likelihood of seeing it in practice. This one may be a
significant migration blocker

Cheers,
David

On Tue, May 23, 2023 at 9:57 AM Mickael Maison 
wrote:

> Hi Christo,
>
> Yes this is expected. This happens when nested fields also accept
> optional tagged fields. The tables list all fields, so they may
> include _tagged_fields multiple times.
> Clearly the layout of this page could be improved, if you have ideas
> how to describe the protocol in a better way, feel free to share them.
>
> Thanks,
> Mickael
>
> On Tue, May 23, 2023 at 3:50 PM Mickael Maison 
> wrote:
> >
> > Hi Josep,
> >
> > Good catch! I opened a PR to fix this:
> > https://github.com/apache/kafka-site/pull/514
> >
> > Thanks,
> > Mickael
> >
> >
> > On Tue, May 23, 2023 at 3:36 PM Christo Lolov 
> wrote:
> > >
> > > Hey Mickael!
> > >
> > > I am giving a +1 (non-binding) for this candidate release.
> > >
> > > * Built from the binary tar.gz source with Java 17 and Scala 2.13 on
> Intel
> > > (m5.4xlarge) and ARM (m6g.4xlarge) machines.
> > > * Ran unit and integration tests on Intel and ARM machines.
> > > * Ran the Quickstart in both Zookeeper and KRaft modes on Intel and ARM
> > > machines.
> > >
> > > Question:
> > > * I went through https://kafka.apache.org/35/protocol.html and there
> are
> > > quite a few repetitive __tagged_fileds fields within the same
> structures -
> > > is this expected?
> > >
> > > On Tue, 23 May 2023 at 12:01, Josep Prat 
> > > wrote:
> > >
> > > > Hi Mickael,
> > > > I just wanted to point out that I think the documentation you
> recently
> > > > merged on Kafka site regarding the 3.5.0 version has a problem when
> it
> > > > states the version number and the sub-menu that links to previous
> versions.
> > > > Left a comment here:
> > > >
> https://github.com/apache/kafka-site/pull/513#pullrequestreview-1438927939
> > > >
> > > > Best,
> > > >
> > > > On Tue, May 23, 2023 at 9:29 AM Josep Prat 
> wrote:
> > > >
> > > > > Hi Mickael,
> > > > >
> > > > > I can +1 this candidate. I verified the following:
> > > > > - Built from source with Java 17 and Scala 2.13
> > > > > - Signatures and hashes of the artifacts generated
> > > > > - Navigated through Javadoc including links to JDK classes
> > > > > - Run the unit tests
> > > > > - Run integration tests
> > > > > - Run the quickstart in KRaft and Zookeeper mode
> > > > >
> > > > > Best,
> > > > >
> > > > > On Mon, May 22, 2023 at 5:30 PM Mickael Maison <
> mimai...@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Hello Kafka users, developers and client-developers,
> > > > >>
> > > > >> This is the first candidate for release of Apache Kafka 3.5.0.
> Some of
> > > > the
> > > > >> major features include:
> > > > >> - KIP-710: Full support for distributed mode in dedicated
> MirrorMaker
> > > > >> 2.0 clusters
> > > > >> - KIP-881: Rack-aware Partition Assignment for Kafka Consumers
> > > > >> - KIP-887: Add ConfigProvider to make use of environment variables
> > > > >> - KIP-889: Versioned State Stores
> > > > >> - KIP-894: Use incrementalAlterConfig for syncing topic
> configurations
> > > > >> - KIP-900: KRaft kafka-storage.sh API additions to support SCRAM
> for
> > > > >> Kafka Brokers
> > > > >>
> > > > >> Release notes for the 3.5.0 release:
> > > > >>
> https://home.apache.org/~mimaison/kafka-3.5.0-rc0/RELEASE_NOTES.html
> > > > >>
> > > > >> *** Please download, test and vote by Friday, May 26, 5pm PT
> > > > >>
> > > > >> Kafka's KEYS file containing PGP keys we use to sign the release:
> > > > >> https://kafka.apache.org/KEYS
> > > > >>
> > > > >> * Release artifacts to be voted upon (source and binary):
> > > > >> https://home.apache.org/~mimaison/kafka-3.5.0-rc0/
> > > > >>
> > > > >> * Maven artifacts to be voted upon:
> > > > >>
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > > > >>
> > > > >> * Javadoc:
> > > > >> https://home.apache.org/~mimaison/kafka-3.5.0-rc0/javadoc/
> > > > >>
> > > > >> * Tag to be voted upon (off 3.5 branch) is the 3.5.0 tag:
> > > > >> https://github.com/apache/kafka/releases/tag/3.5.0-rc0
> > > > >>
> > > > >> The PR adding the 35 documentation is not merged yet
> > > > >> (https://github.com/apache/kafka-site/pull/513)
> > > > >> * Documentation:
> > > > >> https://kafka.apache.org/35/documentation.html
> > > > >> * Protocol:
> > > > >> https://kafka.apache.org/35/protocol.html
> > > > >>
> > > > >> * Successful Jenkins builds for the 3.5 branch:
> > > > >> Unit/integration tests: Jenkins is not detecting the 3.5 branch,
> > > > >> working with INFRA to sort it out:
> > > > >> https://issues.apache.org/jira/browse/INFRA-24577
> > > > >> System tests: The build is still running, 

[jira] [Created] (KAFKA-15017) New ClientQuotas are not written to ZK from snapshot

2023-05-23 Thread David Arthur (Jira)
David Arthur created KAFKA-15017:


 Summary: New ClientQuotas are not written to ZK from snapshot 
 Key: KAFKA-15017
 URL: https://issues.apache.org/jira/browse/KAFKA-15017
 Project: Kafka
  Issue Type: Bug
  Components: kraft
Affects Versions: 3.5.0
Reporter: David Arthur


Similar issue to KAFKA-15009



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Migration from zookeeper to kraft not working

2023-05-18 Thread David Arthur
Elena,

Did you provision a KRaft controller quorum before restarting the brokers?

If you don't mind, could you create a JIRA and attach the config files used
for the brokers before/after the migration along with the controller
configs? Please include the sequence of steps you took in the JIRA as well.

Here is our JIRA project:
https://issues.apache.org/jira/projects/KAFKA/issues, and general info on
filing issues
https://cwiki.apache.org/confluence/display/KAFKA/Reporting+Issues+in+Apache+Kafka

Thanks!
David



On Tue, May 16, 2023 at 2:54 AM Elena Batranu
 wrote:

> Hello! I have a problem with my kafka configuration (kafka 3.4). I'm
> trying to migrate from zookeeper to kraft. I have 3 brokers, on one of them
> was also the zookeeper. I want to restart my brokers one by one, without
> having downtime. I started with putting the configuration with also kraft
> and zookeeper, to do the migration gradually. In this step my nodes are up,
> but i have the following error in the logs from kraft.
> [2023-05-16 06:35:19,485] DEBUG [BrokerToControllerChannelManager broker=0
> name=quorum]: No controller provided, retrying after backoff
> (kafka.server.BrokerToControllerRequestThread)[2023-05-16 06:35:19,585]
> DEBUG [BrokerToControllerChannelManager broker=0 name=quorum]: Controller
> isn't cached, looking for local metadata changes
> (kafka.server.BrokerToControllerRequestThread)[2023-05-16 06:35:19,586]
> DEBUG [BrokerToControllerChannelManager broker=0 name=quorum]: No
> controller provided, retrying after backoff
> (kafka.server.BrokerToControllerRequestThread)[2023-05-16 06:35:19,624]
> INFO [RaftManager nodeId=0] Node 3002 disconnected.
> (org.apache.kafka.clients.NetworkClient)[2023-05-16 06:35:19,624] WARN
> [RaftManager nodeId=0] Connection to node 3002 (/192.168.25.172:9093)
> could not be established. Broker may not be available.
> (org.apache.kafka.clients.NetworkClient)[2023-05-16 06:35:19,642] INFO
> [RaftManager nodeId=0] Node 3001 disconnected.
> (org.apache.kafka.clients.NetworkClient)[2023-05-16 06:35:19,642] WARN
> [RaftManager nodeId=0] Connection to node 3001 (/192.168.25.232:9093)
> could not be established. Broker may not be available.
> (org.apache.kafka.clients.NetworkClient)[2023-05-16 06:35:19,643] INFO
> [RaftManager nodeId=0] Node 3000 disconnected.
> (org.apache.kafka.clients.NetworkClient)[2023-05-16 06:35:19,643] WARN
> [RaftManager nodeId=0] Connection to node 3000 (/192.168.25.146:9093)
> could not be established. Broker may not be available.
> (org.apache.kafka.clients.NetworkClient)
> I configured the controller on each broker, the file looks like this:
> # Licensed to the Apache Software Foundation (ASF) under one or more#
> contributor license agreements.  See the NOTICE file distributed with# this
> work for additional information regarding copyright ownership.# The ASF
> licenses this file to You under the Apache License, Version 2.0# (the
> "License"); you may not use this file except in compliance with# the
> License.  You may obtain a copy of the License at##
> http://www.apache.org/licenses/LICENSE-2.0## Unless required by
> applicable law or agreed to in writing, software# distributed under the
> License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR
> CONDITIONS OF ANY KIND, either express or implied.# See the License for the
> specific language governing permissions and# limitations under the License.
> ## This configuration file is intended for use in KRaft mode, where#
> Apache ZooKeeper is not present.  See config/kraft/README.md for details.#
> # Server Basics #
> # The role of this server. Setting this puts us in KRaft
> modeprocess.roles=controller
> # The node id associated with this instance's rolesnode.id=3000# The
> connect string for the controller
> quorum#controller.quorum.voters=3000@localhost
> :9093controller.quorum.voters=3000@192.168.25.146:9093,
> 3001@192.168.25.232:9093,
> 3002@192.168.25.172:9093#
> 
> Socket Server Settings #
> # The address the socket server listens on.# Note that only the controller
> listeners are allowed here when `process.roles=controller`, and this
> listener should be consistent with `controller.quorum.voters` value.#
>  FORMAT:# listeners = listener_name://host_name:port#   EXAMPLE:#
>  listeners = PLAINTEXT://your.host.name:9092listeners=CONTROLLER://:9093
> # A comma-separated list of the names of the listeners used by the
> controller.# This is required if running in KRaft
> mode.controller.listener.names=CONTROLLER
> # Maps listener names to security protocols, the default is for them to be
> the same. See the config documentation for more
> details#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
> # The number 

Re: [DISCUSS] Apache Kafka 3.5.0 release

2023-05-06 Thread David Arthur
I resolved these three:
* KAFKA-14840 is merged to trunk and 3.5. I removed the 3.4.1 fix version
* KAFKA-14805 is merged to trunk and 3.5
* KAFKA-14918 is merged to trunk and 3.5

KAFKA-14692 (docs issue) is still a not done

Looks like KAFKA-14084 is now resolved as well (it's in trunk and 3.5).

I'll try to find out about KAFKA-14698, I think it's likely a WONTFIX.

-David

On Fri, May 5, 2023 at 10:43 AM Mickael Maison 
wrote:

> Hi David,
>
> Thanks for the update!
> You still own 4 other tickets targeting 3.5: KAFKA-14840, KAFKA-14805,
> KAFKA-14918, KAFKA-14692. Should I move all of them to the next
> release?
> Also KAFKA-14698 and KAFKA-14084 are somewhat related to the
> migration. Should I move them too?
>
> Thanks,
> Mickael
>
> On Fri, May 5, 2023 at 4:27 PM David Arthur
>  wrote:
> >
> > Hey Mickael, my two ZK migration fixes are in 3.5 now.
> >
> > Cheers,
> > David
> >
> > On Fri, May 5, 2023 at 9:37 AM Mickael Maison 
> > wrote:
> >
> > > Hi Divij,
> > >
> > > Some dependencies (ZooKeeper, Snappy, Swagger, zstd, etc) have been
> > > updated since 3.4.
> > > Regarding your PR, I would have been in favor of bringing this to 3.5
> > > a couple of weeks ago, but we're now a week past code freeze for 3.5.
> > > Apart if this fixes CVEs, or significant bugs, I think we should only
> > > merge it in trunk.
> > >
> > > Thanks,
> > > Mickael
> > >
> > > On Fri, May 5, 2023 at 1:49 PM Divij Vaidya 
> > > wrote:
> > > >
> > > > Hey Mickael
> > > >
> > > > Should we consider performing an update of the minor versions of the
> > > > dependencies in 3.5 (per https://github.com/apache/kafka/pull/13673
> )?
> > > >
> > > > --
> > > > Divij Vaidya
> > > >
> > > >
> > > >
> > > > On Tue, May 2, 2023 at 5:48 PM Mickael Maison <
> mickael.mai...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Luke,
> > > > >
> > > > > Yes I think it makes sense to backport both to 3.5.
> > > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Tue, May 2, 2023 at 11:38 AM Luke Chen 
> wrote:
> > > > > >
> > > > > > Hi Mickael,
> > > > > >
> > > > > > There are 1 bug and 1 improvement that I'd like to backport to
> 3.5.
> > > > > > 1. A small improvement for ZK migration based on KAFKA-14840
> > > (mentioned
> > > > > > above in David's mail). PR is already merged to trunk.
> > > > > > https://issues.apache.org/jira/browse/KAFKA-14909
> > > > > >
> > > > > > 2. A bug will cause the KRaft controller node to shut down
> > > unexpectedly.
> > > > > PR
> > > > > > is ready for review.
> > > > > > https://issues.apache.org/jira/browse/KAFKA-14946
> > > > > > https://github.com/apache/kafka/pull/13653
> > > > > >
> > > > > > Thanks.
> > > > > > Luke
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Apr 28, 2023 at 4:18 PM Mickael Maison <
> > > mickael.mai...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi David,
> > > > > > >
> > > > > > > Yes you can backport these to 3.5. Let me know when you are
> done.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Mickael
> > > > > > >
> > > > > > > On Thu, Apr 27, 2023 at 9:02 PM David Arthur
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hey Mickael,
> > > > > > > >
> > > > > > > > I have one major ZK migration improvement (KAFKA-14805) that
> > > landed
> > > > > in
> > > > > > > > trunk this week that I'd like to merge to 3.5 (once we fix
> some
> > > test
> > > > > > > > failures it introduced). After that, I have another PR for
> > > > > KAFKA-14840
> > > > > > > > which is essentially a huge bug in the ZK migration logic
> that
> > > needs
> > > > > to
> 

[jira] [Resolved] (KAFKA-14805) KRaft Controller shouldn't allow metadata updates before migration starts

2023-05-06 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-14805.
--
Resolution: Fixed

> KRaft Controller shouldn't allow metadata updates before migration starts
> -
>
> Key: KAFKA-14805
> URL: https://issues.apache.org/jira/browse/KAFKA-14805
> Project: Kafka
>  Issue Type: Sub-task
>  Components: kraft
>    Reporter: David Arthur
>Assignee: David Arthur
>Priority: Critical
> Fix For: 3.5.0
>
>
> When starting a ZK to KRaft migration, the new KRaft quorum should not accept 
> external metadata updates from things like CreateTopics or 
> AllocateProducerIds. Having metadata present in the log prior to the 
> migration can lead to undefined state, which is not great.
> This is currently causing test failures on trunk because some producer is 
> allocating a producer ID between the time the KRaft quorum starts and the 
> migration starts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14840) Handle KRaft snapshots in dual-write mode

2023-05-06 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-14840.
--
Fix Version/s: (was: 3.4.1)
   Resolution: Fixed

> Handle KRaft snapshots in dual-write mode
> -
>
> Key: KAFKA-14840
> URL: https://issues.apache.org/jira/browse/KAFKA-14840
> Project: Kafka
>  Issue Type: Sub-task
>  Components: kraft
>Affects Versions: 3.4.0
>Reporter: David Arthur
>    Assignee: David Arthur
>Priority: Blocker
> Fix For: 3.5.0
>
>
> While the KRaft controller is making writes back to ZK during the migration, 
> we need to handle the case when a snapshot is loaded. This can happen for a 
> number of reasons in KRaft.
> The difficulty here is we will need to compare the loaded snapshot with the 
> entire state in ZK. Most likely, this will be a very expensive operation.
> Without this, dual-write mode cannot safely tolerate a snapshot being loaded, 
> so marking this as a 3.5 blocker.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Apache Kafka 3.5.0 release

2023-05-05 Thread David Arthur
Hey Mickael, my two ZK migration fixes are in 3.5 now.

Cheers,
David

On Fri, May 5, 2023 at 9:37 AM Mickael Maison 
wrote:

> Hi Divij,
>
> Some dependencies (ZooKeeper, Snappy, Swagger, zstd, etc) have been
> updated since 3.4.
> Regarding your PR, I would have been in favor of bringing this to 3.5
> a couple of weeks ago, but we're now a week past code freeze for 3.5.
> Apart if this fixes CVEs, or significant bugs, I think we should only
> merge it in trunk.
>
> Thanks,
> Mickael
>
> On Fri, May 5, 2023 at 1:49 PM Divij Vaidya 
> wrote:
> >
> > Hey Mickael
> >
> > Should we consider performing an update of the minor versions of the
> > dependencies in 3.5 (per https://github.com/apache/kafka/pull/13673)?
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Tue, May 2, 2023 at 5:48 PM Mickael Maison 
> > wrote:
> >
> > > Hi Luke,
> > >
> > > Yes I think it makes sense to backport both to 3.5.
> > >
> > > Thanks,
> > > Mickael
> > >
> > > On Tue, May 2, 2023 at 11:38 AM Luke Chen  wrote:
> > > >
> > > > Hi Mickael,
> > > >
> > > > There are 1 bug and 1 improvement that I'd like to backport to 3.5.
> > > > 1. A small improvement for ZK migration based on KAFKA-14840
> (mentioned
> > > > above in David's mail). PR is already merged to trunk.
> > > > https://issues.apache.org/jira/browse/KAFKA-14909
> > > >
> > > > 2. A bug will cause the KRaft controller node to shut down
> unexpectedly.
> > > PR
> > > > is ready for review.
> > > > https://issues.apache.org/jira/browse/KAFKA-14946
> > > > https://github.com/apache/kafka/pull/13653
> > > >
> > > > Thanks.
> > > > Luke
> > > >
> > > >
> > > >
> > > > On Fri, Apr 28, 2023 at 4:18 PM Mickael Maison <
> mickael.mai...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Yes you can backport these to 3.5. Let me know when you are done.
> > > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Thu, Apr 27, 2023 at 9:02 PM David Arthur
> > > > >  wrote:
> > > > > >
> > > > > > Hey Mickael,
> > > > > >
> > > > > > I have one major ZK migration improvement (KAFKA-14805) that
> landed
> > > in
> > > > > > trunk this week that I'd like to merge to 3.5 (once we fix some
> test
> > > > > > failures it introduced). After that, I have another PR for
> > > KAFKA-14840
> > > > > > which is essentially a huge bug in the ZK migration logic that
> needs
> > > to
> > > > > > land in 3.5.
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/KAFKA-14805 (done)
> > > > > > https://issues.apache.org/jira/browse/KAFKA-14840 (nearly done)
> > > > > >
> > > > > > I just wanted to check with you before cherry-picking these to
> 3.5
> > > > > >
> > > > > > David
> > > > > >
> > > > > >
> > > > > > On Mon, Apr 24, 2023 at 1:18 PM Mickael Maison <
> > > mickael.mai...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Justine,
> > > > > > >
> > > > > > > That makes sense. Feel free to revert that commit in 3.5.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Mickael
> > > > > > >
> > > > > > > On Mon, Apr 24, 2023 at 7:16 PM Mickael Maison <
> > > > > mickael.mai...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Josep,
> > > > > > > >
> > > > > > > > Thanks for letting me know!
> > > > > > > >
> > > > > > > > On Mon, Apr 24, 2023 at 6:58 PM Justine Olshan
> > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Hey Mickael,
> > > > > > > > >
> > > > > > > > > I've just o

Re: Question: CI Pipeline in Kafka GitHub Pull Requests & Github Actions Usage

2023-05-04 Thread David Arthur
Hello, Aaron and welcome to the project!

If you look towards the bottom of a Pull Request there will be a section
reporting the status of the Checks for the latest commit.
There is a "Details" link that takes you to the Jenkins job for that PR.

The default Jenkins view is the new UI (called Blue Ocean I think?). The
"Classic" Jenkins view is also available by clicking the button near the
top:

[image: image.png]

>From the Classic view it's pretty straightforward to download the console
log for the job to see what went wrong.

Our build steps are defined in the Jenkinsfile
 checked into the
repository. We do have a compile + static analysis step that happens before
running tests.

-David

On Thu, May 4, 2023 at 5:34 AM aaron ai  wrote:

> Hey folks,
>
> I hope this email finds you well. My name is Aaron Ai, and I'm a beginner
> interested in the Kafka community. I am eager to contribute to Kafka's
> development and be part of this amazing project.
>
> However, I have noticed some confusion surrounding the CI pipeline in Kafka
> GitHub pull requests. It appears that the builds often fail, and the
> results are not very clear to analyse. How can I determine if a pull
> request build is okay for now, given the current CI pipeline situation?
> Furthermore, is it a good idea to use Github Action to help with some
> static checks? If so, I would be more than happy to contribute. By the way,
> is there any plan to migrate to Github Actions in the near future?
>
> I eagerly await your response and appreciate any feedback you might have on
> this matter. Thank you for your time and consideration.
>
> Best regards,
> Aaron
>


-- 
-David


[jira] [Created] (KAFKA-14964) ClientQuotaMetadataManager should not suppress exceptions

2023-05-03 Thread David Arthur (Jira)
David Arthur created KAFKA-14964:


 Summary: ClientQuotaMetadataManager should not suppress exceptions
 Key: KAFKA-14964
 URL: https://issues.apache.org/jira/browse/KAFKA-14964
 Project: Kafka
  Issue Type: Bug
Reporter: David Arthur


As MetadataLoader calls each MetadataPublisher upon receiving new records from 
the controller, it surrounds the call with a try-catch block in order to pass 
exceptions to a FaultHandler. The FaultHandler used by MetadataLoader is 
essential for us to learn about metadata errors on the broker since it 
increments the metadata loader error JMX metric.

ClientQuotaMetadataManager is in the update path for ClientQuota metadata 
updates and is capturing exceptions. This means validation errors (like invalid 
quotas) will not be seen by the FaultHandler, and the JMX metric will not get 
incremented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Adding non-committers as Github collaborators

2023-04-28 Thread David Arthur
Just to clarify, the re-triggering of Jenkins jobs would be controlled via
the "jenkins: github_whitelist:" YAML section which we already made use of.
I definitely agree that we should leverage this more to make it easy for
collaborators to trigger builds.

Regarding the 20 collaborators limit, the ASF Infra team has indicated that
this is an ASF limit. Perhaps if we make it to 20 active collaborators and
need to add more, we could consult with the Infra team to figure out a
solution. In other words, maybe that's a problem for tomorrow :)

Here's what we have currently in trunk:
https://github.com/apache/kafka/blob/trunk/.asf.yaml#LL24-L28

jenkins:
  github_whitelist:
- ConcurrencyPractitioner
- ableegoldman
- cadonna


-David

On Fri, Apr 28, 2023 at 12:34 PM Philip Nee  wrote:

> Hi all, opinion from a regular person here: I think being able to
> re-trigger jenkin's test and tagging can be very helpful for a lot of
> times.
>
> On Fri, Apr 28, 2023 at 9:31 AM John Roesler  wrote:
>
> > Hi all,
> >
> > This is a great suggestion! It seems like a really good way to make the
> > Apache Kafka project more efficient in general and also smooth the path
> to
> > committership.
> >
> > I've brought the topic up with the Apache Kafka PMC to consider adopting
> a
> > policy around the "collaborator" rule.
> >
> > In the mean time, it would be great to hear from the broader community
> > what your thoughts are around this capability.
> >
> > Thanks,
> > -John
> >
> > On Fri, Apr 28, 2023, at 10:50, Justine Olshan wrote:
> > > I'm also a bit concerned by the 20 active collaborators rule. How do we
> > > pick the 20 people?
> > >
> > > Justine
> > >
> > > On Fri, Apr 28, 2023 at 8:36 AM Matthias J. Sax 
> > wrote:
> > >
> > >> In general I am +1
> > >>
> > >> The only question I have is about
> > >>
> > >> > You may only have 20 active collaborators at any given time per
> > >> repository.
> > >>
> > >> Not sure if this is a concern or not? I would assume not, but wanted
> to
> > >> bring it to everyone's attention.
> > >>
> > >> There is actually also a way to allow people to re-trigger Jenkins
> jobs:
> > >> https://github.com/apache/kafka/pull/13578
> > >>
> > >> Retriggering test is a little bit more sensitive as our resources are
> > >> limited, and we should avoid overwhelming Jenkins even more.
> > >>
> > >>
> > >> -Matthias
> > >>
> > >>
> > >> On 4/27/23 11:45 AM, David Arthur wrote:
> > >> > Hey folks,
> > >> >
> > >> > I stumbled across this wiki page from the infra team that describes
> > the
> > >> > various features supported in the ".asf.yaml" file:
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features
> > >> >
> > >> > One section that looked particularly interesting was
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub
> > >> >
> > >> > github:
> > >> >collaborators:
> > >> >  - userA
> > >> >  - userB
> > >> >
> > >> > This would allow us to define non-committers as collaborators on the
> > >> Github
> > >> > project. Concretely, this means they would receive the "triage"
> Github
> > >> role
> > >> > (defined here
> > >> >
> > >>
> >
> https://docs.github.com/en/organizations/managing-user-access-to-your-organizations-repositories/repository-roles-for-an-organization#permissions-for-each-role
> > >> ).
> > >> > Practically, this means we could let non-committers do things like
> > assign
> > >> > labels and reviewers on Pull Requests.
> > >> >
> > >> > I wanted to see what the committer group thought about this
> feature. I
> > >> > think it could be useful.
> > >> >
> > >> > Cheers,
> > >> > David
> > >> >
> > >>
> >
>


-- 
-David


Re: [DISCUSS] Recommendations for managing long-running projects

2023-04-28 Thread David Arthur
Kirk,
1) I would check out the project management features built into GitHub,
e.g., labels, milestones, and projects
https://docs.github.com/en/issues/planning-and-tracking-with-projects/learning-about-projects/about-projects
.

2) Ultimately, the contributions will need to follow the normal PR
workflow. A committer will have to review and approve the changes. For big
changes, it's probably best to get more than one committer to approve. Do
you expect the changes to come in incrementally, or as a few large patches?
For major efforts in the past, we have taken the approach of using a
non-trunk feature branch where we land and stabilize new code. The downside
here is the effort involved in integrating new changes from upstream. If
it's mostly greenfield work (i.e., new classes), this might not be a big
problem. Another downside of this approach is the effort involved in
reviewing a massive PR to trunk that is bringing in a large code base.

Here's a possible workflow:
* Create a feature branch on a fork
* Non-committers can commit directly to this branch, or use a PR workflow
using the feature branch as a base
* Stabilize the feature branch
* Break the changeset into a few sensible PRs for merging into trunk. This
could be something like interfaces and configs first followed by
implementation and tests.


To answer this more directly
> Is there a recommended path to collaborate for non-committers?
The normal collaboration path for non-committers is to submit Pull Requests
against trunk. Non-committers can review PRs, they just can't merge them
or +1 them.

HTH,
David

On Thu, Apr 27, 2023 at 8:15 PM Kirk True  wrote:

> Hi all,
>
> A handful of engineers are collaborating on a fairly sizable project to
> improve the Java consumer client [1]. We are using as many ASF tools as
> possible for the work (wiki, Jira, mailing list, and Slack thus far).
>
> There are yet two areas where we need recommendations:
>
> 1. Project management tools. What is the recommended tool for
> communicating project scheduling, milestones, etc.?
>
> 2. Shared code collaboration. Since none of the engineers on the project
> are committers, we can't collaborate by reviewing and merging our changes
> into trunk. Is there a recommended path to collaborate for non-committers?
>
> Thanks,
> Kirk
>
> [1]
> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+project+overview



-- 
-David


Re: [DISCUSS] Apache Kafka 3.5.0 release

2023-04-27 Thread David Arthur
Hey Mickael,

I have one major ZK migration improvement (KAFKA-14805) that landed in
trunk this week that I'd like to merge to 3.5 (once we fix some test
failures it introduced). After that, I have another PR for KAFKA-14840
which is essentially a huge bug in the ZK migration logic that needs to
land in 3.5.

https://issues.apache.org/jira/browse/KAFKA-14805 (done)
https://issues.apache.org/jira/browse/KAFKA-14840 (nearly done)

I just wanted to check with you before cherry-picking these to 3.5

David


On Mon, Apr 24, 2023 at 1:18 PM Mickael Maison 
wrote:

> Hi Justine,
>
> That makes sense. Feel free to revert that commit in 3.5.
>
> Thanks,
> Mickael
>
> On Mon, Apr 24, 2023 at 7:16 PM Mickael Maison 
> wrote:
> >
> > Hi Josep,
> >
> > Thanks for letting me know!
> >
> > On Mon, Apr 24, 2023 at 6:58 PM Justine Olshan
>  wrote:
> > >
> > > Hey Mickael,
> > >
> > > I've just opened a blocker to revert KAFKA-14561 in 3.5. There are a
> few
> > > blocker bugs that I don't think I can fix before the code freeze, so I
> > > think for the quality of the release, we should just revert the commit.
> > >
> > > Thanks,
> > > Justine
> > >
> > > On Fri, Apr 21, 2023 at 1:23 PM Josep Prat  >
> > > wrote:
> > >
> > > > Hi Mickael,
> > > >
> > > > Greg Harris managed to fix a flaky test in
> > > > https://github.com/apache/kafka/pull/13575, I cherry-picked it to
> the 3.5
> > > > (and 3.4) branch. I updated the Jira to reflect that is now fixed on
> 3.5.0
> > > > as well as 3.6.0.
> > > > Let me know if I forgot anything.
> > > >
> > > > Best,
> > > >
> > > > On Fri, Apr 21, 2023 at 3:44 PM Mickael Maison <
> mickael.mai...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Just a quick reminder that code freeze is next week.
> > > > > We still have 27 JIRAs targeting 3.5 [0] including quite a few bugs
> > > > > and flaky test issues opened recently. If you have time, take one
> of
> > > > > these items or help with the reviews.
> > > > >
> > > > > I'll send another update next once we've entered code freeze.
> > > > >
> > > > > 0:
> > > > >
> > > >
> https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%203.5.0%20AND%20status%20not%20in%20(resolved%2C%20closed)%20ORDER%20BY%20priority%20DESC%2C%20status%20DESC%2C%20updated%20DESC
> > > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Thu, Apr 20, 2023 at 9:14 PM Mickael Maison <
> mickael.mai...@gmail.com
> > > > >
> > > > > wrote:
> > > > > >
> > > > > > Hi Ron,
> > > > > >
> > > > > > Yes feel free to merge that fix. Thanks for letting me know!
> > > > > >
> > > > > > Mickael
> > > > > >
> > > > > > On Thu, Apr 20, 2023 at 8:15 PM Ron Dagostino  >
> > > > wrote:
> > > > > > >
> > > > > > > Hi Mickael.  I would like to merge
> > > > > > > https://github.com/apache/kafka/pull/13532 (KAFKA-14887: No
> shutdown
> > > > > > > for ZK session expiration in feature processing) to the 3.5
> branch.
> > > > > > > It is a very small and focused fix that can cause unexpected
> broker
> > > > > > > shutdowns when there is instability in the connectivity to
> ZooKeeper.
> > > > > > > The risk is very low.
> > > > > > >
> > > > > > > Ron
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Apr 18, 2023 at 9:57 AM Mickael Maison <
> > > > > mickael.mai...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi David,
> > > > > > > >
> > > > > > > > Thanks for the update. I've marked KAFKA-14869 as fixed in
> 3.5.0, I
> > > > > > > > guess you'll only resolve this ticket once you merge the
> backports
> > > > to
> > > > > > > > earlier branches. The ticket will have to be resolved to run
> the
> > > > > > > > release but that should leave you enough time.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Mickael
> > > > > > > >
> > > > > > > > On Tue, Apr 18, 2023 at 3:42 PM David Jacot
> > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Hi Mickael,
> > > > > > > > >
> > > > > > > > > FYI - I just merged the two PRs for KIP-915 to trunk/3.5.
> We are
> > > > > all good.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > On Mon, Apr 17, 2023 at 5:10 PM Mickael Maison <
> > > > > mickael.mai...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Chris,
> > > > > > > > > >
> > > > > > > > > > I was looking at that just now! As you said, the PRs
> merged
> > > > > provide
> > > > > > > > > > some functionality so I think it's fine to deliver the
> KIP
> > > > > across 2
> > > > > > > > > > releases.
> > > > > > > > > > I left a comment in
> > > > > https://issues.apache.org/jira/browse/KAFKA-14876
> > > > > > > > > > to document what's in 3.5.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Mickael
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Apr 17, 2023 at 5:05 PM Chris Egerton
> > > > > 
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Mickael,
> > > > > > > 

[DISCUSS] Adding non-committers as Github collaborators

2023-04-27 Thread David Arthur
Hey folks,

I stumbled across this wiki page from the infra team that describes the
various features supported in the ".asf.yaml" file:
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features

One section that looked particularly interesting was
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub

github:
  collaborators:
- userA
- userB

This would allow us to define non-committers as collaborators on the Github
project. Concretely, this means they would receive the "triage" Github role
(defined here
https://docs.github.com/en/organizations/managing-user-access-to-your-organizations-repositories/repository-roles-for-an-organization#permissions-for-each-role).
Practically, this means we could let non-committers do things like assign
labels and reviewers on Pull Requests.

I wanted to see what the committer group thought about this feature. I
think it could be useful.

Cheers,
David


Re: Adding reviewers with Github actions

2023-04-27 Thread David Arthur
I just merged the "reviewers" script I wrote a while ago:
https://github.com/apache/kafka/pull/11096

It works by finding previous occurrences of "Reviewers: ...", so it only
works for people who have reviewed something before. I do suspect this is
largely the common case.

E.g., searching for "Ismael" gives:

Possible matches (in order of most recent):
[1] Ismael Juma ism...@juma.me.uk (1514)
[2] Ismael Juma ij...@apache.org (3)
[3] Ismael Juma mli...@juma.me.uk (4)
[4] Ismael Juma ism...@confluent.io (19)
[5] Ismael Juma git...@juma.me.uk (7)

it shows them in order of most recently occurring along with the number of
occurrences. Now that it's merged, it should be easier for folks to try it
out.

Cheers,
David

On Thu, Apr 20, 2023 at 1:02 PM Justine Olshan 
wrote:

> I've tried the script, but it's not quite complete.
> I've had issues finding folks -- if they haven't reviewed in kafka, we can
> not find an email for them. I also had some issues with finding folks who
> had reviewed before.
>
> Right now, my strategy is to use GitHub to search previous commits for
> folks' emails, but that isn't the most optimal solution -- especially if
> the reviewer has no public email.
> I do think it is useful to have in the commit though, so if anyone has some
> ideas on how to improve, I'd be happy to hear.
>
> Justine
>
> On Wed, Apr 19, 2023 at 6:53 AM Ismael Juma  wrote:
>
> > It's a lot more convenient to have it in the commit than having to follow
> > links, etc.
> >
> > David Arthur also wrote a script to help with this step, I believe.
> >
> > Ismael
> >
> > On Tue, Apr 18, 2023, 9:29 AM Divij Vaidya 
> > wrote:
> >
> > > Do we even need a manual attribution for a reviewer in the commit
> > message?
> > > GitHub automatically marks the folks as "reviewers" who have used the
> > > "review-changes" button on the top left corner and left feedback.
> GitHub
> > > also has searchability for such reviews done by a particular person
> using
> > > the following link:
> > >
> > > https://github.com/search?q=is%3Apr+reviewed-by%3A
> > >
> >
> +repo%3Aapache%2Fkafka+repo%3Aapache%2Fkafka-site=issues
> > >
> > > (replace  with the GitHub username)
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Tue, Apr 18, 2023 at 4:09 PM Viktor Somogyi-Vass
> > >  wrote:
> > >
> > > > I'm not that familiar with Actions either, it just seemed like a tool
> > for
> > > > this purpose. :)
> > > > I Did some digging and what I have in mind is that on pull request
> > review
> > > > it can trigger a workflow:
> > > >
> > > >
> > >
> >
> https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_review
> > > >
> > > > We could in theory use Github CLI to edit the description of the PR
> > when
> > > > someone gives a review (or we could perhaps enable this to simply
> > comment
> > > > too):
> > > >
> > > >
> > >
> >
> https://docs.github.com/en/actions/using-workflows/using-github-cli-in-workflows
> > > >
> > > > So the action definition would look something like this below. Note
> > that
> > > > the "run" part is very basic, it's just here for the idea. We'll
> > probably
> > > > need a shell script instead of that line to format it better. But the
> > > point
> > > > is that it edits the PR and adds the reviewer:
> > > >
> > > > name: Add revieweron:
> > > >   issues:
> > > > types:
> > > >   - pull_request_reviewjobs:
> > > >   comment:
> > > > runs-on: ubuntu-latest
> > > > steps:  - run: gh pr edit $PR_ID --title "$PR_TITLE" --body
> > > > "$PR_BODY\n\nReviewers: $SENDER"
> > > > env:
> > > >   GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
> > > >   PR_ID: ${{ github.event.pull_request.id }}
> > > >   PR_TITLE: ${{ github.event.pull_request.title }}
> > > >   PR_BODY: ${{ github.event.pull_request.body }}
> > > >   SENDER: ${{ github.event.sender }}
> > > >
> > > > I'll take a look if I can try this out one my fork and get back if it
> > > leads
> > > > to anything.
> > > >
> > > > Viktor
> > > >
> > > &g

[jira] [Created] (KAFKA-14939) Only expose ZkMigrationState metric if metadata.version supports it

2023-04-26 Thread David Arthur (Jira)
David Arthur created KAFKA-14939:


 Summary: Only expose ZkMigrationState metric if metadata.version 
supports it
 Key: KAFKA-14939
 URL: https://issues.apache.org/jira/browse/KAFKA-14939
 Project: Kafka
  Issue Type: Sub-task
Affects Versions: 3.5.0
Reporter: David Arthur


We should only expose the KafkaController.ZkMigrationState JMX metric if the 
cluster is running on a metadata.version that supports migrations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14934) KafkaClusterTestKit makes FaultHandler accessible

2023-04-25 Thread David Arthur (Jira)
David Arthur created KAFKA-14934:


 Summary: KafkaClusterTestKit makes FaultHandler accessible
 Key: KAFKA-14934
 URL: https://issues.apache.org/jira/browse/KAFKA-14934
 Project: Kafka
  Issue Type: Improvement
  Components: unit tests
Reporter: David Arthur


In KafkaClusterTestKit, we use a mock fault handler to avoid exiting the 
process during tests. It would be useful to expose this fault handler so tests 
could verify certain fault conditions (like a broker/controller failing to 
start)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14918) KRaft controller sending ZK controller RPCs to KRaft brokers

2023-04-18 Thread David Arthur (Jira)
David Arthur created KAFKA-14918:


 Summary: KRaft controller sending ZK controller RPCs to KRaft 
brokers
 Key: KAFKA-14918
 URL: https://issues.apache.org/jira/browse/KAFKA-14918
 Project: Kafka
  Issue Type: Sub-task
Reporter: David Arthur
Assignee: David Arthur


During the migration, when upgrading a ZK broker to KRaft, the controller is 
incorrectly sending UpdateMetadata requests to the KRaft controller. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14796) Migrate ZK ACLs to KRaft

2023-04-17 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-14796.
--
Resolution: Fixed

Removed the 3.4.1 fix version since we're probably not back-porting this.

> Migrate ZK ACLs to KRaft
> 
>
> Key: KAFKA-14796
> URL: https://issues.apache.org/jira/browse/KAFKA-14796
> Project: Kafka
>  Issue Type: Sub-task
>  Components: kraft
>    Reporter: David Arthur
>Assignee: David Arthur
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   >