Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.5 #16

2023-06-01 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 468240 lines...]
[2023-06-02T05:03:47.068Z] 
[2023-06-02T05:03:47.068Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldFailWithIllegalArgumentExceptionWhenIQPartitionerReturnsMultiplePartitions()
 STARTED
[2023-06-02T05:03:48.094Z] 
[2023-06-02T05:03:48.094Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldFailWithIllegalArgumentExceptionWhenIQPartitionerReturnsMultiplePartitions()
 PASSED
[2023-06-02T05:03:48.094Z] 
[2023-06-02T05:03:48.094Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQueryAllStalePartitionStores() STARTED
[2023-06-02T05:03:52.346Z] 
[2023-06-02T05:03:52.346Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQueryAllStalePartitionStores() PASSED
[2023-06-02T05:03:52.346Z] 
[2023-06-02T05:03:52.346Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreads() STARTED
[2023-06-02T05:03:56.709Z] 
[2023-06-02T05:03:56.709Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreads() PASSED
[2023-06-02T05:03:56.709Z] 
[2023-06-02T05:03:56.709Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStores() STARTED
[2023-06-02T05:04:01.324Z] 
[2023-06-02T05:04:01.324Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStores() PASSED
[2023-06-02T05:04:01.324Z] 
[2023-06-02T05:04:01.324Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQueryOnlyActivePartitionStoresByDefault() STARTED
[2023-06-02T05:04:05.080Z] 
[2023-06-02T05:04:05.080Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQueryOnlyActivePartitionStoresByDefault() PASSED
[2023-06-02T05:04:05.080Z] 
[2023-06-02T05:04:05.080Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQueryStoresAfterAddingAndRemovingStreamThread() STARTED
[2023-06-02T05:04:10.313Z] 
[2023-06-02T05:04:10.313Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > HighAvailabilityTaskAssignorIntegrationTest > 
shouldScaleOutWithWarmupTasksAndInMemoryStores(TestInfo) PASSED
[2023-06-02T05:04:10.313Z] 
[2023-06-02T05:04:10.313Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > KStreamAggregationDedupIntegrationTest > 
shouldReduce(TestInfo) STARTED
[2023-06-02T05:04:11.756Z] 
[2023-06-02T05:04:11.756Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQueryStoresAfterAddingAndRemovingStreamThread() PASSED
[2023-06-02T05:04:11.756Z] 
[2023-06-02T05:04:11.756Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreadsNamedTopology() STARTED
[2023-06-02T05:04:13.870Z] 
[2023-06-02T05:04:13.870Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > KStreamAggregationDedupIntegrationTest > 
shouldReduce(TestInfo) PASSED
[2023-06-02T05:04:13.870Z] 
[2023-06-02T05:04:13.870Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > KStreamAggregationDedupIntegrationTest > 
shouldGroupByKey(TestInfo) STARTED
[2023-06-02T05:04:15.513Z] 
[2023-06-02T05:04:15.513Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 178 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreadsNamedTopology() PASSED
[2023-06-02T05:04:16.709Z] streams-0: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-2: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-3: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-6: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-3: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-7: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-1: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-2: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-5: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-1: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-0: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-5: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-4: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:16.709Z] streams-4: SMOKE-TEST-CLIENT-CLOSED
[2023-06-02T05:04:18.462Z] 

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.5 #15

2023-06-01 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 468828 lines...]
[2023-06-02T02:13:49.444Z] 
[2023-06-02T02:13:49.444Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KStreamAggregationDedupIntegrationTest > 
shouldReduceWindowed(TestInfo) STARTED
[2023-06-02T02:13:49.975Z] 
[2023-06-02T02:13:49.975Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 180 > SmokeTestDriverIntegrationTest > 
shouldWorkWithRebalance(boolean) > 
org.apache.kafka.streams.integration.SmokeTestDriverIntegrationTest.shouldWorkWithRebalance(boolean)[1]
 STARTED
[2023-06-02T02:13:54.479Z] 
[2023-06-02T02:13:54.479Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KStreamAggregationDedupIntegrationTest > 
shouldReduceWindowed(TestInfo) PASSED
[2023-06-02T02:13:57.116Z] 
[2023-06-02T02:13:57.116Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KStreamKStreamIntegrationTest > shouldOuterJoin() STARTED
[2023-06-02T02:14:08.132Z] 
[2023-06-02T02:14:08.132Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KStreamKStreamIntegrationTest > shouldOuterJoin() PASSED
[2023-06-02T02:14:10.102Z] 
[2023-06-02T02:14:10.102Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicNotWrittenToDuringRestoration() STARTED
[2023-06-02T02:14:15.536Z] 
[2023-06-02T02:14:15.536Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicNotWrittenToDuringRestoration() PASSED
[2023-06-02T02:14:15.536Z] 
[2023-06-02T02:14:15.536Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicWrittenToDuringRestorationWithEosAlphaEnabled()
 STARTED
[2023-06-02T02:14:20.018Z] 
[2023-06-02T02:14:20.018Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicWrittenToDuringRestorationWithEosAlphaEnabled()
 PASSED
[2023-06-02T02:14:20.018Z] 
[2023-06-02T02:14:20.018Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicWrittenToDuringRestorationWithEosDisabled() 
STARTED
[2023-06-02T02:14:25.176Z] 
[2023-06-02T02:14:25.176Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicWrittenToDuringRestorationWithEosDisabled() 
PASSED
[2023-06-02T02:14:25.176Z] 
[2023-06-02T02:14:25.176Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicWrittenToDuringRestorationWithEosV2Enabled() 
STARTED
[2023-06-02T02:14:29.470Z] 
[2023-06-02T02:14:29.470Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > KTableSourceTopicRestartIntegrationTest > 
shouldRestoreAndProgressWhenTopicWrittenToDuringRestorationWithEosV2Enabled() 
PASSED
[2023-06-02T02:14:32.754Z] 
[2023-06-02T02:14:32.754Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > RestoreIntegrationTest > 
shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(boolean)[1]
 STARTED
[2023-06-02T02:15:25.820Z] 
[2023-06-02T02:15:25.820Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > RestoreIntegrationTest > 
shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(boolean)[1]
 PASSED
[2023-06-02T02:15:25.820Z] 
[2023-06-02T02:15:25.820Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 182 > RestoreIntegrationTest > 
shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(boolean)[2]
 STARTED
[2023-06-02T02:15:58.776Z] 
[2023-06-02T02:15:58.776Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 180 > SmokeTestDriverIntegrationTest > 
shouldWorkWithRebalance(boolean) > 
org.apache.kafka.streams.integration.SmokeTestDriverIntegrationTest.shouldWorkWithRebalance(boolean)[1]
 PASSED
[2023-06-02T02:15:58.776Z] 
[2023-06-02T02:15:58.776Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 180 > SmokeTestDriverIntegrationTest > 
shouldWorkWithRebalance(boolean) > 

[jira] [Created] (KAFKA-15048) Improve handling of non-fatal quorum controller errors

2023-06-01 Thread Colin McCabe (Jira)
Colin McCabe created KAFKA-15048:


 Summary: Improve handling of non-fatal quorum controller errors
 Key: KAFKA-15048
 URL: https://issues.apache.org/jira/browse/KAFKA-15048
 Project: Kafka
  Issue Type: Bug
Reporter: Colin McCabe






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DISCUSS] KIP-938: Add more metrics for measuring KRaft performance

2023-06-01 Thread Colin McCabe
Hi all,

I posted a KIP to add some more metrics for measuring KRaft performance. Take a 
look at: https://cwiki.apache.org/confluence/x/gBU0Dw

best,
Colin


Re: [VOTE] 3.5.0 RC0

2023-06-01 Thread Colin McCabe
Hi Mickael,

Can you start the new RC tomorrow? There's one last PR we'd like to get in.

If we can't get it in by tomorrow then let's go ahead anyway.

Thanks very much,
Colin


On Thu, Jun 1, 2023, at 14:15, Mickael Maison wrote:
> Hi David,
>
> The PR you mentioned is merged now. Can I start working on a new RC or is
> there more work needed? It seems the associated ticket is still open.
>
> Thanks,
> Mickael
>
> On Wed, 31 May 2023, 23:52 Justine Olshan, 
> wrote:
>
>> Hey Mickael --
>> This is done. Thanks!
>>
>> On Wed, May 31, 2023 at 11:24 AM Mickael Maison 
>> wrote:
>>
>> > Hi Justine,
>> >
>> > Yes you can merge that into 3.5.
>> >
>> > Thanks,
>> > Mickael
>> >
>> > On Wed, May 31, 2023 at 7:56 PM Justine Olshan
>> >  wrote:
>> > >
>> > > FYI -- I just saw this PR regarding a dependency for ARM. We may want
>> to
>> > > get this in for 3.5 as well. It should be quick.
>> > >
>> > > https://issues.apache.org/jira/browse/KAFKA-15044
>> > > https://github.com/apache/kafka/pull/13786
>> > >
>> > > Justine
>> > >
>> > > On Wed, May 31, 2023 at 9:28 AM David Arthur
>> > >  wrote:
>> > >
>> > > > Mickael,
>> > > >
>> > > > Colin has approved my patch for KAFKA-15010, I'm just waiting on a
>> > build
>> > > > before merging. I'll go ahead and backport the other fixes that need
>> to
>> > > > precede this one into 3.5.
>> > > >
>> > > > -David
>> > > >
>> > > > On Wed, May 31, 2023 at 11:52 AM Mickael Maison <
>> > mickael.mai...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > The issue mentioned by Greg has been fixed. As soon as the fix for
>> > > > > KAFKA-15010 is merged I'll build another RC.
>> > > > >
>> > > > > Thanks,
>> > > > > Mickael
>> > > > >
>> > > > > On Tue, May 30, 2023 at 10:33 AM Mickael Maison
>> > > > >  wrote:
>> > > > > >
>> > > > > > Hi David,
>> > > > > >
>> > > > > > Feel free to backport the necessary fixes to 3.5.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Mickael
>> > > > > >
>> > > > > > On Tue, May 30, 2023 at 10:32 AM Mickael Maison
>> > > > > >  wrote:
>> > > > > > >
>> > > > > > > Hi Greg,
>> > > > > > >
>> > > > > > > Thanks for the heads up, this indeed looks like something we
>> > want in
>> > > > > > > 3.5. I've replied in the PR.
>> > > > > > >
>> > > > > > > Mickael
>> > > > > > >
>> > > > > > > On Sat, May 27, 2023 at 11:44 PM David Arthur
>> > > > > > >  wrote:
>> > > > > > > >
>> > > > > > > > Mickael, after looking more closely, I definitely think
>> > KAFKA-15010
>> > > > > is a
>> > > > > > > > blocker. It creates the case where the controller can totally
>> > miss
>> > > > a
>> > > > > > > > metadata update and not write it back to ZK. Since things
>> like
>> > > > > dynamic
>> > > > > > > > configs and ACLs are only read from ZK by the ZK brokers, we
>> > could
>> > > > > have
>> > > > > > > > significant problems while the brokers are being migrated
>> (when
>> > > > some
>> > > > > are
>> > > > > > > > KRaft and some are ZK). E.g., ZK brokers could be totally
>> > unaware
>> > > > of
>> > > > > an ACL
>> > > > > > > > change while the KRaft brokers have it. I have a fix ready
>> here
>> > > > > > > > https://github.com/apache/kafka/pull/13758. I think we can
>> > get it
>> > > > > committed
>> > > > > > > > soon.
>> > > > > > > >
>> > > > > > > > Another blocker is KAFKA-15004 which was just merged to
>> trunk.
>> > This
>> > > > > is
>> > > > > > > > another dual-write bug where new topic/broker configs will
>> not
>> > be
>> > > > > written
>> > > > > > > > back to ZK by the controller.
>> > > > > > > >
>> > > > > > > > The fix for KAFKA-15010 has a few dependencies on fixes we
>> made
>> > > > this
>> > > > > past
>> > > > > > > > week, so we'll need to cherry-pick a few commits. The changes
>> > are
>> > > > > totally
>> > > > > > > > contained within the migration area of code, so I think the
>> > risk in
>> > > > > > > > including them is fairly low.
>> > > > > > > >
>> > > > > > > > -David
>> > > > > > > >
>> > > > > > > > On Thu, May 25, 2023 at 2:15 PM Greg Harris
>> > > > > 
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hey all,
>> > > > > > > > >
>> > > > > > > > > A contributor just pointed out a small but noticeable flaw
>> > in the
>> > > > > > > > > implementation of KIP-581
>> > > > > > > > >
>> > > > > > > > >
>> > > > >
>> > > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-581%3A+Value+of+optional+null+field+which+has+default+value
>> > > > > > > > > which is planned for this release.
>> > > > > > > > > Impact: the feature works for root values in a record, but
>> > does
>> > > > not
>> > > > > > > > > work for any fields within structs. Fields within structs
>> > will
>> > > > > > > > > continue to have their previous, backwards-compatible
>> > behavior.
>> > > > > > > > > The contributor has submitted a bug-fix PR which reports
>> the
>> > > > > problem
>> > > > > > > > > and does not yet have a merge-able solution, but they are
>> > > > actively
>> > > > > > > > > 

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.5 #14

2023-06-01 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 562608 lines...]
[2023-06-01T22:04:02.871Z] > Task :connect:api:publishToMavenLocal
[2023-06-01T22:04:04.589Z] > Task :streams:javadoc
[2023-06-01T22:04:05.505Z] > Task :streams:javadocJar
[2023-06-01T22:04:07.265Z] 
[2023-06-01T22:04:07.265Z] > Task :clients:javadoc
[2023-06-01T22:04:07.265Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.5@2/clients/src/main/java/org/apache/kafka/clients/admin/ScramMechanism.java:32:
 warning - Tag @see: missing final '>': "https://cwiki.apache.org/confluence/display/KAFKA/KIP-554%3A+Add+Broker-side+SCRAM+Config+API;>KIP-554:
 Add Broker-side SCRAM Config API
[2023-06-01T22:04:07.265Z] 
[2023-06-01T22:04:07.265Z]  This code is duplicated in 
org.apache.kafka.common.security.scram.internals.ScramMechanism.
[2023-06-01T22:04:07.265Z]  The type field in both files must match and must 
not change. The type field
[2023-06-01T22:04:07.265Z]  is used both for passing ScramCredentialUpsertion 
and for the internal
[2023-06-01T22:04:07.265Z]  UserScramCredentialRecord. Do not change the type 
field."
[2023-06-01T22:04:08.063Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.5@2/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/secured/package-info.java:21:
 warning - Tag @link: reference not found: 
org.apache.kafka.common.security.oauthbearer
[2023-06-01T22:04:08.979Z] 2 warnings
[2023-06-01T22:04:09.896Z] 
[2023-06-01T22:04:09.896Z] > Task :clients:javadocJar
[2023-06-01T22:04:10.813Z] > Task :clients:srcJar
[2023-06-01T22:04:11.729Z] > Task :clients:testJar
[2023-06-01T22:04:12.645Z] > Task :clients:testSrcJar
[2023-06-01T22:04:12.645Z] > Task 
:clients:publishMavenJavaPublicationToMavenLocal
[2023-06-01T22:04:12.645Z] > Task :clients:publishToMavenLocal
[2023-06-01T22:04:27.061Z] > Task :core:compileScala
[2023-06-01T22:05:24.468Z] > Task :core:classes
[2023-06-01T22:05:24.468Z] > Task :core:compileTestJava NO-SOURCE
[2023-06-01T22:05:54.656Z] > Task :core:compileTestScala
[2023-06-01T22:06:43.627Z] > Task :core:testClasses
[2023-06-01T22:06:43.627Z] > Task :streams:compileTestJava UP-TO-DATE
[2023-06-01T22:06:43.627Z] > Task :streams:testClasses UP-TO-DATE
[2023-06-01T22:06:43.627Z] > Task :streams:testJar
[2023-06-01T22:06:43.627Z] > Task :streams:testSrcJar
[2023-06-01T22:06:43.627Z] > Task 
:streams:publishMavenJavaPublicationToMavenLocal
[2023-06-01T22:06:43.627Z] > Task :streams:publishToMavenLocal
[2023-06-01T22:06:43.627Z] 
[2023-06-01T22:06:43.627Z] Deprecated Gradle features were used in this build, 
making it incompatible with Gradle 9.0.
[2023-06-01T22:06:43.627Z] 
[2023-06-01T22:06:43.627Z] You can use '--warning-mode all' to show the 
individual deprecation warnings and determine if they come from your own 
scripts or plugins.
[2023-06-01T22:06:43.627Z] 
[2023-06-01T22:06:43.627Z] See 
https://docs.gradle.org/8.0.2/userguide/command_line_interface.html#sec:command_line_warnings
[2023-06-01T22:06:43.627Z] 
[2023-06-01T22:06:43.627Z] BUILD SUCCESSFUL in 3m 3s
[2023-06-01T22:06:43.627Z] 89 actionable tasks: 33 executed, 56 up-to-date
[Pipeline] sh
[2023-06-01T22:06:46.231Z] + grep ^version= gradle.properties
[2023-06-01T22:06:46.231Z] + cut -d= -f 2
[Pipeline] dir
[2023-06-01T22:06:46.996Z] Running in 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_3.5@2/streams/quickstart
[Pipeline] {
[Pipeline] sh
[2023-06-01T22:06:48.999Z] + mvn clean install -Dgpg.skip
[2023-06-01T22:06:51.753Z] [INFO] Scanning for projects...
[2023-06-01T22:06:51.753Z] [INFO] 

[2023-06-01T22:06:51.753Z] [INFO] Reactor Build Order:
[2023-06-01T22:06:51.753Z] [INFO] 
[2023-06-01T22:06:51.753Z] [INFO] Kafka Streams :: Quickstart   
 [pom]
[2023-06-01T22:06:51.753Z] [INFO] streams-quickstart-java   
 [maven-archetype]
[2023-06-01T22:06:51.753Z] [INFO] 
[2023-06-01T22:06:51.753Z] [INFO] < 
org.apache.kafka:streams-quickstart >-
[2023-06-01T22:06:51.753Z] [INFO] Building Kafka Streams :: Quickstart 
3.5.0-SNAPSHOT[1/2]
[2023-06-01T22:06:51.753Z] [INFO]   from pom.xml
[2023-06-01T22:06:51.753Z] [INFO] [ pom 
]-
[2023-06-01T22:06:52.672Z] [INFO] 
[2023-06-01T22:06:52.672Z] [INFO] --- clean:3.0.0:clean (default-clean) @ 
streams-quickstart ---
[2023-06-01T22:06:52.672Z] [INFO] 
[2023-06-01T22:06:52.672Z] [INFO] --- remote-resources:1.5:process 
(process-resource-bundles) @ streams-quickstart ---
[2023-06-01T22:06:54.396Z] [INFO] 
[2023-06-01T22:06:54.396Z] [INFO] --- site:3.5.1:attach-descriptor 
(attach-descriptor) @ streams-quickstart ---
[2023-06-01T22:06:56.120Z] [INFO] 
[2023-06-01T22:06:56.120Z] [INFO] --- gpg:1.6:sign (sign-artifacts) @ 
streams-quickstart ---

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

2023-06-01 Thread Walker Carlson
Hey Bruno thanks for the feedback.

1)
I will add this to the kip, but stream time only advances as the when the
buffer receives a new record.

2)
You are correct, I will add a failure section on to the kip. Since the
records wont change in the buffer from when they are read from the topic
they are replicated already.

3)
I see that I'm out voted on the dropping of records thing. We will pass
them on and try to join them if possible. This might cause some null
results, but increasing the table history retention should help that.

4)
I can add some on the kip. But its pretty directly adding whatever the
grace period is to the latency. I don't see a way around it.

Walker

On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna  wrote:

> Hi Walker,
>
> thanks for the KIP!
>
> Here my feedback:
>
> 1.
> It is still not clear to me when stream time for the buffer advances.
> What is the event that let the stream time advance? In the discussion, I
> do not understand what you mean by "The segment store already has an
> observed stream time, we advance based on that. That should only advance
> based on records that enter the store." Where does this segment store
> come from? Anyways, I think it would be great to also state how stream
> time advances in the KIP.
>
> 2.
> How does the buffer behave in case of a failure? I think I understand
> that the buffer will use an implementation of TimeOrderedKeyValueBuffer
> and therefore the records in the buffer will be replicated to a topic in
> Kafka, but I am not completely sure. Could you elaborate on this in the
> KIP?
>
> 3.
> I agree with Matthias about dropping late records. We use grace periods
> in scenarios where we records are grouped like in windowed aggregations
> and windowed joins. The stream buffer you propose does not really group
> any records. It rather delays records and reorders them. I am not sure
> if grace period is the right naming/concept to apply here. Instead of
> dropping records that fall outside of the buffer's time interval the
> join should skip the buffer and try to join the record immediately. In
> the end, a stream-table join is a unwindowed join, i.e., no grouping is
> applied to the records.
> What do you and other folks think about this proposal?
>
> 4.
> How does the proposed buffer, affects processing latency? Could you
> please add some words about this to the KIP?
>
>
> Best,
> Bruno
>
>
>
>
> On 31.05.23 01:49, Walker Carlson wrote:
> > Thanks for all the additional comments. I will either address them here
> or
> > update the kip accordingly.
> >
> >
> > I mentioned a follow kip to add extra features before and in the
> responses.
> > I will try to briefly summarize what options and optimizations I plan to
> > include. If a concern is not covered in this list I for sure talk about
> it
> > below.
> >
> > * Allowing non versioned tables to still use the stream buffer
> > * Automatically materializing tables instead of forcing the user to do it
> > * Configurable for in memory buffer
> > * Order the records in offset order or in time order
> > * Non memory use buffer (offset order, delayed pull from stream.)
> > * Time synced between stream and table side (maybe)
> > * Do not drop late records and process them as they come in instead.
> >
> >
> > First, Victoria.
> >
> > 1) (One of your nits covers this, but you are correct it doesn't make
> > sense. so I removed that part of the example.)
> > For those examples with the "bad" join results I said without buffering
> the
> > stream it would look like that, but that was incomplete. If the look up
> was
> > simply looking at the latest version of the table when the stream records
> > came in then the results were possible. If we are using the point in time
> > lookup that versioned tables let us then you are correct the future
> results
> > are not possible.
> >
> > 2) I'll get to this later as Matthias brought up something related.
> >
> > To your additional thoughts, I agree that we need to call those things
> out
> > in the documentation. I'm writing up a follow up kip with a lot of the
> > ideas we have discussed so that we can improve this feature beyond the
> base
> > implementation if it's needed.
> >
> > I addressed the nits in the kip. I somehow missed the table stream table
> > join processor improvement, it makes your first question make a lot more
> > sense.  Table history retention is a much cleaner way to describe it.
> >
> > As to your mention of the syncing the time for the table and stream.
> > Matthias mentioned that as well. I will address both here. I plan to
> bring
> > that up in the future, but for now we will leave it out. I suppose it
> will
> > be more useful after the table history retention is separable from the
> > table grace period.
> >
> >
> > To address Matthias comments.
> >
> > You are correct by saying the in memory store shouldn't cause any
> semantic
> > concerns. My concern would be more with if we limited the number of
> records
> > on the buffer and 

Re: [VOTE] 3.5.0 RC0

2023-06-01 Thread Mickael Maison
Hi David,

The PR you mentioned is merged now. Can I start working on a new RC or is
there more work needed? It seems the associated ticket is still open.

Thanks,
Mickael

On Wed, 31 May 2023, 23:52 Justine Olshan, 
wrote:

> Hey Mickael --
> This is done. Thanks!
>
> On Wed, May 31, 2023 at 11:24 AM Mickael Maison 
> wrote:
>
> > Hi Justine,
> >
> > Yes you can merge that into 3.5.
> >
> > Thanks,
> > Mickael
> >
> > On Wed, May 31, 2023 at 7:56 PM Justine Olshan
> >  wrote:
> > >
> > > FYI -- I just saw this PR regarding a dependency for ARM. We may want
> to
> > > get this in for 3.5 as well. It should be quick.
> > >
> > > https://issues.apache.org/jira/browse/KAFKA-15044
> > > https://github.com/apache/kafka/pull/13786
> > >
> > > Justine
> > >
> > > On Wed, May 31, 2023 at 9:28 AM David Arthur
> > >  wrote:
> > >
> > > > Mickael,
> > > >
> > > > Colin has approved my patch for KAFKA-15010, I'm just waiting on a
> > build
> > > > before merging. I'll go ahead and backport the other fixes that need
> to
> > > > precede this one into 3.5.
> > > >
> > > > -David
> > > >
> > > > On Wed, May 31, 2023 at 11:52 AM Mickael Maison <
> > mickael.mai...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > The issue mentioned by Greg has been fixed. As soon as the fix for
> > > > > KAFKA-15010 is merged I'll build another RC.
> > > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Tue, May 30, 2023 at 10:33 AM Mickael Maison
> > > > >  wrote:
> > > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Feel free to backport the necessary fixes to 3.5.
> > > > > >
> > > > > > Thanks,
> > > > > > Mickael
> > > > > >
> > > > > > On Tue, May 30, 2023 at 10:32 AM Mickael Maison
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi Greg,
> > > > > > >
> > > > > > > Thanks for the heads up, this indeed looks like something we
> > want in
> > > > > > > 3.5. I've replied in the PR.
> > > > > > >
> > > > > > > Mickael
> > > > > > >
> > > > > > > On Sat, May 27, 2023 at 11:44 PM David Arthur
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Mickael, after looking more closely, I definitely think
> > KAFKA-15010
> > > > > is a
> > > > > > > > blocker. It creates the case where the controller can totally
> > miss
> > > > a
> > > > > > > > metadata update and not write it back to ZK. Since things
> like
> > > > > dynamic
> > > > > > > > configs and ACLs are only read from ZK by the ZK brokers, we
> > could
> > > > > have
> > > > > > > > significant problems while the brokers are being migrated
> (when
> > > > some
> > > > > are
> > > > > > > > KRaft and some are ZK). E.g., ZK brokers could be totally
> > unaware
> > > > of
> > > > > an ACL
> > > > > > > > change while the KRaft brokers have it. I have a fix ready
> here
> > > > > > > > https://github.com/apache/kafka/pull/13758. I think we can
> > get it
> > > > > committed
> > > > > > > > soon.
> > > > > > > >
> > > > > > > > Another blocker is KAFKA-15004 which was just merged to
> trunk.
> > This
> > > > > is
> > > > > > > > another dual-write bug where new topic/broker configs will
> not
> > be
> > > > > written
> > > > > > > > back to ZK by the controller.
> > > > > > > >
> > > > > > > > The fix for KAFKA-15010 has a few dependencies on fixes we
> made
> > > > this
> > > > > past
> > > > > > > > week, so we'll need to cherry-pick a few commits. The changes
> > are
> > > > > totally
> > > > > > > > contained within the migration area of code, so I think the
> > risk in
> > > > > > > > including them is fairly low.
> > > > > > > >
> > > > > > > > -David
> > > > > > > >
> > > > > > > > On Thu, May 25, 2023 at 2:15 PM Greg Harris
> > > > > 
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey all,
> > > > > > > > >
> > > > > > > > > A contributor just pointed out a small but noticeable flaw
> > in the
> > > > > > > > > implementation of KIP-581
> > > > > > > > >
> > > > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-581%3A+Value+of+optional+null+field+which+has+default+value
> > > > > > > > > which is planned for this release.
> > > > > > > > > Impact: the feature works for root values in a record, but
> > does
> > > > not
> > > > > > > > > work for any fields within structs. Fields within structs
> > will
> > > > > > > > > continue to have their previous, backwards-compatible
> > behavior.
> > > > > > > > > The contributor has submitted a bug-fix PR which reports
> the
> > > > > problem
> > > > > > > > > and does not yet have a merge-able solution, but they are
> > > > actively
> > > > > > > > > responding and interested in having this fixed:
> > > > > > > > > https://github.com/apache/kafka/pull/13748
> > > > > > > > > The overall fix should be a one-liner + some unit tests.
> > While
> > > > > this is
> > > > > > > > > not a regression, it does make the feature largely useless,
> > as
> > > > the
> > > > > > > > > majority of use-cases will be for struct fields.
> > > > > > > > >
> > > > > > > > > 

Re: [DISCUSS] Regarding Old PRs

2023-06-01 Thread Ismael Juma
This is worthwhile, but can we use a bot for this? Many other projects do
so already.

Ismael

On Thu, Jun 1, 2023 at 9:24 AM Josep Prat 
wrote:

> Hi Kafka devs,
>
> Seeing that we have over a 1000 PRs and that we want to try to keep the
> number low. I was thinking if it would make sense to go through the oldest
> PRs and ping the authors to see if they want/can to go back to them and
> finish them. If they don't or we can't get any answer, we could close the
> PRs after 2 to 3 weeks. We can create a label to mark these PRs in case we
> want to go back to them easily. If the author, however, states that they
> are still interested in bringing the PR to completion, we the committers
> should do our best to review them.
> As an example, we currently have 344 PRs that were created before January
> 1st 2020 (not included), to put this in perspective of Kafka versions,
> these happened before or at the time of 2.4.0.
> If this is something that people would generally agree to, we could
> organize a task force to go over these PRs created before January 1st 2020
> and ping the authors. I would volunteer to be part of this task.
>
> Best,
>
> --
> [image: Aiven] 
>
> *Josep Prat*
> Open Source Engineering Director, *Aiven*
> josep.p...@aiven.io   |   +491715557497
> aiven.io    |    >
>      <
> https://twitter.com/aiven_io>
> *Aiven Deutschland GmbH*
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> Amtsgericht Charlottenburg, HRB 209739 B
>


Re: [DISCUSS] Regarding Old PRs

2023-06-01 Thread Philip Nee
Hey Josep,

2-3weeks is pretty optimistic, but anything over a year can probably be
closed.  If there's no response.
If they don't or we can't get any answer, we could close the
> PRs after 2 to 3 weeks.

Regular contributors can't label their own PRs though - it requires a
committer/collaborator to mark them.
We can create a label to mark these PRs in case we
> want to go back to them easily.

Thanks!
P



On Thu, Jun 1, 2023 at 12:24 PM Kirk True  wrote:

> Hi Josep,
>
> Sounds like an important task for the community & project.
>
> I would be happy to help in reaching out to the PR authors.
>
> Thanks,
> Kirk
>
> > On Jun 1, 2023, at 9:24 AM, Josep Prat 
> wrote:
> >
> > Hi Kafka devs,
> >
> > Seeing that we have over a 1000 PRs and that we want to try to keep the
> > number low. I was thinking if it would make sense to go through the
> oldest
> > PRs and ping the authors to see if they want/can to go back to them and
> > finish them. If they don't or we can't get any answer, we could close the
> > PRs after 2 to 3 weeks. We can create a label to mark these PRs in case
> we
> > want to go back to them easily. If the author, however, states that they
> > are still interested in bringing the PR to completion, we the committers
> > should do our best to review them.
> > As an example, we currently have 344 PRs that were created before January
> > 1st 2020 (not included), to put this in perspective of Kafka versions,
> > these happened before or at the time of 2.4.0.
> > If this is something that people would generally agree to, we could
> > organize a task force to go over these PRs created before January 1st
> 2020
> > and ping the authors. I would volunteer to be part of this task.
> >
> > Best,
> >
> > --
> > [image: Aiven] 
> >
> > *Josep Prat*
> > Open Source Engineering Director, *Aiven*
> > josep.p...@aiven.io   |   +491715557497
> > aiven.io    |   <
> https://www.facebook.com/aivencloud>
> >     <
> https://twitter.com/aiven_io>
> > *Aiven Deutschland GmbH*
> > Alexanderufer 3-7, 10117 Berlin
> > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> > Amtsgericht Charlottenburg, HRB 209739 B
>
>


Re: [DISCUSS] Regarding Old PRs

2023-06-01 Thread Kirk True
Hi Josep,

Sounds like an important task for the community & project.

I would be happy to help in reaching out to the PR authors.

Thanks,
Kirk

> On Jun 1, 2023, at 9:24 AM, Josep Prat  wrote:
> 
> Hi Kafka devs,
> 
> Seeing that we have over a 1000 PRs and that we want to try to keep the
> number low. I was thinking if it would make sense to go through the oldest
> PRs and ping the authors to see if they want/can to go back to them and
> finish them. If they don't or we can't get any answer, we could close the
> PRs after 2 to 3 weeks. We can create a label to mark these PRs in case we
> want to go back to them easily. If the author, however, states that they
> are still interested in bringing the PR to completion, we the committers
> should do our best to review them.
> As an example, we currently have 344 PRs that were created before January
> 1st 2020 (not included), to put this in perspective of Kafka versions,
> these happened before or at the time of 2.4.0.
> If this is something that people would generally agree to, we could
> organize a task force to go over these PRs created before January 1st 2020
> and ping the authors. I would volunteer to be part of this task.
> 
> Best,
> 
> -- 
> [image: Aiven] 
> 
> *Josep Prat*
> Open Source Engineering Director, *Aiven*
> josep.p...@aiven.io   |   +491715557497
> aiven.io    |   
>     
> *Aiven Deutschland GmbH*
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> Amtsgericht Charlottenburg, HRB 209739 B



Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

2023-06-01 Thread Hao Li
Hi Colt,

Thanks for the feedback.

> most deployments have three racks, RF=3, and one replica for
each partition in each rack

This KIP is mainly targeting the case where the client won't always be in
the same rack as any replica. There's some proposal to make RF=2 and use
other tiered storage to do backup. If all TopicPartitions have replicas
which can be in same rack as clients, this KIP will not take effect and
rack aware assignment will be turned off.

As for setting the leader of replicas, I think this is something clients
don't have control over. Also as you mentioned, if there are multiple
clients, there's no "set the leader and fit all" solution.


On Thu, Jun 1, 2023 at 10:20 AM Colt McNealy  wrote:

> Hi all,
>
> I've got a rather naive question here. The spiritual goal of this KIP is to
> reduce cross-rack traffic (normally, this manifests itself in terms of a
> higher AWS/Azure bill as cloud providers charge for cross-AZ traffic).
>
> To generalize, most deployments have three racks, RF=3, and one replica for
> each partition in each rack. Therefore, in the steady state (absent any
> cluster anomalies such as broker failure, etc) we are pretty confident that
> there should be a replica for every partition (input, changelog,
> repartition, output topic) on the same rack as a given Streams instance.
>
> Why not just let Sophie's High-Availability Task Assignor do its thing, and
> then *set the preferred leader* for each replica to a broker in the same
> rack/AZ as the Active Task? This would solve two problems:
>
> 1. The current KIP can't make any improvements in the case where a Task has
> three involved partitions (eg. input, changelog, output) and the leader for
> each partition is in a different rack. With this approach, we could get
> pretty close to having zero cross-AZ traffic in a healthy cluster.
> 2. There needs to be a lot of work done to balance availability, data
> movement, and cross-AZ traffic in the current proposal. My proposal doesn't
> actually involve any additional data movement; simply reassignment of
> partition leadership.
>
> The biggest argument against this proposal is that there could be two
> Streams apps using the same topic, which would cause some bickering.
> Secondly, some have observed that changing partition leadership can trigger
> ProducerFencedExceptions in EOS, which causes a state restoration.
>
> Colt McNealy
>
> *Founder, LittleHorse.dev*
>
>
> On Thu, Jun 1, 2023 at 10:02 AM Hao Li  wrote:
>
> > Hi Bruno,
> >
> > dropping config rack.aware.assignment.enabled
> > and add value NONE to the enum for the possible values of config
> > rack.aware.assignment.strategy sounds good to me.
> >
> > On Thu, Jun 1, 2023 at 12:39 AM Bruno Cadonna 
> wrote:
> >
> > > Hi Hao,
> > >
> > > Thanks for the updates!
> > >
> > > What do you think about dropping config rack.aware.assignment.enabled
> > > and add value NONE to the enum for the possible values of config
> > > rack.aware.assignment.strategy?
> > >
> > > Best,
> > > Bruno
> > >
> > > On 31.05.23 23:31, Hao Li wrote:
> > > > Hi all,
> > > >
> > > > I've updated the KIP based on the feedback. Major changes I made:
> > > > 1. Add rack aware assignment to `StickyTaskAssignor`
> > > > 2. Reject `Prefer reliability and then find optimal cost` option in
> > > standby
> > > > task assignment.
> > > >
> > > >
> > > > On Wed, May 31, 2023 at 12:09 PM Hao Li  wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> Thanks for the feedback! I will update the KIP accordingly.
> > > >>
> > > >> *For Sophie's comments:*
> > > >>
> > > >> 1 and 2. Good catch. Fixed these.
> > > >>
> > > >> 3 and 4 Yes. We can make this public config and call out the
> > > >> clientConsumer config users need to set.
> > > >>
> > > >> 5. It's ideal to take the previous assignment in HAAssignor into
> > > >> consideration when we compute our target assignment, the
> complications
> > > come
> > > >> with making sure the assignment can eventually converge and we don't
> > do
> > > >> probing rebalance infinitely. It's not only about storing the
> previous
> > > >> assignment or get it somehow. We can actually get the previous
> > > assignment
> > > >> now like we do in StickyAssignor. But the previous assignment will
> > > change
> > > >> in each round of probing rebalance. The proposal which added some
> > > weight to
> > > >> make the rack aware assignment lean towards the original HAA's
> target
> > > >> assignment will add benefits of stability in some corner cases in
> case
> > > of
> > > >> tie in cross rack traffic cost. But it's not sticky. But the bottom
> > > line is
> > > >> it won't be worse than current HAA's stickiness.
> > > >>
> > > >> 6. I'm fine with changing the assignor config to public. Actually, I
> > > think
> > > >> we can min-cost algorithm with StickyAssignor as well to mitigate
> the
> > > >> problem of 5. So we can have one public config to choose an assignor
> > and
> > > >> one public config to enable the rack aware 

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.5 #13

2023-06-01 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 472017 lines...]
[2023-06-01T18:18:26.112Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromSourceTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromSourceTopic(boolean)[1]
 STARTED
[2023-06-01T18:18:26.112Z] 
[2023-06-01T18:18:26.112Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromSourceTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromSourceTopic(boolean)[1]
 PASSED
[2023-06-01T18:18:26.112Z] 
[2023-06-01T18:18:26.112Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromSourceTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromSourceTopic(boolean)[2]
 STARTED
[2023-06-01T18:18:26.112Z] 
[2023-06-01T18:18:26.112Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromSourceTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromSourceTopic(boolean)[2]
 PASSED
[2023-06-01T18:18:26.112Z] 
[2023-06-01T18:18:26.112Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldSuccessfullyStartWhenLoggingDisabled(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldSuccessfullyStartWhenLoggingDisabled(boolean)[1]
 STARTED
[2023-06-01T18:18:28.291Z] 
[2023-06-01T18:18:28.291Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldSuccessfullyStartWhenLoggingDisabled(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldSuccessfullyStartWhenLoggingDisabled(boolean)[1]
 PASSED
[2023-06-01T18:18:28.291Z] 
[2023-06-01T18:18:28.291Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldSuccessfullyStartWhenLoggingDisabled(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldSuccessfullyStartWhenLoggingDisabled(boolean)[2]
 STARTED
[2023-06-01T18:18:31.189Z] 
[2023-06-01T18:18:31.189Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldSuccessfullyStartWhenLoggingDisabled(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldSuccessfullyStartWhenLoggingDisabled(boolean)[2]
 PASSED
[2023-06-01T18:18:31.189Z] 
[2023-06-01T18:18:31.189Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromChangelogTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromChangelogTopic(boolean)[1]
 STARTED
[2023-06-01T18:18:34.441Z] 
[2023-06-01T18:18:34.441Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromChangelogTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromChangelogTopic(boolean)[1]
 PASSED
[2023-06-01T18:18:34.441Z] 
[2023-06-01T18:18:34.441Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromChangelogTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromChangelogTopic(boolean)[2]
 STARTED
[2023-06-01T18:18:38.261Z] 
[2023-06-01T18:18:38.261Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > shouldRestoreNullRecord() PASSED
[2023-06-01T18:18:38.261Z] 
[2023-06-01T18:18:38.261Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromSourceTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromSourceTopic(boolean)[1]
 STARTED
[2023-06-01T18:18:39.741Z] 
[2023-06-01T18:18:39.741Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromSourceTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromSourceTopic(boolean)[1]
 PASSED
[2023-06-01T18:18:39.741Z] 
[2023-06-01T18:18:39.741Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromSourceTopic(boolean) > 
org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRestoreStateFromSourceTopic(boolean)[2]
 STARTED
[2023-06-01T18:18:40.442Z] 
[2023-06-01T18:18:40.443Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 184 > RestoreIntegrationTest > 
shouldRestoreStateFromChangelogTopic(boolean) > 

Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

2023-06-01 Thread Colt McNealy
Hi all,

I've got a rather naive question here. The spiritual goal of this KIP is to
reduce cross-rack traffic (normally, this manifests itself in terms of a
higher AWS/Azure bill as cloud providers charge for cross-AZ traffic).

To generalize, most deployments have three racks, RF=3, and one replica for
each partition in each rack. Therefore, in the steady state (absent any
cluster anomalies such as broker failure, etc) we are pretty confident that
there should be a replica for every partition (input, changelog,
repartition, output topic) on the same rack as a given Streams instance.

Why not just let Sophie's High-Availability Task Assignor do its thing, and
then *set the preferred leader* for each replica to a broker in the same
rack/AZ as the Active Task? This would solve two problems:

1. The current KIP can't make any improvements in the case where a Task has
three involved partitions (eg. input, changelog, output) and the leader for
each partition is in a different rack. With this approach, we could get
pretty close to having zero cross-AZ traffic in a healthy cluster.
2. There needs to be a lot of work done to balance availability, data
movement, and cross-AZ traffic in the current proposal. My proposal doesn't
actually involve any additional data movement; simply reassignment of
partition leadership.

The biggest argument against this proposal is that there could be two
Streams apps using the same topic, which would cause some bickering.
Secondly, some have observed that changing partition leadership can trigger
ProducerFencedExceptions in EOS, which causes a state restoration.

Colt McNealy

*Founder, LittleHorse.dev*


On Thu, Jun 1, 2023 at 10:02 AM Hao Li  wrote:

> Hi Bruno,
>
> dropping config rack.aware.assignment.enabled
> and add value NONE to the enum for the possible values of config
> rack.aware.assignment.strategy sounds good to me.
>
> On Thu, Jun 1, 2023 at 12:39 AM Bruno Cadonna  wrote:
>
> > Hi Hao,
> >
> > Thanks for the updates!
> >
> > What do you think about dropping config rack.aware.assignment.enabled
> > and add value NONE to the enum for the possible values of config
> > rack.aware.assignment.strategy?
> >
> > Best,
> > Bruno
> >
> > On 31.05.23 23:31, Hao Li wrote:
> > > Hi all,
> > >
> > > I've updated the KIP based on the feedback. Major changes I made:
> > > 1. Add rack aware assignment to `StickyTaskAssignor`
> > > 2. Reject `Prefer reliability and then find optimal cost` option in
> > standby
> > > task assignment.
> > >
> > >
> > > On Wed, May 31, 2023 at 12:09 PM Hao Li  wrote:
> > >
> > >> Hi all,
> > >>
> > >> Thanks for the feedback! I will update the KIP accordingly.
> > >>
> > >> *For Sophie's comments:*
> > >>
> > >> 1 and 2. Good catch. Fixed these.
> > >>
> > >> 3 and 4 Yes. We can make this public config and call out the
> > >> clientConsumer config users need to set.
> > >>
> > >> 5. It's ideal to take the previous assignment in HAAssignor into
> > >> consideration when we compute our target assignment, the complications
> > come
> > >> with making sure the assignment can eventually converge and we don't
> do
> > >> probing rebalance infinitely. It's not only about storing the previous
> > >> assignment or get it somehow. We can actually get the previous
> > assignment
> > >> now like we do in StickyAssignor. But the previous assignment will
> > change
> > >> in each round of probing rebalance. The proposal which added some
> > weight to
> > >> make the rack aware assignment lean towards the original HAA's target
> > >> assignment will add benefits of stability in some corner cases in case
> > of
> > >> tie in cross rack traffic cost. But it's not sticky. But the bottom
> > line is
> > >> it won't be worse than current HAA's stickiness.
> > >>
> > >> 6. I'm fine with changing the assignor config to public. Actually, I
> > think
> > >> we can min-cost algorithm with StickyAssignor as well to mitigate the
> > >> problem of 5. So we can have one public config to choose an assignor
> and
> > >> one public config to enable the rack aware assignment.
> > >>
> > >> *For Bruno's comments:*
> > >>
> > >> The proposal was to implement all the options and use configs to
> choose
> > >> them during runtime. We can make those configs public as suggested.
> > >> 1, 2, 3, 4, 5: agree and will fix those.
> > >> 6: subscription protocol is not changed.
> > >> 7: yeah. Let me fix the notations.
> > >> 8: It meant clients. In the figure, it maps to `c1_1`, `c1_2`, `c1_3`
> > etc.
> > >> 9: I'm also ok with just optimizing reliability for standby tasks. Or
> we
> > >> could simply run the "balance reliability over cost" greedy algorithm
> to
> > >> see if any cost could be reduced.
> > >> 10: Make sense. Will fix the wording.
> > >> 11: Make sense. Will update the test part.
> > >>
> > >> *For Walker's comments:*
> > >> 1. Stability for HAA is an issue. See my comments for Sophie's
> feedback
> > 5
> > >> and 6. I think we could use the rack aware assignment for
> > 

Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

2023-06-01 Thread Hao Li
Hi Bruno,

dropping config rack.aware.assignment.enabled
and add value NONE to the enum for the possible values of config
rack.aware.assignment.strategy sounds good to me.

On Thu, Jun 1, 2023 at 12:39 AM Bruno Cadonna  wrote:

> Hi Hao,
>
> Thanks for the updates!
>
> What do you think about dropping config rack.aware.assignment.enabled
> and add value NONE to the enum for the possible values of config
> rack.aware.assignment.strategy?
>
> Best,
> Bruno
>
> On 31.05.23 23:31, Hao Li wrote:
> > Hi all,
> >
> > I've updated the KIP based on the feedback. Major changes I made:
> > 1. Add rack aware assignment to `StickyTaskAssignor`
> > 2. Reject `Prefer reliability and then find optimal cost` option in
> standby
> > task assignment.
> >
> >
> > On Wed, May 31, 2023 at 12:09 PM Hao Li  wrote:
> >
> >> Hi all,
> >>
> >> Thanks for the feedback! I will update the KIP accordingly.
> >>
> >> *For Sophie's comments:*
> >>
> >> 1 and 2. Good catch. Fixed these.
> >>
> >> 3 and 4 Yes. We can make this public config and call out the
> >> clientConsumer config users need to set.
> >>
> >> 5. It's ideal to take the previous assignment in HAAssignor into
> >> consideration when we compute our target assignment, the complications
> come
> >> with making sure the assignment can eventually converge and we don't do
> >> probing rebalance infinitely. It's not only about storing the previous
> >> assignment or get it somehow. We can actually get the previous
> assignment
> >> now like we do in StickyAssignor. But the previous assignment will
> change
> >> in each round of probing rebalance. The proposal which added some
> weight to
> >> make the rack aware assignment lean towards the original HAA's target
> >> assignment will add benefits of stability in some corner cases in case
> of
> >> tie in cross rack traffic cost. But it's not sticky. But the bottom
> line is
> >> it won't be worse than current HAA's stickiness.
> >>
> >> 6. I'm fine with changing the assignor config to public. Actually, I
> think
> >> we can min-cost algorithm with StickyAssignor as well to mitigate the
> >> problem of 5. So we can have one public config to choose an assignor and
> >> one public config to enable the rack aware assignment.
> >>
> >> *For Bruno's comments:*
> >>
> >> The proposal was to implement all the options and use configs to choose
> >> them during runtime. We can make those configs public as suggested.
> >> 1, 2, 3, 4, 5: agree and will fix those.
> >> 6: subscription protocol is not changed.
> >> 7: yeah. Let me fix the notations.
> >> 8: It meant clients. In the figure, it maps to `c1_1`, `c1_2`, `c1_3`
> etc.
> >> 9: I'm also ok with just optimizing reliability for standby tasks. Or we
> >> could simply run the "balance reliability over cost" greedy algorithm to
> >> see if any cost could be reduced.
> >> 10: Make sense. Will fix the wording.
> >> 11: Make sense. Will update the test part.
> >>
> >> *For Walker's comments:*
> >> 1. Stability for HAA is an issue. See my comments for Sophie's feedback
> 5
> >> and 6. I think we could use the rack aware assignment for
> StickyAssignor as
> >> well. For HAA assignments, it's less sticky and we can only shoot for
> >> minimizing the cross rack traffic eventually when everything is stable.
> >> 2. Yeah. This is a good point and we can also turn it on for
> >> StickyAssignor.
> >>
> >> Thanks,
> >> Hao
> >>
> >>
> >> On Tue, May 30, 2023 at 2:28 PM Sophie Blee-Goldman <
> >> ableegold...@gmail.com> wrote:
> >>
> >>> Hey Hao, thanks for the KIP!
> >>>
> >>> 1. There's a typo in the "internal.rack.aware.assignment.strategry"
> >>> config,
> >>> this
> >>> should be internal.rack.aware.assignment.strategy.
> >>>
> >>> 2.
> >>>
>    For O(E^2 * (CU)) complexity, C and U can be viewed as constant.
> >>> Number of
>  edges E is T * N where T is the number of clients and N is the number
> of
>  Tasks. This is because a task can be assigned to any client so there
> >>> will
>  be an edge between every task and every client. The total complexity
> >>> would
>  be O(T * N) if we want to be more specific.
> >>>
> >>> I feel like I'm missing something here, but if E = T * N and the
> >>> complexity
> >>> is ~O(E^2), doesn't
> >>> this make the total complexity order of O(T^2 * N^2)?
> >>>
> >>> 3.
> >>>
>  Since 3.C.I and 3.C.II have different tradeoffs and work better in
>  different workloads etc, we
> >>>
> >>> could add an internal configuration to choose one of them at runtime.
> 
> >>> Why only an internal configuration? Same goes for
> >>> internal.rack.aware.assignment.standby.strategry (which also has the
> typo)
> >>>
> >>> 4.
> >>>
>    There are no changes in public interfaces.
> >>>
> >>> I think it would be good to explicitly call out that users can utilize
> >>> this
> >>> new feature by setting the
> >>> ConsumerConfig's CLIENT_RACK_CONFIG, possibly with a brief example
> >>>
> >>> 5.
> >>>
>  The idea is that if we always try to make it 

[DISCUSS] Regarding Old PRs

2023-06-01 Thread Josep Prat
Hi Kafka devs,

Seeing that we have over a 1000 PRs and that we want to try to keep the
number low. I was thinking if it would make sense to go through the oldest
PRs and ping the authors to see if they want/can to go back to them and
finish them. If they don't or we can't get any answer, we could close the
PRs after 2 to 3 weeks. We can create a label to mark these PRs in case we
want to go back to them easily. If the author, however, states that they
are still interested in bringing the PR to completion, we the committers
should do our best to review them.
As an example, we currently have 344 PRs that were created before January
1st 2020 (not included), to put this in perspective of Kafka versions,
these happened before or at the time of 2.4.0.
If this is something that people would generally agree to, we could
organize a task force to go over these PRs created before January 1st 2020
and ping the authors. I would volunteer to be part of this task.

Best,

-- 
[image: Aiven] 

*Josep Prat*
Open Source Engineering Director, *Aiven*
josep.p...@aiven.io   |   +491715557497
aiven.io    |   
     
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


[GitHub] [kafka-site] mimaison opened a new pull request, #518: MINOR: Update upgrade to 3.5 section

2023-06-01 Thread via GitHub


mimaison opened a new pull request, #518:
URL: https://github.com/apache/kafka-site/pull/518

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (KAFKA-15017) New ClientQuotas are not written to ZK from snapshot

2023-06-01 Thread Proven Provenzano (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Proven Provenzano resolved KAFKA-15017.
---
Resolution: Fixed

> New ClientQuotas are not written to ZK from snapshot 
> -
>
> Key: KAFKA-15017
> URL: https://issues.apache.org/jira/browse/KAFKA-15017
> Project: Kafka
>  Issue Type: Bug
>  Components: kraft
>Affects Versions: 3.5.0
>Reporter: David Arthur
>Assignee: Proven Provenzano
>Priority: Critical
>
> Similar issue to KAFKA-15009



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.5 #12

2023-06-01 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 467270 lines...]
[2023-06-01T12:36:51.459Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQuerySpecificActivePartitionStores() PASSED
[2023-06-01T12:36:51.459Z] 
[2023-06-01T12:36:51.459Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldFailWithIllegalArgumentExceptionWhenIQPartitionerReturnsMultiplePartitions()
 STARTED
[2023-06-01T12:36:54.988Z] 
[2023-06-01T12:36:54.988Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldFailWithIllegalArgumentExceptionWhenIQPartitionerReturnsMultiplePartitions()
 PASSED
[2023-06-01T12:36:54.988Z] 
[2023-06-01T12:36:54.988Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQueryAllStalePartitionStores() STARTED
[2023-06-01T12:37:01.148Z] 
[2023-06-01T12:37:01.148Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQueryAllStalePartitionStores() PASSED
[2023-06-01T12:37:01.148Z] 
[2023-06-01T12:37:01.148Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreads() STARTED
[2023-06-01T12:37:05.235Z] 
[2023-06-01T12:37:05.235Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreads() PASSED
[2023-06-01T12:37:05.235Z] 
[2023-06-01T12:37:05.235Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStores() STARTED
[2023-06-01T12:37:10.762Z] 
[2023-06-01T12:37:10.762Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStores() PASSED
[2023-06-01T12:37:10.762Z] 
[2023-06-01T12:37:10.762Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQueryOnlyActivePartitionStoresByDefault() STARTED
[2023-06-01T12:37:15.685Z] 
[2023-06-01T12:37:15.685Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQueryOnlyActivePartitionStoresByDefault() PASSED
[2023-06-01T12:37:15.685Z] 
[2023-06-01T12:37:15.685Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQueryStoresAfterAddingAndRemovingStreamThread() STARTED
[2023-06-01T12:37:27.413Z] 
[2023-06-01T12:37:27.413Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQueryStoresAfterAddingAndRemovingStreamThread() PASSED
[2023-06-01T12:37:27.413Z] 
[2023-06-01T12:37:27.413Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreadsNamedTopology() STARTED
[2023-06-01T12:37:31.761Z] 
[2023-06-01T12:37:31.761Z] Gradle Test Run :streams:integrationTest > Gradle 
Test Executor 177 > StoreQueryIntegrationTest > 
shouldQuerySpecificStalePartitionStoresMultiStreamThreadsNamedTopology() PASSED
[2023-06-01T12:37:34.267Z] streams-2: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-0: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-4: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-6: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-3: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-4: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-1: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-5: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-3: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-2: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-1: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:34.267Z] streams-0: SMOKE-TEST-CLIENT-CLOSED
[2023-06-01T12:37:41.574Z] 
[2023-06-01T12:37:41.574Z] Deprecated Gradle features were used in this build, 
making it incompatible with Gradle 9.0.
[2023-06-01T12:37:41.574Z] 
[2023-06-01T12:37:41.574Z] You can use '--warning-mode all' to show the 
individual deprecation warnings and determine if they come from your own 
scripts or plugins.
[2023-06-01T12:37:41.574Z] 
[2023-06-01T12:37:41.574Z] See 
https://docs.gradle.org/8.0.2/userguide/command_line_interface.html#sec:command_line_warnings
[2023-06-01T12:37:41.574Z] 
[2023-06-01T12:37:41.574Z] BUILD SUCCESSFUL in 2h 52m 51s
[2023-06-01T12:37:41.574Z] 230 actionable tasks: 124 executed, 106 up-to-date
[2023-06-01T12:37:41.574Z] 
[2023-06-01T12:37:41.574Z] See the profiling report at: 

[jira] [Created] (KAFKA-15047) Handle rolling segments when the active segment's retention is breached incase of tiered storage is enabled.

2023-06-01 Thread Satish Duggana (Jira)
Satish Duggana created KAFKA-15047:
--

 Summary: Handle rolling segments when the active segment's 
retention is breached incase of tiered storage is enabled.
 Key: KAFKA-15047
 URL: https://issues.apache.org/jira/browse/KAFKA-15047
 Project: Kafka
  Issue Type: Sub-task
Reporter: Satish Duggana
Assignee: Kamal Chandraprakash






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-932: Queues for Kafka

2023-06-01 Thread Dániel Urbán
Hi Andrew,

Thank you for the KIP, exciting work you are doing :)
I have 2 questions:
1. I understand that EOS won't be supported for share-groups (yet), but
read_committed fetch will still be possible, correct?

2. I have a very broad question about the proposed solution: why not let
the share-group coordinator manage the states of the in-flight records?
I'm asking this because it seems to me that using the same pattern as the
existing group coordinator would
a, solve the durability of the message state storage (same method as the
one used by the current group coordinator)

b, pave the way for EOS with share-groups (same method as the one used by
the current group coordinator)

c, allow follower-fetching
I saw your point about this: "FFF gives freedom to fetch records from a
nearby broker, but it does not also give the ability to commit offsets to a
nearby broker"
But does it matter if message acknowledgement is not "local"? Supposedly,
fetching is the actual hard work which benefits from follower fetching, not
the group related requests.

The only problem I see with the share-group coordinator managing the
in-flight message state is that the coordinator is not aware of the exact
available offsets of a partition, nor how the messages are batched. For
this problem, maybe the share group coordinator could use some form of
"logical" addresses, such as "the next 2 batches after offset X", or "after
offset X, skip 2 batches, fetch next 2". Acknowledgements always contain
the exact offset, but for the "unknown" sections of a partition, these
logical addresses would be used. The coordinator could keep track of
message states with a mix of offsets and these batch based addresses. The
partition leader could support "skip X, fetch Y batches" fetch requests.
This solution would need changes in the Fetch API to allow such batch based
addresses, but I assume that fetch protocol changes will be needed
regardless of the specific solution.

Thanks,
Daniel

Andrew Schofield  ezt írta (időpont: 2023. máj.
30., K, 18:15):

> Yes, that’s it. I imagine something similar to KIP-848 for managing the
> share group
> membership, and consumers that fetch records from their assigned
> partitions and
> acknowledge when delivery completes.
>
> Thanks,
> Andrew
>
> > On 30 May 2023, at 16:52, Adam Warski  wrote:
> >
> > Thanks for the explanation!
> >
> > So effectively, a share group is subscribed to each partition - but the
> data is not pushed to the consumer, but only sent on demand. And when
> demand is signalled, a batch of messages is sent?
> > Hence it would be up to the consumer to prefetch a sufficient number of
> batches to ensure, that it will never be "bored"?
> >
> > Adam
> >
> >> On 30 May 2023, at 15:25, Andrew Schofield 
> wrote:
> >>
> >> Hi Adam,
> >> Thanks for your question.
> >>
> >> With a share group, each fetch is able to grab available records from
> any partition. So, it alleviates
> >> the “head-of-line” blocking problem where a slow consumer gets in the
> way. There’s no actual
> >> stealing from a slow consumer, but it can be overtaken and must
> complete its processing within
> >> the timeout.
> >>
> >> The way I see this working is that when a consumer joins a share group,
> it receives a set of
> >> assigned share-partitions. To start with, every consumer will be
> assigned all partitions. We
> >> can be smarter than that, but I think that’s really a question of
> writing a smarter assignor
> >> just as has occurred over the years with consumer groups.
> >>
> >> Only a small proportion of Kafka workloads are super high throughput.
> Share groups would
> >> struggle with those I’m sure. Share groups do not diminish the value of
> consumer groups
> >> for streaming. They just give another option for situations where a
> different style of
> >> consumption is more appropriate.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 29 May 2023, at 17:18, Adam Warski  wrote:
> >>>
> >>> Hello,
> >>>
> >>> thank you for the proposal! A very interesting read.
> >>>
> >>> I do have one question, though. When you subscribe to a topic using
> consumer groups, it might happen that one consumer has processed all
> messages from its partitions, while another one still has a lot of work to
> do (this might be due to unbalanced partitioning, long processing times
> etc.). In a message-queue approach, it would be great to solve this problem
> - so that a consumer that is free can steal work from other consumers. Is
> this somehow covered by share groups?
> >>>
> >>> Maybe this is planned as "further work", as indicated here:
> >>>
> >>> "
> >>> It manages the topic-partition assignments for the share-group
> members. An initial, trivial implementation would be to give each member
> the list of all topic-partitions which matches its subscriptions and then
> use the pull-based protocol to fetch records from all partitions. A more
> sophisticated implementation could use topic-partition load and lag metrics
> to distribute 

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

2023-06-01 Thread Bruno Cadonna

Hi Walker,

thanks for the KIP!

Here my feedback:

1.
It is still not clear to me when stream time for the buffer advances. 
What is the event that let the stream time advance? In the discussion, I 
do not understand what you mean by "The segment store already has an 
observed stream time, we advance based on that. That should only advance 
based on records that enter the store." Where does this segment store 
come from? Anyways, I think it would be great to also state how stream 
time advances in the KIP.


2.
How does the buffer behave in case of a failure? I think I understand 
that the buffer will use an implementation of TimeOrderedKeyValueBuffer 
and therefore the records in the buffer will be replicated to a topic in 
Kafka, but I am not completely sure. Could you elaborate on this in the KIP?


3.
I agree with Matthias about dropping late records. We use grace periods 
in scenarios where we records are grouped like in windowed aggregations 
and windowed joins. The stream buffer you propose does not really group 
any records. It rather delays records and reorders them. I am not sure 
if grace period is the right naming/concept to apply here. Instead of 
dropping records that fall outside of the buffer's time interval the 
join should skip the buffer and try to join the record immediately. In 
the end, a stream-table join is a unwindowed join, i.e., no grouping is 
applied to the records.

What do you and other folks think about this proposal?

4.
How does the proposed buffer, affects processing latency? Could you 
please add some words about this to the KIP?



Best,
Bruno




On 31.05.23 01:49, Walker Carlson wrote:

Thanks for all the additional comments. I will either address them here or
update the kip accordingly.


I mentioned a follow kip to add extra features before and in the responses.
I will try to briefly summarize what options and optimizations I plan to
include. If a concern is not covered in this list I for sure talk about it
below.

* Allowing non versioned tables to still use the stream buffer
* Automatically materializing tables instead of forcing the user to do it
* Configurable for in memory buffer
* Order the records in offset order or in time order
* Non memory use buffer (offset order, delayed pull from stream.)
* Time synced between stream and table side (maybe)
* Do not drop late records and process them as they come in instead.


First, Victoria.

1) (One of your nits covers this, but you are correct it doesn't make
sense. so I removed that part of the example.)
For those examples with the "bad" join results I said without buffering the
stream it would look like that, but that was incomplete. If the look up was
simply looking at the latest version of the table when the stream records
came in then the results were possible. If we are using the point in time
lookup that versioned tables let us then you are correct the future results
are not possible.

2) I'll get to this later as Matthias brought up something related.

To your additional thoughts, I agree that we need to call those things out
in the documentation. I'm writing up a follow up kip with a lot of the
ideas we have discussed so that we can improve this feature beyond the base
implementation if it's needed.

I addressed the nits in the kip. I somehow missed the table stream table
join processor improvement, it makes your first question make a lot more
sense.  Table history retention is a much cleaner way to describe it.

As to your mention of the syncing the time for the table and stream.
Matthias mentioned that as well. I will address both here. I plan to bring
that up in the future, but for now we will leave it out. I suppose it will
be more useful after the table history retention is separable from the
table grace period.


To address Matthias comments.

You are correct by saying the in memory store shouldn't cause any semantic
concerns. My concern would be more with if we limited the number of records
on the buffer and what we would do if we hit said limits, (emitting those
records might be an issue, throwing an error and halting would not). I
think we can leave this discussion to the follow up kip along with a few
other options.

I will go through your proposals now.

   - don't support non-versioned KTables

Sure, we can always expand this later on. Will include as part of the of
the improvement kip

   - if grace period is added, users need to explicitly materialize the
table as version (either directly, or upstream. Upstream only works if
downstream tables "inherit" versioned semantics -- cf KIP-914)

again, that works for me for now, if we find a use we can always add later.

   - the table's history retention time must be larger than the grace
period (should be easy to check at runtime, when we build the topology)

agreed

   - because switching from non-versioned to version stores is not
backward compatibly (cf KIP-914), users need to take care of this
themselves, and this also implies that adding 

Re: [DISCUSS] KIP-852 Optimize calculation of size for log in remote tier

2023-06-01 Thread Divij Vaidya
Satish / Jun

Do you have any thoughts on this?

--
Divij Vaidya



On Tue, Feb 14, 2023 at 4:15 PM Divij Vaidya 
wrote:

> Hey Jun
>
> It has been a while since this KIP got some attention. While we wait for
> Satish to chime in here, perhaps I can answer your question.
>
> > Could you explain how you exposed the log size in your KIP-405
> implementation?
>
> The APIs available in RLMM as per KIP405
> are, addRemoteLogSegmentMetadata(), updateRemoteLogSegmentMetadata(), 
> remoteLogSegmentMetadata(), highestOffsetForEpoch(), 
> putRemotePartitionDeleteMetadata(), listRemoteLogSegments(), 
> onPartitionLeadershipChanges()
> and onStopPartitions(). None of these APIs allow us to expose the log size,
> hence, the only option that remains is to list all segments using
> listRemoteLogSegments() and aggregate them every time we require to
> calculate the size. Based on our prior discussion, this requires reading
> all segment metadata which won't work for non-local RLMM implementations.
> Satish's implementation also performs a full scan and calculates the
> aggregate. see:
> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/remote/RemoteLogManager.scala#L619
>
>
> Does this answer your question?
>
> --
> Divij Vaidya
>
>
>
> On Tue, Dec 20, 2022 at 8:40 PM Jun Rao  wrote:
>
>> Hi, Divij,
>>
>> Thanks for the explanation.
>>
>> Good question.
>>
>> Hi, Satish,
>>
>> Could you explain how you exposed the log size in your KIP-405
>> implementation?
>>
>> Thanks,
>>
>> Jun
>>
>> On Tue, Dec 20, 2022 at 4:59 AM Divij Vaidya 
>> wrote:
>>
>> > Hey Jun
>> >
>> > Yes, it is possible to maintain the log size in the cache (see rejected
>> > alternative#3 in the KIP) but I did not understand how it is possible to
>> > retrieve it without the new API. The log size could be calculated on
>> > startup by scanning through the segments (though I would disagree that
>> this
>> > is the right approach since scanning itself takes order of minutes and
>> > hence delay the start of archive process), and incrementally maintained
>> > afterwards, even then, we would need an API in RemoteLogMetadataManager
>> so
>> > that RLM could fetch the cached size!
>> >
>> > If we wish to cache the size without adding a new API, then we need to
>> > cache the size in RLM itself (instead of RLMM implementation) and
>> > incrementally manage it. The downside of longer archive time at startup
>> > (due to initial scale) still remains valid in this situation.
>> >
>> > --
>> > Divij Vaidya
>> >
>> >
>> >
>> > On Fri, Dec 16, 2022 at 12:43 AM Jun Rao 
>> wrote:
>> >
>> > > Hi, Divij,
>> > >
>> > > Thanks for the explanation.
>> > >
>> > > If there is in-memory cache, could we maintain the log size in the
>> cache
>> > > with the existing API? For example, a replica could make a
>> > > listRemoteLogSegments(TopicIdPartition topicIdPartition) call on
>> startup
>> > to
>> > > get the remote segment size before the current leaderEpoch. The leader
>> > > could then maintain the size incrementally afterwards. On leader
>> change,
>> > > other replicas can make a listRemoteLogSegments(TopicIdPartition
>> > > topicIdPartition, int leaderEpoch) call to get the size of newly
>> > generated
>> > > segments.
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > > On Wed, Dec 14, 2022 at 3:27 AM Divij Vaidya > >
>> > > wrote:
>> > >
>> > > > > Is the new method enough for doing size-based retention?
>> > > >
>> > > > Yes. You are right in assuming that this API only provides the
>> Remote
>> > > > storage size (for current epoch chain). We would use this API for
>> size
>> > > > based retention along with a value of localOnlyLogSegmentSize which
>> is
>> > > > computed as Log.sizeInBytes(logSegments.filter(_.baseOffset >
>> > > > highestOffsetWithRemoteIndex)). Hence, (total_log_size =
>> > > > remoteLogSizeBytes + log.localOnlyLogSegmentSize). I have updated
>> the
>> > KIP
>> > > > with this information. You can also check an example implementation
>> at
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/satishd/kafka/blob/2.8.x-tiered-storage/core/src/main/scala/kafka/log/Log.scala#L2077
>> > > >
>> > > >
>> > > > > Do you imagine all accesses to remote metadata will be across the
>> > > network
>> > > > or will there be some local in-memory cache?
>> > > >
>> > > > I would expect a disk-less implementation to maintain a finite
>> > in-memory
>> > > > cache for segment metadata to optimize the number of network calls
>> made
>> > > to
>> > > > fetch the data. In future, we can think about bringing this finite
>> size
>> > > > cache into RLM itself but that's probably a conversation for a
>> > different
>> > > > KIP. There are many other things we would like to do to optimize the
>> > > Tiered
>> > > > storage interface such as introducing a circular buffer / streaming
>> > > > interface from RSM (so that we don't have to wait to fetch the
>> entire
>> > > > segment before starting to send records to the 

Re: [VOTE] 3.4.1 RC3

2023-06-01 Thread Luke Chen
Hi all,

Thanks to everyone who has tested and voted for the RC3 so far!
Currently, I've got 2 binding votes and 3 non-binding votes:

Binding +1 PMC votes:
* Chris Egerton
* Mickael Maison

Non-binding votes:
* Federico Valeri
* Jakub Scholz
* Josep Prat

If anyone is available (especially PMC members :)), please help verify the
RC build.

Thank you.
Luke

On Wed, May 31, 2023 at 1:53 AM Chris Egerton 
wrote:

> Hi Luke,
>
> Many thanks for your continued work on this release!
>
> To verify, I:
> - Built from source using Java 11 with both:
> - - the 3.4.1-rc3 tag on GitHub
> - - the kafka-3.4.1-src.tgz artifact from
> https://home.apache.org/~showuon/kafka-3.4.1-rc3/
> - Checked signatures and checksums
> - Ran the quickstart using the kafka_2.13-3.4.1.tgz artifact from
> https://home.apache.org/~showuon/kafka-3.4.1-rc3/ with Java 11 and Scala
> 13
> in KRaft mode
> - Ran all unit tests
> - Ran all integration tests for Connect and MM2
>
> +1 (binding)
>
> Cheers,
>
> Chris
>
> On Tue, May 30, 2023 at 11:16 AM Mickael Maison 
> wrote:
>
> > Hi Luke,
> >
> > I built from source with Java 11 and Scala 2.13 and ran the unit and
> > integration tests. It took a few retries to get some of them to pass.
> > I verified signatures and hashes and also ran the zookeeper quickstart.
> >
> > +1 (binding)
> >
> > Thanks,
> > Mickael
> >
> > On Sat, May 27, 2023 at 12:58 PM Jakub Scholz  wrote:
> > >
> > > +1 (non-binding) ... I used the staged binaries and Maven artifacts to
> > run
> > > my tests and all seems to work fine.
> > >
> > > Thanks for running the release.
> > >
> > > Jakub
> > >
> > > On Fri, May 26, 2023 at 9:34 AM Luke Chen  wrote:
> > >
> > > > Hello Kafka users, developers and client-developers,
> > > >
> > > > This is the 4th candidate for release of Apache Kafka 3.4.1.
> > > >
> > > > This is a bugfix release with several fixes since the release of
> > 3.4.0. A
> > > > few of the major issues include:
> > > > - core
> > > > KAFKA-14644 
> > Process
> > > > should stop after failure in raft IO thread
> > > > KAFKA-14946 
> KRaft
> > > > controller node shutting down while renouncing leadership
> > > > KAFKA-14887  ZK
> > session
> > > > timeout can cause broker to shutdown
> > > > - client
> > > > KAFKA-14639 
> Kafka
> > > > CooperativeStickyAssignor revokes/assigns partition in one rebalance
> > cycle
> > > > - connect
> > > > KAFKA-12558  MM2
> > may
> > > > not
> > > > sync partition offsets correctly
> > > > KAFKA-14666  MM2
> > should
> > > > translate consumer group offsets behind replication flow
> > > > - stream
> > > > KAFKA-14172  bug:
> > State
> > > > stores lose state when tasks are reassigned under EOS
> > > >
> > > >
> > > > Release notes for the 3.4.1 release:
> > > > https://home.apache.org/~showuon/kafka-3.4.1-rc3/RELEASE_NOTES.html
> > > >
> > > > *** Please download, test and vote by Jun 2, 2023
> > > >
> > > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > > https://kafka.apache.org/KEYS
> > > >
> > > > * Release artifacts to be voted upon (source and binary):
> > > > https://home.apache.org/~showuon/kafka-3.4.1-rc3/
> > > >
> > > > * Maven artifacts to be voted upon:
> > > >
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > > >
> > > > * Javadoc:
> > > > https://home.apache.org/~showuon/kafka-3.4.1-rc3/javadoc/
> > > >
> > > > * Tag to be voted upon (off 3.4 branch) is the 3.4.1 tag:
> > > > https://github.com/apache/kafka/releases/tag/3.4.1-rc3
> > > >
> > > > * Documentation: (will be updated after released)
> > > > https://kafka.apache.org/34/documentation.html
> > > >
> > > > * Protocol: (will be updated after released)
> > > > https://kafka.apache.org/34/protocol.html
> > > >
> > > > The most recent build has had test failures. These all appear to be
> > due to
> > > > flakiness, but it would be nice if someone more familiar with the
> > failed
> > > > tests could confirm this. I may update this thread with passing build
> > links
> > > > if I can get one, or start a new release vote thread if test failures
> > must
> > > > be addressed beyond re-running builds until they pass.
> > > >
> > > > Unit/integration tests:
> > > > https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.4/141/
> > > >
> > > > System tests:
> > > > Will update the results later
> > > >
> > > > Thank you
> > > > Luke
> > > >
> >
>


Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

2023-06-01 Thread Bruno Cadonna

Hi Hao,

Thanks for the updates!

What do you think about dropping config rack.aware.assignment.enabled 
and add value NONE to the enum for the possible values of config 
rack.aware.assignment.strategy?


Best,
Bruno

On 31.05.23 23:31, Hao Li wrote:

Hi all,

I've updated the KIP based on the feedback. Major changes I made:
1. Add rack aware assignment to `StickyTaskAssignor`
2. Reject `Prefer reliability and then find optimal cost` option in standby
task assignment.


On Wed, May 31, 2023 at 12:09 PM Hao Li  wrote:


Hi all,

Thanks for the feedback! I will update the KIP accordingly.

*For Sophie's comments:*

1 and 2. Good catch. Fixed these.

3 and 4 Yes. We can make this public config and call out the
clientConsumer config users need to set.

5. It's ideal to take the previous assignment in HAAssignor into
consideration when we compute our target assignment, the complications come
with making sure the assignment can eventually converge and we don't do
probing rebalance infinitely. It's not only about storing the previous
assignment or get it somehow. We can actually get the previous assignment
now like we do in StickyAssignor. But the previous assignment will change
in each round of probing rebalance. The proposal which added some weight to
make the rack aware assignment lean towards the original HAA's target
assignment will add benefits of stability in some corner cases in case of
tie in cross rack traffic cost. But it's not sticky. But the bottom line is
it won't be worse than current HAA's stickiness.

6. I'm fine with changing the assignor config to public. Actually, I think
we can min-cost algorithm with StickyAssignor as well to mitigate the
problem of 5. So we can have one public config to choose an assignor and
one public config to enable the rack aware assignment.

*For Bruno's comments:*

The proposal was to implement all the options and use configs to choose
them during runtime. We can make those configs public as suggested.
1, 2, 3, 4, 5: agree and will fix those.
6: subscription protocol is not changed.
7: yeah. Let me fix the notations.
8: It meant clients. In the figure, it maps to `c1_1`, `c1_2`, `c1_3` etc.
9: I'm also ok with just optimizing reliability for standby tasks. Or we
could simply run the "balance reliability over cost" greedy algorithm to
see if any cost could be reduced.
10: Make sense. Will fix the wording.
11: Make sense. Will update the test part.

*For Walker's comments:*
1. Stability for HAA is an issue. See my comments for Sophie's feedback 5
and 6. I think we could use the rack aware assignment for StickyAssignor as
well. For HAA assignments, it's less sticky and we can only shoot for
minimizing the cross rack traffic eventually when everything is stable.
2. Yeah. This is a good point and we can also turn it on for
StickyAssignor.

Thanks,
Hao


On Tue, May 30, 2023 at 2:28 PM Sophie Blee-Goldman <
ableegold...@gmail.com> wrote:


Hey Hao, thanks for the KIP!

1. There's a typo in the "internal.rack.aware.assignment.strategry"
config,
this
should be internal.rack.aware.assignment.strategy.

2.


  For O(E^2 * (CU)) complexity, C and U can be viewed as constant.

Number of

edges E is T * N where T is the number of clients and N is the number of
Tasks. This is because a task can be assigned to any client so there

will

be an edge between every task and every client. The total complexity

would

be O(T * N) if we want to be more specific.


I feel like I'm missing something here, but if E = T * N and the
complexity
is ~O(E^2), doesn't
this make the total complexity order of O(T^2 * N^2)?

3.


Since 3.C.I and 3.C.II have different tradeoffs and work better in
different workloads etc, we


could add an internal configuration to choose one of them at runtime.



Why only an internal configuration? Same goes for
internal.rack.aware.assignment.standby.strategry (which also has the typo)

4.


  There are no changes in public interfaces.


I think it would be good to explicitly call out that users can utilize
this
new feature by setting the
ConsumerConfig's CLIENT_RACK_CONFIG, possibly with a brief example

5.


The idea is that if we always try to make it overlap as much with
HAAssignor’s target


assignment, at least there’s a higher chance that tasks won’t be shuffled
a

lot if the clients


remain the same across rebalances.



This line definitely gave me some pause -- if there was one major takeaway
I had after KIP-441,
one thing that most limited the feature's success, it was our assumption
that clients are relatively
stable across rebalances. This was mostly true at limited scale or for
on-prem setups, but
unsurprisingly broke down in cloud environments or larger clusters. Not
only do clients naturally
fall in and out of the group, autoscaling is becoming more and more of a
thing.

Lastly, and this is more easily solved but still worth calling out, an
assignment is only deterministic
as long as the client.id is persisted. Currently in Streams, we only
write

Re: [DISCUSS] solutions for broker OOM caused by many producer IDs

2023-06-01 Thread Luke Chen
Hi Omnia,

Thanks for putting this together into the KIP!
I'll have a look when available.

Thanks.
Luke

On Thu, Jun 1, 2023 at 1:50 AM Justine Olshan  wrote:

> Hey Omnia,
>
> I was doing a bit of snooping (I actually just get updates for the KIP
> page) and I saw this draft was in progress. I shared it with some of my
> colleagues as well who I previously discussed the issue with.
>
> The initial look was pretty promising to me. I appreciate the detailing of
> the rejected options since we had quite a few we worked through :)
>
> One question I have is how will we handle a scenario where potentially
> each new client has a new Kafka Principal? Is that simply not covered by
> throttling?
>
> Thanks,
> Justine
>
> On Wed, May 31, 2023 at 10:08 AM Omnia Ibrahim 
> wrote:
>
>> Hi Justine and Luke,
>>
>> I started a KIP draft here
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-936%3A+Throttle+number+of+active+PIDs
>> for a proposal would appreciate it if you could provide any initial
>> feedback before opening a broader discussion.
>>
>> Thanks
>>
>> On Wed, Feb 22, 2023 at 4:35 PM Omnia Ibrahim 
>> wrote:
>>
>>>
>>> *Hi Justine, *
>>>
>>> *My initial thought of throttling the initProducerId was to get ripped
>>> off the problem at the source (which creates too many PIDs per client) and
>>> fail faster but if having this on the produce request level is easier this
>>> should be fine. I am guessing it will be the same direction as we may
>>> ClientQuotaManage for Produce throttling with a different quota window than
>>> `quota.window.size.seconds `. *
>>>
>>> *If this is good as an initial solution I can put start a KIP and see
>>> what the wider community feels about this. *
>>>
>>> *Also, I noticed that at some point one of us hit "Replay" instead of
>>> "Replay to All" :)  So here are the previous conversations*
>>>
>>> *On Wed, Feb 15, 2023 at 12:20 AM Justine Olshan >> > wrote:*
>>>
 Hey Omnia,

 Thanks for the response. I think I understand your explanations here
 with respect to principal and clientId usage.

 For the throttling -- handleInitProducerIdRequest will allocate the ID
 to the producer, but we don't actually store it on the broker or increment
 our metric until the first produce request for that producer is sent (or
 sent again after previously expiring). Would you consider throttling the
 produce request instead? It may be hard to get any metrics from the
 transaction coordinator where the initProducerId request is handled.

 Justine
>>>
>>>
>>> *On Tue, Feb 14, 2023 at 9:29 AM Omnia Ibrahim >> > wrote:*
>>>
 Hey Justine,
 > If I understand your message correctly, there are issues with
 identifying the source of the rogue clients? So you propose to add a new
 metric for that?
 > And also proposing to throttle based on clientId as a potential
 follow up?
 I want to identify rogue clients by KafkaPrincipal (and/or clientId)
 similarly to how we identify clients in Fetch/Produce/Request
 QuotaManagers. Using KafkaPrincipal should give cluster admin the ability
 to throttle later based on principal which is most likely to be a smaller
 set than clientIds. My initial thought was to add a metrics that represent
 how many InitProducerIDRequest are sent by KafkaPrincipal (and/or clientId)
 similar to Fetch/Produce QuotaManagers.
 Then as a follow-up, we can throttle based on either KafkaPrinciple or
 clientId (maybe default as well to align this with other QuotaManagers in
 Kafka).

 >1. Does we rely on the client using the same ID? What if there are
 many clients that all use different client IDs?
 This is why I want to use the combination of KafkaPrincipal or clientId
 similar to some other quotas we have in Kafka already. This will be a
 similar risk to Fetch/Produce quota in Kafka which also relay on the client
 to use the same clientId and KafkaPrincipal.

 >2. Are there places where high cardinality of this metric is a
 concern? I can imagine many client IDs in the system. Would we treat this
 as a rate metric (ie, when we get an init producer ID and return a new
 producer ID we emit a count for that client id?) Or something else?
 My initial thought here was to follow the steps of ClientQuotaManager
 and ClientRequestQuotaManager and use a rate metric. However, I think we
 can emit it either

1. when we return the new PID. However, I have concerns that we may
circle back to the previous concerns with OMM due to keeping track of
ACTIVE PIDs per KafkaPrincipal(and/or) clientId in the future. Also this
would be the first time Kafka throttle IDs for any client.
2. or once we recieve initProducerIDRequest and throttle before
even hitting `handleInitProducerIdRequest`. Going this direction we may
need to throttle it within a different quota window than `