Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #1632

2023-02-28 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #1631

2023-02-28 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 537266 lines...]
[2023-03-01T04:54:39.150Z] > Task :clients:compileTestJava UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :clients:testClasses UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :storage:api:compileTestJava UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :storage:api:testClasses UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :connect:json:compileTestJava UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :connect:json:testClasses UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task 
:streams:generateMetadataFileForMavenJavaPublication
[2023-03-01T04:54:39.150Z] > Task :raft:compileTestJava UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :raft:testClasses UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task 
:clients:generateMetadataFileForMavenJavaPublication
[2023-03-01T04:54:39.150Z] > Task :connect:json:testJar
[2023-03-01T04:54:39.150Z] > Task :connect:json:testSrcJar
[2023-03-01T04:54:39.150Z] > Task :server-common:compileTestJava UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :server-common:testClasses UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :group-coordinator:compileTestJava UP-TO-DATE
[2023-03-01T04:54:39.150Z] > Task :group-coordinator:testClasses UP-TO-DATE
[2023-03-01T04:54:41.093Z] > Task :metadata:compileTestJava UP-TO-DATE
[2023-03-01T04:54:41.093Z] > Task :metadata:testClasses UP-TO-DATE
[2023-03-01T04:54:42.280Z] 
[2023-03-01T04:54:42.280Z] > Task :connect:api:javadoc
[2023-03-01T04:54:42.280Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk_2/connect/api/src/main/java/org/apache/kafka/connect/source/SourceRecord.java:44:
 warning - Tag @link: reference not found: org.apache.kafka.connect.data
[2023-03-01T04:54:45.447Z] 1 warning
[2023-03-01T04:54:45.447Z] 
[2023-03-01T04:54:45.447Z] > Task :connect:api:copyDependantLibs UP-TO-DATE
[2023-03-01T04:54:45.447Z] > Task :connect:api:jar UP-TO-DATE
[2023-03-01T04:54:45.447Z] > Task 
:connect:api:generateMetadataFileForMavenJavaPublication
[2023-03-01T04:54:45.447Z] > Task :connect:json:copyDependantLibs UP-TO-DATE
[2023-03-01T04:54:45.447Z] > Task :connect:json:jar UP-TO-DATE
[2023-03-01T04:54:45.447Z] > Task 
:connect:json:generateMetadataFileForMavenJavaPublication
[2023-03-01T04:54:45.447Z] > Task :connect:api:javadocJar
[2023-03-01T04:54:45.447Z] > Task :connect:api:compileTestJava UP-TO-DATE
[2023-03-01T04:54:45.447Z] > Task :connect:api:testClasses UP-TO-DATE
[2023-03-01T04:54:45.447Z] > Task 
:connect:json:publishMavenJavaPublicationToMavenLocal
[2023-03-01T04:54:45.447Z] > Task :connect:json:publishToMavenLocal
[2023-03-01T04:54:45.447Z] > Task :connect:api:testJar
[2023-03-01T04:54:45.447Z] > Task :connect:api:testSrcJar
[2023-03-01T04:54:45.447Z] > Task 
:connect:api:publishMavenJavaPublicationToMavenLocal
[2023-03-01T04:54:45.447Z] > Task :connect:api:publishToMavenLocal
[2023-03-01T04:54:49.019Z] > Task :streams:javadoc
[2023-03-01T04:54:49.019Z] > Task :streams:javadocJar
[2023-03-01T04:54:49.935Z] 
[2023-03-01T04:54:49.935Z] > Task :clients:javadoc
[2023-03-01T04:54:49.935Z] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka_trunk_2/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/secured/package-info.java:21:
 warning - Tag @link: reference not found: 
org.apache.kafka.common.security.oauthbearer
[2023-03-01T04:54:52.488Z] 1 warning
[2023-03-01T04:54:52.488Z] 
[2023-03-01T04:54:52.488Z] > Task :clients:javadocJar
[2023-03-01T04:54:54.612Z] > Task :clients:srcJar
[2023-03-01T04:54:54.613Z] > Task :clients:testJar
[2023-03-01T04:54:54.613Z] > Task :clients:testSrcJar
[2023-03-01T04:54:54.613Z] > Task 
:clients:publishMavenJavaPublicationToMavenLocal
[2023-03-01T04:54:54.613Z] > Task :clients:publishToMavenLocal
[2023-03-01T04:55:09.380Z] > Task :core:compileScala
[2023-03-01T04:56:53.372Z] > Task :core:classes
[2023-03-01T04:56:53.372Z] > Task :core:compileTestJava NO-SOURCE
[2023-03-01T04:57:15.218Z] > Task :core:compileTestScala
[2023-03-01T04:58:41.826Z] > Task :core:testClasses
[2023-03-01T04:58:41.826Z] > Task :streams:compileTestJava UP-TO-DATE
[2023-03-01T04:58:41.826Z] > Task :streams:testClasses UP-TO-DATE
[2023-03-01T04:58:41.826Z] > Task :streams:testJar
[2023-03-01T04:58:41.826Z] > Task :streams:testSrcJar
[2023-03-01T04:58:41.826Z] > Task 
:streams:publishMavenJavaPublicationToMavenLocal
[2023-03-01T04:58:41.826Z] > Task :streams:publishToMavenLocal
[2023-03-01T04:58:41.826Z] 
[2023-03-01T04:58:41.826Z] Deprecated Gradle features were used in this build, 
making it incompatible with Gradle 9.0.
[2023-03-01T04:58:41.826Z] 
[2023-03-01T04:58:41.826Z] You can use '--warning-mode all' to show the 
individual deprecation warnings and determine if they come from your own 
scripts or plugins.
[2023-03-01T04:58:41.826Z] 
[2023-03-01T04:58:41.826Z] See 
https://docs.gradle.org/8.0.1/userguide/command_line_interface.html#sec:command_line_warnings

[jira] [Resolved] (KAFKA-14371) quorum-state file contains empty/unused clusterId field

2023-02-28 Thread Luke Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen resolved KAFKA-14371.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> quorum-state file contains empty/unused clusterId field
> ---
>
> Key: KAFKA-14371
> URL: https://issues.apache.org/jira/browse/KAFKA-14371
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ron Dagostino
>Assignee: Gantigmaa Selenge
>Priority: Minor
> Fix For: 3.5.0
>
>
> The KRaft controller's quorum-state file 
> `$LOG_DIR/__cluster_metadata-0/quorum-state` contains an empty clusterId 
> value.  This value is never non-empty, and it is never used after it is 
> written and then subsequently read.  This is a cosmetic issue; it would be 
> best if this value did not exist there.  The cluster ID already exists in the 
> `$LOG_DIR/meta.properties` file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #1630

2023-02-28 Thread Apache Jenkins Server
See 




[jira] [Resolved] (KAFKA-12639) AbstractCoordinator ignores backoff timeout when joining the consumer group

2023-02-28 Thread Guozhang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang resolved KAFKA-12639.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> AbstractCoordinator ignores backoff timeout when joining the consumer group
> ---
>
> Key: KAFKA-12639
> URL: https://issues.apache.org/jira/browse/KAFKA-12639
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 2.7.0
>Reporter: Matiss Gutmanis
>Assignee: Philip Nee
>Priority: Major
> Fix For: 3.5.0
>
>
> We observed heavy logging while trying to join consumer group during partial 
> unavailability of Kafka cluster (it's part of our testing process). Seems 
> that {{rebalanceConfig.retryBackoffMs}} used in  {{ 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator#joinGroupIfNeeded}}
>  is not respected. Debugging revealed that {{Timer}} instance technically is 
> expired thus using sleep of 0 milliseconds which defeats the purpose of 
> backoff timeout.
> Minimal backoff timeout should be respected.
>  
> {code:java}
> 2021-03-30 08:30:24,488 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] JoinGroup failed: Coordinator 
> 127.0.0.1:9092 (id: 2147483634 rack: null) is loading the group.
> 2021-03-30 08:30:24,488 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] Rebalance failed.
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: The 
> coordinator is loading and hence can't process requests.
> 2021-03-30 08:30:24,488 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] (Re-)joining group
> 2021-03-30 08:30:24,489 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] JoinGroup failed: Coordinator 
> 127.0.0.1:9092 (id: 2147483634 rack: null) is loading the group.
> 2021-03-30 08:30:24,489 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] Rebalance failed.
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: The 
> coordinator is loading and hence can't process requests.
> 2021-03-30 08:30:24,489 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] (Re-)joining group
> 2021-03-30 08:30:24,490 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] JoinGroup failed: Coordinator 
> 127.0.0.1:9092 (id: 2147483634 rack: null) is loading the group.
> 2021-03-30 08:30:24,490 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] Rebalance failed.
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: The 
> coordinator is loading and hence can't process requests.
> 2021-03-30 08:30:24,490 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] (Re-)joining group
> 2021-03-30 08:30:24,491 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] JoinGroup failed: Coordinator 
> 127.0.0.1:9092 (id: 2147483634 rack: null) is loading the group.
> 2021-03-30 08:30:24,491 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] Rebalance failed.
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: The 
> coordinator is loading and hence can't process requests.
> 2021-03-30 08:30:24,491 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] (Re-)joining group
> 2021-03-30 08:30:24,492 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] JoinGroup failed: Coordinator 
> 127.0.0.1:9092 (id: 2147483634 rack: null) is loading the group.
> 2021-03-30 08:30:24,492 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] Rebalance failed.
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: The 
> coordinator is loading and hence can't process requests.
> 2021-03-30 08:30:24,492 INFO 
> [fs2-kafka-consumer-41][o.a.k.c.c.i.AbstractCoordinator] [Consumer 
> clientId=app_clientid, groupId=consumer-group] (Re-)joining group
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-903: Replicas with stale broker epoch should not be allowed to join the ISR

2023-02-28 Thread Jason Gustafson
Thanks Calvin, +1 from me.

On Mon, Feb 27, 2023 at 9:41 AM Calvin Liu 
wrote:

> Hi Jason,
> Updated the fields accordingly. Also, rename the BrokerState to
> ReplicaState.
> Thanks.
>
> On Wed, Feb 22, 2023 at 4:38 PM Jason Gustafson  >
> wrote:
>
> > Hi Calvin,
> >
> > The `BrokerState` struct I suggested would replace the `BrokerId` field
> in
> > older versions.
> >
> > { "name": "ReplicaId", "type": "int32", "versions": "0-13",
> > "entityType": "brokerId",
> >   "about": "The broker ID of the follower, of -1 if this request is
> > from a consumer." },
> > { "name": "BrokerState", "type": "BrokerState", "taggedVersions":
> > "14+", "tag": 1, "fields": [
> >   { "name": "BrokerId", "type": "int32", "versions": "14+",
> > "entityType": "brokerId",
> > "about": "The broker ID of the follower, of -1 if this request is
> > from a consumer." },
> >   { "name": "BrokerEpoch", "type": "int64", "versions": "14+",
> "about":
> > "The epoch of this follower." }
> > ]},
> >
> > Note that the version range of `ReplicaId` is set to 0-13. Version 14
> > onward would not include it.
> >
> > -Jason
> >
> > On Wed, Feb 22, 2023 at 12:07 PM Calvin Liu 
> > wrote:
> >
> > > To Jose:
> > > 1. Actually I have a second thoughts about the naming ReplicaEpoch. The
> > > BrokerEpoch only applies to the replication protocol between the
> brokers.
> > > For others like the KRaft controller, this field can be ignored. So can
> > we
> > > change the name to ReplicaEpoch when we really use it in other
> scenarios?
> > >
> > > On Wed, Feb 22, 2023 at 11:08 AM Calvin Liu 
> wrote:
> > >
> > > > To Jason:
> > > > 1. Related to the Fetch Request fields change, previously you
> suggested
> > > > deprecating the ReplicaId and moving it into a BrokerState field. How
> > > about
> > > > we just make the BrokerEpoch a tag field?
> > > > - The ReplicaId is currently in use and is filled every time. So that
> > we
> > > > can keep the way simple.
> > > > - We can still make the optional BrokerEpoch out of the request when
> it
> > > is
> > > > not needed.
> > > >
> > > > On Tue, Feb 21, 2023 at 10:39 PM Calvin Liu 
> > wrote:
> > > >
> > > >> To Jason:
> > > >> 1. We can make the BrokerEpoch a tagged field. But I am not sure
> about
> > > >> your proposed metadata structure. In the BrokerState, do we need to
> > > store
> > > >> the BrokerId again? It would duplicate with ReplicaId.
> > > >> 2. Considering that the broker reboot data loss case is rare and
> Kraft
> > > is
> > > >> coming soon. Plus we need extra effort to
> > > >> - Simply asking the controller to compare the epoch with its best
> > > >> knowledge is not enough, because the ZK controller may not know the
> > > latest
> > > >> broker epoch,
> > > >> - The current design only helps with the delayed AlterPartition
> issue
> > > >> when the broker reboots. In ZK mode, we also need to cover the
> broker
> > > >> reboot + controller reboot scenario. If the reboot broker is in ISR
> > > >> already, the controller also crashes during the broker reboot, the
> new
> > > >> controller can be completely unaware of the bounced broker and
> select
> > > this
> > > >> broker as the leader.
> > > >> - Create a test framework to simulate the event sequence of broker
> > > reboot
> > > >> and registration, delayed AlterPartition request.
> > > >>
> > > >> To Jose:
> > > >> 1. Thanks for the renaming advice. I will update the KIP later.
> > > >> 2. Ack, will update.
> > > >>
> > > >>
> > > >> On Tue, Feb 21, 2023 at 2:49 PM José Armando García Sancio
> > > >>  wrote:
> > > >>
> > > >>> Hi Calvin,
> > > >>>
> > > >>> Thanks for the improvement.
> > > >>>
> > > >>> 1. In the KIP, you suggest changing the Fetch request to "Rename
> the
> > > >>> ReplicaId to BrokerId" and "Add a new Field BrokerEpoch". The Fetch
> > > >>> RPC is used by replicas that are not brokers, for example
> controllers
> > > >>> in KRaft.
> > > >>> Can we keep the name "ReplicaId" and use "ReplicaEpoch". Both KRaft
> > > >>> and ISR partitions have the concept of replica id and replica epoch
> > > >>> but not necessarily the concept of a broker.
> > > >>>
> > > >>> 2. Since the new field "BrokerEpoch '' is ignorable, should it also
> > > >>> have a default value? How about -1 since that is what you use in
> > > >>> AlterPartittion RPC.
> > > >>>
> > > >>> --
> > > >>> -José
> > > >>>
> > > >>
> > >
> >
>


Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #1629

2023-02-28 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

2023-02-28 Thread Jason Gustafson
Hi Philip,

> Having an overall timeout also seems reasonable, but I wonder what should
the client do after running out of the time? Should we throw a
non-retriable exception (instead of TimeoutExceptoin to stop the client
from retrying) and alert the user to examine the config and the DNS server?

Yeah, not sure exactly. I'd probably suggest a
`BootstrapConnectionException` or something like that with a clear message
indicating the problem. What the user does with it is up to them, but at
least it gives them the option to fail their application if that is what
they prefer to do in this case. If they catch it and ignore it, I would
expect the client to just continue retrying. Logging for bootstrap
dns/connection failures will be helpful in any case.

-Jason






On Tue, Feb 28, 2023 at 11:47 AM Philip Nee  wrote:

> Jason:
> Thanks for the feedback.  Now giving it a second thought, I think your
> suggestion of logging the error might make sense, as I could imagine most
> users would just continue to retry, so it might not be necessary to throw
> an exception anyway.
> Having an overall timeout also seems reasonable, but I wonder what should
> the client do after running out of the time? Should we throw a
> non-retriable exception (instead of TimeoutExceptoin to stop the client
> from retrying) and alert the user to examine the config and the DNS server?
>
> Chris:
> I feel I still haven't answered your question about the pre-flight check,
> as it seems exposing an API might be harder to push through.
>
> Thanks!
> P
>
> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
> 
> wrote:
>
> > One more random thought I had just as I pushed send. We're currently
> > treating this problem somewhat narrowly by focusing only on the DNS
> > resolution of the bootstrap servers. Even if the servers resolve, there's
> > no guarantee that they are reachable by the client. Would it make sense
> to
> > have a timeout which bounds the total time that the client should wait to
> > connect to the bootstrap servers? Something like `
> > bootstrap.servers.connection.timeout.ms`.
> >
> > -Jason
> >
> > On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson 
> > wrote:
> >
> > > Hi Philip,
> > >
> > > An alternative is not to fail at all. Every other network error is
> caught
> > > and handled internally in the client. We see this case as different
> > because
> > > a DNS resolution error may imply misconfiguration. Could it also imply
> > that
> > > the DNS server is unavailable? I'm not sure why that case should be
> > handled
> > > differently than if the bootstrap servers themselves are unavailable.
> > Would
> > > it be enough to log a clear warning in the logs if the bootstrap
> servers
> > > could not resolve?
> > >
> > > On the whole, the current fail-fast approach probably does more good
> than
> > > bad, but it does seem somewhat inconsistent overall and my guess is
> that
> > > dynamic environments will become increasingly common. It would be nice
> to
> > > have a reasonable default behavior which could handle these cases
> > > gracefully without any additional logic. In any case, it would be nice
> to
> > > see this option in the rejected alternatives at least if we do not take
> > it.
> > >
> > > If we want to take the route of throwing an exception, then I think
> we're
> > > probably going to need a new configuration since I can't see what a
> > > reasonable timeout we would use as a default. The benefit of a
> > > configuration is that it would let us retain the current default
> behavior
> > > with timeout effectively set to 0 and it would also let users
> effectively
> > > disable the timeout by using a very large value. Otherwise, it seems
> > like a
> > > potential compatibility break to have a new exception type thrown at
> some
> > > arbitrary time without giving the user any control over it.
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > > On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton  >
> > > wrote:
> > >
> > >> Hi Philip,
> > >>
> > >> Yeah, it's basically DNS resolution we're talking about, though
> there's
> > >> some additional subtlety there with the logic introduced by KIP-235
> [1].
> > >> Essentially it should cover any scenario that causes a client
> > constructor
> > >> to fail with the current logic but would not after this KIP is
> released.
> > >>
> > >> We can generalize the Connect use case like this: a client application
> > >> that
> > >> may connect to different Kafka clusters with a public-facing,
> > easy-to-use
> > >> API for restarting failed tasks and automatic handling of retriable
> > >> exceptions. The ease with which failed tasks can be restarted is
> > >> significant because it reduces the cost of failing on non-retriable
> > >> exceptions and makes fail-fast behavior easier to work with. And, in
> > cases
> > >> like this where we can't really know whether the error we're dealing
> > with
> > >> is retriable or not, it's better IMO to continue to allow applications
> > >> like
> > 

[jira] [Resolved] (KAFKA-14769) NPE in ControllerMetricsManager when upgrading from old KRaft version

2023-02-28 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur resolved KAFKA-14769.
--
Resolution: Invalid

> NPE in ControllerMetricsManager when upgrading from old KRaft version
> -
>
> Key: KAFKA-14769
> URL: https://issues.apache.org/jira/browse/KAFKA-14769
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: David Arthur
>Assignee: Jose Armando Garcia Sancio
>Priority: Critical
> Fix For: 3.5.0, 3.4.1
>
>
> In older KRaft versions, we could see a ConfigRecord for a topic config 
> appear before the TopicRecord in a batch.
> When upgrading from an older KRaft version (e.g., 3.1), the latest code in 
> the KRaft controller hits an NPE when it encounters a ConfigRecord for a 
> topic config before the TopicRecord. This was introduced relatively recently 
> by KAFKA-14457



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14769) NPE in ControllerMetricsManager when upgrading from old KRaft version

2023-02-28 Thread David Arthur (Jira)
David Arthur created KAFKA-14769:


 Summary: NPE in ControllerMetricsManager when upgrading from old 
KRaft version
 Key: KAFKA-14769
 URL: https://issues.apache.org/jira/browse/KAFKA-14769
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: David Arthur
Assignee: Jose Armando Garcia Sancio
 Fix For: 3.5.0, 3.4.1


In older KRaft versions, we could see a ConfigRecord for a topic config appear 
before the TopicRecord in a batch.

When upgrading from an older KRaft version (e.g., 3.1), the latest code in the 
KRaft controller hits an NPE when it encounters a ConfigRecord for a topic 
config before the TopicRecord. This was introduced relatively recently by 
KAFKA-14457



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-900: KRaft kafka-storage.sh API additions to support SCRAM for Kafka Brokers

2023-02-28 Thread Proven Provenzano
Hi all,
I am going to close the vote and start implementing.
The KIP is accepted with three binding votes from Colin, Jose, and
Manikumar.

--Proven

On Sat, Feb 25, 2023 at 1:21 AM Manikumar  wrote:

> +1 (binding)
>
> Thanks for the KIP.
>
> On Wed, Feb 22, 2023 at 3:48 AM José Armando García Sancio
>  wrote:
> >
> > LGTM Proven. Thanks for the improvements. +1 (binding)
> >
> > --
> > -José
>


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #1628

2023-02-28 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 448217 lines...]
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testDeleteNonExistentZNode() PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testExistsExistingZNode() STARTED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testExistsExistingZNode() PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testZooKeeperStateChangeRateMetrics() 
STARTED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testZooKeeperStateChangeRateMetrics() 
PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testZNodeChangeHandlerForDeletion() STARTED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testZNodeChangeHandlerForDeletion() PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testGetAclNonExistentZNode() STARTED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testGetAclNonExistentZNode() PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testStateChangeHandlerForAuthFailure() 
STARTED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > ZooKeeperClientTest > testStateChangeHandlerForAuthFailure() 
PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > BrokerRegistrationRequestTest > 
testRegisterZkWithKRaftMigrationDisabled(ClusterInstance) > 
unit.kafka.server.BrokerRegistrationRequestTest.testRegisterZkWithKRaftMigrationDisabled(ClusterInstance)[1]
 STARTED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > BrokerRegistrationRequestTest > 
testRegisterZkWithKRaftMigrationDisabled(ClusterInstance) > 
unit.kafka.server.BrokerRegistrationRequestTest.testRegisterZkWithKRaftMigrationDisabled(ClusterInstance)[1]
 PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > BrokerRegistrationRequestTest > 
testRegisterZkWithKRaftMigrationEnabled(ClusterInstance) > 
unit.kafka.server.BrokerRegistrationRequestTest.testRegisterZkWithKRaftMigrationEnabled(ClusterInstance)[1]
 STARTED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > BrokerRegistrationRequestTest > 
testRegisterZkWithKRaftMigrationEnabled(ClusterInstance) > 
unit.kafka.server.BrokerRegistrationRequestTest.testRegisterZkWithKRaftMigrationEnabled(ClusterInstance)[1]
 PASSED
[2023-02-28T20:04:32.224Z] 
[2023-02-28T20:04:32.224Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > BrokerRegistrationRequestTest > 
testRegisterZkWithKRaftOldMetadataVersion(ClusterInstance) > 
unit.kafka.server.BrokerRegistrationRequestTest.testRegisterZkWithKRaftOldMetadataVersion(ClusterInstance)[1]
 STARTED
[2023-02-28T20:04:42.931Z] 
[2023-02-28T20:04:42.931Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 173 > BrokerRegistrationRequestTest > 
testRegisterZkWithKRaftOldMetadataVersion(ClusterInstance) > 
unit.kafka.server.BrokerRegistrationRequestTest.testRegisterZkWithKRaftOldMetadataVersion(ClusterInstance)[1]
 PASSED
[2023-02-28T20:04:51.805Z] 
[2023-02-28T20:04:51.805Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 177 > ReassignPartitionsIntegrationTest > testCancellation(String) > 
kafka.admin.ReassignPartitionsIntegrationTest.testCancellation(String)[1] 
STARTED
[2023-02-28T20:05:11.063Z] 
[2023-02-28T20:05:11.063Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 177 > ReassignPartitionsIntegrationTest > testCancellation(String) > 
kafka.admin.ReassignPartitionsIntegrationTest.testCancellation(String)[1] PASSED
[2023-02-28T20:05:11.063Z] 
[2023-02-28T20:05:11.063Z] Gradle Test Run :core:integrationTest > Gradle Test 
Executor 177 > ReassignPartitionsIntegrationTest > testCancellation(String) > 

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

2023-02-28 Thread Philip Nee
Jason:
Thanks for the feedback.  Now giving it a second thought, I think your
suggestion of logging the error might make sense, as I could imagine most
users would just continue to retry, so it might not be necessary to throw
an exception anyway.
Having an overall timeout also seems reasonable, but I wonder what should
the client do after running out of the time? Should we throw a
non-retriable exception (instead of TimeoutExceptoin to stop the client
from retrying) and alert the user to examine the config and the DNS server?

Chris:
I feel I still haven't answered your question about the pre-flight check,
as it seems exposing an API might be harder to push through.

Thanks!
P

On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson 
wrote:

> One more random thought I had just as I pushed send. We're currently
> treating this problem somewhat narrowly by focusing only on the DNS
> resolution of the bootstrap servers. Even if the servers resolve, there's
> no guarantee that they are reachable by the client. Would it make sense to
> have a timeout which bounds the total time that the client should wait to
> connect to the bootstrap servers? Something like `
> bootstrap.servers.connection.timeout.ms`.
>
> -Jason
>
> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson 
> wrote:
>
> > Hi Philip,
> >
> > An alternative is not to fail at all. Every other network error is caught
> > and handled internally in the client. We see this case as different
> because
> > a DNS resolution error may imply misconfiguration. Could it also imply
> that
> > the DNS server is unavailable? I'm not sure why that case should be
> handled
> > differently than if the bootstrap servers themselves are unavailable.
> Would
> > it be enough to log a clear warning in the logs if the bootstrap servers
> > could not resolve?
> >
> > On the whole, the current fail-fast approach probably does more good than
> > bad, but it does seem somewhat inconsistent overall and my guess is that
> > dynamic environments will become increasingly common. It would be nice to
> > have a reasonable default behavior which could handle these cases
> > gracefully without any additional logic. In any case, it would be nice to
> > see this option in the rejected alternatives at least if we do not take
> it.
> >
> > If we want to take the route of throwing an exception, then I think we're
> > probably going to need a new configuration since I can't see what a
> > reasonable timeout we would use as a default. The benefit of a
> > configuration is that it would let us retain the current default behavior
> > with timeout effectively set to 0 and it would also let users effectively
> > disable the timeout by using a very large value. Otherwise, it seems
> like a
> > potential compatibility break to have a new exception type thrown at some
> > arbitrary time without giving the user any control over it.
> >
> > Thanks,
> > Jason
> >
> >
> > On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton 
> > wrote:
> >
> >> Hi Philip,
> >>
> >> Yeah, it's basically DNS resolution we're talking about, though there's
> >> some additional subtlety there with the logic introduced by KIP-235 [1].
> >> Essentially it should cover any scenario that causes a client
> constructor
> >> to fail with the current logic but would not after this KIP is released.
> >>
> >> We can generalize the Connect use case like this: a client application
> >> that
> >> may connect to different Kafka clusters with a public-facing,
> easy-to-use
> >> API for restarting failed tasks and automatic handling of retriable
> >> exceptions. The ease with which failed tasks can be restarted is
> >> significant because it reduces the cost of failing on non-retriable
> >> exceptions and makes fail-fast behavior easier to work with. And, in
> cases
> >> like this where we can't really know whether the error we're dealing
> with
> >> is retriable or not, it's better IMO to continue to allow applications
> >> like
> >> these to fail fast. I do agree that it'd be nice to get input from the
> >> community, though.
> >>
> >> I was toying with the idea of a NetworkException subclass too. It's a
> >> simpler API, but it doesn't allow for preflight validation, which can be
> >> useful in scenarios where submitting new configurations for client
> >> applications is expensive in terms of time or resources. Then again, I
> >> don't see why the two are mutually exclusive, and we might opt to use
> the
> >> NetworkException subclass in this KIP and pursue an opt-in validation
> API
> >> later on. Thoughts?
> >>
> >> [1] -
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
> >>
> >> Cheers,
> >>
> >> Chris
> >>
> >> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee  wrote:
> >>
> >> > Hey Chris,
> >> >
> >> > Thanks again for the feedback!
> >> >
> >> >
> >> > For the preflight DNS check (are we basically trying to resolve the
> DNS
> >> > there?): Maybe it makes more sense to add it to the 

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

2023-02-28 Thread Jason Gustafson
One more random thought I had just as I pushed send. We're currently
treating this problem somewhat narrowly by focusing only on the DNS
resolution of the bootstrap servers. Even if the servers resolve, there's
no guarantee that they are reachable by the client. Would it make sense to
have a timeout which bounds the total time that the client should wait to
connect to the bootstrap servers? Something like `
bootstrap.servers.connection.timeout.ms`.

-Jason

On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson  wrote:

> Hi Philip,
>
> An alternative is not to fail at all. Every other network error is caught
> and handled internally in the client. We see this case as different because
> a DNS resolution error may imply misconfiguration. Could it also imply that
> the DNS server is unavailable? I'm not sure why that case should be handled
> differently than if the bootstrap servers themselves are unavailable. Would
> it be enough to log a clear warning in the logs if the bootstrap servers
> could not resolve?
>
> On the whole, the current fail-fast approach probably does more good than
> bad, but it does seem somewhat inconsistent overall and my guess is that
> dynamic environments will become increasingly common. It would be nice to
> have a reasonable default behavior which could handle these cases
> gracefully without any additional logic. In any case, it would be nice to
> see this option in the rejected alternatives at least if we do not take it.
>
> If we want to take the route of throwing an exception, then I think we're
> probably going to need a new configuration since I can't see what a
> reasonable timeout we would use as a default. The benefit of a
> configuration is that it would let us retain the current default behavior
> with timeout effectively set to 0 and it would also let users effectively
> disable the timeout by using a very large value. Otherwise, it seems like a
> potential compatibility break to have a new exception type thrown at some
> arbitrary time without giving the user any control over it.
>
> Thanks,
> Jason
>
>
> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton 
> wrote:
>
>> Hi Philip,
>>
>> Yeah, it's basically DNS resolution we're talking about, though there's
>> some additional subtlety there with the logic introduced by KIP-235 [1].
>> Essentially it should cover any scenario that causes a client constructor
>> to fail with the current logic but would not after this KIP is released.
>>
>> We can generalize the Connect use case like this: a client application
>> that
>> may connect to different Kafka clusters with a public-facing, easy-to-use
>> API for restarting failed tasks and automatic handling of retriable
>> exceptions. The ease with which failed tasks can be restarted is
>> significant because it reduces the cost of failing on non-retriable
>> exceptions and makes fail-fast behavior easier to work with. And, in cases
>> like this where we can't really know whether the error we're dealing with
>> is retriable or not, it's better IMO to continue to allow applications
>> like
>> these to fail fast. I do agree that it'd be nice to get input from the
>> community, though.
>>
>> I was toying with the idea of a NetworkException subclass too. It's a
>> simpler API, but it doesn't allow for preflight validation, which can be
>> useful in scenarios where submitting new configurations for client
>> applications is expensive in terms of time or resources. Then again, I
>> don't see why the two are mutually exclusive, and we might opt to use the
>> NetworkException subclass in this KIP and pursue an opt-in validation API
>> later on. Thoughts?
>>
>> [1] -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
>>
>> Cheers,
>>
>> Chris
>>
>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee  wrote:
>>
>> > Hey Chris,
>> >
>> > Thanks again for the feedback!
>> >
>> >
>> > For the preflight DNS check (are we basically trying to resolve the DNS
>> > there?): Maybe it makes more sense to add it to the Config modules? I
>> would
>> > like to hear what the community says as I'm not familiar with the
>> Connect
>> > use case.
>> >
>> > A "slower failing" alternative - I wonder if it makes sense for us to
>> > extend the NetworkException so that clients can be smarter at handling
>> > these exceptions. Of course, it is still retriable and requires polling
>> the
>> > consumer, but then we can distinguish the DNS resolution error from
>> other
>> > network errors.
>> >
>> > Thanks!
>> > P
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton 
>> > wrote:
>> >
>> > > Hi Philip,
>> > >
>> > > Yeah,  "DNS resolution should occur..." seems like a better fit. 
>> > >
>> > > One other question I have is whether we should expose some kind of
>> public
>> > > API for performing preflight validation of the bootstrap URLs. If we
>> > change
>> > > the behavior of a client configured with a silly typo (e.g.,
>> > > "loclahost 

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

2023-02-28 Thread Jason Gustafson
Hi Philip,

An alternative is not to fail at all. Every other network error is caught
and handled internally in the client. We see this case as different because
a DNS resolution error may imply misconfiguration. Could it also imply that
the DNS server is unavailable? I'm not sure why that case should be handled
differently than if the bootstrap servers themselves are unavailable. Would
it be enough to log a clear warning in the logs if the bootstrap servers
could not resolve?

On the whole, the current fail-fast approach probably does more good than
bad, but it does seem somewhat inconsistent overall and my guess is that
dynamic environments will become increasingly common. It would be nice to
have a reasonable default behavior which could handle these cases
gracefully without any additional logic. In any case, it would be nice to
see this option in the rejected alternatives at least if we do not take it.

If we want to take the route of throwing an exception, then I think we're
probably going to need a new configuration since I can't see what a
reasonable timeout we would use as a default. The benefit of a
configuration is that it would let us retain the current default behavior
with timeout effectively set to 0 and it would also let users effectively
disable the timeout by using a very large value. Otherwise, it seems like a
potential compatibility break to have a new exception type thrown at some
arbitrary time without giving the user any control over it.

Thanks,
Jason


On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton 
wrote:

> Hi Philip,
>
> Yeah, it's basically DNS resolution we're talking about, though there's
> some additional subtlety there with the logic introduced by KIP-235 [1].
> Essentially it should cover any scenario that causes a client constructor
> to fail with the current logic but would not after this KIP is released.
>
> We can generalize the Connect use case like this: a client application that
> may connect to different Kafka clusters with a public-facing, easy-to-use
> API for restarting failed tasks and automatic handling of retriable
> exceptions. The ease with which failed tasks can be restarted is
> significant because it reduces the cost of failing on non-retriable
> exceptions and makes fail-fast behavior easier to work with. And, in cases
> like this where we can't really know whether the error we're dealing with
> is retriable or not, it's better IMO to continue to allow applications like
> these to fail fast. I do agree that it'd be nice to get input from the
> community, though.
>
> I was toying with the idea of a NetworkException subclass too. It's a
> simpler API, but it doesn't allow for preflight validation, which can be
> useful in scenarios where submitting new configurations for client
> applications is expensive in terms of time or resources. Then again, I
> don't see why the two are mutually exclusive, and we might opt to use the
> NetworkException subclass in this KIP and pursue an opt-in validation API
> later on. Thoughts?
>
> [1] -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
>
> Cheers,
>
> Chris
>
> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee  wrote:
>
> > Hey Chris,
> >
> > Thanks again for the feedback!
> >
> >
> > For the preflight DNS check (are we basically trying to resolve the DNS
> > there?): Maybe it makes more sense to add it to the Config modules? I
> would
> > like to hear what the community says as I'm not familiar with the Connect
> > use case.
> >
> > A "slower failing" alternative - I wonder if it makes sense for us to
> > extend the NetworkException so that clients can be smarter at handling
> > these exceptions. Of course, it is still retriable and requires polling
> the
> > consumer, but then we can distinguish the DNS resolution error from other
> > network errors.
> >
> > Thanks!
> > P
> >
> >
> >
> >
> >
> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton 
> > wrote:
> >
> > > Hi Philip,
> > >
> > > Yeah,  "DNS resolution should occur..." seems like a better fit. 
> > >
> > > One other question I have is whether we should expose some kind of
> public
> > > API for performing preflight validation of the bootstrap URLs. If we
> > change
> > > the behavior of a client configured with a silly typo (e.g.,
> > > "loclahost instead of localhost") from failing in the constructor to
> > > failing with a retriable exception, this might lead some client
> > > applications to handle that failure by, well, retrying. For reference,
> > this
> > > is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO
> > it'd
> > > be nice to be able to opt into keeping the current behavior so that
> > > projects like Connect could still do preflight checks of the
> > > bootstrap.servers property for connectors before starting them, and
> > report
> > > any issues by failing fast instead of continuously writing
> warning/error
> > > messages to their logs.
> > >
> > > I'm not sure about 

Creating a JIRA account

2023-02-28 Thread Przemysław Białoń
Hi,

due to documentation as a new contributor, I should create a JIRA account
and assign a ticket I want to work on to me. But recently new JIRA account
creation was disabled. What should I do? Should I work with a ticket
without assigning it to me (with a risk of work duplication)? Or should I
apply for an account? If so, where should I go?  I didn't find information
about it in the contributor guide.

Regards,
Przemek


Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.4 #92

2023-02-28 Thread Apache Jenkins Server
See 




Jenkins build is unstable: Kafka » Kafka Branch Builder » 3.3 #159

2023-02-28 Thread Apache Jenkins Server
See 




Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #1627

2023-02-28 Thread Apache Jenkins Server
See 




Fwd: Request for Jira Account Creation

2023-02-28 Thread Ganesh S
Hi Team,
Just following up for the JIRA user creation.

Regards
Ganesh


-- Forwarded message -
From: Ganesh S 
Date: Fri, Feb 24, 2023 at 10:53 AM
Subject: Request for Jira Account Creation
To: 


Dear Team,

I am writing to request a Jira account for the purpose of contributing to
the Apache Kafka project.

I am a new beginner and I am very interested in contributing to this
project.

I believe that the Apache Kafka project is a valuable and important
open-source project that has the potential to revolutionize the way data is
processed and analyzed.

I have been studying Kafka and its related technologies for some time now,
and I am eager to start contributing to the project. I have already
identified some areas where I can potentially contribute, such as bug
fixes, documentation, and testing. I am committed to learning and
developing my skills further as I contribute to the project.

I would be very grateful if you could create a Jira account for me, so that
I can start contributing to the project.
I understand that there may be some guidelines or requirements that I need
to follow as a new contributor, and I am willing to abide by them.
I have also read and agree to the Apache Kafka Code of Conduct.

Thank you very much for considering my request. I look forward to
contributing to the Apache Kafka project.

My Linkedin 


Best regards,
Ganesh Sahu


Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

2023-02-28 Thread Chris Egerton
Hi Philip,

Yeah, it's basically DNS resolution we're talking about, though there's
some additional subtlety there with the logic introduced by KIP-235 [1].
Essentially it should cover any scenario that causes a client constructor
to fail with the current logic but would not after this KIP is released.

We can generalize the Connect use case like this: a client application that
may connect to different Kafka clusters with a public-facing, easy-to-use
API for restarting failed tasks and automatic handling of retriable
exceptions. The ease with which failed tasks can be restarted is
significant because it reduces the cost of failing on non-retriable
exceptions and makes fail-fast behavior easier to work with. And, in cases
like this where we can't really know whether the error we're dealing with
is retriable or not, it's better IMO to continue to allow applications like
these to fail fast. I do agree that it'd be nice to get input from the
community, though.

I was toying with the idea of a NetworkException subclass too. It's a
simpler API, but it doesn't allow for preflight validation, which can be
useful in scenarios where submitting new configurations for client
applications is expensive in terms of time or resources. Then again, I
don't see why the two are mutually exclusive, and we might opt to use the
NetworkException subclass in this KIP and pursue an opt-in validation API
later on. Thoughts?

[1] -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection

Cheers,

Chris

On Mon, Feb 27, 2023 at 7:06 PM Philip Nee  wrote:

> Hey Chris,
>
> Thanks again for the feedback!
>
>
> For the preflight DNS check (are we basically trying to resolve the DNS
> there?): Maybe it makes more sense to add it to the Config modules? I would
> like to hear what the community says as I'm not familiar with the Connect
> use case.
>
> A "slower failing" alternative - I wonder if it makes sense for us to
> extend the NetworkException so that clients can be smarter at handling
> these exceptions. Of course, it is still retriable and requires polling the
> consumer, but then we can distinguish the DNS resolution error from other
> network errors.
>
> Thanks!
> P
>
>
>
>
>
> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton 
> wrote:
>
> > Hi Philip,
> >
> > Yeah,  "DNS resolution should occur..." seems like a better fit. 
> >
> > One other question I have is whether we should expose some kind of public
> > API for performing preflight validation of the bootstrap URLs. If we
> change
> > the behavior of a client configured with a silly typo (e.g.,
> > "loclahost instead of localhost") from failing in the constructor to
> > failing with a retriable exception, this might lead some client
> > applications to handle that failure by, well, retrying. For reference,
> this
> > is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO
> it'd
> > be nice to be able to opt into keeping the current behavior so that
> > projects like Connect could still do preflight checks of the
> > bootstrap.servers property for connectors before starting them, and
> report
> > any issues by failing fast instead of continuously writing warning/error
> > messages to their logs.
> >
> > I'm not sure about where this new API could go, but a few options might
> be:
> >
> > - Expose a public variant of the existing ClientUtils class
> > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > AdminClientConfig classes
> > - Add those same static methods to the KafkaConsumer, KafkaProducer, and
> > KafkaAdminClient classes
> >
> > If this seems reasonable, we should probably also specify in the KIP that
> > Kafka Connect will leverage this preflight validation logic before
> > instantiating any Kafka clients for use by connectors or tasks, and
> > continue to fail fast if there are typos in the bootstrap.servers
> property,
> > or if temporary DNS resolution issues come up.
> >
> > [1] -
> >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > [2] -
> >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> >
> > Cheers,
> >
> > Chris
> >
> > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee  wrote:
> >
> > > Hey Chris,
> > >
> > > Thanks for the quick response, and I apologize for the unclear wording
> > > there, I guess "DNS lookup" would be a more appropriate wording here.
> So
> > > what I meant there was, to delegate the DNS lookup in the constructor
> to
> > > the network client poll, and it will happen on the very first poll.  I
> > > guess the logic could look like this:
> > >
> > > - if the client has been bootstrapped, do nothing.
> > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > address.
> > >
> > > Thanks for the comment there, I'll change up the 

Re: [VOTE] KIP-898: Modernize Connect plugin discovery

2023-02-28 Thread Bill Bejeck
Thanks for the well-detailed KIP, Greg. It'll be a needed improvement.

+1(binding)

Thanks,
Bill

On Tue, Feb 28, 2023 at 9:51 AM John Roesler  wrote:

> Thanks for the KIP, Greg!
>
> I’m +1 (binding)
>
> I really appreciate all the care you took in the migration and test
> design.
>
> Thanks,
> John
>
> On Tue, Feb 28, 2023, at 04:33, Federico Valeri wrote:
> > +1 (non binding)
> >
> > Thanks
> > Fede
> >
> > On Tue, Feb 28, 2023 at 10:10 AM Mickael Maison
> >  wrote:
> >>
> >> +1 (binding)
> >>
> >> Thanks,
> >> Mickael
> >>
> >> On Mon, Feb 27, 2023 at 7:42 PM Chris Egerton 
> wrote:
> >> >
> >> > +1 (binding). Thanks for the KIP!
> >> >
> >> > On Mon, Feb 27, 2023 at 12:51 PM Greg Harris
> 
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I'd like to call a vote for KIP-898 which aims to improve the
> performance
> >> > > of Connect startup by allowing discovery of plugins via the
> ServiceLoader.
> >> > >
> >> > > KIP:
> >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-898
> >> > > %3A+Modernize+Connect+plugin+discovery
> >> > >
> >> > > Discussion thread:
> >> > > https://lists.apache.org/thread/wxh0r343w86s91py0876njzbyn5qxd8s
> >> > >
> >> > > Thanks!
> >> > >
>


Re: [VOTE] KIP-898: Modernize Connect plugin discovery

2023-02-28 Thread John Roesler
Thanks for the KIP, Greg!

I’m +1 (binding)

I really appreciate all the care you took in the migration and test design. 

Thanks,
John

On Tue, Feb 28, 2023, at 04:33, Federico Valeri wrote:
> +1 (non binding)
>
> Thanks
> Fede
>
> On Tue, Feb 28, 2023 at 10:10 AM Mickael Maison
>  wrote:
>>
>> +1 (binding)
>>
>> Thanks,
>> Mickael
>>
>> On Mon, Feb 27, 2023 at 7:42 PM Chris Egerton  
>> wrote:
>> >
>> > +1 (binding). Thanks for the KIP!
>> >
>> > On Mon, Feb 27, 2023 at 12:51 PM Greg Harris 
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I'd like to call a vote for KIP-898 which aims to improve the performance
>> > > of Connect startup by allowing discovery of plugins via the 
>> > > ServiceLoader.
>> > >
>> > > KIP:
>> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-898
>> > > %3A+Modernize+Connect+plugin+discovery
>> > >
>> > > Discussion thread:
>> > > https://lists.apache.org/thread/wxh0r343w86s91py0876njzbyn5qxd8s
>> > >
>> > > Thanks!
>> > >


[jira] [Resolved] (KAFKA-14659) source-record-write-[rate|total] metrics include filtered records

2023-02-28 Thread Chris Egerton (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Egerton resolved KAFKA-14659.
---
Resolution: Fixed

> source-record-write-[rate|total] metrics include filtered records
> -
>
> Key: KAFKA-14659
> URL: https://issues.apache.org/jira/browse/KAFKA-14659
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Chris Beard
>Assignee: Hector Geraldino
>Priority: Minor
>
> Source tasks in Kafka connect offer two sets of metrics (documented in 
> [ConnectMetricsRegistry.java|https://github.com/apache/kafka/blob/72cfc994f5675be349d4494ece3528efed290651/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/ConnectMetricsRegistry.java#L173-L191]):
> ||Metric||Description||
> |source-record-poll-rate|The average per-second number of records 
> produced/polled (before transformation) by this task belonging to the named 
> source connector in this worker.|
> |source-record-write-rate|The average per-second number of records output 
> from the transformations and written to Kafka for this task belonging to the 
> named source connector in this worker. This is after transformations are 
> applied and excludes any records filtered out by the transformations.|
> There are also corresponding "-total" metrics that capture the total number 
> of records polled and written for the metrics above, respectively.
> In short, the "poll" metrics capture the number of messages sourced 
> pre-transformation/filtering, and the "write" metrics should capture the 
> number of messages ultimately written to Kafka post-transformation/filtering. 
> However, the implementation of the {{source-record-write-*}}  metrics 
> _includes_ records filtered out by transformations (and also records that 
> result in produce failures with the config {{{}errors.tolerance=all{}}}).
> h3. Details
> In 
> [AbstractWorkerSourceTask.java|https://github.com/apache/kafka/blob/a382acd31d1b53cd8695ff9488977566083540b1/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L397],
>  each source record is passed through the transformation chain where it is 
> potentially filtered out, checked to see if it was in fact filtered out, and 
> if so it is accounted for in the internal metrics via 
> {{{}counter.skipRecord(){}}}.
> {code:java}
> for (final SourceRecord preTransformRecord : toSend) { 
> retryWithToleranceOperator.sourceRecord(preTransformRecord);
> final SourceRecord record = 
> transformationChain.apply(preTransformRecord);
> final ProducerRecord producerRecord = 
> convertTransformedRecord(record);
> if (producerRecord == null || retryWithToleranceOperator.failed()) {  
>   
> counter.skipRecord();
> recordDropped(preTransformRecord);
> continue;
> }
> ...
> {code}
> {{SourceRecordWriteCounter.skipRecord()}} is implemented as follows:
> {code:java}
> 
> public SourceRecordWriteCounter(int batchSize, SourceTaskMetricsGroup 
> metricsGroup) {
> assert batchSize > 0;
> assert metricsGroup != null;
> this.batchSize = batchSize;
> counter = batchSize;
> this.metricsGroup = metricsGroup;
> }
> public void skipRecord() {
> if (counter > 0 && --counter == 0) {
> finishedAllWrites();
> }
> }
> 
> private void finishedAllWrites() {
> if (!completed) {
> metricsGroup.recordWrite(batchSize - counter);
> completed = true;
> }
> }
> {code}
> For example: If a batch starts with 100 records, {{batchSize}} and 
> {{counter}} will both be initialized to 100. If all 100 records get filtered 
> out, {{counter}} will be decremented 100 times, and 
> {{{}finishedAllWrites(){}}}will record the value 100 to the underlying 
> {{source-record-write-*}}  metrics rather than 0, the correct value according 
> to the documentation for these metrics.
> h3. Solutions
> Assuming the documentation correctly captures the intent of the 
> {{source-record-write-*}}  metrics, it seems reasonable to fix these metrics 
> such that filtered records do not get counted.
> It may also be useful to add additional metrics to capture the rate and total 
> number of records filtered out by transformations, which would require a KIP.
> I'm not sure what the best way of accounting for produce failures in the case 
> of {{errors.tolerance=all}} is yet. Maybe these failures deserve their own 
> new metrics?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #1626

2023-02-28 Thread Apache Jenkins Server
See 




Re: [VOTE] KIP-898: Modernize Connect plugin discovery

2023-02-28 Thread Federico Valeri
+1 (non binding)

Thanks
Fede

On Tue, Feb 28, 2023 at 10:10 AM Mickael Maison
 wrote:
>
> +1 (binding)
>
> Thanks,
> Mickael
>
> On Mon, Feb 27, 2023 at 7:42 PM Chris Egerton  wrote:
> >
> > +1 (binding). Thanks for the KIP!
> >
> > On Mon, Feb 27, 2023 at 12:51 PM Greg Harris 
> > wrote:
> >
> > > Hi,
> > >
> > > I'd like to call a vote for KIP-898 which aims to improve the performance
> > > of Connect startup by allowing discovery of plugins via the ServiceLoader.
> > >
> > > KIP:
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-898
> > > %3A+Modernize+Connect+plugin+discovery
> > >
> > > Discussion thread:
> > > https://lists.apache.org/thread/wxh0r343w86s91py0876njzbyn5qxd8s
> > >
> > > Thanks!
> > >


Re: [VOTE] KIP-898: Modernize Connect plugin discovery

2023-02-28 Thread Mickael Maison
+1 (binding)

Thanks,
Mickael

On Mon, Feb 27, 2023 at 7:42 PM Chris Egerton  wrote:
>
> +1 (binding). Thanks for the KIP!
>
> On Mon, Feb 27, 2023 at 12:51 PM Greg Harris 
> wrote:
>
> > Hi,
> >
> > I'd like to call a vote for KIP-898 which aims to improve the performance
> > of Connect startup by allowing discovery of plugins via the ServiceLoader.
> >
> > KIP:
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-898
> > %3A+Modernize+Connect+plugin+discovery
> >
> > Discussion thread:
> > https://lists.apache.org/thread/wxh0r343w86s91py0876njzbyn5qxd8s
> >
> > Thanks!
> >


[jira] [Resolved] (KAFKA-13771) Support to explicitly delete delegationTokens that have expired but have not been automatically cleaned up

2023-02-28 Thread RivenSun (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RivenSun resolved KAFKA-13771.
--
Resolution: Resolved

> Support to explicitly delete delegationTokens that have expired but have not 
> been automatically cleaned up
> --
>
> Key: KAFKA-13771
> URL: https://issues.apache.org/jira/browse/KAFKA-13771
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Reporter: RivenSun
>Assignee: RivenSun
>Priority: Major
>
> Quoting the official documentation
> {quote}
> Tokens can also be cancelled explicitly. If a token is not renewed by the 
> token’s expiration time or if token is beyond the max life time, it will be 
> deleted from all broker caches as well as from zookeeper.
> {quote}
> 1. The first point above means that after the `AdminClient` initiates the 
> EXPIRE_DELEGATION_TOKEN request, in the DelegationTokenManager.expireToken() 
> method on the KafkaServer side, if the user passes in expireLifeTimeMs less 
> than 0, KafaServer will delete the corresponding delegationToken directly.
> 2. There is a thread named "delete-expired-tokens" on the KafkaServer side, 
> which is responsible for regularly cleaning up expired tokens. The execution 
> interval is `delegation.token.expiry.check.interval.ms`, and the default 
> value is one hour.
> But carefully analyze the code logic in DelegationTokenManager.expireToken(), 
> *now Kafka does not support users to delete an expired delegationToken that 
> he no longer uses/renew. If the user wants to do this, they will receive a 
> DelegationTokenExpiredException.*
> In the worst case, an expired delegationToken may still can be used normally 
> within {*}an hour{*}, even if this configuration 
> (delegation.token.expiry.check.interval.ms) broker can shorten the 
> configuration as much as possible.
> The solution is very simple, simply adjust the `if` order of 
> DelegationTokenManager.expireToken().
> {code:java}
> if (!allowedToRenew(principal, tokenInfo)) {
>   expireResponseCallback(Errors.DELEGATION_TOKEN_OWNER_MISMATCH, -1)
> } else if (expireLifeTimeMs < 0) { //expire immediately
>   removeToken(tokenInfo.tokenId)
>   info(s"Token expired for token: ${tokenInfo.tokenId} for owner: 
> ${tokenInfo.owner}")
>   expireResponseCallback(Errors.NONE, now)
> } else if (tokenInfo.maxTimestamp < now || tokenInfo.expiryTimestamp < now) {
>   expireResponseCallback(Errors.DELEGATION_TOKEN_EXPIRED, -1)
> } else {
>   //set expiry time stamp
>  ..
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)