Re: [VOTE] 3.7.0 RC2

2024-01-12 Thread Luke Chen
Hi Stanislav,

I commented in the "Apache Kafka 3.7.0 Release" thread, but maybe you
missed it.
cross-posting here:

There is a bug KAFKA-16101
 reporting that "Kafka
cluster will be unavailable during KRaft migration rollback".
The impact for this issue is that if brokers try to rollback to ZK mode
during KRaft migration process, there will be a period of time the cluster
is unavailable.
Since ZK migrating to KRaft feature is a production ready feature, I think
this should be addressed soon.
Do you think this is a blocker for v3.7.0?

Thanks.
Luke

On Sat, Jan 13, 2024 at 8:36 AM Chris Egerton 
wrote:

> Thanks, Kirk!
>
> @Stanislav--do you believe that this warrants a new RC?
>
> On Fri, Jan 12, 2024, 19:08 Kirk True  wrote:
>
> > Hi Chris/Stanislav,
> >
> > I'm working on the 'Unable to find FetchSessionHandler' log problem
> > (KAFKA-16029) and have put out a draft PR (
> > https://github.com/apache/kafka/pull/15186). I will use the quickstart
> > approach as a second means to reproduce/verify while I wait for the PR's
> > Jenkins job to finish.
> >
> > Thanks,
> > Kirk
> >
> > On Fri, Jan 12, 2024, at 11:31 AM, Chris Egerton wrote:
> > > Hi Stanislav,
> > >
> > >
> > > Thanks for running this release!
> > >
> > > To verify, I:
> > > - Built from source using Java 11 with both:
> > > - - the 3.7.0-rc2 tag on GitHub
> > > - - the kafka-3.7.0-src.tgz artifact from
> > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> > > - Checked signatures and checksums
> > > - Ran the quickstart using both:
> > > - - The kafka_2.13-3.7.0.tgz artifact from
> > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ with Java
> > 11
> > > and Scala 13 in KRaft mode
> > > - - Our shiny new broker Docker image, apache/kafka:3.7.0-rc2
> > > - Ran all unit tests
> > > - Ran all integration tests for Connect and MM2
> > >
> > >
> > > I found two minor areas for concern:
> > >
> > > 1. (Possibly a blocker)
> > > When running the quickstart, I noticed this ERROR-level log message
> being
> > > emitted frequently (not not every time) when I killed my console
> consumer
> > > via ctrl-C:
> > >
> > > > [2024-01-12 11:00:31,088] ERROR [Consumer clientId=console-consumer,
> > > groupId=console-consumer-74388] Unable to find FetchSessionHandler for
> > node
> > > 1. Ignoring fetch response
> > > (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> > >
> > > I see that this error message is already reported in
> > > https://issues.apache.org/jira/browse/KAFKA-16029. I think we should
> > > prioritize fixing it for this release. I know it's probably benign but
> > it's
> > > really not a good look for us when basic operations log error messages,
> > and
> > > it may give new users some headaches.
> > >
> > >
> > > 2. (Probably not a blocker)
> > > The following unit tests failed the first time around, and all of them
> > > passed the second time I ran them:
> > >
> > > - (clients)
> > ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
> > > - (clients) SelectorTest.testConnectionsByClientMetric()
> > > - (clients) Tls13SelectorTest.testConnectionsByClientMetric()
> > > - (connect) TopicAdminTest.retryEndOffsetsShouldRetryWhenTopicNotFound
> (I
> > > thought I fixed this one! 郎郎)
> > > - (core) ProducerIdManagerTest.testUnrecoverableErrors(Errors)[2]
> > >
> > >
> > > Thanks again for your work on this release, and congratulations to
> Kafka
> > > Streams for having zero flaky unit tests during my highly-experimental
> > > single laptop run!
> > >
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Thu, Jan 11, 2024 at 1:33 PM Stanislav Kozlovski
> > >  wrote:
> > >
> > > > Hello Kafka users, developers, and client-developers,
> > > >
> > > > This is the first candidate for release of Apache Kafka 3.7.0.
> > > >
> > > > Note it's named "RC2" because I had a few "failed" RCs that I had
> > > > cut/uploaded but ultimately had to scrap prior to announcing due to
> new
> > > > blockers arriving before I could even announce them.
> > > >
> > > > Further - I haven't yet been able to set up the system tests
> > successfully.
> > > > And the integration/unit tests do have a few failures that I have to
> > spend
> > > > time triaging. I would appreciate any help in case anyone notices any
> > tests
> > > > failing that they're subject matters experts in. Expect me to follow
> > up in
> > > > a day or two with more detailed analysis.
> > > >
> > > > Major changes include:
> > > > - Early Access to KIP-848 - the next generation of the consumer
> > rebalance
> > > > protocol
> > > > - KIP-858: Adding JBOD support to KRaft
> > > > - KIP-714: Observability into Client metrics via a standardized
> > interface
> > > >
> > > > Check more information in the WIP blog post:
> > > > https://github.com/apache/kafka-site/pull/578
> > > >
> > > > Release notes for the 3.7.0 release:
> > > >
> > > >
> >
> 

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2569

2024-01-12 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-890 Server Side Defense

2024-01-12 Thread Justine Olshan
(1) the prepare marker is written, but the endTxn response is not received
by the client when the server downgrades
(2)  the prepare marker is written, the endTxn response is received by the
client when the server downgrades.

I think I am still a little confused. In both of these cases, the
transaction log has the old producer ID. We don't write the new producer ID
in the prepare marker's non tagged fields.
If the server downgrades now, it would read the records not in tagged
fields and the complete marker will also have the old producer ID.
(If we had used the new producer ID, we would not have transactional
correctness since the producer id doesn't match the transaction and the
state would not be correct on the data partition.)

In the overflow case, I'd expect the following to happen on the client side
Case 1  -- we retry EndTxn -- it is the same producer ID and epoch - 1 this
would fence the producer
Case 2 -- we don't retry EndTxn and use the new producer id which would
result in InvalidPidMappingException

Maybe we can have special handling for when a server downgrades. When it
reconnects we could get an API version request showing KIP-890 part 2 is
not supported. In that case, we can call initProducerId to abort the
transaction. (In the overflow case, this correctly gives us a new producer
ID)

I guess the corresponding case would be where the *complete marker *is
written but the endTxn is not received by the client and the server
downgrades? This would result in the transaction coordinator having the new
ID and not the old one.  If the client retries, it will receive an
InvalidPidMappingException. The InitProducerId scenario above would help
here too.

To be clear, my compatibility story is meant to support downgrades server
side in keeping the transactional correctness. Keeping the client from
fencing itself is not the priority.

Hope this helps. I can also add text in the KIP about InitProducerId if we
think that fixes some edge cases.

Justine

On Fri, Jan 12, 2024 at 4:10 PM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply.
>
> I agree that we don't need to optimize for fencing during downgrades.
> Regarding consistency, there are two possible cases: (1) the prepare marker
> is written, but the endTxn response is not received by the client when the
> server downgrades; (2)  the prepare marker is written, the endTxn response
> is received by the client when the server downgrades. In (1), the client
> will have the old produce Id and in (2), the client will have the new
> produce Id. If we downgrade right after the prepare marker, we can't be
> consistent to both (1) and (2) since we can only put one value in the
> existing produce Id field. It's also not clear which case is more likely.
> So we could probably be consistent with either case. By putting the new
> producer Id in the prepare marker, we are consistent with case (2) and it
> also has the slight benefit that the produce field in the prepare and
> complete marker are consistent in the overflow case.
>
> Jun
>
> On Fri, Jan 12, 2024 at 3:11 PM Justine Olshan
> 
> wrote:
>
> > Hi Jun,
> >
> > In the case you describe, we would need to have a delayed request, send a
> > successful EndTxn, and a successful AddPartitionsToTxn and then have the
> > delayed EndTxn request go through for a given producer.
> > I'm trying to figure out if it is possible for the client to transition
> if
> > a previous request is delayed somewhere. But yes, in this case I think we
> > would fence the client.
> >
> > Not for the overflow case. In the overflow case, the producer ID and the
> > epoch are different on the marker and on the new transaction. So we want
> > the marker to use the max epoch  but the new transaction should start
> with
> > the new ID and epoch 0 in the transactional state.
> >
> > In the server downgrade case, we want to see the producer ID as that is
> > what the client will have. If we complete the commit, and the transaction
> > state is reloaded, we need the new producer ID in the state so there
> isn't
> > an invalid producer ID mapping.
> > The server downgrade cases are considering transactional correctness and
> > not regressing from previous behavior -- and are not concerned about
> > supporting the safety from fencing retries (as we have downgraded so we
> > don't need to support). Perhaps this is a trade off, but I think it is
> the
> > right one.
> >
> > (If the client downgrades, it will have restarted and it is ok for it to
> > have a new producer ID too).
> >
> > Justine
> >
> > On Fri, Jan 12, 2024 at 11:42 AM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the reply.
> > >
> > > 101.4 "If the marker is written by the new client, we can as I
> mentioned
> > in
> > > the last email guarantee that any EndTxn requests with the same epoch
> are
> > > from the same producer and the same transaction. Then we don't have to
> > > return a fenced error but can handle gracefully as described in the
> KIP."
> 

Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2568

2024-01-12 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 461122 lines...]
Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testConnection() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testConnection() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testZNodeChangeHandlerForCreation() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testZNodeChangeHandlerForCreation() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetAclExistingZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetAclExistingZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testSessionExpiryDuringClose() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testSessionExpiryDuringClose() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testReinitializeAfterAuthFailure() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testReinitializeAfterAuthFailure() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testSetAclNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testSetAclNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testConnectionLossRequestTermination() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testConnectionLossRequestTermination() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testExistsNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testExistsNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetDataNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetDataNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testConnectionTimeout() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testConnectionTimeout() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testBlockOnRequestCompletionFromStateChangeHandler() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testBlockOnRequestCompletionFromStateChangeHandler() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testUnresolvableConnectString() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testUnresolvableConnectString() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetChildrenNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetChildrenNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testPipelinedGetData() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testPipelinedGetData() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testZNodeChildChangeHandlerForChildChange() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testZNodeChildChangeHandlerForChildChange() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetChildrenExistingZNodeWithChildren() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetChildrenExistingZNodeWithChildren() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testSetDataExistingZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testSetDataExistingZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testZNodeChildChangeHandlerForChildChangeNotTriggered() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testZNodeChildChangeHandlerForChildChangeNotTriggered() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testMixedPipeline() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testMixedPipeline() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetDataExistingZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 92 > ZooKeeperClientTest > 
testGetDataExistingZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 92 > 

Re: [VOTE] 3.7.0 RC2

2024-01-12 Thread Chris Egerton
Thanks, Kirk!

@Stanislav--do you believe that this warrants a new RC?

On Fri, Jan 12, 2024, 19:08 Kirk True  wrote:

> Hi Chris/Stanislav,
>
> I'm working on the 'Unable to find FetchSessionHandler' log problem
> (KAFKA-16029) and have put out a draft PR (
> https://github.com/apache/kafka/pull/15186). I will use the quickstart
> approach as a second means to reproduce/verify while I wait for the PR's
> Jenkins job to finish.
>
> Thanks,
> Kirk
>
> On Fri, Jan 12, 2024, at 11:31 AM, Chris Egerton wrote:
> > Hi Stanislav,
> >
> >
> > Thanks for running this release!
> >
> > To verify, I:
> > - Built from source using Java 11 with both:
> > - - the 3.7.0-rc2 tag on GitHub
> > - - the kafka-3.7.0-src.tgz artifact from
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> > - Checked signatures and checksums
> > - Ran the quickstart using both:
> > - - The kafka_2.13-3.7.0.tgz artifact from
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ with Java
> 11
> > and Scala 13 in KRaft mode
> > - - Our shiny new broker Docker image, apache/kafka:3.7.0-rc2
> > - Ran all unit tests
> > - Ran all integration tests for Connect and MM2
> >
> >
> > I found two minor areas for concern:
> >
> > 1. (Possibly a blocker)
> > When running the quickstart, I noticed this ERROR-level log message being
> > emitted frequently (not not every time) when I killed my console consumer
> > via ctrl-C:
> >
> > > [2024-01-12 11:00:31,088] ERROR [Consumer clientId=console-consumer,
> > groupId=console-consumer-74388] Unable to find FetchSessionHandler for
> node
> > 1. Ignoring fetch response
> > (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> >
> > I see that this error message is already reported in
> > https://issues.apache.org/jira/browse/KAFKA-16029. I think we should
> > prioritize fixing it for this release. I know it's probably benign but
> it's
> > really not a good look for us when basic operations log error messages,
> and
> > it may give new users some headaches.
> >
> >
> > 2. (Probably not a blocker)
> > The following unit tests failed the first time around, and all of them
> > passed the second time I ran them:
> >
> > - (clients)
> ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
> > - (clients) SelectorTest.testConnectionsByClientMetric()
> > - (clients) Tls13SelectorTest.testConnectionsByClientMetric()
> > - (connect) TopicAdminTest.retryEndOffsetsShouldRetryWhenTopicNotFound (I
> > thought I fixed this one! 郎郎)
> > - (core) ProducerIdManagerTest.testUnrecoverableErrors(Errors)[2]
> >
> >
> > Thanks again for your work on this release, and congratulations to Kafka
> > Streams for having zero flaky unit tests during my highly-experimental
> > single laptop run!
> >
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Jan 11, 2024 at 1:33 PM Stanislav Kozlovski
> >  wrote:
> >
> > > Hello Kafka users, developers, and client-developers,
> > >
> > > This is the first candidate for release of Apache Kafka 3.7.0.
> > >
> > > Note it's named "RC2" because I had a few "failed" RCs that I had
> > > cut/uploaded but ultimately had to scrap prior to announcing due to new
> > > blockers arriving before I could even announce them.
> > >
> > > Further - I haven't yet been able to set up the system tests
> successfully.
> > > And the integration/unit tests do have a few failures that I have to
> spend
> > > time triaging. I would appreciate any help in case anyone notices any
> tests
> > > failing that they're subject matters experts in. Expect me to follow
> up in
> > > a day or two with more detailed analysis.
> > >
> > > Major changes include:
> > > - Early Access to KIP-848 - the next generation of the consumer
> rebalance
> > > protocol
> > > - KIP-858: Adding JBOD support to KRaft
> > > - KIP-714: Observability into Client metrics via a standardized
> interface
> > >
> > > Check more information in the WIP blog post:
> > > https://github.com/apache/kafka-site/pull/578
> > >
> > > Release notes for the 3.7.0 release:
> > >
> > >
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/RELEASE_NOTES.html
> > >
> > > *** Please download, test and vote by Thursday, January 18, 9am PT ***
> > >
> > > Usually these deadlines tend to be 2-3 days, but due to this being the
> > > first RC and the tests not having ran yet, I am giving it a bit more
> time.
> > >
> > > Kafka's KEYS file containing PGP keys we use to sign the release:
> > > https://kafka.apache.org/KEYS
> > >
> > > * Release artifacts to be voted upon (source and binary):
> > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> > >
> > > * Docker release artifact to be voted upon:
> > > apache/kafka:3.7.0-rc2
> > >
> > > * Maven artifacts to be voted upon:
> > > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> > >
> > > * Javadoc:
> > > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/javadoc/
> > >
> > > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> > > 

Re: [DISCUSS] KIP-890 Server Side Defense

2024-01-12 Thread Jun Rao
Hi, Justine,

Thanks for the reply.

I agree that we don't need to optimize for fencing during downgrades.
Regarding consistency, there are two possible cases: (1) the prepare marker
is written, but the endTxn response is not received by the client when the
server downgrades; (2)  the prepare marker is written, the endTxn response
is received by the client when the server downgrades. In (1), the client
will have the old produce Id and in (2), the client will have the new
produce Id. If we downgrade right after the prepare marker, we can't be
consistent to both (1) and (2) since we can only put one value in the
existing produce Id field. It's also not clear which case is more likely.
So we could probably be consistent with either case. By putting the new
producer Id in the prepare marker, we are consistent with case (2) and it
also has the slight benefit that the produce field in the prepare and
complete marker are consistent in the overflow case.

Jun

On Fri, Jan 12, 2024 at 3:11 PM Justine Olshan 
wrote:

> Hi Jun,
>
> In the case you describe, we would need to have a delayed request, send a
> successful EndTxn, and a successful AddPartitionsToTxn and then have the
> delayed EndTxn request go through for a given producer.
> I'm trying to figure out if it is possible for the client to transition if
> a previous request is delayed somewhere. But yes, in this case I think we
> would fence the client.
>
> Not for the overflow case. In the overflow case, the producer ID and the
> epoch are different on the marker and on the new transaction. So we want
> the marker to use the max epoch  but the new transaction should start with
> the new ID and epoch 0 in the transactional state.
>
> In the server downgrade case, we want to see the producer ID as that is
> what the client will have. If we complete the commit, and the transaction
> state is reloaded, we need the new producer ID in the state so there isn't
> an invalid producer ID mapping.
> The server downgrade cases are considering transactional correctness and
> not regressing from previous behavior -- and are not concerned about
> supporting the safety from fencing retries (as we have downgraded so we
> don't need to support). Perhaps this is a trade off, but I think it is the
> right one.
>
> (If the client downgrades, it will have restarted and it is ok for it to
> have a new producer ID too).
>
> Justine
>
> On Fri, Jan 12, 2024 at 11:42 AM Jun Rao  wrote:
>
> > Hi, Justine,
> >
> > Thanks for the reply.
> >
> > 101.4 "If the marker is written by the new client, we can as I mentioned
> in
> > the last email guarantee that any EndTxn requests with the same epoch are
> > from the same producer and the same transaction. Then we don't have to
> > return a fenced error but can handle gracefully as described in the KIP."
> > When a delayed EndTnx request is processed, the txn state could be
> ongoing
> > for the next txn. I guess in this case we still return the fenced error
> for
> > the delayed request?
> >
> > 102. Sorry, my question was inaccurate. What you described is accurate.
> > "The downgrade compatibility I mention is that we keep the same producer
> ID
> > and epoch in the main (non-tagged) fields as we did before the code on
> the
> > server side." If we want to do this, it seems that we should use the
> > current produce Id and max epoch in the existing producerId and
> > producerEpoch fields for both the prepare and the complete marker, right?
> > The downgrade can happen after the complete marker is written. With what
> > you described, the downgraded coordinator will see the new produce Id
> > instead of the old one.
> >
> > Jun
> >
> > On Fri, Jan 12, 2024 at 10:44 AM Justine Olshan
> >  wrote:
> >
> > > Hi Jun,
> > >
> > > I can update the description.
> > >
> > > I believe your second point is mentioned in the KIP. I can add more
> text
> > on
> > > this if it is helpful.
> > > > The delayed message case can also violate EOS if the delayed message
> > > comes in after the next addPartitionsToTxn request comes in.
> Effectively
> > we
> > > may see a message from a previous (aborted) transaction become part of
> > the
> > > next transaction.
> > >
> > > If the marker is written by the new client, we can as I mentioned in
> the
> > > last email guarantee that any EndTxn requests with the same epoch are
> > from
> > > the same producer and the same transaction. Then we don't have to
> return
> > a
> > > fenced error but can handle gracefully as described in the KIP.
> > > I don't think a boolean is useful since it is directly encoded by the
> > > existence or lack of the tagged field being written.
> > > In the prepare marker we will have the same producer ID in the
> non-tagged
> > > field. In the Complete state we may not.
> > > I'm not sure why the ongoing state matters for this KIP. It does matter
> > for
> > > KIP-939.
> > >
> > > I'm not sure what you are referring to about writing the previous
> > producer
> > > ID in the prepare 

Re: [VOTE] 3.7.0 RC2

2024-01-12 Thread Kirk True
Hi Chris/Stanislav,

I'm working on the 'Unable to find FetchSessionHandler' log problem 
(KAFKA-16029) and have put out a draft PR 
(https://github.com/apache/kafka/pull/15186). I will use the quickstart 
approach as a second means to reproduce/verify while I wait for the PR's 
Jenkins job to finish.   

Thanks,
Kirk

On Fri, Jan 12, 2024, at 11:31 AM, Chris Egerton wrote:
> Hi Stanislav,
> 
> 
> Thanks for running this release!
> 
> To verify, I:
> - Built from source using Java 11 with both:
> - - the 3.7.0-rc2 tag on GitHub
> - - the kafka-3.7.0-src.tgz artifact from
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> - Checked signatures and checksums
> - Ran the quickstart using both:
> - - The kafka_2.13-3.7.0.tgz artifact from
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ with Java 11
> and Scala 13 in KRaft mode
> - - Our shiny new broker Docker image, apache/kafka:3.7.0-rc2
> - Ran all unit tests
> - Ran all integration tests for Connect and MM2
> 
> 
> I found two minor areas for concern:
> 
> 1. (Possibly a blocker)
> When running the quickstart, I noticed this ERROR-level log message being
> emitted frequently (not not every time) when I killed my console consumer
> via ctrl-C:
> 
> > [2024-01-12 11:00:31,088] ERROR [Consumer clientId=console-consumer,
> groupId=console-consumer-74388] Unable to find FetchSessionHandler for node
> 1. Ignoring fetch response
> (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> 
> I see that this error message is already reported in
> https://issues.apache.org/jira/browse/KAFKA-16029. I think we should
> prioritize fixing it for this release. I know it's probably benign but it's
> really not a good look for us when basic operations log error messages, and
> it may give new users some headaches.
> 
> 
> 2. (Probably not a blocker)
> The following unit tests failed the first time around, and all of them
> passed the second time I ran them:
> 
> - (clients) ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
> - (clients) SelectorTest.testConnectionsByClientMetric()
> - (clients) Tls13SelectorTest.testConnectionsByClientMetric()
> - (connect) TopicAdminTest.retryEndOffsetsShouldRetryWhenTopicNotFound (I
> thought I fixed this one! 郎郎)
> - (core) ProducerIdManagerTest.testUnrecoverableErrors(Errors)[2]
> 
> 
> Thanks again for your work on this release, and congratulations to Kafka
> Streams for having zero flaky unit tests during my highly-experimental
> single laptop run!
> 
> 
> Cheers,
> 
> Chris
> 
> On Thu, Jan 11, 2024 at 1:33 PM Stanislav Kozlovski
>  wrote:
> 
> > Hello Kafka users, developers, and client-developers,
> >
> > This is the first candidate for release of Apache Kafka 3.7.0.
> >
> > Note it's named "RC2" because I had a few "failed" RCs that I had
> > cut/uploaded but ultimately had to scrap prior to announcing due to new
> > blockers arriving before I could even announce them.
> >
> > Further - I haven't yet been able to set up the system tests successfully.
> > And the integration/unit tests do have a few failures that I have to spend
> > time triaging. I would appreciate any help in case anyone notices any tests
> > failing that they're subject matters experts in. Expect me to follow up in
> > a day or two with more detailed analysis.
> >
> > Major changes include:
> > - Early Access to KIP-848 - the next generation of the consumer rebalance
> > protocol
> > - KIP-858: Adding JBOD support to KRaft
> > - KIP-714: Observability into Client metrics via a standardized interface
> >
> > Check more information in the WIP blog post:
> > https://github.com/apache/kafka-site/pull/578
> >
> > Release notes for the 3.7.0 release:
> >
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/RELEASE_NOTES.html
> >
> > *** Please download, test and vote by Thursday, January 18, 9am PT ***
> >
> > Usually these deadlines tend to be 2-3 days, but due to this being the
> > first RC and the tests not having ran yet, I am giving it a bit more time.
> >
> > Kafka's KEYS file containing PGP keys we use to sign the release:
> > https://kafka.apache.org/KEYS
> >
> > * Release artifacts to be voted upon (source and binary):
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
> >
> > * Docker release artifact to be voted upon:
> > apache/kafka:3.7.0-rc2
> >
> > * Maven artifacts to be voted upon:
> > https://repository.apache.org/content/groups/staging/org/apache/kafka/
> >
> > * Javadoc:
> > https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/javadoc/
> >
> > * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> > https://github.com/apache/kafka/releases/tag/3.7.0-rc2
> >
> > * Documentation:
> > https://kafka.apache.org/37/documentation.html
> >
> > * Protocol:
> > https://kafka.apache.org/37/protocol.html
> >
> > * Successful Jenkins builds for the 3.7 branch:
> > Unit/integration tests:
> > https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/58/
> > There are failing 

[jira] [Created] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received

2024-01-12 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16122:
--

 Summary: TransactionsBounceTest -- server disconnected before 
response was received
 Key: KAFKA-16122
 URL: https://issues.apache.org/jira/browse/KAFKA-16122
 Project: Kafka
  Issue Type: Test
Reporter: Justine Olshan


I noticed a ton of tests failing with 


h4.  
{code:java}
Error  org.apache.kafka.common.KafkaException: Unexpected error in 
TxnOffsetCommitResponse: The server disconnected before a response was 
received.  {code}
{code:java}
Stacktrace  org.apache.kafka.common.KafkaException: Unexpected error in 
TxnOffsetCommitResponse: The server disconnected before a response was 
received.  at 
app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702)
  at 
app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236)
  at 
app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154)
  at 
app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608)
  at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600)  
at 
app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457)
  at 
app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334)
  at 
app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249)  
at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code}


The error indicates a network error which is retriable but the TxnOffsetCommit 
handler doesn't expect this. 

https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other 
requests but not this one. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16051) Deadlock on connector initialization

2024-01-12 Thread Greg Harris (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Harris resolved KAFKA-16051.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

> Deadlock on connector initialization
> 
>
> Key: KAFKA-16051
> URL: https://issues.apache.org/jira/browse/KAFKA-16051
> Project: Kafka
>  Issue Type: Bug
>  Components: connect
>Affects Versions: 2.6.3, 3.6.1
>Reporter: Octavian Ciubotaru
>Assignee: Octavian Ciubotaru
>Priority: Major
> Fix For: 3.8.0
>
>
>  
> Tested with Kafka 3.6.1 and 2.6.3.
> The only plugin installed is confluentinc-kafka-connect-jdbc-10.7.4.
> Stack trace for Kafka 3.6.1:
> {noformat}
> Found one Java-level deadlock:
> =
> "pool-3-thread-1":
>   waiting to lock monitor 0x7fbc88006300 (object 0x91002aa0, a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder),
>   which is held by "Thread-9"
> "Thread-9":
>   waiting to lock monitor 0x7fbc88008800 (object 0x9101ccd8, a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore),
>   which is held by "pool-3-thread-1"Java stack information for the threads 
> listed above:
> ===
> "pool-3-thread-1":
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder$ConfigUpdateListener.onTaskConfigUpdate(StandaloneHerder.java:516)
>     - waiting to lock <0x91002aa0> (a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder)
>     at 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:137)
>     - locked <0x9101ccd8> (a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.lambda$null$2(StandaloneHerder.java:229)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder$$Lambda$692/0x000840557440.run(Unknown
>  Source)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.21/Executors.java:515)
>     at 
> java.util.concurrent.FutureTask.run(java.base@11.0.21/FutureTask.java:264)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.21/ScheduledThreadPoolExecutor.java:304)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1128)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628)
>     at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
> "Thread-9":
>     at 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:129)
>     - waiting to lock <0x9101ccd8> (a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.requestTaskReconfiguration(StandaloneHerder.java:255)
>     - locked <0x91002aa0> (a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder)
>     at 
> org.apache.kafka.connect.runtime.HerderConnectorContext.requestTaskReconfiguration(HerderConnectorContext.java:50)
>     at 
> org.apache.kafka.connect.runtime.WorkerConnector$WorkerConnectorContext.requestTaskReconfiguration(WorkerConnector.java:548)
>     at 
> io.confluent.connect.jdbc.source.TableMonitorThread.run(TableMonitorThread.java:86)
> Found 1 deadlock.
> {noformat}
> The jdbc source connector is loading tables from the database and updates the 
> configuration once the list is available. The deadlock is very consistent in 
> my environment, probably because the database is on the same machine.
> Maybe it is possible to avoid this situation by always locking the herder 
> first and the config backing store second. From what I see, 
> updateConnectorTasks sometimes is called before locking on herder and other 
> times it is not.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PROPOSAL] Add commercial support page on website

2024-01-12 Thread Matthias J. Sax

François,

thanks for starting this initiative. Personally, I don't think it's 
necessarily harmful for the project to add such a new page, however, I 
share the same concerns others raised already.


I understand your motivation that people had issues finding commercial 
support, but I am not sure we can address this issue that way. I am also 
"worried" (for the lack of a better word) that the page might become 
long an unwieldy. In the end, any freelancer/consultant offering Kafka 
services would be able to get on the page, so we might get hundreds of 
entries, what also makes it impossible for users to find what they are 
looking for. Also, the services of different companies might vary 
drastically; should users read all these descriptions? I can also 
imagine that some companies offer their services only in some 
countries/regions making it even harder for user to find what they are 
looking for?


Overall, it sounds more like a search optimization problem, and thus it 
seems out-of-scope what we can solve. As I said, I am not strictly 
against it, but I just don't see much value either.



-Matthias

On 1/11/24 12:55 PM, Francois Papon wrote:

Hi Justine,

You're right, Kafka is a part of my business (training, consulting, 
architecture design, sla...) and most of the time, users/customers said 
that it was hard for them to find a commercial support (in France for my 
case) after searching on the Kafka website (Google didn't help them).


As an ASF member and PMC of several ASF projects, I know that this kind 
of page exist so this is why I made this proposal for the Kafka project 
because I really think that it can help users.


As you suggest, I can submit a PR to be added on the "powered by" page.

Thanks,

François

On 11/01/2024 21:00, Justine Olshan wrote:

Hey François,

My point was that the companies on that page use kafka as part of their
business. If you use Kafka as part of your business feel free to submit a
PR to be added.

I second Chris's point that other projects are not enough to require 
Kafka

having such a support page.

Justine

On Thu, Jan 11, 2024 at 11:57 AM Chris Egerton 
wrote:


Hi François,

Is it an official policy of the ASF that projects provide a listing of
commercial support options for themselves? I understand that other 
projects

have chosen to provide one, but this doesn't necessarily imply that all
projects should do the same, and I can't say I find this point very
convincing as a rebuttal to some of the good-faith concerns raised by 
the

PMC and members of the community so far. However, if there's an official
ASF stance on this topic, then I acknowledge that Apache Kafka should 
align

with it.

Best,

Chris


On Thu, Jan 11, 2024, 14:50 fpapon  wrote:


Hi Justine,

I'm not sure to see the difference between "happy users" and vendors
that advertise their products in some of the company list in the
"powered by" page.

Btw, my initial purpose of my proposal was to help user to find support
for production stuff rather than searching in google.

I don't think this is a bad thing because this is something that 
already

exist in many ASF projects like:

https://hop.apache.org/community/commercial/
https://struts.apache.org/commercial-support.html
https://directory.apache.org/commercial-support.html
https://tomee.apache.org/commercial-support.html
https://plc4x.apache.org/users/commercial-support.html
https://camel.apache.org/community/support/
https://openmeetings.apache.org/commercial-support.html
https://guacamole.apache.org/support/



https://cwiki.apache.org/confluence/display/HADOOP2/Distributions+and+Commercial+Support
https://activemq.apache.org/supporthttps://karaf.apache.org/community.html

https://netbeans.apache.org/front/main/help/commercial-support/
https://royale.apache.org/royale-commercial-support/

https://karaf.apache.org/community.html

As I understand for now, the channel for users to find production
support is:

- The mailing list (u...@kafka.apache.org / dev@kafka.apache.org)

- The official #kafka  ASF Slack channel (may be we can add it on the
website because I didn't find it in the website =>
https://kafka.apache.org/contact)

- Search in google for commercial support only

I can update my PR to mention only the 3 points above for the "get
support" page if people think that having a support page make sense.

regards,

François

On 11/01/2024 19:34, Justine Olshan wrote:

I think there is a difference between the "Powered by" page and a page

for

vendors to advertise their products and services.

The idea is that the companies on that page are "powered by" Kafka.

They

serve as examples of happy users of Kafka.
I don't think it is meant only as a place just for those companies to
advertise.

I'm a little confused by

In this case, I'm ok to say that the commercial support section in 
the

"Get support" is no need as we can use this page.

If you plan to submit for this page, please include a description on

how

your company uses 

Re: [Kafka Streams], Processor API for KTable and KGroupedStream

2024-01-12 Thread Matthias J. Sax
`KGroupedStream` is just an "intermediate representation" to get a 
better flow in the DSL. It's not a "top level" abstraction like 
KStream/KTable.


For `KTable` there is `transformValue()` -- there is no `transform()` 
because keying must be preserved -- if you want to change the keying you 
 need to use `KTable#groupBy()` (data needs to be repartitioned if you 
change the key).


HTH.

-Matthias

On 1/12/24 3:09 AM, Igor Maznitsa wrote:

Hello

Is there any way in Kafka Streams API to have processors for KTable and 
KGroupedStream like KStream#transform? How to provide a complex 
processor for KTable or KGroupedStream which could provide way to not 
downstream events for some business logic?





Re: [DISCUSS] KIP-890 Server Side Defense

2024-01-12 Thread Justine Olshan
Hi Jun,

In the case you describe, we would need to have a delayed request, send a
successful EndTxn, and a successful AddPartitionsToTxn and then have the
delayed EndTxn request go through for a given producer.
I'm trying to figure out if it is possible for the client to transition if
a previous request is delayed somewhere. But yes, in this case I think we
would fence the client.

Not for the overflow case. In the overflow case, the producer ID and the
epoch are different on the marker and on the new transaction. So we want
the marker to use the max epoch  but the new transaction should start with
the new ID and epoch 0 in the transactional state.

In the server downgrade case, we want to see the producer ID as that is
what the client will have. If we complete the commit, and the transaction
state is reloaded, we need the new producer ID in the state so there isn't
an invalid producer ID mapping.
The server downgrade cases are considering transactional correctness and
not regressing from previous behavior -- and are not concerned about
supporting the safety from fencing retries (as we have downgraded so we
don't need to support). Perhaps this is a trade off, but I think it is the
right one.

(If the client downgrades, it will have restarted and it is ok for it to
have a new producer ID too).

Justine

On Fri, Jan 12, 2024 at 11:42 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply.
>
> 101.4 "If the marker is written by the new client, we can as I mentioned in
> the last email guarantee that any EndTxn requests with the same epoch are
> from the same producer and the same transaction. Then we don't have to
> return a fenced error but can handle gracefully as described in the KIP."
> When a delayed EndTnx request is processed, the txn state could be ongoing
> for the next txn. I guess in this case we still return the fenced error for
> the delayed request?
>
> 102. Sorry, my question was inaccurate. What you described is accurate.
> "The downgrade compatibility I mention is that we keep the same producer ID
> and epoch in the main (non-tagged) fields as we did before the code on the
> server side." If we want to do this, it seems that we should use the
> current produce Id and max epoch in the existing producerId and
> producerEpoch fields for both the prepare and the complete marker, right?
> The downgrade can happen after the complete marker is written. With what
> you described, the downgraded coordinator will see the new produce Id
> instead of the old one.
>
> Jun
>
> On Fri, Jan 12, 2024 at 10:44 AM Justine Olshan
>  wrote:
>
> > Hi Jun,
> >
> > I can update the description.
> >
> > I believe your second point is mentioned in the KIP. I can add more text
> on
> > this if it is helpful.
> > > The delayed message case can also violate EOS if the delayed message
> > comes in after the next addPartitionsToTxn request comes in. Effectively
> we
> > may see a message from a previous (aborted) transaction become part of
> the
> > next transaction.
> >
> > If the marker is written by the new client, we can as I mentioned in the
> > last email guarantee that any EndTxn requests with the same epoch are
> from
> > the same producer and the same transaction. Then we don't have to return
> a
> > fenced error but can handle gracefully as described in the KIP.
> > I don't think a boolean is useful since it is directly encoded by the
> > existence or lack of the tagged field being written.
> > In the prepare marker we will have the same producer ID in the non-tagged
> > field. In the Complete state we may not.
> > I'm not sure why the ongoing state matters for this KIP. It does matter
> for
> > KIP-939.
> >
> > I'm not sure what you are referring to about writing the previous
> producer
> > ID in the prepare marker. This is not in the KIP.
> > In the overflow case, we write the nextProducerId in the prepare state.
> > This is so we know what we assigned when we reload the transaction log.
> > Once we complete, we transition this ID to the main (non-tagged field)
> and
> > have the previous producer ID field filled in. This is so we can identify
> > in a retry case the operation completed successfully and we don't fence
> our
> > producer. The downgrade compatibility I mention is that we keep the same
> > producer ID and epoch in the main (non-tagged) fields as we did before
> the
> > code on the server side. If the server downgrades, we are still
> compatible.
> > This addresses both the prepare and complete state downgrades.
> >
> > Justine
> >
> > On Fri, Jan 12, 2024 at 10:21 AM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the reply. Sorry for the delay. I have a few more comments.
> > >
> > > 110. I think the motivation section could be improved. One of the
> > > motivations listed by the KIP is "This can happen when a message gets
> > stuck
> > > or delayed due to networking issues or a network partition, the
> > transaction
> > > aborts, and then the delayed message finally comes 

[Kafka Streams], Processor API for KTable and KGroupedStream

2024-01-12 Thread Igor Maznitsa

Hello

Is there any way in Kafka Streams API to have processors for KTable and 
KGroupedStream like KStream#transform? How to provide a complex 
processor for KTable or KGroupedStream which could provide way to not 
downstream events for some business logic?



--
Igor Maznitsa
email: rrg4...@gmail.com



[jira] [Resolved] (KAFKA-15816) Typos in tests leak network sockets

2024-01-12 Thread Greg Harris (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Harris resolved KAFKA-15816.
-
Fix Version/s: 3.8.0
   Resolution: Fixed

> Typos in tests leak network sockets
> ---
>
> Key: KAFKA-15816
> URL: https://issues.apache.org/jira/browse/KAFKA-15816
> Project: Kafka
>  Issue Type: Bug
>  Components: unit tests
>Affects Versions: 3.6.0
>Reporter: Greg Harris
>Assignee: Greg Harris
>Priority: Minor
> Fix For: 3.8.0
>
>
> There are a few tests which leak network sockets due to small typos in the 
> tests themselves.
> Clients: [https://github.com/apache/kafka/pull/14750] (DONE)
>  * NioEchoServer
>  * KafkaConsumerTest
>  * KafkaProducerTest
>  * SelectorTest
>  * SslTransportLayerTest
>  * SslTransportTls12Tls13Test
>  * SslVersionsTransportLayerTest
>  * SaslAuthenticatorTest
> Core: [https://github.com/apache/kafka/pull/14754] (DONE)
>  * MiniKdc
>  * GssapiAuthenticationTest
>  * MirrorMakerIntegrationTest
>  * SocketServerTest
>  * EpochDrivenReplicationProtocolAcceptanceTest
>  * LeaderEpochIntegrationTest
> Trogdor: [https://github.com/apache/kafka/pull/14771] (DONE)
>  * AgentTest
> Mirror: [https://github.com/apache/kafka/pull/14761] (DONE)
>  * DedicatedMirrorIntegrationTest
>  * MirrorConnectorsIntegrationTest
>  * MirrorConnectorsWithCustomForwardingAdminIntegrationTest
> Runtime: [https://github.com/apache/kafka/pull/14764] (DONE)
>  * ConnectorTopicsIntegrationTest
>  * ExactlyOnceSourceIntegrationTest
>  * WorkerTest
>  * WorkerGroupMemberTest
> Streams: [https://github.com/apache/kafka/pull/14769] (DONE)
>  * IQv2IntegrationTest
>  * MetricsReporterIntegrationTest
>  * NamedTopologyIntegrationTest
>  * PurgeRepartitionTopicIntegrationTest
> These can be addressed by just fixing the tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-890 Server Side Defense

2024-01-12 Thread Jun Rao
Hi, Justine,

Thanks for the reply.

101.4 "If the marker is written by the new client, we can as I mentioned in
the last email guarantee that any EndTxn requests with the same epoch are
from the same producer and the same transaction. Then we don't have to
return a fenced error but can handle gracefully as described in the KIP."
When a delayed EndTnx request is processed, the txn state could be ongoing
for the next txn. I guess in this case we still return the fenced error for
the delayed request?

102. Sorry, my question was inaccurate. What you described is accurate.
"The downgrade compatibility I mention is that we keep the same producer ID
and epoch in the main (non-tagged) fields as we did before the code on the
server side." If we want to do this, it seems that we should use the
current produce Id and max epoch in the existing producerId and
producerEpoch fields for both the prepare and the complete marker, right?
The downgrade can happen after the complete marker is written. With what
you described, the downgraded coordinator will see the new produce Id
instead of the old one.

Jun

On Fri, Jan 12, 2024 at 10:44 AM Justine Olshan
 wrote:

> Hi Jun,
>
> I can update the description.
>
> I believe your second point is mentioned in the KIP. I can add more text on
> this if it is helpful.
> > The delayed message case can also violate EOS if the delayed message
> comes in after the next addPartitionsToTxn request comes in. Effectively we
> may see a message from a previous (aborted) transaction become part of the
> next transaction.
>
> If the marker is written by the new client, we can as I mentioned in the
> last email guarantee that any EndTxn requests with the same epoch are from
> the same producer and the same transaction. Then we don't have to return a
> fenced error but can handle gracefully as described in the KIP.
> I don't think a boolean is useful since it is directly encoded by the
> existence or lack of the tagged field being written.
> In the prepare marker we will have the same producer ID in the non-tagged
> field. In the Complete state we may not.
> I'm not sure why the ongoing state matters for this KIP. It does matter for
> KIP-939.
>
> I'm not sure what you are referring to about writing the previous producer
> ID in the prepare marker. This is not in the KIP.
> In the overflow case, we write the nextProducerId in the prepare state.
> This is so we know what we assigned when we reload the transaction log.
> Once we complete, we transition this ID to the main (non-tagged field) and
> have the previous producer ID field filled in. This is so we can identify
> in a retry case the operation completed successfully and we don't fence our
> producer. The downgrade compatibility I mention is that we keep the same
> producer ID and epoch in the main (non-tagged) fields as we did before the
> code on the server side. If the server downgrades, we are still compatible.
> This addresses both the prepare and complete state downgrades.
>
> Justine
>
> On Fri, Jan 12, 2024 at 10:21 AM Jun Rao  wrote:
>
> > Hi, Justine,
> >
> > Thanks for the reply. Sorry for the delay. I have a few more comments.
> >
> > 110. I think the motivation section could be improved. One of the
> > motivations listed by the KIP is "This can happen when a message gets
> stuck
> > or delayed due to networking issues or a network partition, the
> transaction
> > aborts, and then the delayed message finally comes in.". This seems not
> > very accurate. Without KIP-890, currently, if the coordinator times out
> and
> > aborts an ongoing transaction, it already bumps up the epoch in the
> marker,
> > which prevents the delayed produce message from being added to the user
> > partition. What can cause a hanging transaction is that the producer
> > completes (either aborts or commits) a transaction before receiving a
> > successful ack on messages published in the same txn. In this case, it's
> > possible for the delayed message to be appended to the partition after
> the
> > marker, causing a transaction to hang.
> >
> > A similar issue (not mentioned in the motivation) could happen on the
> > marker in the coordinator's log. For example, it's possible for an
> > EndTxnRequest to be delayed on the coordinator. By the time the delayed
> > EndTxnRequest is processed, it's possible that the previous txn has
> already
> > completed and a new txn has started. Currently, since the epoch is not
> > bumped on every txn, the delayed EndTxnRequest will add an unexpected
> > prepare marker (and eventually a complete marker) to the ongoing txn.
> This
> > won't cause the transaction to hang, but it will break the EoS semantic.
> > The proposal in this KIP will address this issue too.
> >
> > 101. "However, I was writing it so that we can distinguish between
> > old clients where we don't have the ability do this operation and new
> > clients that can. (Old clients don't bump the epoch on commit, so we
> can't
> > say for sure the write 

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2567

2024-01-12 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Artem Livshits
I think using feature flags (whether we support a framework and tooling for
feature flags or just an ad-hoc XyzEnabled flag) can be an alternative to
this KIP.  I think the value of this KIP is that it's trying to propose a
systemic approach for gating functionality that may take multiple releases
to develop.  A problem with ad-hoc feature flags is that it's useful during
feature development, so that people who are working on this feature (or are
interested in beta-testing the feature) can get early access (without any
guarantees on compatibility or even correctness); but then the feature
flags often stick forever after the feature development is done (and as
time moves one, the new code is written in such a way that it's not
possible to turn the feature off any more cleanly).

I'd also clarify how I think about "stable".  Ismael made a comment "
something is stable in the "this is battle-tested" sense.".  I don't think
it has to be "battle-tested", it just has to meet the bar of being in the
trunk.  Again, thinking of a small single-commit feature -- to commit to
trunk, the feature doesn't have to be "battle-tested", but it should be
complete (and not just a bunch of TODOs), with tests written and some level
of dev-testing done, so that once the release is cut, we can find and fix
bugs and make it release-quality (as opposed to reverting the whole
thing).  We can apply the same "stability" bar for features to be in the
stable MV -- fully complete, tests written and some level of dev testing
done.

-Artem

On Fri, Jan 12, 2024 at 10:12 AM Justine Olshan
 wrote:

> Hi Ismael,
>
> I debated including something about feature flags in my last comment, but
> maybe I should have.
> What you said makes sense.
>
> Justine
>
> On Fri, Jan 12, 2024 at 9:31 AM Ismael Juma  wrote:
>
> > Justine,
> >
> > For features that are not production-ready, they should have an
> additional
> > configuration (not the metadata version) that enables/disables it. The MV
> > specific features we ship are something we have to support and make sure
> we
> > don't break going forward.
> >
> > Ismael
> >
> > On Fri, Jan 12, 2024 at 9:26 AM Justine Olshan
> > 
> > wrote:
> >
> > > Hi Ismael,
> > >
> > > I think the concern I have about a MV for a feature that is not
> > production
> > > ready is that it blocks any development/features with higher MV
> versions
> > > that could be production ready.
> > >
> > > I do see your point though. Previously MV/IBP was about pure broker
> > > compatibility and not about the confidence in the feature it is
> gating. I
> > > do wonder though if it could be useful to have that sort of gating.
> > > I also wonder if an internal config could be useful so that the average
> > > user doesn't have to worry about the complexities of unstable metadata
> > > versions (and their risks).
> > >
> > > I am ok with options 2 and 2 as well by the way.
> > >
> > > Justine
> > >
> > > On Fri, Jan 12, 2024 at 7:36 AM Ismael Juma  wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for the KIP.
> > > >
> > > > Reading the discussion, I think a lot of the confusion is due to the
> > > > "unstable" naming. People are then trying to figure out when we think
> > > > something is stable in the "this is battle-tested" sense. But this
> flag
> > > > should not be about that. We can have an MV for a feature that is not
> > yet
> > > > production-ready (and we did that when KRaft itself was not
> production
> > > > ready). I think this flag is about metadata versions that are
> basically
> > > > unsupported - if you use it, you get to keep the pieces. They exist
> > > solely
> > > > to make the lives of Apache Kafka developers easier. I would even
> > suggest
> > > > that the config we introduce be of the internal variety, ie it won't
> > show
> > > > in the generated documentation and there won't be any compatibility
> > > > guarantee.
> > > >
> > > > Thoughts?
> > > >
> > > > Ismael
> > > >
> > > > On Fri, Jan 5, 2024 at 7:33 AM Proven Provenzano
> > > >  wrote:
> > > >
> > > > > Hey folks,
> > > > >
> > > > > I am starting a discussion thread for managing unstable metadata
> > > versions
> > > > > in Apache Kafka.
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1014%3A+Managing+Unstable+Metadata+Versions+in+Apache+Kafka
> > > > >
> > > > > This KIP is actually already implemented in 3.7 with PR
> > > > > https://github.com/apache/kafka/pull/14860.
> > > > > I have created this KIP to explain the motivation and how managing
> > > > Metadata
> > > > > Versions is expected to work.
> > > > > Comments are greatly appreciated as this process can always be
> > > improved.
> > > > >
> > > > > --
> > > > > --Proven
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] 3.7.0 RC2

2024-01-12 Thread Chris Egerton
Hi Stanislav,


Thanks for running this release!

To verify, I:
- Built from source using Java 11 with both:
- - the 3.7.0-rc2 tag on GitHub
- - the kafka-3.7.0-src.tgz artifact from
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
- Checked signatures and checksums
- Ran the quickstart using both:
- - The kafka_2.13-3.7.0.tgz artifact from
https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/ with Java 11
and Scala 13 in KRaft mode
- - Our shiny new broker Docker image, apache/kafka:3.7.0-rc2
- Ran all unit tests
- Ran all integration tests for Connect and MM2


I found two minor areas for concern:

1. (Possibly a blocker)
When running the quickstart, I noticed this ERROR-level log message being
emitted frequently (not not every time) when I killed my console consumer
via ctrl-C:

> [2024-01-12 11:00:31,088] ERROR [Consumer clientId=console-consumer,
groupId=console-consumer-74388] Unable to find FetchSessionHandler for node
1. Ignoring fetch response
(org.apache.kafka.clients.consumer.internals.AbstractFetch)

I see that this error message is already reported in
https://issues.apache.org/jira/browse/KAFKA-16029. I think we should
prioritize fixing it for this release. I know it's probably benign but it's
really not a good look for us when basic operations log error messages, and
it may give new users some headaches.


2. (Probably not a blocker)
The following unit tests failed the first time around, and all of them
passed the second time I ran them:

- (clients) ClientUtilsTest.testParseAndValidateAddressesWithReverseLookup()
- (clients) SelectorTest.testConnectionsByClientMetric()
- (clients) Tls13SelectorTest.testConnectionsByClientMetric()
- (connect) TopicAdminTest.retryEndOffsetsShouldRetryWhenTopicNotFound (I
thought I fixed this one! 郎郎)
- (core) ProducerIdManagerTest.testUnrecoverableErrors(Errors)[2]


Thanks again for your work on this release, and congratulations to Kafka
Streams for having zero flaky unit tests during my highly-experimental
single laptop run!


Cheers,

Chris

On Thu, Jan 11, 2024 at 1:33 PM Stanislav Kozlovski
 wrote:

> Hello Kafka users, developers, and client-developers,
>
> This is the first candidate for release of Apache Kafka 3.7.0.
>
> Note it's named "RC2" because I had a few "failed" RCs that I had
> cut/uploaded but ultimately had to scrap prior to announcing due to new
> blockers arriving before I could even announce them.
>
> Further - I haven't yet been able to set up the system tests successfully.
> And the integration/unit tests do have a few failures that I have to spend
> time triaging. I would appreciate any help in case anyone notices any tests
> failing that they're subject matters experts in. Expect me to follow up in
> a day or two with more detailed analysis.
>
> Major changes include:
> - Early Access to KIP-848 - the next generation of the consumer rebalance
> protocol
> - KIP-858: Adding JBOD support to KRaft
> - KIP-714: Observability into Client metrics via a standardized interface
>
> Check more information in the WIP blog post:
> https://github.com/apache/kafka-site/pull/578
>
> Release notes for the 3.7.0 release:
>
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/RELEASE_NOTES.html
>
> *** Please download, test and vote by Thursday, January 18, 9am PT ***
>
> Usually these deadlines tend to be 2-3 days, but due to this being the
> first RC and the tests not having ran yet, I am giving it a bit more time.
>
> Kafka's KEYS file containing PGP keys we use to sign the release:
> https://kafka.apache.org/KEYS
>
> * Release artifacts to be voted upon (source and binary):
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/
>
> * Docker release artifact to be voted upon:
> apache/kafka:3.7.0-rc2
>
> * Maven artifacts to be voted upon:
> https://repository.apache.org/content/groups/staging/org/apache/kafka/
>
> * Javadoc:
> https://home.apache.org/~stanislavkozlovski/kafka-3.7.0-rc2/javadoc/
>
> * Tag to be voted upon (off 3.7 branch) is the 3.7.0 tag:
> https://github.com/apache/kafka/releases/tag/3.7.0-rc2
>
> * Documentation:
> https://kafka.apache.org/37/documentation.html
>
> * Protocol:
> https://kafka.apache.org/37/protocol.html
>
> * Successful Jenkins builds for the 3.7 branch:
> Unit/integration tests:
> https://ci-builds.apache.org/job/Kafka/job/kafka/job/3.7/58/
> There are failing tests here. I have to follow up with triaging some of
> the failures and figuring out if they're actual problems or simply flakes.
>
> System tests: https://jenkins.confluent.io/job/system-test-kafka/job/3.7/
>
> No successful system test runs yet. I am working on getting the job to run.
>
> * Successful Docker Image Github Actions Pipeline for 3.7 branch:
> Attached are the scan_report and report_jvm output files from the Docker
> Build run:
> https://github.com/apache/kafka/actions/runs/7486094960/job/20375761673
>
> And the final docker image build job - Docker Build Test Pipeline:
> 

[jira] [Resolved] (KAFKA-16112) Review JMX metrics in Async Consumer and determine the missing ones

2024-01-12 Thread Philip Nee (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Nee resolved KAFKA-16112.

Resolution: Fixed

These are the results of this ticket

 
|KAFKA-16113|
|KAFKA-16116|
|KAFKA-16115|

> Review JMX metrics in Async Consumer and determine the missing ones
> ---
>
> Key: KAFKA-16112
> URL: https://issues.apache.org/jira/browse/KAFKA-16112
> Project: Kafka
>  Issue Type: Task
>  Components: clients, consumer
>Reporter: Kirk True
>Assignee: Philip Nee
>Priority: Major
>  Labels: consumer-threading-refactor, kip-848-client-support
> Fix For: 3.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-890 Server Side Defense

2024-01-12 Thread Justine Olshan
Hi Jun,

I can update the description.

I believe your second point is mentioned in the KIP. I can add more text on
this if it is helpful.
> The delayed message case can also violate EOS if the delayed message
comes in after the next addPartitionsToTxn request comes in. Effectively we
may see a message from a previous (aborted) transaction become part of the
next transaction.

If the marker is written by the new client, we can as I mentioned in the
last email guarantee that any EndTxn requests with the same epoch are from
the same producer and the same transaction. Then we don't have to return a
fenced error but can handle gracefully as described in the KIP.
I don't think a boolean is useful since it is directly encoded by the
existence or lack of the tagged field being written.
In the prepare marker we will have the same producer ID in the non-tagged
field. In the Complete state we may not.
I'm not sure why the ongoing state matters for this KIP. It does matter for
KIP-939.

I'm not sure what you are referring to about writing the previous producer
ID in the prepare marker. This is not in the KIP.
In the overflow case, we write the nextProducerId in the prepare state.
This is so we know what we assigned when we reload the transaction log.
Once we complete, we transition this ID to the main (non-tagged field) and
have the previous producer ID field filled in. This is so we can identify
in a retry case the operation completed successfully and we don't fence our
producer. The downgrade compatibility I mention is that we keep the same
producer ID and epoch in the main (non-tagged) fields as we did before the
code on the server side. If the server downgrades, we are still compatible.
This addresses both the prepare and complete state downgrades.

Justine

On Fri, Jan 12, 2024 at 10:21 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the reply. Sorry for the delay. I have a few more comments.
>
> 110. I think the motivation section could be improved. One of the
> motivations listed by the KIP is "This can happen when a message gets stuck
> or delayed due to networking issues or a network partition, the transaction
> aborts, and then the delayed message finally comes in.". This seems not
> very accurate. Without KIP-890, currently, if the coordinator times out and
> aborts an ongoing transaction, it already bumps up the epoch in the marker,
> which prevents the delayed produce message from being added to the user
> partition. What can cause a hanging transaction is that the producer
> completes (either aborts or commits) a transaction before receiving a
> successful ack on messages published in the same txn. In this case, it's
> possible for the delayed message to be appended to the partition after the
> marker, causing a transaction to hang.
>
> A similar issue (not mentioned in the motivation) could happen on the
> marker in the coordinator's log. For example, it's possible for an
> EndTxnRequest to be delayed on the coordinator. By the time the delayed
> EndTxnRequest is processed, it's possible that the previous txn has already
> completed and a new txn has started. Currently, since the epoch is not
> bumped on every txn, the delayed EndTxnRequest will add an unexpected
> prepare marker (and eventually a complete marker) to the ongoing txn. This
> won't cause the transaction to hang, but it will break the EoS semantic.
> The proposal in this KIP will address this issue too.
>
> 101. "However, I was writing it so that we can distinguish between
> old clients where we don't have the ability do this operation and new
> clients that can. (Old clients don't bump the epoch on commit, so we can't
> say for sure the write belongs to the given transaction)."
> 101.1 I am wondering why we need to distinguish whether the marker is
> written by the old and the new client. Could you describe what we do
> differently if we know the marker is written by the new client?
> 101.2 If we do need a way to distinguish whether the marker is written by
> the old and the new client. Would it be simpler to just introduce a boolean
> field instead of indirectly through the previous produce ID field?
> 101.3 It's not clear to me why we only add the previous produce ID field in
> the complete marker, but not in the prepare marker. If we want to know
> whether a marker is written by the new client or not, it seems that we want
> to do this consistently for all markers.
> 101.4 What about the TransactionLogValue record representing the ongoing
> state? Should we also distinguish whether it's written by the old or the
> new client?
>
> 102. In the overflow case, it's still not clear to me why we write the
> previous produce Id in the prepare marker while writing the next produce Id
> in the complete marker. You mentioned that it's for downgrading. However,
> we could downgrade with either the prepare marker or the complete marker.
> In either case, the downgraded coordinator should see the same produce id
> (probably the previous produce 

Re: [DISCUSS] KIP-890 Server Side Defense

2024-01-12 Thread Jun Rao
Hi, Justine,

Thanks for the reply. Sorry for the delay. I have a few more comments.

110. I think the motivation section could be improved. One of the
motivations listed by the KIP is "This can happen when a message gets stuck
or delayed due to networking issues or a network partition, the transaction
aborts, and then the delayed message finally comes in.". This seems not
very accurate. Without KIP-890, currently, if the coordinator times out and
aborts an ongoing transaction, it already bumps up the epoch in the marker,
which prevents the delayed produce message from being added to the user
partition. What can cause a hanging transaction is that the producer
completes (either aborts or commits) a transaction before receiving a
successful ack on messages published in the same txn. In this case, it's
possible for the delayed message to be appended to the partition after the
marker, causing a transaction to hang.

A similar issue (not mentioned in the motivation) could happen on the
marker in the coordinator's log. For example, it's possible for an
EndTxnRequest to be delayed on the coordinator. By the time the delayed
EndTxnRequest is processed, it's possible that the previous txn has already
completed and a new txn has started. Currently, since the epoch is not
bumped on every txn, the delayed EndTxnRequest will add an unexpected
prepare marker (and eventually a complete marker) to the ongoing txn. This
won't cause the transaction to hang, but it will break the EoS semantic.
The proposal in this KIP will address this issue too.

101. "However, I was writing it so that we can distinguish between
old clients where we don't have the ability do this operation and new
clients that can. (Old clients don't bump the epoch on commit, so we can't
say for sure the write belongs to the given transaction)."
101.1 I am wondering why we need to distinguish whether the marker is
written by the old and the new client. Could you describe what we do
differently if we know the marker is written by the new client?
101.2 If we do need a way to distinguish whether the marker is written by
the old and the new client. Would it be simpler to just introduce a boolean
field instead of indirectly through the previous produce ID field?
101.3 It's not clear to me why we only add the previous produce ID field in
the complete marker, but not in the prepare marker. If we want to know
whether a marker is written by the new client or not, it seems that we want
to do this consistently for all markers.
101.4 What about the TransactionLogValue record representing the ongoing
state? Should we also distinguish whether it's written by the old or the
new client?

102. In the overflow case, it's still not clear to me why we write the
previous produce Id in the prepare marker while writing the next produce Id
in the complete marker. You mentioned that it's for downgrading. However,
we could downgrade with either the prepare marker or the complete marker.
In either case, the downgraded coordinator should see the same produce id
(probably the previous produce Id), right?

Jun

On Wed, Dec 20, 2023 at 6:00 PM Justine Olshan 
wrote:

> Hey Jun,
>
> Thanks for taking a look at the KIP again.
>
> 100. For the epoch overflow case, only the marker will have max epoch. This
> keeps the behavior of the rest of the markers where the last marker is the
> epoch of the transaction records + 1.
>
> 101. You are correct that we don't need to write the producer ID since it
> is the same. However, I was writing it so that we can distinguish between
> old clients where we don't have the ability do this operation and new
> clients that can. (Old clients don't bump the epoch on commit, so we can't
> say for sure the write belongs to the given transaction). If we receive an
> EndTxn request from a new client, we will fill this field. We can guarantee
> that any EndTxn requests with the same epoch are from the same producer and
> the same transaction.
>
> 102. In prepare phase, we have the same producer ID and epoch we always
> had. It is the producer ID and epoch that are on the marker. In commit
> phase, we stay the same unless it is the overflow case. In that case, we
> set the producer ID to the new one we generated and epoch to 0 after
> complete. This is for downgrade compatibility. The tagged fields are just
> safety guards for retries and failovers.
>
> In prepare phase for epoch overflow case only we store the next producer
> ID. This is for the case where we reload the transaction coordinator in
> prepare state. Once the transaction is committed, we can use the producer
> ID the client already is using.
>
> In commit phase, we store the previous producer ID in case of retries.
>
> I think it is easier to think of it as just how we were storing producer ID
> and epoch before, with some extra bookeeping and edge case handling in the
> tagged fields. We have to do it this way for compatibility with downgrades.
>
> 103. Next producer ID is for prepare status and 

Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Justine Olshan
Hi Ismael,

I debated including something about feature flags in my last comment, but
maybe I should have.
What you said makes sense.

Justine

On Fri, Jan 12, 2024 at 9:31 AM Ismael Juma  wrote:

> Justine,
>
> For features that are not production-ready, they should have an additional
> configuration (not the metadata version) that enables/disables it. The MV
> specific features we ship are something we have to support and make sure we
> don't break going forward.
>
> Ismael
>
> On Fri, Jan 12, 2024 at 9:26 AM Justine Olshan
> 
> wrote:
>
> > Hi Ismael,
> >
> > I think the concern I have about a MV for a feature that is not
> production
> > ready is that it blocks any development/features with higher MV versions
> > that could be production ready.
> >
> > I do see your point though. Previously MV/IBP was about pure broker
> > compatibility and not about the confidence in the feature it is gating. I
> > do wonder though if it could be useful to have that sort of gating.
> > I also wonder if an internal config could be useful so that the average
> > user doesn't have to worry about the complexities of unstable metadata
> > versions (and their risks).
> >
> > I am ok with options 2 and 2 as well by the way.
> >
> > Justine
> >
> > On Fri, Jan 12, 2024 at 7:36 AM Ismael Juma  wrote:
> >
> > > Hi,
> > >
> > > Thanks for the KIP.
> > >
> > > Reading the discussion, I think a lot of the confusion is due to the
> > > "unstable" naming. People are then trying to figure out when we think
> > > something is stable in the "this is battle-tested" sense. But this flag
> > > should not be about that. We can have an MV for a feature that is not
> yet
> > > production-ready (and we did that when KRaft itself was not production
> > > ready). I think this flag is about metadata versions that are basically
> > > unsupported - if you use it, you get to keep the pieces. They exist
> > solely
> > > to make the lives of Apache Kafka developers easier. I would even
> suggest
> > > that the config we introduce be of the internal variety, ie it won't
> show
> > > in the generated documentation and there won't be any compatibility
> > > guarantee.
> > >
> > > Thoughts?
> > >
> > > Ismael
> > >
> > > On Fri, Jan 5, 2024 at 7:33 AM Proven Provenzano
> > >  wrote:
> > >
> > > > Hey folks,
> > > >
> > > > I am starting a discussion thread for managing unstable metadata
> > versions
> > > > in Apache Kafka.
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1014%3A+Managing+Unstable+Metadata+Versions+in+Apache+Kafka
> > > >
> > > > This KIP is actually already implemented in 3.7 with PR
> > > > https://github.com/apache/kafka/pull/14860.
> > > > I have created this KIP to explain the motivation and how managing
> > > Metadata
> > > > Versions is expected to work.
> > > > Comments are greatly appreciated as this process can always be
> > improved.
> > > >
> > > > --
> > > > --Proven
> > > >
> > >
> >
>


[jira] [Created] (KAFKA-16121) Partition reassignments in ZK migration dual write mode stalled until leader epoch incremented

2024-01-12 Thread David Mao (Jira)
David Mao created KAFKA-16121:
-

 Summary: Partition reassignments in ZK migration dual write mode 
stalled until leader epoch incremented
 Key: KAFKA-16121
 URL: https://issues.apache.org/jira/browse/KAFKA-16121
 Project: Kafka
  Issue Type: Bug
Reporter: David Mao


I noticed this in an integration test in 
https://github.com/apache/kafka/pull/15184

In ZK mode, partition leaders rely on the LeaderAndIsr request to be notified 
of new replicas as part of a reassignment. In ZK mode, we ignore any 
LeaderAndIsr request where the partition leader epoch is less than or equal to 
the current partition leader epoch.

In KRaft mode, we do not bump the leader epoch when starting a new 
reassignment, see: `triggerLeaderEpochBumpIfNeeded`. This means that the leader 
will ignore the LISR request initiating the reassignment until a leader epoch 
bump is triggered through another means, for instance preferred leader election.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Ismael Juma
Justine,

For features that are not production-ready, they should have an additional
configuration (not the metadata version) that enables/disables it. The MV
specific features we ship are something we have to support and make sure we
don't break going forward.

Ismael

On Fri, Jan 12, 2024 at 9:26 AM Justine Olshan 
wrote:

> Hi Ismael,
>
> I think the concern I have about a MV for a feature that is not production
> ready is that it blocks any development/features with higher MV versions
> that could be production ready.
>
> I do see your point though. Previously MV/IBP was about pure broker
> compatibility and not about the confidence in the feature it is gating. I
> do wonder though if it could be useful to have that sort of gating.
> I also wonder if an internal config could be useful so that the average
> user doesn't have to worry about the complexities of unstable metadata
> versions (and their risks).
>
> I am ok with options 2 and 2 as well by the way.
>
> Justine
>
> On Fri, Jan 12, 2024 at 7:36 AM Ismael Juma  wrote:
>
> > Hi,
> >
> > Thanks for the KIP.
> >
> > Reading the discussion, I think a lot of the confusion is due to the
> > "unstable" naming. People are then trying to figure out when we think
> > something is stable in the "this is battle-tested" sense. But this flag
> > should not be about that. We can have an MV for a feature that is not yet
> > production-ready (and we did that when KRaft itself was not production
> > ready). I think this flag is about metadata versions that are basically
> > unsupported - if you use it, you get to keep the pieces. They exist
> solely
> > to make the lives of Apache Kafka developers easier. I would even suggest
> > that the config we introduce be of the internal variety, ie it won't show
> > in the generated documentation and there won't be any compatibility
> > guarantee.
> >
> > Thoughts?
> >
> > Ismael
> >
> > On Fri, Jan 5, 2024 at 7:33 AM Proven Provenzano
> >  wrote:
> >
> > > Hey folks,
> > >
> > > I am starting a discussion thread for managing unstable metadata
> versions
> > > in Apache Kafka.
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1014%3A+Managing+Unstable+Metadata+Versions+in+Apache+Kafka
> > >
> > > This KIP is actually already implemented in 3.7 with PR
> > > https://github.com/apache/kafka/pull/14860.
> > > I have created this KIP to explain the motivation and how managing
> > Metadata
> > > Versions is expected to work.
> > > Comments are greatly appreciated as this process can always be
> improved.
> > >
> > > --
> > > --Proven
> > >
> >
>


Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Justine Olshan
Hi Ismael,

I think the concern I have about a MV for a feature that is not production
ready is that it blocks any development/features with higher MV versions
that could be production ready.

I do see your point though. Previously MV/IBP was about pure broker
compatibility and not about the confidence in the feature it is gating. I
do wonder though if it could be useful to have that sort of gating.
I also wonder if an internal config could be useful so that the average
user doesn't have to worry about the complexities of unstable metadata
versions (and their risks).

I am ok with options 2 and 2 as well by the way.

Justine

On Fri, Jan 12, 2024 at 7:36 AM Ismael Juma  wrote:

> Hi,
>
> Thanks for the KIP.
>
> Reading the discussion, I think a lot of the confusion is due to the
> "unstable" naming. People are then trying to figure out when we think
> something is stable in the "this is battle-tested" sense. But this flag
> should not be about that. We can have an MV for a feature that is not yet
> production-ready (and we did that when KRaft itself was not production
> ready). I think this flag is about metadata versions that are basically
> unsupported - if you use it, you get to keep the pieces. They exist solely
> to make the lives of Apache Kafka developers easier. I would even suggest
> that the config we introduce be of the internal variety, ie it won't show
> in the generated documentation and there won't be any compatibility
> guarantee.
>
> Thoughts?
>
> Ismael
>
> On Fri, Jan 5, 2024 at 7:33 AM Proven Provenzano
>  wrote:
>
> > Hey folks,
> >
> > I am starting a discussion thread for managing unstable metadata versions
> > in Apache Kafka.
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1014%3A+Managing+Unstable+Metadata+Versions+in+Apache+Kafka
> >
> > This KIP is actually already implemented in 3.7 with PR
> > https://github.com/apache/kafka/pull/14860.
> > I have created this KIP to explain the motivation and how managing
> Metadata
> > Versions is expected to work.
> > Comments are greatly appreciated as this process can always be improved.
> >
> > --
> > --Proven
> >
>


[jira] [Created] (KAFKA-16120) Partition reassignments in ZK migration dual write leaves stray partitions

2024-01-12 Thread David Mao (Jira)
David Mao created KAFKA-16120:
-

 Summary: Partition reassignments in ZK migration dual write leaves 
stray partitions
 Key: KAFKA-16120
 URL: https://issues.apache.org/jira/browse/KAFKA-16120
 Project: Kafka
  Issue Type: Bug
Reporter: David Mao


When a reassignment is completed in ZK migration dual-write mode, the 
`StopReplica` sent by the kraft quorum migration propagator is sent with 
`delete = false` for deleted replicas when processing the topic delta. This 
results in stray replicas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Ismael Juma
Hi,

Thanks for the KIP.

Reading the discussion, I think a lot of the confusion is due to the
"unstable" naming. People are then trying to figure out when we think
something is stable in the "this is battle-tested" sense. But this flag
should not be about that. We can have an MV for a feature that is not yet
production-ready (and we did that when KRaft itself was not production
ready). I think this flag is about metadata versions that are basically
unsupported - if you use it, you get to keep the pieces. They exist solely
to make the lives of Apache Kafka developers easier. I would even suggest
that the config we introduce be of the internal variety, ie it won't show
in the generated documentation and there won't be any compatibility
guarantee.

Thoughts?

Ismael

On Fri, Jan 5, 2024 at 7:33 AM Proven Provenzano
 wrote:

> Hey folks,
>
> I am starting a discussion thread for managing unstable metadata versions
> in Apache Kafka.
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1014%3A+Managing+Unstable+Metadata+Versions+in+Apache+Kafka
>
> This KIP is actually already implemented in 3.7 with PR
> https://github.com/apache/kafka/pull/14860.
> I have created this KIP to explain the motivation and how managing Metadata
> Versions is expected to work.
> Comments are greatly appreciated as this process can always be improved.
>
> --
> --Proven
>


[jira] [Resolved] (KAFKA-16119) kraft_upgrade_test system test is broken

2024-01-12 Thread Mickael Maison (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mickael Maison resolved KAFKA-16119.

Resolution: Invalid

After rebuilding my env from scratch I don't see this error anymore

> kraft_upgrade_test system test is broken
> 
>
> Key: KAFKA-16119
> URL: https://issues.apache.org/jira/browse/KAFKA-16119
> Project: Kafka
>  Issue Type: New Feature
>Affects Versions: 3.6.0, 3.7.0, 3.6.1
>Reporter: Mickael Maison
>Priority: Major
>
> When the test attempts to restart brokers after the upgrade, brokers fail 
> with:
> [2024-01-12 13:43:40,144] ERROR Exiting Kafka due to fatal exception 
> (kafka.Kafka$)
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/image/loader/MetadataLoaderMetrics
> at kafka.server.KafkaRaftServer.(KafkaRaftServer.scala:68)
> at kafka.Kafka$.buildServer(Kafka.scala:83)
> at kafka.Kafka$.main(Kafka.scala:91)
> at kafka.Kafka.main(Kafka.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.image.loader.MetadataLoaderMetrics
> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> ... 4 more
> MetadataLoaderMetrics was moved from org.apache.kafka.image.loader to 
> org.apache.kafka.image.loader.metrics in 
> https://github.com/apache/kafka/commit/c7de30f38bfd6e2d62a0b5c09b5dc9707e58096b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2566

2024-01-12 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 346482 lines...]
Gradle Test Run :core:test > Gradle Test Executor 94 > ZkMigrationClientTest > 
testCreateNewTopic() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZkMigrationClientTest > 
testCreateNewTopic() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZkMigrationClientTest > 
testUpdateExistingTopicWithNewAndChangedPartitions() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZkMigrationClientTest > 
testUpdateExistingTopicWithNewAndChangedPartitions() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testZNodeChangeHandlerForDataChange() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testZNodeChangeHandlerForDataChange() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testZooKeeperSessionStateMetric() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testZooKeeperSessionStateMetric() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testExceptionInBeforeInitializingSession() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testExceptionInBeforeInitializingSession() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetChildrenExistingZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetChildrenExistingZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testConnection() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testConnection() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testZNodeChangeHandlerForCreation() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testZNodeChangeHandlerForCreation() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetAclExistingZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetAclExistingZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testSessionExpiryDuringClose() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testSessionExpiryDuringClose() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testReinitializeAfterAuthFailure() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testReinitializeAfterAuthFailure() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testSetAclNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testSetAclNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testConnectionLossRequestTermination() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testConnectionLossRequestTermination() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testExistsNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testExistsNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetDataNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetDataNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testConnectionTimeout() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testConnectionTimeout() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testBlockOnRequestCompletionFromStateChangeHandler() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testBlockOnRequestCompletionFromStateChangeHandler() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testUnresolvableConnectString() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testUnresolvableConnectString() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetChildrenNonExistentZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testGetChildrenNonExistentZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testPipelinedGetData() STARTED

Gradle Test Run :core:test > Gradle Test Executor 94 > ZooKeeperClientTest > 
testPipelinedGetData() PASSED

Gradle Test Run :core:test > Gradle Test 

Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.7 #60

2024-01-12 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-16119) kraft_upgrade_test system test is broken

2024-01-12 Thread Mickael Maison (Jira)
Mickael Maison created KAFKA-16119:
--

 Summary: kraft_upgrade_test system test is broken
 Key: KAFKA-16119
 URL: https://issues.apache.org/jira/browse/KAFKA-16119
 Project: Kafka
  Issue Type: New Feature
Affects Versions: 3.6.1, 3.6.0, 3.7.0
Reporter: Mickael Maison


When the test attempts to restart brokers after the upgrade, brokers fail with:

[2024-01-12 13:43:40,144] ERROR Exiting Kafka due to fatal exception 
(kafka.Kafka$)
java.lang.NoClassDefFoundError: 
org/apache/kafka/image/loader/MetadataLoaderMetrics
at kafka.server.KafkaRaftServer.(KafkaRaftServer.scala:68)
at kafka.Kafka$.buildServer(Kafka.scala:83)
at kafka.Kafka$.main(Kafka.scala:91)
at kafka.Kafka.main(Kafka.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.kafka.image.loader.MetadataLoaderMetrics
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 4 more

MetadataLoaderMetrics was moved from org.apache.kafka.image.loader to 
org.apache.kafka.image.loader.metrics in 
https://github.com/apache/kafka/commit/c7de30f38bfd6e2d62a0b5c09b5dc9707e58096b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is unstable: Kafka » Kafka Branch Builder » 3.5 #104

2024-01-12 Thread Apache Jenkins Server
See 




Jenkins build is unstable: Kafka » Kafka Branch Builder » 3.4 #173

2024-01-12 Thread Apache Jenkins Server
See 




Jenkins build is still unstable: Kafka » Kafka Branch Builder » 3.6 #135

2024-01-12 Thread Apache Jenkins Server
See 




Re: KIP-991: Allow DropHeaders SMT to drop headers by wildcard/regexp

2024-01-12 Thread Roman Schmitz
Hi,

Thanks, 100% agree, I fixed the description.

Thanks,
Roman

Am Do., 11. Jan. 2024 um 18:27 Uhr schrieb Mickael Maison <
mickael.mai...@gmail.com>:

> Hi Roman,
>
> Thanks for the updates, this looks much better.
>
> Just a couple of small comments:
> - The type of the field is listed as "boolean". I think it should be
> string (or list)
> - Should the field be named "headers.patterns" instead of
> "headers.pattern" since it accepts a list of patterns?
>
> Thanks,
> Mickael
>
> On Thu, Jan 11, 2024 at 12:56 PM Roman Schmitz 
> wrote:
> >
> > Hi Mickael,
> > Hi all,
> >
> > Thanks for the feedback!
> > I have adapted the KIP description - actually much shorter and just
> > reflecting the general functionality and interface/configuration changes.
> >
> > Kindly let me know if you have any comments, questions, or suggestions
> for
> > this KIP!
> >
> > Thanks,
> > Roman
> >
> > Am Fr., 5. Jan. 2024 um 17:36 Uhr schrieb Mickael Maison <
> > mickael.mai...@gmail.com>:
> >
> > > Hi Roman,
> > >
> > > Thanks for the KIP! This would be a useful improvement.
> > >
> > > Ideally you want to make a concerte proposal in the KIP instead of
> > > listing a series of options. Currently the KIP seems to list two
> > > alternatives.
> > >
> > > Also a KIP focuses on the API changes rather than on the pure
> > > implementation. It seems you're proposing adding a configuration to
> > > the DropHeaders SMT. It would be good to describe that new
> > > configuration. For example see KIP-911 which also added a
> > > configuration.
> > >
> > > Thanks,
> > > Mickael
> > >
> > > On Mon, Oct 16, 2023 at 9:50 AM Roman Schmitz  >
> > > wrote:
> > > >
> > > > Hi Andrew,
> > > >
> > > > Ok, thanks for the feedback! I added a few more details and code
> examples
> > > > to explain the proposed changes.
> > > >
> > > > Thanks,
> > > > Roman
> > > >
> > > > Am So., 15. Okt. 2023 um 22:12 Uhr schrieb Andrew Schofield <
> > > > andrew_schofield_j...@outlook.com>:
> > > >
> > > > > Hi Roman,
> > > > > Thanks for the KIP. I think it’s an interesting idea, but I think
> the
> > > KIP
> > > > > document needs some
> > > > > more details added before it’s ready for review. For example,
> here’s a
> > > KIP
> > > > > in the same
> > > > > area which was delivered in an earlier version of Kafka. I think
> this
> > > is a
> > > > > good KIP to copy
> > > > > for a suitable level of detail and description (
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-585%3A+Filter+and+Conditional+SMTs
> > > > > ).
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > Thanks,
> > > > > Andrew
> > > > >
> > > > > > On 15 Oct 2023, at 21:02, Roman Schmitz  >
> > > wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > While working with different customers I came across the case
> several
> > > > > times
> > > > > > that we'd like to not only explicitly remove headers by name but
> by
> > > > > pattern
> > > > > > / regexp. Here is a KIP for this feature!
> > > > > >
> > > > > > Please let me know if you have any comments, questions, or
> > > suggestions!
> > > > > >
> > > > > > https://cwiki.apache.org/confluence/x/oYtEE
> > > > > >
> > > > > > Thanks,
> > > > > > Roman
> > > > >
> > > > >
> > >
>


Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2565

2024-01-12 Thread Apache Jenkins Server
See 




[jira] [Created] (KAFKA-16118) Coordinator unloading fails when replica is deleted

2024-01-12 Thread David Jacot (Jira)
David Jacot created KAFKA-16118:
---

 Summary: Coordinator unloading fails when replica is deleted
 Key: KAFKA-16118
 URL: https://issues.apache.org/jira/browse/KAFKA-16118
 Project: Kafka
  Issue Type: Sub-task
Reporter: David Jacot
Assignee: David Jacot


The new group coordinator always expects the leader epoch to be received when 
it must unload the metadata for a partition. However, in KRaft, the leader 
epoch is not passed when the replica is delete (e.g. after reassignment).
{noformat}
java.lang.IllegalArgumentException: The leader epoch should always be provided 
in KRaft.
    at 
org.apache.kafka.coordinator.group.GroupCoordinatorService.onResignation(GroupCoordinatorService.java:931)
    at 
kafka.server.metadata.BrokerMetadataPublisher.$anonfun$onMetadataUpdate$9(BrokerMetadataPublisher.scala:200)
    at 
kafka.server.metadata.BrokerMetadataPublisher.$anonfun$onMetadataUpdate$9$adapted(BrokerMetadataPublisher.scala:200)
    at 
kafka.server.metadata.BrokerMetadataPublisher.$anonfun$updateCoordinator$4(BrokerMetadataPublisher.scala:397)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at 
kafka.server.metadata.BrokerMetadataPublisher.updateCoordinator(BrokerMetadataPublisher.scala:396)
    at 
kafka.server.metadata.BrokerMetadataPublisher.$anonfun$onMetadataUpdate$7(BrokerMetadataPublisher.scala:200)
    at 
kafka.server.metadata.BrokerMetadataPublisher.onMetadataUpdate(BrokerMetadataPublisher.scala:186)
    at 
org.apache.kafka.image.loader.MetadataLoader.maybePublishMetadata(MetadataLoader.java:382)
    at 
org.apache.kafka.image.loader.MetadataBatchLoader.applyDeltaAndUpdate(MetadataBatchLoader.java:286)
    at 
org.apache.kafka.image.loader.MetadataBatchLoader.maybeFlushBatches(MetadataBatchLoader.java:222)
    at 
org.apache.kafka.image.loader.MetadataLoader.lambda$handleCommit$1(MetadataLoader.java:406)
    at 
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
    at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
    at 
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
    at java.base/java.lang.Thread.run(Thread.java:1583)
    at 
org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66){noformat}
The side effect of this bug is that group coordinator loading/unloading fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] KIP-1005: Expose EarliestLocalOffset and TieredOffset

2024-01-12 Thread Federico Valeri
+1 non binding

Thanks

On Fri, Jan 12, 2024 at 1:31 AM Boudjelda Mohamed Said
 wrote:
>
> +1 (binding)
>
>
> On Fri, Jan 12, 2024 at 1:21 AM Satish Duggana 
> wrote:
>
> > +1 (binding)
> >
> > Thanks,
> > Satish.
> >
> > On Thu, 11 Jan 2024 at 17:52, Divij Vaidya 
> > wrote:
> > >
> > > +1 (binding)
> > >
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Tue, Dec 26, 2023 at 7:05 AM Kamal Chandraprakash <
> > > kamal.chandraprak...@gmail.com> wrote:
> > >
> > > > +1 (non-binding). Thanks for the KIP!
> > > >
> > > > --
> > > > Kamal
> > > >
> > > > On Thu, Dec 21, 2023 at 2:23 PM Christo Lolov 
> > > > wrote:
> > > >
> > > > > Heya all!
> > > > >
> > > > > KIP-1005 (
> > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1005%3A+Expose+EarliestLocalOffset+and+TieredOffset
> > > > > )
> > > > > has been open for around a month with no further comments - I would
> > like
> > > > to
> > > > > start a voting round on it!
> > > > >
> > > > > Best,
> > > > > Christo
> > > > >
> > > >
> >


Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Andrew Schofield
I also agree with 2 & 2 with reasoning along the same lines as Artem.

Thanks,
Andrew

> On 12 Jan 2024, at 09:15, Federico Valeri  wrote:
> 
> On Thu, Jan 11, 2024 at 10:43 PM Artem Livshits
>  wrote:
>> 
>> Hi Proven,
>> 
>> I'd say that we should do 2 & 2.  The idea is that for small features that
>> can be done and stabilized within a short period of time (with one or very
>> few commits) that's exactly what happens -- people interested in testing
>> in-progress feature could take unstable code from a patch (or private
>> branch / fork) with the expectation that that private code could create a
>> state that will not be compatible with anything (or may be completely
>> broken for that matter -- in the end of the day it's a functionality that
>> may not be fully tested or even fully implemented); and once the feature is
>> stable it goes to trunk it is fully committed there, if the bugs are found
>> they'd get fixed "forward".
> 
> I agree with this reasoning.
> 
>> The 2 & 2 option pretty much extends this to
>> large features -- if a feature is above stable MV, then going above it is
>> like getting some in-progress code for early testing with the expectation
>> that something may not fully work or leave system in upgradable state;
> 
> Usually I expect that an early access feature may not fully work, but
> not that it could affect upgrades. I think this is less obvious,
> that's why I asked to document clearly.
> 
>> promoting a feature into a state MV would come with the expectation that
>> the feature gets fully committed and any bugs will be fixed "forward".
>> 
>> -Artem
>> 
>> On Thu, Jan 11, 2024 at 10:16 AM Proven Provenzano
>>  wrote:
>> 
>>> We have two approaches here for how we update unstable metadata versions.
>>> 
>>>   1. The update will only increase MVs of unstable features to a value
>>>   greater than the new stable feature. The idea is that a specific
>>> unstable
>>>   MV may support some set of features and in the future that set is
>>> always a
>>>   strict subset of the current set. The issue is that moving a feature to
>>>   make way for a stable feature with a higher MV will leave holes.
>>>   2. We are free to reorder the MV for any unstable feature. This removes
>>>   the hole issue, but does make the unstable MVs more muddled. There isn't
>>>   the same binary state for a MV where a feature is available or there is
>>> a
>>>   hole.
>>> 
>>> 
>>> We also have two ends of the spectrum as to when we update the stable MV.
>>> 
>>>   1. We update at release points which reduces the amount of churn of the
>>>   unstable MVs and makes a stronger correlation between accepted features
>>> and
>>>   stable MVs for a release but means less testing on trunk as a stable MV.
>>>   2. We update when the developers of a feature think it is done. This
>>>   leads to features being available for more testing in trunk but forces
>>> the
>>>   next release to include it as stable.
>>> 
>>> 
>>> I'd like more feedback from others on these two dimensions.
>>> --Proven
>>> 
>>> 
>>> 
>>> On Wed, Jan 10, 2024 at 12:16 PM Justine Olshan
>>>  wrote:
>>> 
 Hmm it seems like Colin and Proven are disagreeing with whether we can
>>> swap
 unstable metadata versions.
 
> When we reorder, we are always allocating a new MV and we are never
 reusing an existing MV even if it was also unstable.
 
> Given that this is true, there's no reason to have special rules about
 what we can and can't do with unstable MVs. We can do anything
 
 I don't have a strong preference either way, but I think we should agree
>>> on
 one approach.
 The benefit of reordering and reusing is that we can release features
>>> that
 are ready earlier and we have more flexibility. With the approach where
>>> we
 always create a new MV, I am concerned with having many "empty" MVs. This
 would encourage waiting until the release before we decide an incomplete
 feature is not ready and moving its MV into the future. (The
 abandoning comment I made earlier -- that is consistent with Proven's
 approach)
 
 I think the only potential issue with reordering is that it could be a
>>> bit
 confusing and *potentially *prone to errors. Note I say potentially
>>> because
 I think it depends on folks' understanding with this new unstable
>>> metadata
 version concept. I echo Federico's comments about making sure the risks
>>> are
 highlighted.
 
 Thanks,
 
 Justine
 
 On Wed, Jan 10, 2024 at 1:16 AM Federico Valeri 
 wrote:
 
> Hi folks,
> 
>> If you use an unstable MV, you probably won't be able to upgrade your
> software. Because whenever something changes, you'll probably get
> serialization exceptions being thrown inside the controller. Fatal
>>> ones.
> 
> Thanks for this clarification. I think this concrete risk should be
> highlighted in the KIP and in the 

Re: [DISCUSS] KIP-1014: Managing Unstable Metadata Versions in Apache Kafka

2024-01-12 Thread Federico Valeri
On Thu, Jan 11, 2024 at 10:43 PM Artem Livshits
 wrote:
>
> Hi Proven,
>
> I'd say that we should do 2 & 2.  The idea is that for small features that
> can be done and stabilized within a short period of time (with one or very
> few commits) that's exactly what happens -- people interested in testing
> in-progress feature could take unstable code from a patch (or private
> branch / fork) with the expectation that that private code could create a
> state that will not be compatible with anything (or may be completely
> broken for that matter -- in the end of the day it's a functionality that
> may not be fully tested or even fully implemented); and once the feature is
> stable it goes to trunk it is fully committed there, if the bugs are found
> they'd get fixed "forward".

I agree with this reasoning.

> The 2 & 2 option pretty much extends this to
> large features -- if a feature is above stable MV, then going above it is
> like getting some in-progress code for early testing with the expectation
> that something may not fully work or leave system in upgradable state;

Usually I expect that an early access feature may not fully work, but
not that it could affect upgrades. I think this is less obvious,
that's why I asked to document clearly.

> promoting a feature into a state MV would come with the expectation that
> the feature gets fully committed and any bugs will be fixed "forward".
>
> -Artem
>
> On Thu, Jan 11, 2024 at 10:16 AM Proven Provenzano
>  wrote:
>
> > We have two approaches here for how we update unstable metadata versions.
> >
> >1. The update will only increase MVs of unstable features to a value
> >greater than the new stable feature. The idea is that a specific
> > unstable
> >MV may support some set of features and in the future that set is
> > always a
> >strict subset of the current set. The issue is that moving a feature to
> >make way for a stable feature with a higher MV will leave holes.
> >2. We are free to reorder the MV for any unstable feature. This removes
> >the hole issue, but does make the unstable MVs more muddled. There isn't
> >the same binary state for a MV where a feature is available or there is
> > a
> >hole.
> >
> >
> > We also have two ends of the spectrum as to when we update the stable MV.
> >
> >1. We update at release points which reduces the amount of churn of the
> >unstable MVs and makes a stronger correlation between accepted features
> > and
> >stable MVs for a release but means less testing on trunk as a stable MV.
> >2. We update when the developers of a feature think it is done. This
> >leads to features being available for more testing in trunk but forces
> > the
> >next release to include it as stable.
> >
> >
> > I'd like more feedback from others on these two dimensions.
> > --Proven
> >
> >
> >
> > On Wed, Jan 10, 2024 at 12:16 PM Justine Olshan
> >  wrote:
> >
> > > Hmm it seems like Colin and Proven are disagreeing with whether we can
> > swap
> > > unstable metadata versions.
> > >
> > > >  When we reorder, we are always allocating a new MV and we are never
> > > reusing an existing MV even if it was also unstable.
> > >
> > > > Given that this is true, there's no reason to have special rules about
> > > what we can and can't do with unstable MVs. We can do anything
> > >
> > > I don't have a strong preference either way, but I think we should agree
> > on
> > > one approach.
> > > The benefit of reordering and reusing is that we can release features
> > that
> > > are ready earlier and we have more flexibility. With the approach where
> > we
> > > always create a new MV, I am concerned with having many "empty" MVs. This
> > > would encourage waiting until the release before we decide an incomplete
> > > feature is not ready and moving its MV into the future. (The
> > > abandoning comment I made earlier -- that is consistent with Proven's
> > > approach)
> > >
> > > I think the only potential issue with reordering is that it could be a
> > bit
> > > confusing and *potentially *prone to errors. Note I say potentially
> > because
> > > I think it depends on folks' understanding with this new unstable
> > metadata
> > > version concept. I echo Federico's comments about making sure the risks
> > are
> > > highlighted.
> > >
> > > Thanks,
> > >
> > > Justine
> > >
> > > On Wed, Jan 10, 2024 at 1:16 AM Federico Valeri 
> > > wrote:
> > >
> > > > Hi folks,
> > > >
> > > > > If you use an unstable MV, you probably won't be able to upgrade your
> > > > software. Because whenever something changes, you'll probably get
> > > > serialization exceptions being thrown inside the controller. Fatal
> > ones.
> > > >
> > > > Thanks for this clarification. I think this concrete risk should be
> > > > highlighted in the KIP and in the "unstable.metadata.versions.enable"
> > > > documentation.
> > > >
> > > > In the test plan, should we also have one system test checking that
> > > > "features with a 

Re: DISCUSS KIP-1011: Use incrementalAlterConfigs when updating broker configs by kafka-configs.sh

2024-01-12 Thread Luke Chen
Hi Ziming,

Thanks for the KIP!
LGTM!
Using incremental by defaul and fallback automatically if it's not
supported is a good idea!

One minor comment:
1. so I'm inclined to move it to incrementalAlterConfigs  and "provide a
flag" to still use alterConfigs  for new client to interact with old
servers.
I don't think we will "provide any flag" after the discussion. We should
remove it.

Thanks.
Luke

On Fri, Jan 12, 2024 at 12:29 PM ziming deng 
wrote:

> Thank you for your clarification, Chris,
>
> I have spent some time to review KIP-894 and I think it's automatic way is
> better and bring no side effect, and I will also adopt this way here.
> As you mentioned, the changes in semantics is minor, the most important
> reason for this change is fixing bug brought by sensitive configs.
>
>
> >  We
> > don't appear to support appending/subtracting from list properties via
> the
> > CLI for any other entity type right now,
> You are right about this, I tried and found that we can’t subtract or
> append configs, I will change the KIP to "making way for
> appending/subtracting list properties"
>
> --
> Best,
> Ziming
>
> > On Jan 6, 2024, at 01:34, Chris Egerton  wrote:
> >
> > Hi all,
> >
> > Can we clarify any changes in the user-facing semantics for the CLI tool
> > that would come about as a result of this KIP? I think the debate over
> the
> > necessity of an opt-in flag, or waiting for 4.0.0, ultimately boils down
> to
> > this.
> >
> > My understanding is that the only changes in semantics are fairly minor
> > (semantic versioning pun intended):
> >
> > - Existing sensitive broker properties no longer have to be explicitly
> > specified on the command line if they're not being changed
> > - A small race condition is fixed where the broker config is updated by a
> > separate operation in between when the CLI reads the existing broker
> config
> > and writes the new broker config
> > - Usage of a new broker API that has been supported since version 2.3.0,
> > but which does not require any new ACLs and does not act any differently
> > apart from the two small changes noted above
> >
> > If this is correct, then I'm inclined to agree with Ismael's suggestion
> of
> > starting with incrementalAlterConfigs, and falling back on alterConfigs
> if
> > the former is not supported by the broker, and do not believe it's
> > necessary to wait for 4.0.0 or provide opt-in or opt-out flags to release
> > this change. This would also be similar to changes we made to
> MirrorMaker 2
> > in KIP-894 [1], where the default behavior for syncing topic configs is
> now
> > to start with incrementalAlterConfigs and fall back on alterConfigs if
> it's
> > not supported.
> >
> > If there are other, more significant changes to the user-facing semantics
> > for the CLI, then these should be called out here and in the KIP, and we
> > might consider a more cautious approach.
> >
> > [1] -
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-894%3A+Use+incrementalAlterConfigs+API+for+syncing+topic+configurations
> >
> >
> > Also, regarding this part of the KIP:
> >
> >> incrementalAlterConfigs is more convenient especially for updating
> > configs of list data type, such as
> "leader.replication.throttled.replicas"
> >
> > While this is true for the Java admin client and the corresponding broker
> > APIs, it doesn't appear to be relevant to the kafka-configs.sh CLI tool.
> We
> > don't appear to support appending/subtracting from list properties via
> the
> > CLI for any other entity type right now, and there's nothing in the KIP
> > that leads me to believe we'd be adding it for broker configs.
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Jan 4, 2024 at 10:12 PM ziming deng  >
> > wrote:
> >
> >> Hi Ismael,
> >> I added this automatically approach to “Rejected alternatives”
> concerning
> >> that we need to unify the semantics between alterConfigs and
> >> incrementalAlterConfigs, so I choose to give this privilege to users.
> >>
> >> After reviewing these code and doing some tests I found that they
> >> following the similar approach, I think the simplest way is to let the
> >> client choose the best method heuristically.
> >>
> >> Thank you for pointing out this, I will change the KIP later.
> >>
> >> Best,
> >> Ziming
> >>
> >>> On Jan 4, 2024, at 17:28, Ismael Juma  wrote:
> >>>
> >>> Hi Ziming,
> >>>
> >>> Why is the flag required at all? Can we use incremental and fallback
> >> automatically if it's not supported by the broker? At this point, the
> vast
> >> majority of clusters should support it.
> >>>
> >>> Ismael
> >>>
> >>> On Mon, Dec 18, 2023 at 7:58 PM ziming deng  >> > wrote:
> 
>  Hello, I want to start a discussion on KIP-1011, to make the broker
> >> config change path unified with that of user/topic/client-metrics and
> avoid
> >> some bugs.
> 
>  Here is the link:
> 
>  KIP-1011: Use incrementalAlterConfigs when updating 

[jira] [Resolved] (KAFKA-15738) KRaft support in ConsumerWithLegacyMessageFormatIntegrationTest

2024-01-12 Thread Manikumar (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikumar resolved KAFKA-15738.
---
Fix Version/s: 3.8.0
   Resolution: Fixed

> KRaft support in ConsumerWithLegacyMessageFormatIntegrationTest
> ---
>
> Key: KAFKA-15738
> URL: https://issues.apache.org/jira/browse/KAFKA-15738
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Sameer Tejani
>Assignee: Abhinav Dixit
>Priority: Minor
>  Labels: kraft, kraft-test, newbie
> Fix For: 3.8.0
>
>
> The following tests in ConsumerWithLegacyMessageFormatIntegrationTest in 
> core/src/test/scala/integration/kafka/api/ConsumerWithLegacyMessageFormatIntegrationTest.scala
>  need to be updated to support KRaft
> 0 : def testOffsetsForTimes(): Unit = {
> 102 : def testEarliestOrLatestOffsets(): Unit = {
> Scanned 132 lines. Found 0 KRaft tests out of 2 tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)