[jira] [Created] (KAFKA-16541) Potential leader epoch checkpoint file corruption on OS crash

2024-04-11 Thread Haruki Okada (Jira)
Haruki Okada created KAFKA-16541:


 Summary: Potential leader epoch checkpoint file corruption on OS 
crash
 Key: KAFKA-16541
 URL: https://issues.apache.org/jira/browse/KAFKA-16541
 Project: Kafka
  Issue Type: Bug
  Components: core
Reporter: Haruki Okada
Assignee: Haruki Okada


Pointed out by [~junrao] on 
[GitHub|https://github.com/apache/kafka/pull/14242#discussion_r1556161125]

[A patch for KAFKA-15046|https://github.com/apache/kafka/pull/14242] got rid of 
fsync of leader-epoch ckeckpoint file in some path for performance reason.

However, since now checkpoint file is flushed to the device asynchronously by 
OS, content would corrupt if OS suddenly crashes (e.g. by power failure, kernel 
panic) in the middle of flush.

Corrupted checkpoint file could prevent Kafka broker to start-up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16540) Update partitions when the min isr config is updated.

2024-04-11 Thread Calvin Liu (Jira)
Calvin Liu created KAFKA-16540:
--

 Summary: Update partitions when the min isr config is updated.
 Key: KAFKA-16540
 URL: https://issues.apache.org/jira/browse/KAFKA-16540
 Project: Kafka
  Issue Type: Sub-task
Reporter: Calvin Liu
Assignee: Calvin Liu


If the min isr config is changed, we need to update the partitions accordingly. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15586) Clean shutdown detection, server side

2024-04-11 Thread Calvin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-15586.

Resolution: Fixed

> Clean shutdown detection, server side
> -
>
> Key: KAFKA-15586
> URL: https://issues.apache.org/jira/browse/KAFKA-15586
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>
> Upon the broker registration, if the broker has an unclean shutdown, it 
> should be removed from all the ELRs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16539) Can't update specific broker configs in pre-migration mode

2024-04-11 Thread David Arthur (Jira)
David Arthur created KAFKA-16539:


 Summary: Can't update specific broker configs in pre-migration mode
 Key: KAFKA-16539
 URL: https://issues.apache.org/jira/browse/KAFKA-16539
 Project: Kafka
  Issue Type: Bug
  Components: config, kraft
Affects Versions: 3.6.2, 3.6.1, 3.7.0, 3.6.0
Reporter: David Arthur
Assignee: David Arthur
 Fix For: 3.8.0, 3.7.1, 3.6.3


In migration mode, ZK brokers will have a forwarding manager configured. This 
is used to forward requests to the KRaft controller once we get to that part of 
the migration. However, prior to KRaft taking over as the controller (known as 
pre-migration mode), the ZK brokers are still attempting to forward 
IncrementalAlterConfigs to the controller.

This works fine for cluster level configs (e.g., "--entity-type broker 
--entity-default"), but this fails for specific broker configs (e.g., 
"--entity-type broker --entity-id 1").

This affects BROKER and BROKER_LOGGER config types.

To workaround this bug, you can either disable migrations on the brokers 
(assuming no migration has taken place), or proceed with the migration and get 
to the point where KRaft is the controller.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16538) UpdateFeatures for kraft.version

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16538:
--

 Summary: UpdateFeatures for kraft.version
 Key: KAFKA-16538
 URL: https://issues.apache.org/jira/browse/KAFKA-16538
 Project: Kafka
  Issue Type: Sub-task
Reporter: José Armando García Sancio


Should:
 # Route request to cluster metadata kraft client.
 # KRaft leader should check the supported version of all voters and observers
 ## voter information comes from VoterSet
 ## observer information is push down to kraft by the metadata controller
 # Persist both the kraft.version and voter set in one control batch

We need to allow for the kraft.version to succeed while the metadata controller 
changes may fail. This is needed because there will be two batches for this 
updates. One control record batch which includes kraft.version and voter set, 
and one metadata batch which includes the feature records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16537) RemoveVoter RPC and request handling

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16537:
--

 Summary: RemoveVoter RPC and request handling
 Key: KAFKA-16537
 URL: https://issues.apache.org/jira/browse/KAFKA-16537
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


Implement removing a voter from the voters set. That includes RemoveVoter 
request handling and decreasing the voters set by the internal listener. 
Triggering a new election when removing the leader (calling EndQuorumEpoch RPC).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16536) Use BeginQuorumEpoch as the leader's heartbeat

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16536:
--

 Summary: Use BeginQuorumEpoch as the leader's heartbeat
 Key: KAFKA-16536
 URL: https://issues.apache.org/jira/browse/KAFKA-16536
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio


Change the leader's implementation of the BeginQuorumEpoch to behave more like 
a heartbeat mechanism. The first implementation can just send the request at 
some interval (half the fetch timeout).

Future implementations can conserve resource by tracking fetch timeouts per 
remove voter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16535) AddVoter RPC and request handling

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16535:
--

 Summary: AddVoter RPC and request handling
 Key: KAFKA-16535
 URL: https://issues.apache.org/jira/browse/KAFKA-16535
 Project: Kafka
  Issue Type: Sub-task
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


Implement adding a voter to the voters set. That includes the AddVoter request 
handling and increasing the voters set by the internal listener. Send 
BeginQuorumEpoch to the new voter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16534) Sending UpdateVoter request and response handling

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16534:
--

 Summary: Sending UpdateVoter request and response handling
 Key: KAFKA-16534
 URL: https://issues.apache.org/jira/browse/KAFKA-16534
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16533) UpdateVoter RPC and request handling

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16533:
--

 Summary: UpdateVoter RPC and request handling
 Key: KAFKA-16533
 URL: https://issues.apache.org/jira/browse/KAFKA-16533
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio


The request handling for UpdateVoter request needs to support the case when the 
kraft.version is 0 or 1.
 # When the kraft.version is 0 the leader should just update it in-memory state.
 # When the kraft.version is 1 the leader should write to the log and update 
its in-memory state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16532) Support for first leader bootstrapping the voter set

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16532:
--

 Summary: Support for first leader bootstrapping the voter set
 Key: KAFKA-16532
 URL: https://issues.apache.org/jira/browse/KAFKA-16532
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio


Need to include the following changes:
 # replicas can read the bootstrapping checkpoint at 0-0.checkpoint
 # the leader will write the set of voters and kraft.version to the log if it 
hasn't been done before. The leader should these two control records in one 
batch if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16531) Fix check-quorum calculation to no assume that the leader is in the voter set

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16531:
--

 Summary: Fix check-quorum calculation to no assume that the leader 
is in the voter set
 Key: KAFKA-16531
 URL: https://issues.apache.org/jira/browse/KAFKA-16531
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


In the check-quorum calculation, the leader should not assume that it is part 
of the voter set. This may happen when the leader is removing itself from the 
voter set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16530) Fix high-watermark calculation to no assume the leader is in the voter set

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16530:
--

 Summary: Fix high-watermark calculation to no assume the leader is 
in the voter set
 Key: KAFKA-16530
 URL: https://issues.apache.org/jira/browse/KAFKA-16530
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


When the leader is being removed from the voter set, the leader may not be in 
the voter set. This means that kraft should not assume that the leader is part 
of the high-watermark calculation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16529) Response handling and request sending for voters RPCs

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16529:
--

 Summary: Response handling and request sending for voters RPCs
 Key: KAFKA-16529
 URL: https://issues.apache.org/jira/browse/KAFKA-16529
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


Implement response handling and request sending for the following RPCs:
 # Vote
 # BeginQuorumEpoch
 # Fetch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16528) Reset member heartbeat interval when request sent

2024-04-11 Thread Lianet Magrans (Jira)
Lianet Magrans created KAFKA-16528:
--

 Summary: Reset member heartbeat interval when request sent
 Key: KAFKA-16528
 URL: https://issues.apache.org/jira/browse/KAFKA-16528
 Project: Kafka
  Issue Type: Task
  Components: clients, consumer
Reporter: Lianet Magrans
Assignee: Lianet Magrans
 Fix For: 3.8.0


Member should reset the heartbeat timer when the request is sent, rather than 
when a response is received. This aims to ensure that we don't add-up to 
interval any delay there might be in a response. With this, we better respect 
the contract of members sending HB on the interval to remain in the group, and 
avoid potential unwanted rebalances.   

Note that there is already a logic in place to avoid sending a request if a 
response hasn't been received. So that will ensure that, even with the reset of 
the interval on the send,  the next HB will only be sent as when the response 
is received. (Will be sent out on the next poll of the HB manager, and 
respecting the minimal backoff for sending consecutive requests). This will btw 
be consistent with how the interval timing & in-flights is handled for 
auto-commits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16527) RPC changes and request handling

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16527:
--

 Summary: RPC changes and request handling
 Key: KAFKA-16527
 URL: https://issues.apache.org/jira/browse/KAFKA-16527
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


Make changes to the RPC for KIP-853 for the following RPCs:
 # Vote
 # BeginQuorumEpoch
 # Fetch

This must also include the handling of these requests. This change will allow 
followers (voters and observers) discover the leader's endpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-11 Thread Justine Olshan
Updated. :)
Thanks for the reviews

Justine

On Thu, Apr 11, 2024 at 11:01 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the updated KIP.
>
> Perhaps it's better to name the new config unstable.feature.versions.enable
> since there could be multiple unstable versions.
>
> Other than that, the KIP looks good to me.
>
> Jun
>
> On Thu, Apr 11, 2024 at 9:06 AM Justine Olshan
> 
> wrote:
>
> > The original config was never actually approved in any KIP. But we can
> say
> > it is deprecated.
> > I can change the config name.
> >
> > Justine
> >
> > On Thu, Apr 11, 2024 at 8:52 AM Jun Rao 
> wrote:
> >
> > > Hi, Justine,
> > >
> > > Thanks for the updated KIP.
> > >
> > > Would unstable.feature.version.enable be a clearer name? Also, should
> we
> > > remove/deprecate unstable.metadata.versions.enable in this KIP?
> > >
> > > Jun
> > >
> > > On Tue, Apr 9, 2024 at 9:09 AM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Makes sense to me. It seems like KIP-1014 has been inactive
> recently. I
> > > can
> > > > update my KIP and mention this change on that discussion thread.
> > > >
> > > > Justine
> > > >
> > > > On Tue, Apr 9, 2024 at 9:01 AM Jun Rao 
> > wrote:
> > > >
> > > > > Hi, Justine,
> > > > >
> > > > > A single config makes sense to me too. We just need to reach
> > consensus
> > > > with
> > > > > KIP-1014.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > That's a good question. I think maybe for simplicity, we can
> have a
> > > > > single
> > > > > > config?
> > > > > > If that makes sense, I will update the KIP.
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao  >
> > > > wrote:
> > > > > >
> > > > > > > Hi, Justine,
> > > > > > >
> > > > > > > Thanks for the updated KIP.
> > > > > > >
> > > > > > > One more question related to KIP-1014. It introduced a new
> > > > > > > config unstable.metadata.versions.enable. Does each new feature
> > > need
> > > > to
> > > > > > > have a corresponding config to enable the testing of unstable
> > > > features
> > > > > or
> > > > > > > should we have a generic config enabling the testing of all
> > > unstable
> > > > > > > features?
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan
> > > > > >  > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I'm hoping this covers the majority of comments. I will go
> > ahead
> > > > and
> > > > > > open
> > > > > > > > the vote in the next day or so.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Justine
> > > > > > > >
> > > > > > > > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan <
> > > > jols...@confluent.io>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Find and replace has failed me :(
> > > > > > > > >
> > > > > > > > > Group version seems a little vague, but we can update it.
> > > > Hopefully
> > > > > > > find
> > > > > > > > > and replace won't fail me again, otherwise I will get
> another
> > > > email
> > > > > > on
> > > > > > > > this.
> > > > > > > > >
> > > > > > > > > Justine
> > > > > > > > >
> > > > > > > > > On Wed, Apr 3, 2024 at 12:15 PM David Jacot
> > > > > > >  > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Thanks, Justine.
> > > > > > > > >>
> > > > > > > > >> * Should we also use `group.version` (GV) as I suggested
> in
> > my
> > > > > > > previous
> > > > > > > > >> message in order to be consistent?
> > > > > > > > >> * Should we add both names to the `Public Interfaces`
> > section?
> > > > > > > > >> * There is still at least one usage of
> > > > > > `transaction.protocol.verison`
> > > > > > > in
> > > > > > > > >> the KIP too.
> > > > > > > > >>
> > > > > > > > >> Best,
> > > > > > > > >> David
> > > > > > > > >>
> > > > > > > > >> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
> > > > > > > > >> 
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > I had missed the David's message yesterday about the
> > naming
> > > > for
> > > > > > > > >> transaction
> > > > > > > > >> > version vs transaction protocol version.
> > > > > > > > >> >
> > > > > > > > >> > After some offline discussion with Jun, Artem, and
> David,
> > we
> > > > > > agreed
> > > > > > > > that
> > > > > > > > >> > transaction version is simpler and conveys more than
> just
> > > > > protocol
> > > > > > > > >> changes
> > > > > > > > >> > (flexible records for example)
> > > > > > > > >> >
> > > > > > > > >> > I will update the KIP as well as KIP-890
> > > > > > > > >> >
> > > > > > > > >> > Thanks,
> > > > > > > > >> > Justine
> > > > > > > > >> >
> > > > > > > > >> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan <
> > > > > > jols...@confluent.io
> > > > > > > >
> > > > > > > > >> > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Updated!
> > > > > > > 

Re: [DISCUSS] KIP-1036: Extend RecordDeserializationException exception

2024-04-11 Thread Frédérik Rouleau
Hi everyone,

I have made some changes to take in account comments. I have replaced the
ConsumerRecord by Record. As it was not allowed by checkstyle, I have
modified its configuration. I hope that's ok.
I find this new version better. Thanks for your help.

Regards,
Fred


[jira] [Created] (KAFKA-16526) Change to quorum state for KIP-853

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16526:
--

 Summary: Change to quorum state for KIP-853
 Key: KAFKA-16526
 URL: https://issues.apache.org/jira/browse/KAFKA-16526
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


QuorumState needs to integrates with the internal partition listener. Need to 
support reading and writing both versions. There were some fields that were 
removed. Need to remove the assumption that voters are static.

 

https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-Quorumstate



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-11 Thread Jun Rao
Hi, Justine,

Thanks for the updated KIP.

Perhaps it's better to name the new config unstable.feature.versions.enable
since there could be multiple unstable versions.

Other than that, the KIP looks good to me.

Jun

On Thu, Apr 11, 2024 at 9:06 AM Justine Olshan 
wrote:

> The original config was never actually approved in any KIP. But we can say
> it is deprecated.
> I can change the config name.
>
> Justine
>
> On Thu, Apr 11, 2024 at 8:52 AM Jun Rao  wrote:
>
> > Hi, Justine,
> >
> > Thanks for the updated KIP.
> >
> > Would unstable.feature.version.enable be a clearer name? Also, should we
> > remove/deprecate unstable.metadata.versions.enable in this KIP?
> >
> > Jun
> >
> > On Tue, Apr 9, 2024 at 9:09 AM Justine Olshan
>  > >
> > wrote:
> >
> > > Hi Jun,
> > >
> > > Makes sense to me. It seems like KIP-1014 has been inactive recently. I
> > can
> > > update my KIP and mention this change on that discussion thread.
> > >
> > > Justine
> > >
> > > On Tue, Apr 9, 2024 at 9:01 AM Jun Rao 
> wrote:
> > >
> > > > Hi, Justine,
> > > >
> > > > A single config makes sense to me too. We just need to reach
> consensus
> > > with
> > > > KIP-1014.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan
> > >  > > > >
> > > > wrote:
> > > >
> > > > > Hey Jun,
> > > > >
> > > > > That's a good question. I think maybe for simplicity, we can have a
> > > > single
> > > > > config?
> > > > > If that makes sense, I will update the KIP.
> > > > >
> > > > > Justine
> > > > >
> > > > > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao 
> > > wrote:
> > > > >
> > > > > > Hi, Justine,
> > > > > >
> > > > > > Thanks for the updated KIP.
> > > > > >
> > > > > > One more question related to KIP-1014. It introduced a new
> > > > > > config unstable.metadata.versions.enable. Does each new feature
> > need
> > > to
> > > > > > have a corresponding config to enable the testing of unstable
> > > features
> > > > or
> > > > > > should we have a generic config enabling the testing of all
> > unstable
> > > > > > features?
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan
> > > > >  > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I'm hoping this covers the majority of comments. I will go
> ahead
> > > and
> > > > > open
> > > > > > > the vote in the next day or so.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Justine
> > > > > > >
> > > > > > > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan <
> > > jols...@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Find and replace has failed me :(
> > > > > > > >
> > > > > > > > Group version seems a little vague, but we can update it.
> > > Hopefully
> > > > > > find
> > > > > > > > and replace won't fail me again, otherwise I will get another
> > > email
> > > > > on
> > > > > > > this.
> > > > > > > >
> > > > > > > > Justine
> > > > > > > >
> > > > > > > > On Wed, Apr 3, 2024 at 12:15 PM David Jacot
> > > > > >  > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks, Justine.
> > > > > > > >>
> > > > > > > >> * Should we also use `group.version` (GV) as I suggested in
> my
> > > > > > previous
> > > > > > > >> message in order to be consistent?
> > > > > > > >> * Should we add both names to the `Public Interfaces`
> section?
> > > > > > > >> * There is still at least one usage of
> > > > > `transaction.protocol.verison`
> > > > > > in
> > > > > > > >> the KIP too.
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> David
> > > > > > > >>
> > > > > > > >> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
> > > > > > > >> 
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > I had missed the David's message yesterday about the
> naming
> > > for
> > > > > > > >> transaction
> > > > > > > >> > version vs transaction protocol version.
> > > > > > > >> >
> > > > > > > >> > After some offline discussion with Jun, Artem, and David,
> we
> > > > > agreed
> > > > > > > that
> > > > > > > >> > transaction version is simpler and conveys more than just
> > > > protocol
> > > > > > > >> changes
> > > > > > > >> > (flexible records for example)
> > > > > > > >> >
> > > > > > > >> > I will update the KIP as well as KIP-890
> > > > > > > >> >
> > > > > > > >> > Thanks,
> > > > > > > >> > Justine
> > > > > > > >> >
> > > > > > > >> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan <
> > > > > jols...@confluent.io
> > > > > > >
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > Updated!
> > > > > > > >> > >
> > > > > > > >> > > Justine
> > > > > > > >> > >
> > > > > > > >> > > On Tue, Apr 2, 2024 at 2:40 PM Jun Rao
> > > >  > > > > >
> > > > > > > >> wrote:
> > > > > > > >> > >
> > > > > > > >> > >> Hi, Justine,
> > > > > > > >> > >>
> > > > > > > >> > >> Thanks for the reply.
> > > > > > > >> > >>
> > > > > > > >> > >> 21. Sounds good. It would be useful to document that.
> > > > > > > >> > >>
> > > > > > > >> > >> 22. Should we add the IV in 

[jira] [Created] (KAFKA-16525) Request manager and raft channel discover node endpoints

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16525:
--

 Summary: Request manager and raft channel discover node endpoints
 Key: KAFKA-16525
 URL: https://issues.apache.org/jira/browse/KAFKA-16525
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


The RequestManager and RaftChannel implementation assume that nodes are always 
voters and are static. This needs to change so that it can discover this nodes 
dynamically and correctly handle when nodes are added or removed.

This also involves adding support for bootstrap servers for observers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16524) Metrics for KIP-853

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16524:
--

 Summary: Metrics for KIP-853
 Key: KAFKA-16524
 URL: https://issues.apache.org/jira/browse/KAFKA-16524
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio


https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-Monitoring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1026: Handling producer snapshot when upgrading from < v2.8.0 for Tiered Storage

2024-04-11 Thread Kamal Chandraprakash
> Is making it optional a good solution? Or should we recover the snapshot
if not found before uploading it?

IMO, we should recover the snapshot if it is not found.

> But what if the topic is created before v2.8.0 and old log segments are
deleted, how could we recover all the producer snapshot for old logs?

We don't need the snapshot files of the deleted log segments. Only for the
segments which lack one.
The ProduceStateManager contains the complete state. We have to regenerate
the state from there.

@christolo...@gmail.com 

You have filed https://issues.apache.org/jira/browse/KAFKA-15195 ticket to
generate the snapshots.
Could you please update the ticket with the details? Thanks!

On Thu, Apr 4, 2024 at 2:01 PM Luke Chen  wrote:

> Hi Kamal,
>
> Thanks for sharing! I didn't know about the change before v2.8.
> If that's the case, then we have to reconsider the solution of this PR.
> Is making it optional a good solution? Or should we recover the snapshot
> if not found before uploading it?
> But what if the topic is created before v2.8.0 and old log segments are
> deleted, how could we recover all the producer snapshot for old logs?
>
> Thanks.
> Luke
>
>
> On Wed, Apr 3, 2024 at 11:59 PM Arpit Goyal 
> wrote:
>
>> Thanks @Kamal Chandraprakash   Greg
>> Harris
>> I currently do not have detailed understanding on the behaviour when empty
>> producer snapshot  restored. I will try to test out the
>> behaviour.Meanwhile
>> I would request other community members if they can chime in and assist if
>> they are already aware of the behaviour mentioned above.
>> Thanks and Regards
>> Arpit Goyal
>> 8861094754
>>
>>
>> On Tue, Mar 26, 2024 at 4:04 PM Kamal Chandraprakash <
>> kamal.chandraprak...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > Sorry for the late reply. Greg has raised some good points:
>> >
>> > > Does an empty producer snapshot have the same behavior as a
>> > non-existent snapshot when restored?
>> >
>> > Producer snapshots maintain the status about the ongoing txn to either
>> > COMMIT/ABORT the transaction. In the older version (<2.8), we maintain
>> > the producer snapshots only for the recent segments. If such a topic
>> gets
>> > onboarded to tiered storage and the recently built replica becomes then
>> > leader,
>> > then it might break the producer.
>> >
>> > Assume there are 100 segments for a partition, and the producer
>> snapshots
>> > are available only for the recent 2 segments. Then, tiered storage is
>> > enabled
>> > for that topic, 90/100 segments are uploaded and local-log-segments are
>> > cleared upto segment 50. If a new follower builds the state from
>> remote, it
>> > will
>> > have the empty producer snapshots and start reading the data from the
>> > leader
>> > from segment-51. If a transaction is started at segment-40, then it will
>> > break the
>> > client.
>> >
>> > We also have to check the impact of expiring producer-ids before the
>> > default
>> > expiration time of 7 days: *transactional.id.expiration.ms
>> > *
>> >
>> > > 2. Why were empty producer snapshots not backfilled for older data
>> > when clusters were upgraded from 2.8?
>> >
>> > https://github.com/apache/kafka/pull/7929 -- It was not required at
>> that
>> > time.
>> > With tiered storage, we need the snapshot file for each segment to
>> reliably
>> > build the follower state from remote storage.
>> >
>> > > 3. Do producer snapshots need to be available contiguously, or can
>> > earlier snapshots be empty while later segments do not exist?
>> >
>> > I assume you refer to "while later segments do exist". Each snapshot
>> file
>> > will contain
>> > the cumulative/complete data of all the previous segments. So, a recent
>> > segment
>> > snapshot is enough to build the producer state. We need to figure out a
>> > solution to
>> > build the complete producer state for replicas that built the state
>> using
>> > the remote.
>> >
>> > Arpit,
>> > We have to deep dive into each of them to come up with the proper
>> solution.
>> >
>> > --
>> > Kamal
>> >
>> >
>> > On Tue, Mar 26, 2024 at 3:55 AM Greg Harris
>> 
>> > wrote:
>> >
>> > > Hi Arpit,
>> > >
>> > > I think creating empty producer snapshots would be
>> > > backwards-compatible for the tiered storage plugins, but I'm not aware
>> > > of what the other compatibility/design concerns might be. Maybe you or
>> > > another reviewer can answer these questions:
>> > > 1. Does an empty producer snapshot have the same behavior as a
>> > > non-existent snapshot when restored?
>> > > 2. Why were empty producer snapshots not backfilled for older data
>> > > when clusters were upgraded from 2.8?
>> > > 3. Do producer snapshots need to be available contiguously, or can
>> > > earlier snapshots be empty while later segments do not exist?
>> > >
>> > > Thanks,
>> > > Greg
>> > >
>> > > On Sat, Mar 23, 2024 at 3:24 AM Arpit Goyal > >
>> > > wrote:
>> > > >
>> > > > Yes Luke,
>> > > > I am also in favour of creating producer 

[jira] [Created] (KAFKA-16523) kafka-metadata-quorum add voter and remove voter changes

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16523:
--

 Summary: kafka-metadata-quorum add voter and remove voter changes
 Key: KAFKA-16523
 URL: https://issues.apache.org/jira/browse/KAFKA-16523
 Project: Kafka
  Issue Type: Sub-task
  Components: tools
Reporter: José Armando García Sancio


# 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-add-controller–config%3Cserver.properties%3E
 # 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-remove-controller--controller-id%3Ccontroller-id%3E--controller-uuid%3Ccontroller-uuid%3E



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16522) Implement Admin client changes for adding and removing votes

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16522:
--

 Summary: Implement Admin client changes for adding and removing 
votes
 Key: KAFKA-16522
 URL: https://issues.apache.org/jira/browse/KAFKA-16522
 Project: Kafka
  Issue Type: Sub-task
Reporter: José Armando García Sancio


https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-Admin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16521) Implement kafka-metadata-quorum describe changes for KIP-853

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16521:
--

 Summary: Implement kafka-metadata-quorum describe changes for 
KIP-853
 Key: KAFKA-16521
 URL: https://issues.apache.org/jira/browse/KAFKA-16521
 Project: Kafka
  Issue Type: Sub-task
  Components: tools
Reporter: José Armando García Sancio


# 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-describe–status
 # 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-describe--replication



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16520) Implement changes to DescribeQuorum response

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16520:
--

 Summary: Implement changes to DescribeQuorum response
 Key: KAFKA-16520
 URL: https://issues.apache.org/jira/browse/KAFKA-16520
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16519) Expose the supported and finalized kraft.version in ApiVersions response

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16519:
--

 Summary: Expose the supported and finalized kraft.version in 
ApiVersions response
 Key: KAFKA-16519
 URL: https://issues.apache.org/jira/browse/KAFKA-16519
 Project: Kafka
  Issue Type: Sub-task
  Components: core
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16518) Implement storage tool changes for KIP-853

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16518:
--

 Summary: Implement storage tool changes for KIP-853
 Key: KAFKA-16518
 URL: https://issues.apache.org/jira/browse/KAFKA-16518
 Project: Kafka
  Issue Type: Sub-task
  Components: tools
Reporter: José Armando García Sancio
 Fix For: 3.8.0


https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes#KIP853:KRaftControllerMembershipChanges-kafka-storage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16517) Do not decode metadata records in the internal kraft partition listener

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16517:
--

 Summary: Do not decode metadata records in the internal kraft 
partition listener
 Key: KAFKA-16517
 URL: https://issues.apache.org/jira/browse/KAFKA-16517
 Project: Kafka
  Issue Type: Sub-task
  Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio


The implementation for the internal partition listener for kraft reads and 
decodes the data record. This is not required and it is only done because it is 
easier to implement with the current code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16516) Fix the controller node provider for broker to control channel

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16516:
--

 Summary: Fix the controller node provider for broker to control 
channel
 Key: KAFKA-16516
 URL: https://issues.apache.org/jira/browse/KAFKA-16516
 Project: Kafka
  Issue Type: Sub-task
  Components: core
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


The broker to controller channel gets the set of voters directly from the 
static configuration. This needs to change so that the leader nodes comes from 
the kraft client/manager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16515) Fix the ZK Metadata cache use of voter static configuration

2024-04-11 Thread Jira
José Armando García Sancio created KAFKA-16515:
--

 Summary: Fix the ZK Metadata cache use of voter static 
configuration
 Key: KAFKA-16515
 URL: https://issues.apache.org/jira/browse/KAFKA-16515
 Project: Kafka
  Issue Type: Sub-task
  Components: core
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
 Fix For: 3.8.0


Looks like because of ZK migration to KRaft the ZK Metadata cache was changed 
to read the voter static configuration. This needs to change to use the voter 
nodes reported by  the raft manager or the kraft client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16514) Kafka Streams: stream.close(CloseOptions) does not respect options.leaveGroup flag.

2024-04-11 Thread Sal Sorrentino (Jira)
Sal Sorrentino created KAFKA-16514:
--

 Summary: Kafka Streams: stream.close(CloseOptions) does not 
respect options.leaveGroup flag.
 Key: KAFKA-16514
 URL: https://issues.apache.org/jira/browse/KAFKA-16514
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: Sal Sorrentino


Working with Kafka Streams 3.7.0, but may affect earlier versions as well.

 

When attempting to shutdown a streams application and leave the associated 
consumer group, the supplied `leaveGroup` option seems to have no effect. 
Sample code:


{code:java}
CloseOptions options = new CloseOptions().leaveGroup(true);
stream.close(options);{code}

The expected behavior here is that the group member would shutdown and leave 
the group, immediately triggering a consumer group rebalance. In practice, the 
rebalance happens after the appropriate timeout configuration has expired.

I understand the default behavior in that there is an assumption that ant 
associated StateStores would be persisted to disk and that in the case of a 
rolling restart/deployment, the rebalance delay may be preferable. However, in 
our application we are using in-memory state stores and standby replicas. There 
is no benefit in delaying the rebalance in this setup and we are in need of a 
way to force a member to leave the group when shutting down.



The workaround we found is to set an undocumented internal StreamConfig to 
enforce this behavior:


{code:java}
props.put("internal.leave.group.on.close", true);
{code}
To state the obvious, this is less than ideal.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] Add youtube to content security policy [kafka-site]

2024-04-11 Thread via GitHub


omkreddy merged PR #597:
URL: https://github.com/apache/kafka-site/pull/597


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Add youtube to content security policy [kafka-site]

2024-04-11 Thread via GitHub


VedarthConfluent opened a new pull request, #597:
URL: https://github.com/apache/kafka-site/pull/597

   This change should fix the broken video on quickstart page 
https://kafka.apache.org/quickstart.
   Currently observing following error:-
   `quickstart:297 Refused to frame 'https://www.youtube.com/' because it 
violates the following Content Security Policy directive: "frame-src 'self'"`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-11 Thread Justine Olshan
The original config was never actually approved in any KIP. But we can say
it is deprecated.
I can change the config name.

Justine

On Thu, Apr 11, 2024 at 8:52 AM Jun Rao  wrote:

> Hi, Justine,
>
> Thanks for the updated KIP.
>
> Would unstable.feature.version.enable be a clearer name? Also, should we
> remove/deprecate unstable.metadata.versions.enable in this KIP?
>
> Jun
>
> On Tue, Apr 9, 2024 at 9:09 AM Justine Olshan  >
> wrote:
>
> > Hi Jun,
> >
> > Makes sense to me. It seems like KIP-1014 has been inactive recently. I
> can
> > update my KIP and mention this change on that discussion thread.
> >
> > Justine
> >
> > On Tue, Apr 9, 2024 at 9:01 AM Jun Rao  wrote:
> >
> > > Hi, Justine,
> > >
> > > A single config makes sense to me too. We just need to reach consensus
> > with
> > > KIP-1014.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan
> >  > > >
> > > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > That's a good question. I think maybe for simplicity, we can have a
> > > single
> > > > config?
> > > > If that makes sense, I will update the KIP.
> > > >
> > > > Justine
> > > >
> > > > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao 
> > wrote:
> > > >
> > > > > Hi, Justine,
> > > > >
> > > > > Thanks for the updated KIP.
> > > > >
> > > > > One more question related to KIP-1014. It introduced a new
> > > > > config unstable.metadata.versions.enable. Does each new feature
> need
> > to
> > > > > have a corresponding config to enable the testing of unstable
> > features
> > > or
> > > > > should we have a generic config enabling the testing of all
> unstable
> > > > > features?
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > I'm hoping this covers the majority of comments. I will go ahead
> > and
> > > > open
> > > > > > the vote in the next day or so.
> > > > > >
> > > > > > Thanks,
> > > > > > Justine
> > > > > >
> > > > > > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan <
> > jols...@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Find and replace has failed me :(
> > > > > > >
> > > > > > > Group version seems a little vague, but we can update it.
> > Hopefully
> > > > > find
> > > > > > > and replace won't fail me again, otherwise I will get another
> > email
> > > > on
> > > > > > this.
> > > > > > >
> > > > > > > Justine
> > > > > > >
> > > > > > > On Wed, Apr 3, 2024 at 12:15 PM David Jacot
> > > > >  > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks, Justine.
> > > > > > >>
> > > > > > >> * Should we also use `group.version` (GV) as I suggested in my
> > > > > previous
> > > > > > >> message in order to be consistent?
> > > > > > >> * Should we add both names to the `Public Interfaces` section?
> > > > > > >> * There is still at least one usage of
> > > > `transaction.protocol.verison`
> > > > > in
> > > > > > >> the KIP too.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> David
> > > > > > >>
> > > > > > >> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
> > > > > > >> 
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > I had missed the David's message yesterday about the naming
> > for
> > > > > > >> transaction
> > > > > > >> > version vs transaction protocol version.
> > > > > > >> >
> > > > > > >> > After some offline discussion with Jun, Artem, and David, we
> > > > agreed
> > > > > > that
> > > > > > >> > transaction version is simpler and conveys more than just
> > > protocol
> > > > > > >> changes
> > > > > > >> > (flexible records for example)
> > > > > > >> >
> > > > > > >> > I will update the KIP as well as KIP-890
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> > Justine
> > > > > > >> >
> > > > > > >> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan <
> > > > jols...@confluent.io
> > > > > >
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Updated!
> > > > > > >> > >
> > > > > > >> > > Justine
> > > > > > >> > >
> > > > > > >> > > On Tue, Apr 2, 2024 at 2:40 PM Jun Rao
> > >  > > > >
> > > > > > >> wrote:
> > > > > > >> > >
> > > > > > >> > >> Hi, Justine,
> > > > > > >> > >>
> > > > > > >> > >> Thanks for the reply.
> > > > > > >> > >>
> > > > > > >> > >> 21. Sounds good. It would be useful to document that.
> > > > > > >> > >>
> > > > > > >> > >> 22. Should we add the IV in "metadata.version=17 has no
> > > > > > dependencies"
> > > > > > >> > too?
> > > > > > >> > >>
> > > > > > >> > >> Jun
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >> On Tue, Apr 2, 2024 at 11:31 AM Justine Olshan
> > > > > > >> > >> 
> > > > > > >> > >> wrote:
> > > > > > >> > >>
> > > > > > >> > >> > Jun,
> > > > > > >> > >> >
> > > > > > >> > >> > 21. Next producer ID field doesn't need to be populated
> > for
> > > > TV
> > > > > 1.
> > > > > > >> We
> > > > > > >> > >> don't
> > > > > > >> > >> > have the same need to retain this since it is written
> > > > directly
> > > > > to
> > > > > > >> the
> 

[DISCUSS] Transition Required from RemoteLogSegment to LocalLogSegment during Fetch Request

2024-04-11 Thread Arpit Goyal
Hi All,
When tackling the issue outlined in
https://issues.apache.org/jira/browse/KAFKA-16088, which pertains to
transitioning from RemoteLogSegment to LocalLogSegment during log
compaction in a fetch request, I'm seeking some suggestions or guidance
from the community to advance

*Current Behaviour *

At a very high level The fetch request behaves incase of topics enabled
with tiered storage.

[image: Screenshot 2024-04-11 at 8.49.31 PM.png]

1. When a consumer client requests an offset that is not available in the
local log for the given partition, the broker throws an OffsetOutOfRange
error.
2. If the required offset falls within the range of the Remote Segment log,
we schedule a DelayedRemoteFetch request along with the successful results
of other partitions from the log, and the responseCallback.

*Current Issue*
In this scenario, the functionality is compromised when RemoteLogSegments
are log compacted. As we initiate offset reading from RemoteLogManager,
there's a possibility that we cannot find the required remote log segment
within the requested offset range because of log compaction. Ideally, the
system should then search for higher log segments for additional retrieval.
However, these higher segments are stored locally instead of remote.
Presently, there's no mechanism in place to transition the fetch request
from remote log segments to local log segments, resulting in empty records
being returned (Check the diagram above). Consequently, the consumer client
remains unable to increment the offset, erroneously perceiving that there
is no more data available.

The scenario has been discussed here

in detail.

*Possible Approaches and downside*

*Implement Mechanism of RemoteLogSegment to Local LogSegment*

1. Suppose we implement a mechanism to advance the fetch request from
Remote LogSegment to LocalLogSegment. However, during this process, there's
a possibility that the local segments are moved to Remote Storage. This
situation hints at a potential dead spiral loop, where we continuously
switch between local and remote segments and vice versa.

2. Handling the advancement of Fetch Request from remote segment to log
segment code-wise is complex, mainly because the flow is independent, and
there is no existing mechanism to manage this transition seamlessly.

*Use endoffset of the last RemoteLogSegmentMetadata *

Once we've determined that no RemoteLogSegment satisfies the fetch offset,
we can utilize the information regarding the next segment to inspect, which
is based on the end offset of the lastRemoteLogSegment iterated. This
information can be passed to the client along with an error message
resembling "OffsetOutOfRangeBecauseOfLogCompaction". Similar to the
advanced strategy options like "latest" or "earliest", we can advance the
next fetch request for the given partition to the required value passed to
the client.

1. In our current implementation, the subscription position is only
incremented under two conditions: when there are records in the response or
when the resetStrategy is configured.
2. The proposed design requires sending the next fetch offset in the
response to the client if this scenario occurs.


Please let me know if the community has any suggestions or directions to
offer.


Thanks and Regards
Arpit Goyal
8861094754


[DISCUSS] KIP-1037: Allow WriteTxnMarkers API with Alter Cluster Permission

2024-04-11 Thread Nikhil Ramakrishnan
Hi everyone,

I would like to start a discussion for

KIP-1037: Allow WriteTxnMarkers API with Alter Cluster Permission
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1037%3A+Allow+WriteTxnMarkers+API+with+Alter+Cluster+Permission

The WriteTxnMarkers API was originally used for inter-broker
communication only. This required the ClusterAction permission on the
Cluster resource to invoke.

In KIP-664, we modified the WriteTxnMarkers API so that it could be
invoked externally from the Kafka AdminClient to safely abort a
hanging transaction. Such usage is more aligned with the Alter
permission on the Cluster resource, which includes other
administrative actions invoked from the Kafka AdminClient (i.e.
CreateAcls and DeleteAcls). This KIP proposes allowing the
WriteTxnMarkers API to be invoked with the Alter permission on the
Cluster.

I am looking forward to your thoughts and suggestions for improvement!

Thanks,
Nikhil


Re: [DISCUSS] KIP-1022 Formatting and Updating Features

2024-04-11 Thread Jun Rao
Hi, Justine,

Thanks for the updated KIP.

Would unstable.feature.version.enable be a clearer name? Also, should we
remove/deprecate unstable.metadata.versions.enable in this KIP?

Jun

On Tue, Apr 9, 2024 at 9:09 AM Justine Olshan 
wrote:

> Hi Jun,
>
> Makes sense to me. It seems like KIP-1014 has been inactive recently. I can
> update my KIP and mention this change on that discussion thread.
>
> Justine
>
> On Tue, Apr 9, 2024 at 9:01 AM Jun Rao  wrote:
>
> > Hi, Justine,
> >
> > A single config makes sense to me too. We just need to reach consensus
> with
> > KIP-1014.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Apr 8, 2024 at 5:06 PM Justine Olshan
>  > >
> > wrote:
> >
> > > Hey Jun,
> > >
> > > That's a good question. I think maybe for simplicity, we can have a
> > single
> > > config?
> > > If that makes sense, I will update the KIP.
> > >
> > > Justine
> > >
> > > On Mon, Apr 8, 2024 at 3:20 PM Jun Rao 
> wrote:
> > >
> > > > Hi, Justine,
> > > >
> > > > Thanks for the updated KIP.
> > > >
> > > > One more question related to KIP-1014. It introduced a new
> > > > config unstable.metadata.versions.enable. Does each new feature need
> to
> > > > have a corresponding config to enable the testing of unstable
> features
> > or
> > > > should we have a generic config enabling the testing of all unstable
> > > > features?
> > > >
> > > > Jun
> > > >
> > > > On Thu, Apr 4, 2024 at 8:24 PM Justine Olshan
> > >  > > > >
> > > > wrote:
> > > >
> > > > > I'm hoping this covers the majority of comments. I will go ahead
> and
> > > open
> > > > > the vote in the next day or so.
> > > > >
> > > > > Thanks,
> > > > > Justine
> > > > >
> > > > > On Wed, Apr 3, 2024 at 3:31 PM Justine Olshan <
> jols...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Find and replace has failed me :(
> > > > > >
> > > > > > Group version seems a little vague, but we can update it.
> Hopefully
> > > > find
> > > > > > and replace won't fail me again, otherwise I will get another
> email
> > > on
> > > > > this.
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Wed, Apr 3, 2024 at 12:15 PM David Jacot
> > > >  > > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks, Justine.
> > > > > >>
> > > > > >> * Should we also use `group.version` (GV) as I suggested in my
> > > > previous
> > > > > >> message in order to be consistent?
> > > > > >> * Should we add both names to the `Public Interfaces` section?
> > > > > >> * There is still at least one usage of
> > > `transaction.protocol.verison`
> > > > in
> > > > > >> the KIP too.
> > > > > >>
> > > > > >> Best,
> > > > > >> David
> > > > > >>
> > > > > >> On Wed, Apr 3, 2024 at 6:29 PM Justine Olshan
> > > > > >> 
> > > > > >> wrote:
> > > > > >>
> > > > > >> > I had missed the David's message yesterday about the naming
> for
> > > > > >> transaction
> > > > > >> > version vs transaction protocol version.
> > > > > >> >
> > > > > >> > After some offline discussion with Jun, Artem, and David, we
> > > agreed
> > > > > that
> > > > > >> > transaction version is simpler and conveys more than just
> > protocol
> > > > > >> changes
> > > > > >> > (flexible records for example)
> > > > > >> >
> > > > > >> > I will update the KIP as well as KIP-890
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Justine
> > > > > >> >
> > > > > >> > On Tue, Apr 2, 2024 at 2:50 PM Justine Olshan <
> > > jols...@confluent.io
> > > > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Updated!
> > > > > >> > >
> > > > > >> > > Justine
> > > > > >> > >
> > > > > >> > > On Tue, Apr 2, 2024 at 2:40 PM Jun Rao
> >  > > >
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > >> Hi, Justine,
> > > > > >> > >>
> > > > > >> > >> Thanks for the reply.
> > > > > >> > >>
> > > > > >> > >> 21. Sounds good. It would be useful to document that.
> > > > > >> > >>
> > > > > >> > >> 22. Should we add the IV in "metadata.version=17 has no
> > > > > dependencies"
> > > > > >> > too?
> > > > > >> > >>
> > > > > >> > >> Jun
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> On Tue, Apr 2, 2024 at 11:31 AM Justine Olshan
> > > > > >> > >> 
> > > > > >> > >> wrote:
> > > > > >> > >>
> > > > > >> > >> > Jun,
> > > > > >> > >> >
> > > > > >> > >> > 21. Next producer ID field doesn't need to be populated
> for
> > > TV
> > > > 1.
> > > > > >> We
> > > > > >> > >> don't
> > > > > >> > >> > have the same need to retain this since it is written
> > > directly
> > > > to
> > > > > >> the
> > > > > >> > >> > transaction log in InitProducerId. It is only needed for
> > > > KIP-890
> > > > > >> part
> > > > > >> > 2
> > > > > >> > >> /
> > > > > >> > >> > TV 2.
> > > > > >> > >> >
> > > > > >> > >> > 22. We can do that.
> > > > > >> > >> >
> > > > > >> > >> > Justine
> > > > > >> > >> >
> > > > > >> > >> > On Tue, Apr 2, 2024 at 10:41 AM Jun Rao
> > > >  > > > > >
> > > > > >> > >> wrote:
> > > > > >> > >> >
> > > > > >> > >> > > Hi, Justine,
> > > > > >> > >> > >
> > > > > >> > >> > > Thanks for the reply.
> > > > > 

[jira] [Created] (KAFKA-16513) Allow WriteTxnMarkers API with Alter Cluster Permission

2024-04-11 Thread Nikhil Ramakrishnan (Jira)
Nikhil Ramakrishnan created KAFKA-16513:
---

 Summary: Allow WriteTxnMarkers API with Alter Cluster Permission
 Key: KAFKA-16513
 URL: https://issues.apache.org/jira/browse/KAFKA-16513
 Project: Kafka
  Issue Type: Improvement
  Components: admin
Reporter: Nikhil Ramakrishnan
 Fix For: 3.8.0


We should allow WriteTxnMarkers API with Alter Cluster Permission because it 
can invoked externally by a Kafka AdminClient. Such usage is more aligned with 
the Alter permission on the Cluster resource, which includes other 
administrative actions invoked from the Kafka AdminClient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16512) Top level menu on Kafka homepage not working

2024-04-11 Thread Tobias Liefke (Jira)
Tobias Liefke created KAFKA-16512:
-

 Summary: Top level menu on Kafka homepage not working
 Key: KAFKA-16512
 URL: https://issues.apache.org/jira/browse/KAFKA-16512
 Project: Kafka
  Issue Type: Bug
  Components: website
Affects Versions: 3.7.0
Reporter: Tobias Liefke


When opening [https://kafka.apache.org|https://kafka.apache.org/] nothing 
happens when hovering or clicking on the top level menu items "GET STARTED", 
"COMMUNITY" or "APACHE".

Only the menu items "DOCS" or "POWERED BY" are working, as these doesn't 
contain a sub menu.

The "mobile menu" (also called "burger menu", the one with the three bars, 
appears below 950px screen width) is working fine.

Tested on different browsers (Chromium based and Firefox) on different devices.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-971 Expose replication-offset-lag MirrorMaker2 metric

2024-04-11 Thread Elxan Eminov
Hi Mickael!
Any thoughts on this?
Thanks!

On Wed, 3 Apr 2024 at 13:21, Elxan Eminov  wrote:

> Hi Mickael,
> Thanks for your response and apologies for a huge delay in mine.
>
> My thinking is that any partition could go stale if there are no records
> being produced into it. If enough of such partitions are present and are
> owned by a single MM task, an OOM could happen.
>
> Regarding the scenario where the TTL value is lower than the refresh
> interval - I believe that this is an edge that we need to document and
> prevent against, for example either failing to start on such a combination
> or resorting to a default value that would satisfy the constraint and
> logging an error.
>
> Thanks,
> Elkhan
>
> On Thu, 8 Feb 2024 at 14:17, Mickael Maison 
> wrote:
>
>> Hi,
>>
>> Thanks for the updates.
>> I'm wondering whether we really need the ttl eviction mechanism. The
>> motivation is to "avoid storing stale LRO entries which can cause an
>> eventual OOM error". How could it contain stake entries? I would
>> expect its cache to only contain entries for partitions assigned to
>> the task that owns it. Also what is the expected behavior if there's
>> no available LRO in the cache? If we keep this mechanism what happens
>> if its value is lower than
>> replication.record.lag.metric.refresh.interval?
>>
>> Thanks,
>> Mickael
>>
>> On Mon, Feb 5, 2024 at 5:23 PM Elxan Eminov 
>> wrote:
>> >
>> > Hi Mickael!
>> > Any further thoughts on this?
>> >
>> > Thanks,
>> > Elkhan
>> >
>> > On Thu, 18 Jan 2024 at 11:53, Mickael Maison 
>> > wrote:
>> >
>> > > Hi Elxan,
>> > >
>> > > Thanks for the updates.
>> > >
>> > > We used dots to separate words in configuration names, so I think
>> > > replication.offset.lag.metric.last-replicated-offset.ttl should be
>> > > named replication.offset.lag.metric.last.replicated.offset.ttl
>> > > instead.
>> > >
>> > > About the names of the metrics, fair enough if you prefer keeping the
>> > > replication prefix. Out of the alternatives you mentioned, I think I
>> > > prefer replication-record-lag. I think the metrics and configuration
>> > > names should match too. Let's see what the others think about it.
>> > >
>> > > Thanks,
>> > > Mickael
>> > >
>> > > On Mon, Jan 15, 2024 at 9:50 PM Elxan Eminov > >
>> > > wrote:
>> > > >
>> > > > Apologies, forgot to reply on your last comment about the metric
>> name.
>> > > > I believe both replication-lag and record-lag are a little too
>> abstract -
>> > > > what do you think about either leaving it as replication-offset-lag
>> or
>> > > > renaming to replication-record-lag?
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Wed, 10 Jan 2024 at 15:31, Mickael Maison <
>> mickael.mai...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Elxan,
>> > > > >
>> > > > > Thanks for the KIP, it looks like a useful addition.
>> > > > >
>> > > > > Can you add to the KIP the default value you propose for
>> > > > > replication.lag.metric.refresh.interval? In MirrorMaker most
>> interval
>> > > > > configs can be set to -1 to disable them, will it be the case for
>> this
>> > > > > new feature or will this setting only accept positive values?
>> > > > > I also wonder if replication-lag, or record-lag would be clearer
>> names
>> > > > > instead of replication-offset-lag, WDYT?
>> > > > >
>> > > > > Thanks,
>> > > > > Mickael
>> > > > >
>> > > > > On Wed, Jan 3, 2024 at 6:15 PM Elxan Eminov <
>> elxanemino...@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > Hi all,
>> > > > > > Here is the vote thread:
>> > > > > >
>> https://lists.apache.org/thread/ftlnolcrh858dry89sjg06mdcdj9mrqv
>> > > > > >
>> > > > > > Cheers!
>> > > > > >
>> > > > > > On Wed, 27 Dec 2023 at 11:23, Elxan Eminov <
>> elxanemino...@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi all,
>> > > > > > > I've updated the KIP with the details we discussed in this
>> thread.
>> > > > > > > I'll call in a vote after the holidays if everything looks
>> good.
>> > > > > > > Thanks!
>> > > > > > >
>> > > > > > > On Sat, 26 Aug 2023 at 15:49, Elxan Eminov <
>> > > elxanemino...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > >> Relatively minor change with a new metric for MM2
>> > > > > > >>
>> > > > > > >>
>> > > > >
>> > >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-971%3A+Expose+replication-offset-lag+MirrorMaker2+metric
>> > > > > > >>
>> > > > > > >
>> > > > >
>> > >
>>
>


[jira] [Created] (KAFKA-16511) Leaking tiered segments

2024-04-11 Thread Francois Visconte (Jira)
Francois Visconte created KAFKA-16511:
-

 Summary: Leaking tiered segments
 Key: KAFKA-16511
 URL: https://issues.apache.org/jira/browse/KAFKA-16511
 Project: Kafka
  Issue Type: Bug
  Components: Tiered-Storage
Affects Versions: 3.7.0
Reporter: Francois Visconte


I have some topics there were not written since a few days (having 12h 
retention) where some data remains on tiered storage (in our case S3) and they 
are never deleted.

 

Looking at the log history, it appears that we never even tried to delete these 
segments: 

When looking at one of the non-leaking segment, I get the following interesting 
messages: 

```

"2024-04-02T10:30:45.265Z","""kafka""","""10039""","[RemoteLogManager=10039 
partition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764] Deleted remote log segment 
RemoteLogSegmentId\{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764, 
id=fqGng3UURCG3-v4lETeLKQ} due to leader-epoch-cache truncation. Current 
earliest-epoch-entry: EpochEntry(epoch=8, startOffset=2980106), 
segment-end-offset: 2976819 and segment-epochs: [5]"

"2024-04-02T10:30:45.242Z","""kafka""","""10039""","Deleting log segment data 
for completed successfully 
RemoteLogSegmentMetadata\{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764,
 id=fqGng3UURCG3-v4lETeLKQ}, startOffset=2968418, endOffset=2976819, 
brokerId=10029, maxTimestampMs=1712009754536, eventTimestampMs=1712013411147, 
segmentLeaderEpochs=\{5=2968418}, segmentSizeInBytes=536351075, 
customMetadata=Optional.empty, state=COPY_SEGMENT_FINISHED}"

"2024-04-02T10:30:45.144Z","""kafka""","""10039""","Deleting log segment data 
for 
RemoteLogSegmentMetadata\{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764,
 id=fqGng3UURCG3-v4lETeLKQ}, startOffset=2968418, endOffset=2976819, 
brokerId=10029, maxTimestampMs=1712009754536, eventTimestampMs=1712013411147, 
segmentLeaderEpochs=\{5=2968418}, segmentSizeInBytes=536351075, 
customMetadata=Optional.empty, state=COPY_SEGMENT_FINISHED}"

"2024-04-01T23:16:51.157Z","""kafka""","""10029""","[RemoteLogManager=10029 
partition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764] Copied 
02968418.log to remote storage with segment-id: 
RemoteLogSegmentId\{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764, 
id=fqGng3UURCG3-v4lETeLKQ}"

"2024-04-01T23:16:51.147Z","""kafka""","""10029""","Copying log segment data 
completed successfully, metadata: 
RemoteLogSegmentMetadata\{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764,
 id=fqGng3UURCG3-v4lETeLKQ}, startOffset=2968418, endOffset=2976819, 
brokerId=10029, maxTimestampMs=1712009754536, eventTimestampMs=1712013397319, 
segmentLeaderEpochs=\{5=2968418}, segmentSizeInBytes=536351075, 
customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED}"

"2024-04-01T23:16:37.328Z","""kafka""","""10029""","Copying log segment data, 
metadata: 
RemoteLogSegmentMetadata\{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-764,
 id=fqGng3UURCG3-v4lETeLKQ}, startOffset=2968418, endOffset=2976819, 
brokerId=10029, maxTimestampMs=1712009754536, eventTimestampMs=1712013397319, 
segmentLeaderEpochs=\{5=2968418}, segmentSizeInBytes=536351075, 
customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED}"

```

Which looks right because we can see logs from both the plugin and remote log 
manager indicating that the remote log segment was removed. 

Now if I look on one of the leaked segment, here is what I see

 

```

"2024-04-02T00:43:33.834Z","""kafka""","""10001""","[RemoteLogManager=10001 
partition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-765] Copied 
02971163.log to remote storage with segment-id: 
RemoteLogSegmentId\{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-765, 
id=8dP13VDYSaiFlubl9SNBTQ}"

"2024-04-02T00:43:33.822Z","""kafka""","""10001""","Copying log segment data 
completed successfully, metadata: 
RemoteLogSegmentMetadata\{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-765,
 id=8dP13VDYSaiFlubl9SNBTQ}, startOffset=2971163, endOffset=2978396, 
brokerId=10001, maxTimestampMs=1712010648756, eventTimestampMs=1712018599981, 
segmentLeaderEpochs=\{7=2971163}, segmentSizeInBytes=459778940, 
customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED}"

"2024-04-02T00:43:20.003Z","""kafka""","""10001""","Copying log segment data, 
metadata: 
RemoteLogSegmentMetadata\{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=5G8Ai8kBSwmQ3Ln4QRY5rA:topic1_3543-765,
 id=8dP13VDYSaiFlubl9SNBTQ}, startOffset=2971163, endOffset=2978396, 
brokerId=10001, maxTimestampMs=1712010648756, eventTimestampMs=1712018599981, 
segmentLeaderEpochs=\{7=2971163}, segmentSizeInBytes=459778940, 
customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED}"

```

 

I 

Re: [DISCUSS] KIP-1033: Add Kafka Streams exception handler for exceptions occuring during processing

2024-04-11 Thread Damien Gasparina
Hi Matthias, Bruno,

1.a During my previous comment, by Processor Node ID, I meant
Processor name. This is important information to expose in the handler
as it allows users to identify the location of the exception in the
topology.
I assume this information could be useful in other places, that's why
I would lean toward adding this as an attribute in the
ProcessingContext.

1.b Looking at the ProcessingContext, I do think the following 3
methods should not be accessible in the exception handlers:
getStateStore(), schedule() and commit().
Having a separate interface would make a cleaner signature. It would
also be a great time to ensure that all exception handlers are
consistent, at the moment, the
DeserializationExceptionHandler.handle() relies on the old PAPI
ProcessorContext and the ProductionExceptionHandler.handle() has none.
It could make sense to build the new interface in this KIP and track
the effort to migrate the existing handlers in a separate KIP, what do
you think?
Maybe I am overthinking this part and the ProcessingContext would be fine.

4. Good point regarding the dropped-record metric, as it is used by
the other handlers, I do think it makes sense to leverage it instead
of creating a new metric.
I will update the KIP to update the dropped-record-metric.

8. Regarding the DSL, I am aligned with Bruno, I think we could close
the gaps in a future KIP.

Cheers,
Damien


On Thu, 11 Apr 2024 at 11:56, Bruno Cadonna  wrote:
>
> Hi Matthias,
>
>
> 1.a
> With processor node ID, I mean the ID that is exposed in the tags of
> processor node metrics. That ID cannot be internal since it is exposed
> in metrics. I think the processor name and the processor node ID is the
> same thing. I followed how the processor node ID is set in metrics and I
> ended up in addProcessor(name, ...).
>
>
> 1.b
> Regarding ProcessingContext, I also thought about a separate class to
> pass-in context information into the handler, but then I dismissed the
> idea because I thought I was overthinking it. Apparently, I was not
> overthinking it if you also had the same idea. So let's consider a
> separate class.
>
>
> 4.
> Regarding the metric, thanks for pointing to the dropped-record metric,
> Matthias. The dropped-record metric is used with the deserialization
> handler and the production handler. So, it would make sense to also use
> it for this handler. However, the dropped-record metric only records
> records that are skipped by the handler and not the number of calls to
> the handler. But that difference is probably irrelevant since in case of
> FAIL, the metric will be reset anyways since the stream thread will be
> restarted. In conclusion, I think the dropped-record metric in
> combination with a warn log message might be the better choice to
> introducing a new metric.
>
>
> 8.
> Regarding the DSL, I think we should close possible gaps in a separate KIP.
>
>
> Best,
> Bruno
>
> On 4/11/24 12:06 AM, Matthias J. Sax wrote:
> > Thanks for the KIP. Great discussion.
> >
> > I am not sure if I understand the proposal from Bruno to hand in the
> > processor node id? Isn't this internal (could not even find it quickly).
> > We do have a processor name, right? Or do I mix up something?
> >
> > Another question is about `ProcessingContext` -- it contains a lot of
> > (potentially irrelevant?) metadata. We should think carefully about what
> > we want to pass in and what not -- removing stuff is hard, but adding
> > stuff is easy. It's always an option to create a new interface that only
> > exposes stuff we find useful, and allows us to evolve this interface
> > independent of others. Re-using an existing interface always has the
> > danger to introduce an undesired coupling that could bite us in the
> > future. -- It make total sense to pass in `RecordMetadata`, but
> > `ProcessingContext` (even if already limited compared to
> > `ProcessorContext`) still seems to be too broad? For example, there is
> > `getStateStore()` and `schedule()` methods which I think we should not
> > expose.
> >
> > The other interesting question is about "what record gets passed in".
> > For the PAPI, passing in the Processor's input record make a lot of
> > sense. However, for DSL operators, I am not 100% sure? The DSL often
> > uses internal types not exposed to the user, and thus I am not sure if
> > users could write useful code for this case? -- In general, I still
> > agree that the handler should be implement with a try-catch around
> > `Processor.process()` but it might not be too useful for DSL processor.
> > Hence, I am wondering if we need to so something more in the DSL? I
> > don't have a concrete proposal (a few high level ideas only) and if we
> > don't do anything special for the DSL I am ok with moving forward with
> > this KIP as-is, but we should be aware of potential limitations for DSL
> > users. We can always do a follow up KIP to close gaps when we understand
> > the impact better -- covering the DSL would also expand the 

Re: Gentle bump on KAFKA-16371 (Unstable committed offsets after triggering commits where metadata for some partitions are over the limit)

2024-04-11 Thread Michał Łowicki
BR,
Michał Łowicki


On Fri, 5 Apr 2024 at 15:59, David Jacot 
wrote:

> Thanks, Michal. Let me add it to my review queue.


Gentle ping ;-)



>
> BR,
> David
>
> On Fri, Apr 5, 2024 at 3:29 PM Michał Łowicki  wrote:
>
> > Hi there!
> >
> > Created https://issues.apache.org/jira/browse/KAFKA-16371 few weeks back
> > but there wasn't any attention. Any chance someone knowing that code
> could
> > take a look at the issue found and proposed fixed? Thanks in advance.
> >
> > --
> > BR,
> > Michał Łowicki
> >
>


[jira] [Resolved] (KAFKA-15610) Fix `CoreUtils.swallow()` test gaps

2024-04-11 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-15610.

Fix Version/s: 3.8.0
   Resolution: Fixed

> Fix `CoreUtils.swallow()` test gaps
> ---
>
> Key: KAFKA-15610
> URL: https://issues.apache.org/jira/browse/KAFKA-15610
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Atul Sharma
>Priority: Minor
>  Labels: newbie
> Fix For: 3.8.0
>
>
> For example, it should verify that the passed in `logging` is used in case of 
> an exception. We found that there is no test for this in 
> https://github.com/apache/kafka/pull/14529#discussion_r1355277747.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16482) Eliminate the IDE warnings of accepting ClusterConfig in BeforeEach

2024-04-11 Thread Chia-Ping Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-16482.

Fix Version/s: 3.8.0
   Resolution: Fixed

> Eliminate the IDE warnings of accepting ClusterConfig in BeforeEach
> ---
>
> Key: KAFKA-16482
> URL: https://issues.apache.org/jira/browse/KAFKA-16482
> Project: Kafka
>  Issue Type: Test
>Reporter: Chia-Ping Tsai
>Assignee: Cheng-Kai, Zhang
>Priority: Major
> Fix For: 3.8.0
>
>
> IDE does not like the code style, and we can leverage `ClusterConfigProperty` 
> to eliminate the false error from IDE
>  
> [https://github.com/apache/kafka/blob/trunk/core/src/test/scala/integration/kafka/coordinator/transaction/ProducerIdsIntegrationTest.scala#L42]
> [https://github.com/apache/kafka/blob/trunk/tools/src/test/java/org/apache/kafka/tools/LeaderElectionCommandTest.java#L75]
>  
> https://github.com/apache/kafka/blob/trunk/tools/src/test/java/org/apache/kafka/tools/GetOffsetShellTest.java#L68



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] KIP-1033: Add Kafka Streams exception handler for exceptions occuring during processing

2024-04-11 Thread Bruno Cadonna

Hi Matthias,


1.a
With processor node ID, I mean the ID that is exposed in the tags of 
processor node metrics. That ID cannot be internal since it is exposed 
in metrics. I think the processor name and the processor node ID is the 
same thing. I followed how the processor node ID is set in metrics and I 
ended up in addProcessor(name, ...).



1.b
Regarding ProcessingContext, I also thought about a separate class to 
pass-in context information into the handler, but then I dismissed the 
idea because I thought I was overthinking it. Apparently, I was not 
overthinking it if you also had the same idea. So let's consider a 
separate class.



4.
Regarding the metric, thanks for pointing to the dropped-record metric, 
Matthias. The dropped-record metric is used with the deserialization 
handler and the production handler. So, it would make sense to also use 
it for this handler. However, the dropped-record metric only records 
records that are skipped by the handler and not the number of calls to 
the handler. But that difference is probably irrelevant since in case of 
FAIL, the metric will be reset anyways since the stream thread will be 
restarted. In conclusion, I think the dropped-record metric in 
combination with a warn log message might be the better choice to 
introducing a new metric.



8.
Regarding the DSL, I think we should close possible gaps in a separate KIP.


Best,
Bruno

On 4/11/24 12:06 AM, Matthias J. Sax wrote:

Thanks for the KIP. Great discussion.

I am not sure if I understand the proposal from Bruno to hand in the 
processor node id? Isn't this internal (could not even find it quickly). 
We do have a processor name, right? Or do I mix up something?


Another question is about `ProcessingContext` -- it contains a lot of 
(potentially irrelevant?) metadata. We should think carefully about what 
we want to pass in and what not -- removing stuff is hard, but adding 
stuff is easy. It's always an option to create a new interface that only 
exposes stuff we find useful, and allows us to evolve this interface 
independent of others. Re-using an existing interface always has the 
danger to introduce an undesired coupling that could bite us in the 
future. -- It make total sense to pass in `RecordMetadata`, but 
`ProcessingContext` (even if already limited compared to 
`ProcessorContext`) still seems to be too broad? For example, there is 
`getStateStore()` and `schedule()` methods which I think we should not 
expose.


The other interesting question is about "what record gets passed in". 
For the PAPI, passing in the Processor's input record make a lot of 
sense. However, for DSL operators, I am not 100% sure? The DSL often 
uses internal types not exposed to the user, and thus I am not sure if 
users could write useful code for this case? -- In general, I still 
agree that the handler should be implement with a try-catch around 
`Processor.process()` but it might not be too useful for DSL processor. 
Hence, I am wondering if we need to so something more in the DSL? I 
don't have a concrete proposal (a few high level ideas only) and if we 
don't do anything special for the DSL I am ok with moving forward with 
this KIP as-is, but we should be aware of potential limitations for DSL 
users. We can always do a follow up KIP to close gaps when we understand 
the impact better -- covering the DSL would also expand the scope of 
this KIP significantly...


About the metric: just to double check. Do we think it's worth to add a 
new metric? Or could we re-use the existing "dropped record metric"?




-Matthias


On 4/10/24 5:11 AM, Sebastien Viale wrote:

Hi,

You are right, it will simplify types.

We update the KIP

regards

Sébastien *VIALE***

*MICHELIN GROUP* - InfORMATION Technology
*Technical Expert Kafka*

  Carmes / Bâtiment A17 4e / 63100 Clermont-Ferrand


*De :* Bruno Cadonna 
*Envoyé :* mercredi 10 avril 2024 10:38
*À :* dev@kafka.apache.org 
*Objet :* [EXT] Re: [DISCUSS] KIP-1033: Add Kafka Streams exception 
handler for exceptions occuring during processing
Warning External sender Do not click on any links or open any 
attachments unless you trust the sender and know the content is safe.


Hi Loïc, Damien, and Sébastien,

Great that we are converging!


3.
Damien and Loïc, I think in your examples the handler will receive
Record because an Record is passed to
the processor in the following code line:
https://github.com/apache/kafka/blob/044d058e03bde8179ed57994d0a77ae9bff9fc10/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorNode.java#L152
 


I see that we do not need to pass into the the handler a Record just because we do that for the DeserializationExceptionHandler
and the ProductionExceptionHandler. When those 

Re: [DISCUSS] KIP-1036: Extend RecordDeserializationException exception

2024-04-11 Thread Frédérik Rouleau
Hi Kirk,

I have made the test and I confirm that checkstyle is complaining
(Disallowed import - org.apache.kafka.common.record.Record.) if I use
Record class in the RecordDeserialisationException.
An alternative might be to add key(), value(), headers() etc methods
directly in the exception.

Regards,
Fred


Re: [DISCUSS] KIP-899: Allow clients to rebootstrap

2024-04-11 Thread Chris Egerton
Hi Ivan,

I agree with Andrew that we can save cluster ID checking for later. This
feature is opt-in and if necessary we can add a note to users about only
enabling it if they can be certain that the same cluster will always be
resolved by the bootstrap servers. This would apply regardless of whether
we did client ID checking anyways.

Thanks for exploring a variety of options and ironing out the details on
this KIP. I think this is acceptable as-is but have a couple of final
suggestions we might consider:

1. Although the definition of an unavailable broker is useful ("A broker is
unavailable when the client doesn't have an established connection with it
and cannot establish a connection (e.g. due to the reconnect backoff)"), I
think this is a little too restrictive. It's useful to note this as an
example of what we may consider an unavailable broker, but if we leave some
more wiggle room, it could save us the trouble of a follow-up KIP when
tweaking behavior in the future. For example, to reduce discovery time for
a migrated Kafka cluster, it could be nice to re-bootstrap after a
connection attempt has failed for every currently-known broker with no
successful attempts in between, instead of waiting for the reconnection
backoff interval to kick in. Again, I don't think this needs to happen with
the initial implementation of the KIP, I just want us to be able to explore
options like this in the future.

2. In a similar vein, I think we can leave more room in our definition of
re-bootstrapping. Instead of "During the rebootstrap process, the client
forgets the brokers it knows about and falls back on the bootstrap brokers
(i.e. provided by bootstrap.servers provided by the client configuration)
as if it had just been initialized.", we could say something like "During
the rebootstrap process, the client attempts to re-contact the bootstrap
servers (i.e. ...) that it contacted during initialization." This could be
useful if we want to add the bootstrap servers to the previously-known list
of brokers instead of completely discarding the previously-known set. This
too can be left out of the initial implementation and just give us a bit
more room for future options.

Cheers,

Chris

On Tue, Apr 9, 2024 at 11:51 AM Andrew Schofield 
wrote:

> Hi Ivan,
> I think you have to go one way or the other with the cluster ID, so I
> think removing that from this KIP might
> be the best. I think there’s another KIP waiting to be written for
> ensuring consistency of clusters, but
> I think that wouldn’t conflict at all with this one.
>
> Thanks,
> Andrew
>
> > On 9 Apr 2024, at 19:11, Ivan Yurchenko  wrote:
> >
> > Hi Andrew and all,
> >
> > I looked deeper into the code [1] and it seems the Metadata class is OK
> with cluster ID changing. So I'm thinking that the rebootstrapping
> shouldn't introduce a new failure mode here. And I should remove the
> mention of this cluster ID checks from the KIP.
> >
> > Best,
> > Ivan
> >
> > [1]
> https://github.com/apache/kafka/blob/ff90f78c700c582f9800013faad827c36b45ceb7/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L355
> >
> > On Tue, Apr 9, 2024, at 09:28, Andrew Schofield wrote:
> >> Hi Ivan,
> >> Thanks for the KIP. I can see situations in which this would be
> helpful. I have one question.
> >>
> >> The KIP says the client checks the cluster ID when it re-bootstraps and
> that it will fail if the
> >> cluster ID doesn’t match the previously known one. How does it fail?
> Which exception does
> >> it throw and when?
> >>
> >> In a similar vein, now that we are checking cluster IDs, I wonder if it
> could be extended to
> >> cover all situations in which there are cluster ID mismatches, such as
> the bootstrap server
> >> list erroneously pointing at brokers from different clusters and the
> problem only being
> >> detectable later on.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 8 Apr 2024, at 18:24, Ivan Yurchenko  wrote:
> >>>
> >>> Hello!
> >>>
> >>> I changed the KIP a bit, specifying that the certain benefit goes to
> consumers not participating in a group, but that other clients can benefit
> as well in certain situations.
> >>>
> >>> You can see the changes in the history [1]
> >>>
> >>> Thank you!
> >>>
> >>> Ivan
> >>>
> >>> [1]
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=240881396=10=11
> >>>
> >>> On 2023/07/15 16:37:52 Ivan Yurchenko wrote:
>  Hello!
> 
>  I've made several changes to the KIP based on the comments:
> 
>  1. Reduced the scope to producer and consumer clients only.
>  2. Added more details to the description of the rebootstrap process.
>  3. Documented the role of low values of reconnect.backoff.max.ms in
>  preventing rebootstrapping.
>  4. Some wording changes.
> 
>  You can see the changes in the history [1]
> 
>  I'm planning to put the KIP to a vote in some days if there are no new
>  comments.
> 
>  Thank you!
> 
>  Ivan
> 

Re: KIP-919: Allow AdminClient to Talk Directly with the KRaft Controller Quorum

2024-04-11 Thread Nelson B.
Hi Colin,

Thanks for the KIP!

I'm sorry for reopening this old discussion thread but I don't know where
else I can ask my question.

I wanted to use the newly updated kafka-configs.sh tool to dynamically
update the ssl keystore in the controller node, but I still couldn't
figure it out.
Is it possible to use kafka-configs.sh to dynamically update
controller configuration?

In my setup, I have a controller node with node id 1 listening on port
9093. Below is the log of kafka-configs.sh tool:

[image: image.png]

As you can see, for some reason node is not recognized as a controller and
I'm not sure why this happens. Could you please help me?

Thanks! Regards

On Wed, Jul 26, 2023 at 3:43 AM Colin McCabe  wrote:

> On Tue, Jul 25, 2023, at 05:30, Luke Chen wrote:
> > Hi Colin,
> >
> > Some more comments:
> > 1. In the KIP, we mentioned "controller heartbeats", but it is not
> > explained anywhere.
> > I think "controller heartbeats" = controller registration", is that
> > correct?
> > If no, please explain more about it in the KIP.
>
> Hi Luke,
>
> Sorry, this was an artifact of earlier revisions. I have changed it to
> "ControllerRegistration" in all the cases where I didn't update it before.
>
> >
> > 2. Following this question:
> > > Which endpoint will the inactive controllers use to send the
> > > ControllerRegistrationRequest?
> > > A: They will use the endpoint in controller.quorum.voters.
> > If the registration request will include controller.quorum.voters, why
> > bother sending this information to active controller again?
> > The active controller should already have all the
> controller.quorum.voters
> > when start up.
> > Any purpose of that design? For validation?
>
> The controllers don't know which endpoint controller.quorum.voters is
> referencing.
>
> >
> > 3. If a controller node is not part of `controller.quorum.voters`, when
> it
> > sends ControllerRegistrationRequest, what will we respond to it?
> >
>
> Good point. I added a new error, UNKNOWN_CONTROLLER_ID, for this case.
>
> > 4. Nice and clear compatibility matrix!
> >
>
> Thanks!
> Colin
>
> > Thank you.
> > Luke
> >
> > On Sat, Jul 22, 2023 at 3:38 AM Colin McCabe  wrote:
> >
> >> On Fri, Jul 21, 2023, at 09:43, José Armando García Sancio wrote:
> >> > Thanks for the KIP Colin. Apologies if some of these points have
> >> > already been made. I have not followed the discussion closely:
> >> >
> >> > 1. Re: Periodically, each controller will check that the controller
> >> > registration for its ID is as expected
> >> >
> >> > Does this need to be periodic? Can't the controller schedule this RPC,
> >> > retry etc, when it finds that the incarnation ID doesn't match its
> >> > own?
> >> >
> >>
> >> Hi José,
> >>
> >> Thanks for the reviews.
> >>
> >> David had the same question. I agree that it should be event-driven
> rather
> >> than periodic (except for retries, etc.)
> >>
> >> >
> >> > 2. Did you consider including the active controller's epoch in the
> >> > ControllerRegistrationRequest?
> >> >
> >> > This would allow the active controller to reject registration from
> >> > controllers that are not part of the active quorum and don't know the
> >> > latest controller epoch. This should mitigate some of the concerns you
> >> > raised in bullet point 1.
> >> >
> >>
> >> Good idea. I will add the active controller epoch to the registration
> >> request.
> >>
> >> >
> >> > 3. Which endpoint will the inactive controllers use to send the
> >> > ControllerRegistrationRequest?
> >> >
> >> > Will it use the first endpoint described in the cluster metadata
> >> > controller registration record? Or would it use the endpoint described
> >> > in the server configuration at controller.quorum.voters?
> >> >
> >>
> >> They will use the endpoint in controller.quorum.voters. In general, the
> >> endpoints from the registration are only used for responding to
> >> DESCRIBE_CLUSTER. Since, after all, we may not even have the
> registration
> >> endpoints when we start up.
> >>
> >> >
> >> > 4. Re: Raft integration in the rejected alternatives
> >> >
> >> > Yes, The KRaft layer needs to solve a similar problem like endpoint
> >> > discovery to support dynamic controller membership change. As you
> >> > point out the requirements are different and the set of information
> >> > that needs to be tracked is different. I think it is okay to use a
> >> > different solution for each of these problems.
> >>
> >> Yeah that was my feeling too. Thanks for taking a look.
> >>
> >> regards,
> >> Colin
> >>
> >> >
> >> > Thanks!
> >> > --
> >> > -José
> >>
>


Re: [DISCUSS] KIP-1036: Extend RecordDeserializationException exception

2024-04-11 Thread Frédérik Rouleau
Thanks for your feedback.

Kirk, that's a good point. I will check if there are other ways of raising
an exception than the deserialisation itself.
About Record, I agree I think it would be a better choice and my initial
version was using it. But then I realised that this class might not be
exposed, at least I had some errors from checkstyle. That solution would
also improve GC pressure if you do not use the record by avoiding the
allocation of useless byte arrays.
Maybe someone can confirm that there is no issue by using the Record class.
Matthias, thanks for your comment. Unfortunately I will have to find
someone to do the changes for me as I was not able to create an account on
the wiki.

Regards,
Fred