Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2287

2023-10-13 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 315757 lines...]
Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConditionalUpdatePath() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConditionalUpdatePath() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetAllTopicsInClusterTriggersWatch() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetAllTopicsInClusterTriggersWatch() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteTopicZNode() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteTopicZNode() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeletePath() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeletePath() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetBrokerMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetBrokerMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testCreateTokenChangeNotification() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testCreateTokenChangeNotification() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetTopicsAndPartitions() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetTopicsAndPartitions() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [1] createChrootIfNecessary=true STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [1] createChrootIfNecessary=true PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [2] createChrootIfNecessary=false STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testChroot(boolean) > [2] createChrootIfNecessary=false PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRegisterBrokerInfo() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRegisterBrokerInfo() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRetryRegisterBrokerInfo() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testRetryRegisterBrokerInfo() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConsumerOffsetPath() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConsumerOffsetPath() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursiveWithControllerEpochVersionCheck() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursiveWithControllerEpochVersionCheck() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignments() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignments() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerManagementMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerManagementMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignmentMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testTopicAssignmentMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConnectionViaNettyClient() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testConnectionViaNettyClient() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testPropagateIsrChanges() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testPropagateIsrChanges() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerEpochMethods() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testControllerEpochMethods() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursive() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testDeleteRecursive() PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > KafkaZkClientTest > 
testGetTopicPartitionStates() STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > 

Re: [DISCUSS] KIP-988 Streams Standby Task Update Listener

2023-10-13 Thread Colt McNealy
Hello there!

Thanks everyone for the comments. There's a lot of back-and-forth going on,
so I'll do my best to summarize what everyone's said in TLDR format:

1. Rename `onStandbyUpdateStart()` -> `onUpdateStart()`,  and do similarly
for the other methods.
2. Keep `SuspendReason.PROMOTED` and `SuspendReason.MIGRATED`.
3. Remove the `earliestOffset` parameter for performance reasons.

If that's all fine with everyone, I'll update the KIP and we—well, mostly
Edu (:  —will open a PR.

Cheers,
Colt McNealy

*Founder, LittleHorse.dev*


On Fri, Oct 13, 2023 at 7:58 PM Eduwer Camacaro 
wrote:

> Hello everyone,
>
> Thanks for all your feedback for this KIP!
>
> I think that the key to choosing proper names for this API is understanding
> the terms used inside the StoreChangelogReader. Currently, this class has
> two possible states: ACTIVE_RESTORING and STANDBY_UPDATING. In my opinion,
> using StandbyUpdateListener for the interface fits better on these terms.
> Same applies for onUpdateStart/Suspended.
>
> StoreChangelogReader uses "the same mechanism" for active task restoration
> and standby task updates, but this is an implementation detail. Under
> normal circumstances (no rebalances or task migrations), the changelog
> reader will be in STANDBY_UPDATING, which means it will be updating standby
> tasks as long as there are new records in the changelog topic. That's why I
> prefer onStandbyUpdated instead of onBatchUpdated, even if it doesn't 100%
> align with StateRestoreListener, but either one is fine.
>
> Edu
>
> On Fri, Oct 13, 2023 at 8:53 PM Guozhang Wang 
> wrote:
>
> > Hello Colt,
> >
> > Thanks for writing the KIP! I have read through the updated KIP and
> > overall it looks great. I only have minor naming comments (well,
> > aren't naming the least boring stuff to discuss and that takes the
> > most of the time for KIPs :P):
> >
> > 1. I tend to agree with Sophie regarding whether or not to include
> > "Standby" in the functions of "onStandbyUpdateStart/Suspended", since
> > it is also more consistent with the functions of
> > "StateRestoreListener" where we do not name it as
> > "onStateRestoreState" etc.
> >
> > 2. I know in community discussions we sometimes say "a standby is
> > promoted to active", but in the official code / java docs we did not
> > have a term of "promotion", since what the code does is really recycle
> > the task (while keeping its state stores open), and create a new
> > active task that takes in the recycled state stores and just changing
> > the other fields like task type etc. After thinking about this for a
> > bit, I tend to feel that "promoted" is indeed a better name for user
> > facing purposes while "recycle" is more of a technical detail inside
> > the code and could be abstracted away from users. So I feel keeping
> > the name "PROMOTED" is fine.
> >
> > 3. Regarding "earliestOffset", it does feel like we cannot always
> > avoid another call to the Kafka API. And on the other hand, I also
> > tend to think that such bookkeeping may be better done at the app
> > level than from the Streams' public API level. I.e. the app could keep
> > a "first ever starting offset" per "topic-partition-store" key, and a
> > when we have rolling restart and hence some standby task keeps
> > "jumping" from one client to another via task assignment, the app
> > would update this value just one when it finds the
> > ""topic-partition-store" was never triggered before. What do you
> > think?
> >
> > 4. I do not have a strong opinion either, but what about
> "onBatchUpdated" ?
> >
> >
> > Guozhang
> >
> > On Wed, Oct 11, 2023 at 9:31 PM Colt McNealy 
> wrote:
> > >
> > > Sohpie—
> > >
> > > Thank you very much for such a detailed review of the KIP. It might
> > > actually be longer than the original KIP in the first place!
> > >
> > > 1. Ack'ed and fixed.
> > >
> > > 2. Correct, this is a confusing passage and requires context:
> > >
> > > One thing on our list of TODO's regarding reliability is to determine
> how
> > > to configure `session.timeout.ms`. In our Kubernetes Environment, an
> > > instance of our Streams App can be terminated, restarted, and get back
> > into
> > > the "RUNNING" Streams state in about 20 seconds. We have two options
> > here:
> > > a) set session.timeout.ms to 30 seconds or so, and deal with 20
> seconds
> > of
> > > unavailability for affected partitions, but avoid shuffling Tasks; or
> b)
> > > set session.timeout.ms to a low value, such as 6 seconds (
> > > heartbeat.interval.ms of 2000), and reduce the unavailability window
> > during
> > > a rolling bounce but incur an "extra" rebalance. There are several
> > > different costs to a rebalance, including the shuffling of standby
> tasks.
> > > JMX metrics are not fine-grained enough to give us an accurate picture
> of
> > > what's going on with the whole Standby Task Shuffle Dance. I
> hypothesize
> > > that the Standby Update Listener might help us clarify just how the
> > > shuffling actually (not 

Re: [DISCUSS] KIP-988 Streams Standby Task Update Listener

2023-10-13 Thread Eduwer Camacaro
Hello everyone,

Thanks for all your feedback for this KIP!

I think that the key to choosing proper names for this API is understanding
the terms used inside the StoreChangelogReader. Currently, this class has
two possible states: ACTIVE_RESTORING and STANDBY_UPDATING. In my opinion,
using StandbyUpdateListener for the interface fits better on these terms.
Same applies for onUpdateStart/Suspended.

StoreChangelogReader uses "the same mechanism" for active task restoration
and standby task updates, but this is an implementation detail. Under
normal circumstances (no rebalances or task migrations), the changelog
reader will be in STANDBY_UPDATING, which means it will be updating standby
tasks as long as there are new records in the changelog topic. That's why I
prefer onStandbyUpdated instead of onBatchUpdated, even if it doesn't 100%
align with StateRestoreListener, but either one is fine.

Edu

On Fri, Oct 13, 2023 at 8:53 PM Guozhang Wang 
wrote:

> Hello Colt,
>
> Thanks for writing the KIP! I have read through the updated KIP and
> overall it looks great. I only have minor naming comments (well,
> aren't naming the least boring stuff to discuss and that takes the
> most of the time for KIPs :P):
>
> 1. I tend to agree with Sophie regarding whether or not to include
> "Standby" in the functions of "onStandbyUpdateStart/Suspended", since
> it is also more consistent with the functions of
> "StateRestoreListener" where we do not name it as
> "onStateRestoreState" etc.
>
> 2. I know in community discussions we sometimes say "a standby is
> promoted to active", but in the official code / java docs we did not
> have a term of "promotion", since what the code does is really recycle
> the task (while keeping its state stores open), and create a new
> active task that takes in the recycled state stores and just changing
> the other fields like task type etc. After thinking about this for a
> bit, I tend to feel that "promoted" is indeed a better name for user
> facing purposes while "recycle" is more of a technical detail inside
> the code and could be abstracted away from users. So I feel keeping
> the name "PROMOTED" is fine.
>
> 3. Regarding "earliestOffset", it does feel like we cannot always
> avoid another call to the Kafka API. And on the other hand, I also
> tend to think that such bookkeeping may be better done at the app
> level than from the Streams' public API level. I.e. the app could keep
> a "first ever starting offset" per "topic-partition-store" key, and a
> when we have rolling restart and hence some standby task keeps
> "jumping" from one client to another via task assignment, the app
> would update this value just one when it finds the
> ""topic-partition-store" was never triggered before. What do you
> think?
>
> 4. I do not have a strong opinion either, but what about "onBatchUpdated" ?
>
>
> Guozhang
>
> On Wed, Oct 11, 2023 at 9:31 PM Colt McNealy  wrote:
> >
> > Sohpie—
> >
> > Thank you very much for such a detailed review of the KIP. It might
> > actually be longer than the original KIP in the first place!
> >
> > 1. Ack'ed and fixed.
> >
> > 2. Correct, this is a confusing passage and requires context:
> >
> > One thing on our list of TODO's regarding reliability is to determine how
> > to configure `session.timeout.ms`. In our Kubernetes Environment, an
> > instance of our Streams App can be terminated, restarted, and get back
> into
> > the "RUNNING" Streams state in about 20 seconds. We have two options
> here:
> > a) set session.timeout.ms to 30 seconds or so, and deal with 20 seconds
> of
> > unavailability for affected partitions, but avoid shuffling Tasks; or b)
> > set session.timeout.ms to a low value, such as 6 seconds (
> > heartbeat.interval.ms of 2000), and reduce the unavailability window
> during
> > a rolling bounce but incur an "extra" rebalance. There are several
> > different costs to a rebalance, including the shuffling of standby tasks.
> > JMX metrics are not fine-grained enough to give us an accurate picture of
> > what's going on with the whole Standby Task Shuffle Dance. I hypothesize
> > that the Standby Update Listener might help us clarify just how the
> > shuffling actually (not theoretically) works, which will help us make a
> > more informed decision about the session timeout config.
> >
> > If you think this is worth putting in the KIP, I'll polish it and do so;
> > else, I'll remove the current half-baked explanation.
> >
> > 3. Overall, I agree with this. In our app, each Task has only one Store
> to
> > reduce the number of changelog partitions, so I sometimes forget the
> > distinction between the two concepts, as reflected in the KIP (:
> >
> > 3a. I don't like the word "Restore" here, since Restoration refers to an
> > Active Task getting caught up in preparation to resume processing.
> > `StandbyUpdateListener` is fine by me; I have updated the KIP. I am a
> > native Python speaker so I do prefer shorter names anyways (:
> >
> > 3b1. +1 to 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Guozhang Wang
Hello Nick,

First of all, thanks a lot for the great effort you've put in driving
this KIP! I really like it coming through finally, as many people in
the community have raised this. At the same time I honestly feel a bit
ashamed for not putting enough of my time supporting it and pushing it
through the finish line (you raised this KIP almost a year ago).

I briefly passed through the DISCUSS thread so far, not sure I've 100
percent digested all the bullet points. But with the goal of trying to
help take it through the finish line in mind, I'd want to throw
thoughts on top of my head only on the point #4 above which I felt may
be the main hurdle for the current KIP to drive to a consensus now.

The general question I asked myself is, whether we want to couple "IQ
reading mode" with "processing mode". While technically I tend to
agree with you that, it's feels like a bug if some single user chose
"EOS" for processing mode while choosing "read uncommitted" for IQ
reading mode, at the same time, I'm thinking if it's possible that
there could be two different persons (or even two teams) that would be
using the stream API to build the app, and the IQ API to query the
running state of the app. I know this is less of a technical thing but
rather a more design stuff, but if it could be ever the case, I'm
wondering if the personale using the IQ API knows about the risks of
using read uncommitted but still chose so for the favor of
performance, no matter if the underlying stream processing mode
configured by another personale is EOS or not. In that regard, I'm
leaning towards a "leaving the door open, and close it later if we
found it's a bad idea" aspect with a configuration that we can
potentially deprecate than "shut the door, clean for everyone". More
specifically, allowing the processing mode / IQ read mode to be
decoupled, and if we found that there's no such cases as I speculated
above or people started complaining a lot, we can still enforce
coupling them.

Again, just my 2c here. Thanks again for the great patience and
diligence on this KIP.


Guozhang



On Fri, Oct 13, 2023 at 8:48 AM Nick Telford  wrote:
>
> Hi Bruno,
>
> 4.
> I'll hold off on making that change until we have a consensus as to what
> configuration to use to control all of this, as it'll be affected by the
> decision on EOS isolation levels.
>
> 5.
> Done. I've chosen "committedOffsets".
>
> Regards,
> Nick
>
> On Fri, 13 Oct 2023 at 16:23, Bruno Cadonna  wrote:
>
> > Hi Nick,
> >
> > 1.
> > Yeah, you are probably right that it does not make too much sense.
> > Thanks for the clarification!
> >
> >
> > 4.
> > Yes, sorry for the back and forth, but I think for the sake of the KIP
> > it is better to let the ALOS behavior as it is for now due to the
> > possible issues you would run into. Maybe we can find a solution in the
> > future. Now the question returns to whether we really need
> > default.state.isolation.level. Maybe the config could be the feature
> > flag Sophie requested.
> >
> >
> > 5.
> > There is a guideline in Kafka not to use the get prefix for getters (at
> > least in the public API). Thus, could you please rename
> >
> > getCommittedOffset(TopicPartition partition) ->
> > committedOffsetFor(TopicPartition partition)
> >
> > You can also propose an alternative to committedOffsetFor().
> >
> >
> > Best,
> > Bruno
> >
> >
> > On 10/13/23 3:21 PM, Nick Telford wrote:
> > > Hi Bruno,
> > >
> > > Thanks for getting back to me.
> > >
> > > 1.
> > > I think this should be possible. Are you thinking of the situation where
> > a
> > > user may downgrade to a previous version of Kafka Streams? In that case,
> > > sadly, the RocksDBStore would get wiped by the older version of Kafka
> > > Streams anyway, because that version wouldn't understand the extra column
> > > family (that holds offsets), so the missing Position file would
> > > automatically get rebuilt when the store is rebuilt from the changelog.
> > > Are there other situations than downgrade where a transactional store
> > could
> > > be replaced by a non-transactional one? I can't think of any.
> > >
> > > 2.
> > > Ahh yes, the Test Plan - my Kryptonite! This section definitely needs to
> > be
> > > fleshed out. I'll work on that. How much detail do you need?
> > >
> > > 3.
> > > See my previous email discussing this.
> > >
> > > 4.
> > > Hmm, this is an interesting point. Are you suggesting that under ALOS
> > > READ_COMMITTED should not be supported?
> > >
> > > Regards,
> > > Nick
> > >
> > > On Fri, 13 Oct 2023 at 13:52, Bruno Cadonna  wrote:
> > >
> > >> Hi Nick,
> > >>
> > >> I think the KIP is converging!
> > >>
> > >>
> > >> 1.
> > >> I am wondering whether it makes sense to write the position file during
> > >> close as we do for the checkpoint file, so that in case the state store
> > >> is replaced with a non-transactional state store the non-transactional
> > >> state store finds the position file. I think, this is not strictly
> > >> needed, but would be a 

Re: [DISCUSS] KIP-988 Streams Standby Task Update Listener

2023-10-13 Thread Guozhang Wang
Hello Colt,

Thanks for writing the KIP! I have read through the updated KIP and
overall it looks great. I only have minor naming comments (well,
aren't naming the least boring stuff to discuss and that takes the
most of the time for KIPs :P):

1. I tend to agree with Sophie regarding whether or not to include
"Standby" in the functions of "onStandbyUpdateStart/Suspended", since
it is also more consistent with the functions of
"StateRestoreListener" where we do not name it as
"onStateRestoreState" etc.

2. I know in community discussions we sometimes say "a standby is
promoted to active", but in the official code / java docs we did not
have a term of "promotion", since what the code does is really recycle
the task (while keeping its state stores open), and create a new
active task that takes in the recycled state stores and just changing
the other fields like task type etc. After thinking about this for a
bit, I tend to feel that "promoted" is indeed a better name for user
facing purposes while "recycle" is more of a technical detail inside
the code and could be abstracted away from users. So I feel keeping
the name "PROMOTED" is fine.

3. Regarding "earliestOffset", it does feel like we cannot always
avoid another call to the Kafka API. And on the other hand, I also
tend to think that such bookkeeping may be better done at the app
level than from the Streams' public API level. I.e. the app could keep
a "first ever starting offset" per "topic-partition-store" key, and a
when we have rolling restart and hence some standby task keeps
"jumping" from one client to another via task assignment, the app
would update this value just one when it finds the
""topic-partition-store" was never triggered before. What do you
think?

4. I do not have a strong opinion either, but what about "onBatchUpdated" ?


Guozhang

On Wed, Oct 11, 2023 at 9:31 PM Colt McNealy  wrote:
>
> Sohpie—
>
> Thank you very much for such a detailed review of the KIP. It might
> actually be longer than the original KIP in the first place!
>
> 1. Ack'ed and fixed.
>
> 2. Correct, this is a confusing passage and requires context:
>
> One thing on our list of TODO's regarding reliability is to determine how
> to configure `session.timeout.ms`. In our Kubernetes Environment, an
> instance of our Streams App can be terminated, restarted, and get back into
> the "RUNNING" Streams state in about 20 seconds. We have two options here:
> a) set session.timeout.ms to 30 seconds or so, and deal with 20 seconds of
> unavailability for affected partitions, but avoid shuffling Tasks; or b)
> set session.timeout.ms to a low value, such as 6 seconds (
> heartbeat.interval.ms of 2000), and reduce the unavailability window during
> a rolling bounce but incur an "extra" rebalance. There are several
> different costs to a rebalance, including the shuffling of standby tasks.
> JMX metrics are not fine-grained enough to give us an accurate picture of
> what's going on with the whole Standby Task Shuffle Dance. I hypothesize
> that the Standby Update Listener might help us clarify just how the
> shuffling actually (not theoretically) works, which will help us make a
> more informed decision about the session timeout config.
>
> If you think this is worth putting in the KIP, I'll polish it and do so;
> else, I'll remove the current half-baked explanation.
>
> 3. Overall, I agree with this. In our app, each Task has only one Store to
> reduce the number of changelog partitions, so I sometimes forget the
> distinction between the two concepts, as reflected in the KIP (:
>
> 3a. I don't like the word "Restore" here, since Restoration refers to an
> Active Task getting caught up in preparation to resume processing.
> `StandbyUpdateListener` is fine by me; I have updated the KIP. I am a
> native Python speaker so I do prefer shorter names anyways (:
>
> 3b1. +1 to removing the word 'Task'.
>
> 3b2. I like `onUpdateStart()`, but with your permission I'd prefer
> `onStandbyUpdateStart()` which matches the name of the Interface
> "StandbyUpdateListener". (the python part of me hates this, however)
>
> 3b3. Going back to question 2), `earliestOffset` was intended to allow us
> to more easily calculate the amount of state _already loaded_ in the store
> by subtracting (startingOffset - earliestOffset). This would help us see
> how much inefficiency is introduced in a rolling restart—if we end up going
> from a situation with an up-to-date standby before the restart, and then
> after the whole restart, the Task is shuffled onto an instance where there
> is no previous state, then that is expensive. However, if the final
> shuffling results in the Task back on an instance with a lot of pre-built
> state, it's not expensive.
>
> If a call over the network is required to determine the earliestOffset,
> then this is a "hard no-go" for me, and we will remove it (I'll have to
> check with Eduwer as he is close to having a working implementation). I
> think we can probably determine 

Re: [DISCUSS] KIP-968: Support single-key_multi-timestamp interactive queries (IQv2) for versioned state stores

2023-10-13 Thread Guozhang Wang
Thanks Alieh for the KIP, as well as a nice summary of all the
discussions! Just my 2c regarding your open questions:

1. I just checked KIP-889
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-889%3A+Versioned+State+Stores)
and we used "VersionedRecord get(K key, long asOfTimestamp);", so I
feel to be consistent with this API is better compared with being
consistent with "WindowKeyQuery"?

3. I agree with Matthias that naming is always tricky, and I also tend
to be consistent with what we already have (even if in retro it may
not be the best idea :P and if that was really becoming a complaint,
we would change it across the board in one shot as well later).

Guozhang

On Wed, Oct 11, 2023 at 9:12 PM Matthias J. Sax  wrote:
>
> Thanks for the update!
>
>
>
> Some thoughts about changes you made and open questions you raised:
>
>
> 10) About asOf vs until: I was just looking into `WindowKeyQuery`,
> `WindowRangeQuery` and also `ReadOnlyWindowStore` interfaces. For those,
> we use "timeFrom" and "timeTo", so it seems best to actually use
> `to(Instant toTime)` to keep the naming consistent across the board?
>
> If yes, we should also do `from (Instant fromTime)` and use getters
> `fromTime()` and `toTime()` -- given that it's range bounds it seems
> acceptable to me, to diverge a little bit from KIP-960 `asOfTimestamp()`
> -- but we could also rename it to `asOfTime()`? -- Given that we
> strongly type with `Instant` I am not worried about semantic ambiguity.
>
>
>
> 20) About throwing a NPE when time bounds are `null` -- why? (For the
> key it makes sense as is mandatory to have a key.) Could we not
> interpret `null` as "no bound". We did KIP-941 to add `null` for
> open-ended `RangeQueries`, so I am wondering if we should just stick to
> the same semantics?
>
> Cf
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-941%3A+Range+queries+to+accept+null+lower+and+upper+bounds
>
>
>
> 30) About the class naming. That's always tricky, and I am not married
> to my proposal. I agree with Bruno that the other suggested names are
> not really better. -- The underlying idea was, to get some "consistent"
> naming across the board.
>
> Existing `KeyQuery`
> New `VersionedKeyQuery` (KIP-960; we add a prefix)
> New `MultiVersionKeyQuery` (this KIP; extend the prefix with a pre-prefix)
>
> Existing `RangeQuery`
> New `MultiVersionRangeQuery` (KIP-969; add same prefix as above)
>
>
>
> 40) I am fine with not adding `range(from, to)` -- it was just an idea.
>
>
>
>
>
> Some more follow up question:
>
> 50) You propose to add a new constructor and getter to `VersionedRecord`
> -- I am wondering if this implies that `validTo` is optional because the
> existing constructor is not deprecated? -- Also, what happens if
> `validTo` is not set and `valueTo()` is called? Or do we intent to make
> `validTo` mandatory?
>
> Maybe this question can only be answered when working on the code, but I
> am wondering if we should make `validTo` mandatory or not... And what
> the "blast radius" of changing `VersionedRecord` will be in general. Do
> you have already some POC PR that we could look at to get some signals
> about this?
>
>
>
> 60) The new query class is defined to return
> `ValueIterator>` -- while I like the idea to add
> `ValueIterator` in a generic way on the one hand, I am wondering if
> it might be better to change it, and enforce its usage (ie, return type)
> of `VersionedRecord` to improve type safety (type erasure is often a
> pain, and we could mitigate it this way).
>
> Btw: We actually do a similar thing for `KeyValueIterator`.
>
> Ie,
>
> public interface ValueIterator extends Iterator>
>
> and
>
> ValueAndTimestamp peek();
>
> This would imply that the return type of the new query is
> `ValueIterator` on the interface what seems simpler and more elegant?
>
> If we go with the change, I am also wondering if we need to find a
> better name for the new iterator class? Maybe `VersionIterator` or
> something like this?
>
> Of course it might limit the use of `ValueIterator` for other value
> types -- not sure if this a limitation that is prohibitive? My gut
> feeling is, that is should not be too limiting.
>
>
>
>
> 70) Do we really need the change in `VersionedKeyValueStore` and add a
> new method? In the end, the idea of IQv2 is to avoid exactly this... It
> was the main issue for IQv1, that the base interface of the store needed
> an update and thus all classed implementing the base interface, making
> it very cumbersome to add new query types. -- Of course, we need this
> new method on the actually implementation (as private method) that can
> be called from `query()` method, but adding it to the interface seems to
> defeat the purpose of IQv2.
>
> Note, for existing IQv2 queries types that go against others stores, the
> public methods already existed when IQv2 was introduces, and thus the
> implementation of these query types just pragmatically re-used existing
> methods -- but it does not imply that new 

Re: [VOTE] KIP-714: Client metrics and observability

2023-10-13 Thread Jun Rao
Hi, Andrew,

Thanks for the KIP. +1 from me too.

Jun

On Wed, Oct 11, 2023 at 4:00 PM Sophie Blee-Goldman 
wrote:

> This looks great! +1 (binding)
>
> Sophie
>
> On Wed, Oct 11, 2023 at 1:46 PM Matthias J. Sax  wrote:
>
> > +1 (binding)
> >
> > On 9/13/23 5:48 PM, Jason Gustafson wrote:
> > > Hey Andrew,
> > >
> > > +1 on the KIP. For many users of Kafka, it may not be fully understood
> > how
> > > much of a challenge client monitoring is. With tens of clients in a
> > > cluster, it is already difficult to coordinate metrics collection. When
> > > there are thousands of clients, and when the cluster operator has no
> > > control over them, it is essentially impossible. For the fat clients
> that
> > > we have, the lack of useful telemetry is a huge operational gap.
> > > Consistency between clients has also been a major challenge. I think
> the
> > > effort toward standardization in this KIP will have some positive
> impact
> > > even in deployments which have effective client-side monitoring.
> > Overall, I
> > > think this proposal will provide a lot of value across the board.
> > >
> > > Best,
> > > Jason
> > >
> > > On Wed, Sep 13, 2023 at 9:50 AM Philip Nee 
> wrote:
> > >
> > >> Hey Andrew -
> > >>
> > >> Thank you for taking the time to reply to my questions. I'm just
> adding
> > >> some notes to this discussion.
> > >>
> > >> 1. epoch: It can be helpful to know the delta of the client side and
> the
> > >> actual leader epoch.  It is helpful to understand why sometimes commit
> > >> fails/client not making progress.
> > >> 2. Client connection: If the client selects the "wrong" connection to
> > push
> > >> out the data, I assume the request would timeout; which should lead to
> > >> disconnecting from the node and reselecting another node as you
> > mentioned,
> > >> via the least loaded node.
> > >>
> > >> Cheers,
> > >> P
> > >>
> > >>
> > >> On Tue, Sep 12, 2023 at 10:40 AM Andrew Schofield <
> > >> andrew_schofield_j...@outlook.com> wrote:
> > >>
> > >>> Hi Philip,
> > >>> Thanks for your vote and interest in the KIP.
> > >>>
> > >>> KIP-714 does not introduce any new client metrics, and that’s
> > >> intentional.
> > >>> It does
> > >>> tell how that all of the client metrics can have their names
> > transformed
> > >>> into
> > >>> equivalent "telemetry metric names”, and then potentially used in
> > metrics
> > >>> subscriptions.
> > >>>
> > >>> I am interested in the idea of client’s leader epoch in this context,
> > but
> > >>> I don’t have
> > >>> an immediate plan for how best to do this, and it would take another
> > KIP
> > >>> to enhance
> > >>> existing metrics or introduce some new ones. Those would then
> naturally
> > >> be
> > >>> applicable to the metrics push introduced in KIP-714.
> > >>>
> > >>> In a similar vein, there are no existing client metrics specifically
> > for
> > >>> auto-commit.
> > >>> We could add them to Kafka, but I really think this is just an
> example
> > of
> > >>> asynchronous
> > >>> commit in which the application has decided not to specify when the
> > >> commit
> > >>> should
> > >>> begin.
> > >>>
> > >>> It is possible to increase the cadence of pushing by modifying the
> > >>> interval.ms
> > >>> configuration property of the CLIENT_METRICS resource.
> > >>>
> > >>> There is an “assigned-partitions” metric for each consumer, but not
> one
> > >> for
> > >>> active partitions. We could add one, again as a follow-on KIP.
> > >>>
> > >>> I take your point about holding on to a connection in a channel which
> > >> might
> > >>> experience congestion. Do you have a suggestion for how to improve on
> > >> this?
> > >>> For example, the client does have the concept of a least-loaded node.
> > >> Maybe
> > >>> this is something we should investigate in the implementation and
> > decide
> > >>> on the
> > >>> best approach. In general, I think sticking with the same node for
> > >>> consecutive
> > >>> pushes is best, but if you choose the “wrong” node to start with,
> it’s
> > >> not
> > >>> ideal.
> > >>>
> > >>> Thanks,
> > >>> Andrew
> > >>>
> >  On 8 Sep 2023, at 19:29, Philip Nee  wrote:
> > 
> >  Hey Andrew -
> > 
> >  +1 but I don't have a binding vote!
> > 
> >  It took me a while to go through the KIP. Here are some of my notes
> > >>> during
> >  the reading:
> > 
> >  *Metrics*
> >  - Should we care about the client's leader epoch? There is a case
> > where
> > >>> the
> >  user recreates the topic, but the consumer thinks it is still the
> same
> >  topic and therefore, attempts to start from an offset that doesn't
> > >> exist.
> >  KIP-848 addresses this issue, but I can still see some potential
> > >> benefits
> >  from knowing the client's epoch information.
> >  - I assume poll idle is similar to poll interval: I needed to read
> the
> >  description a few times.
> >  - I don't have a clear use case in mind for the commit latency, but
> I
> > >> do
> >  

[jira] [Created] (KAFKA-15606) Verify & refactor correctness of FetcherTest.testCompletedFetchRemoval()

2023-10-13 Thread Kirk True (Jira)
Kirk True created KAFKA-15606:
-

 Summary: Verify & refactor correctness of 
FetcherTest.testCompletedFetchRemoval()
 Key: KAFKA-15606
 URL: https://issues.apache.org/jira/browse/KAFKA-15606
 Project: Kafka
  Issue Type: Improvement
  Components: clients, consumer
Reporter: Kirk True
Assignee: Kirk True


As part of the review for #14406, [~junrao] made these comments on the 
FetcherTest.testCompletedFetchRemoval() test:
{quote}Hmm, why don't we return records from other partitions since maxRecords 
is maxInt?
{quote}
and:
{quote}Is [the check for the `fetchedRecord` size of 3] redundant given the 
test [two lines above]?
{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2286

2023-10-13 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 210073 lines...]

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldDrainRestoredActiveTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemovePausedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemovePausedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemoveTasksFromAndClearInputQueueOnShutdown() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemoveTasksFromAndClearInputQueueOnShutdown() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfAddingActiveTasksWithSameId() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfAddingActiveTasksWithSameId() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > 
shouldRestoreActiveStatefulTasksAndUpdateStandbyTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > 
shouldRestoreActiveStatefulTasksAndUpdateStandbyTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldPauseStandbyTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldPauseStandbyTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfStatefulTaskNotInStateRestoring() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfStatefulTaskNotInStateRestoring() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetUncaughtStreamsException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetUncaughtStreamsException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskTimeoutOnProcessed() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskTimeoutOnProcessed() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenRequired() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenRequired() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskReleaseFutureOnShutdown() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskReleaseFutureOnShutdown() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldProcessTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldProcessTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateStreamTime() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateStreamTime() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldShutdownTaskExecutor() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldShutdownTaskExecutor() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldAwaitProcessableTasksIfNoneAssignable() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldAwaitProcessableTasksIfNoneAssignable() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > 
shouldRespectPunctuationDisabledByTaskExecutionMetadata() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > 
shouldRespectPunctuationDisabledByTaskExecutionMetadata() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenNotProgressing() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > 

Re: [DISCUSS] KIP-714: Client metrics and observability

2023-10-13 Thread Andrew Schofield
Hi Jun,
Thanks for the clarifications.

131. The client instance ids returned from 
KafkaStreams.clientInstanceIds(Duration) correspond to the
client_instance_id labels added by the broker to the metrics pushed from the 
clients. This should be
sufficient information to enable correlation between the metrics available in 
the client, and the metrics
pushed to the broker.

132. Yes, I see. I used JMX to look at the metrics on my broker and you’re 
entirely right. I will
remove the redundant metric from the KIP.

Thanks,
Andrew

> On 12 Oct 2023, at 20:12, Jun Rao  wrote:
>
> Hi, Andrew,
>
> Thanks for the reply.
>
> 131. Could we also document how one could correlate each client instance in
> KStreams with the labels for the metrics received by the brokers?
>
> 132. The documentation for RequestsPerSec is not complete. If you trace
> through how
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/RequestChannel.scala#L71
> 
> is
> implemented, it includes every API key tagged with the corresponding
> listener.
>
> Jun
>
> On Thu, Oct 12, 2023 at 11:42 AM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Jun,
>> Thanks for your comments.
>>
>> 130. As Matthias described, and I am adding to the KIP, the
>> `KafkaStreams#clientInstanceIds` method
>> is only permitted when the state is RUNNING or REBALANCING. Also, clients
>> can be added dynamically
>> so the maps might change over time. If it’s in a permitted state, the
>> method is prepared to wait up to the
>> supplied timeout to get the client instance ids. It does not return a
>> partial result - it returns a result or
>> fails.
>>
>> 131. I’ve refactored the `ClientsInstanceIds` object and the global
>> consumer is now part of the map
>> of consumers. There is no need for the Optional any longer. I’ve also
>> renamed it `ClientInstanceIds`.
>>
>> 132. My reading of
>> `(kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*)` is that
>> It does not support every request type - it supports Produce,
>> FetchConsumer and FetchFollower.
>> Consequently, I think the ClientMetricsSubscriptionRequestCount is not
>> instantly obsolete.
>>
>> If I’ve misunderstood, please let me know.
>>
>> Thanks,
>> Andrew
>>
>>
>>> On 12 Oct 2023, at 01:07, Jun Rao  wrote:
>>>
>>> Hi, Andrew,
>>>
>>> Thanks for the updated KIP. Just a few more minor comments.
>>>
>>> 130. KafkaStreams.clientsInstanceId(Duration timeout): Does it wait for
>> all
>>> consumer/producer/adminClient instances to be initialized? Are all those
>>> instances created during KafkaStreams initialization?
>>>
>>> 131. Why does globalConsumerInstanceId() return Optional while
>>> other consumer instances don't return Optional?
>>>
>>> 132. ClientMetricsSubscriptionRequestCount: Do we need this since we
>> have a
>>> set of generic metrics
>>> (kafka.network:type=RequestMetrics,name=RequestsPerSec,request=*) that
>>> report Request rate for every request type?
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Wed, Oct 11, 2023 at 1:47 PM Matthias J. Sax 
>> wrote:
>>>
 Thanks!

 On 10/10/23 11:31 PM, Andrew Schofield wrote:
> Matthias,
> Yes, I think that’s a sensible way forward and the interface you
>> propose
 looks good. I’ll update the KIP accordingly.
>
> Thanks,
> Andrew
>
>> On 10 Oct 2023, at 23:01, Matthias J. Sax  wrote:
>>
>> Andrew,
>>
>> yes I would like to get this change into KIP-714 right way. Seems to
>> be
 important, as we don't know if/when a follow-up KIP for Kafka Streams
>> would
 land.
>>
>> I was also thinking (and discussed with a few others) how to expose
>> it,
 and we would propose the following:
>>
>> We add a new method to `KafkaStreams` class:
>>
>>   public ClientsInstanceIds clientsInstanceIds(Duration timeout);
>>
>> The returned object is like below:
>>
>> public class ClientsInstanceIds {
>>   // we only have a single admin client per KS instance
>>   String adminInstanceId();
>>
>>   // we only have a single global consumer per KS instance (if any)
>>   // Optional<> because we might not have global-thread
>>   Optional globalConsumerInstanceId();
>>
>>   // return a  ClientInstanceId> mapping
>>   // for the underlying (restore-)consumers/producers
>>   Map mainConsumerInstanceIds();
>>   Map restoreConsumerInstanceIds();
>>   Map producerInstanceIds();
>> }
>>
>> For the `threadKey`, we would use some pattern like this:
>>
>> [Stream|StateUpdater]Thread-
>>
>>
>> Would this work from your POV?
>>
>>
>>
>> -Matthias
>>
>>
>> On 10/9/23 2:15 AM, Andrew Schofield wrote:
>>> Hi Matthias,
>>> Good point. Makes sense to me.
>>> Is this something that can also be included in the proposed Kafka
 

[jira] [Created] (KAFKA-15605) Topic marked for deletion are incorrectly migrated to KRaft

2023-10-13 Thread David Arthur (Jira)
David Arthur created KAFKA-15605:


 Summary: Topic marked for deletion are incorrectly migrated to 
KRaft
 Key: KAFKA-15605
 URL: https://issues.apache.org/jira/browse/KAFKA-15605
 Project: Kafka
  Issue Type: Bug
  Components: controller, kraft
Affects Versions: 3.6.0
Reporter: David Arthur
 Fix For: 3.6.1


When migrating topics from ZooKeeper, the KRaft controller reads all the topic 
and partition metadata from ZK directly. This includes topics which have been 
marked for deletion by the ZK controller. 

Since the client request to delete these topics has already been returned as 
successful, it would be confusing to the client that the topic still existed. 
An operator or application would need to issue another topic deletion to remove 
these topics once the controller had moved to KRaft. If they tried to create a 
new topic with the same name, they would receive a TOPIC_ALREADY_EXISTS error.

The migration logic should carry over pending topic deletions and resolve them 
either as part of the migration or shortly after.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] MINOR Add note about KAFKA-15552 to 3.6 upgrade section [kafka-site]

2023-10-13 Thread via GitHub


ijuma commented on code in PR #560:
URL: https://github.com/apache/kafka-site/pull/560#discussion_r1358724223


##
36/upgrade.html:
##
@@ -84,7 +84,9 @@ Upgrading KRaft-based cl
 
 Notable 
changes in 3.6.0
 
-Apache Kafka now supports having both an IPv4 and an IPv6 listener 
on the same port. This change only applies to
+ZooKeeper to KRaft migrations are now recommended for production 
usage. One significant issue was found in the
+ 3.6.0 release which affects transactional producers 
https://issues.apache.org/jira/browse/KAFKA-15552.

Review Comment:
   And can we be clearer about what we're recommending? Are we saying that they 
should wait for 3.6.1 if they use idempotent producers (the default since 3.0) 
or transactions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2285

2023-10-13 Thread Apache Jenkins Server
See 




Re: [PR] MINOR Add note about KAFKA-15552 to 3.6 upgrade section [kafka-site]

2023-10-13 Thread via GitHub


jolshan commented on code in PR #560:
URL: https://github.com/apache/kafka-site/pull/560#discussion_r1358710418


##
36/upgrade.html:
##
@@ -84,7 +84,9 @@ Upgrading KRaft-based cl
 
 Notable 
changes in 3.6.0
 
-Apache Kafka now supports having both an IPv4 and an IPv6 listener 
on the same port. This change only applies to
+ZooKeeper to KRaft migrations are now recommended for production 
usage. One significant issue was found in the
+ 3.6.0 release which affects transactional producers 
https://issues.apache.org/jira/browse/KAFKA-15552.

Review Comment:
   we should clarify this is idempotent producers as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] KIP-983: Full speed async processing during rebalance

2023-10-13 Thread David Jacot
Hi Erik,

Thanks for the KIP. I haven’t fully read the KIP yet but I agree with the
weaknesses that you point out in it. I will continue to read it.

For your information, we are working full speed on implementing KIP-848
while also changing the internal threading model of consumer. Those changes
are already extremely large so I would rather prefer to complete them
before adding more on top of them. Moreover, I think that this KIP should
build on top of KIP-848 now. Would you agree with this?


Best,
David

Le ven. 13 oct. 2023 à 20:44, Erik van Oosten 
a écrit :

> Thanks Philip,
>
> No worries, I am not in a hurry. Knowing this is not forgotten is enough
> for me. If there is anything I can do to help the process please let me
> know.
>
> Kind regards,
>  Erik.
>
>
> Op 13-10-2023 om 20:29 schreef Philip Nee:
> > Hi Erik,
> >
> > Sorry for the delay, I have not finished reviewing the KIP, but I also
> have
> > not forgotten about it!
> >
> > In general, KIP review process can be lengthy, so I think mailing list is
> > the best bet to get the committer's attention.
> >
> > P
> >
> > On Fri, Oct 13, 2023 at 10:55 AM Erik van Oosten
> >  wrote:
> >
> >> Hi client developers,
> >>
> >> The text is updated so that it is more clear that you can only use
> >> auto-commit when doing synchronous processing (approach 1). I am
> >> assuming that auto-commit commits whatever was consumed in the previous
> >> poll.
> >>
> >> I am wondering why this KIP doesn't get more attention. Is async
> >> processing not something that the kafka client wants to support?
> >>
> >> Kind regards,
> >>   Erik.
> >>
> >>
> >> Op 25-09-2023 om 18:17 schreef Erik van Oosten:
> >>> Hi Viktor,
> >>>
> >>> Good questions!
> >>>
> >>> 1. Auto-commits would only work with approach 1 in the KIP. Any async
> >>> solution is incompatible with auto-commits. Do you think the text will
> >>> improve when this is mentioned?
> >>>
> >>> 2. That is entirely correct. If you use async commits you can await
> >>> completion by doing a single sync commit with an empty offsets Map
> >>> (this will work as of Kafka 3.6.0).
> >>>
> >>> Is there anything I can do to make the text clearer?
> >>>
> >>> Kind regards,
> >>>  Erik.
> >>>
> >>>
> >>> Op 25-09-2023 om 17:04 schreef Viktor Somogyi-Vass:
>  Hi Erik,
> 
>  I'm still trying to wrap my head around the KIP, however I have a few
>  questions that weren't clear to me regarding offset commits:
>  1. Would auto-commits interfere with the behavior defined in your KIP
> or
>  would it work the same as manual commits?
>  2. As I see you don't separate offset commits by whether they're sync
> or
>  async. For sync commits timing isn't really a problem but how would
> you
>  change work in case of async offset commits? There can be a few
> caveats
>  there as you may not know whether a commit is finished or not until
> your
>  callback is called.
> 
>  Thanks,
>  Viktor
> 
>  On Sat, Sep 23, 2023 at 4:00 PM Erik van Oosten
>   wrote:
> 
> > Hi all,
> >
> > I would like to start the discussion on KIP-983: Full speed async
> > processing during rebalance [1].
> >
> > The idea is that we can prevent the drop in throughput during a
> > cooperative rebalance.
> >
> > I am curious to your ideas and comments.
> >
> > Kind regards,
> >Erik.
> >
> > [1]
> >
> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-983%3A+Full+speed+async+processing+during+rebalance
> >> --
> >> Erik van Oosten
> >> e.vanoos...@grons.nl
> >> https://day-to-day-stuff.blogspot.com
> >>
> >>
> --
> Erik van Oosten
> e.vanoos...@grons.nl
> https://day-to-day-stuff.blogspot.com
>
>


[PR] Add note about KAFKA-15552 to 3.6 upgrade section [kafka-site]

2023-10-13 Thread via GitHub


mumrah opened a new pull request, #560:
URL: https://github.com/apache/kafka-site/pull/560

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] KIP-983: Full speed async processing during rebalance

2023-10-13 Thread Erik van Oosten

Thanks Philip,

No worries, I am not in a hurry. Knowing this is not forgotten is enough 
for me. If there is anything I can do to help the process please let me 
know.


Kind regards,
    Erik.


Op 13-10-2023 om 20:29 schreef Philip Nee:

Hi Erik,

Sorry for the delay, I have not finished reviewing the KIP, but I also have
not forgotten about it!

In general, KIP review process can be lengthy, so I think mailing list is
the best bet to get the committer's attention.

P

On Fri, Oct 13, 2023 at 10:55 AM Erik van Oosten
 wrote:


Hi client developers,

The text is updated so that it is more clear that you can only use
auto-commit when doing synchronous processing (approach 1). I am
assuming that auto-commit commits whatever was consumed in the previous
poll.

I am wondering why this KIP doesn't get more attention. Is async
processing not something that the kafka client wants to support?

Kind regards,
  Erik.


Op 25-09-2023 om 18:17 schreef Erik van Oosten:

Hi Viktor,

Good questions!

1. Auto-commits would only work with approach 1 in the KIP. Any async
solution is incompatible with auto-commits. Do you think the text will
improve when this is mentioned?

2. That is entirely correct. If you use async commits you can await
completion by doing a single sync commit with an empty offsets Map
(this will work as of Kafka 3.6.0).

Is there anything I can do to make the text clearer?

Kind regards,
 Erik.


Op 25-09-2023 om 17:04 schreef Viktor Somogyi-Vass:

Hi Erik,

I'm still trying to wrap my head around the KIP, however I have a few
questions that weren't clear to me regarding offset commits:
1. Would auto-commits interfere with the behavior defined in your KIP or
would it work the same as manual commits?
2. As I see you don't separate offset commits by whether they're sync or
async. For sync commits timing isn't really a problem but how would you
change work in case of async offset commits? There can be a few caveats
there as you may not know whether a commit is finished or not until your
callback is called.

Thanks,
Viktor

On Sat, Sep 23, 2023 at 4:00 PM Erik van Oosten
 wrote:


Hi all,

I would like to start the discussion on KIP-983: Full speed async
processing during rebalance [1].

The idea is that we can prevent the drop in throughput during a
cooperative rebalance.

I am curious to your ideas and comments.

Kind regards,
   Erik.

[1]



https://cwiki.apache.org/confluence/display/KAFKA/KIP-983%3A+Full+speed+async+processing+during+rebalance
--
Erik van Oosten
e.vanoos...@grons.nl
https://day-to-day-stuff.blogspot.com



--
Erik van Oosten
e.vanoos...@grons.nl
https://day-to-day-stuff.blogspot.com



Re: [DISCUSS] KIP-983: Full speed async processing during rebalance

2023-10-13 Thread Philip Nee
Hi Erik,

Sorry for the delay, I have not finished reviewing the KIP, but I also have
not forgotten about it!

In general, KIP review process can be lengthy, so I think mailing list is
the best bet to get the committer's attention.

P

On Fri, Oct 13, 2023 at 10:55 AM Erik van Oosten
 wrote:

> Hi client developers,
>
> The text is updated so that it is more clear that you can only use
> auto-commit when doing synchronous processing (approach 1). I am
> assuming that auto-commit commits whatever was consumed in the previous
> poll.
>
> I am wondering why this KIP doesn't get more attention. Is async
> processing not something that the kafka client wants to support?
>
> Kind regards,
>  Erik.
>
>
> Op 25-09-2023 om 18:17 schreef Erik van Oosten:
> > Hi Viktor,
> >
> > Good questions!
> >
> > 1. Auto-commits would only work with approach 1 in the KIP. Any async
> > solution is incompatible with auto-commits. Do you think the text will
> > improve when this is mentioned?
> >
> > 2. That is entirely correct. If you use async commits you can await
> > completion by doing a single sync commit with an empty offsets Map
> > (this will work as of Kafka 3.6.0).
> >
> > Is there anything I can do to make the text clearer?
> >
> > Kind regards,
> > Erik.
> >
> >
> > Op 25-09-2023 om 17:04 schreef Viktor Somogyi-Vass:
> >> Hi Erik,
> >>
> >> I'm still trying to wrap my head around the KIP, however I have a few
> >> questions that weren't clear to me regarding offset commits:
> >> 1. Would auto-commits interfere with the behavior defined in your KIP or
> >> would it work the same as manual commits?
> >> 2. As I see you don't separate offset commits by whether they're sync or
> >> async. For sync commits timing isn't really a problem but how would you
> >> change work in case of async offset commits? There can be a few caveats
> >> there as you may not know whether a commit is finished or not until your
> >> callback is called.
> >>
> >> Thanks,
> >> Viktor
> >>
> >> On Sat, Sep 23, 2023 at 4:00 PM Erik van Oosten
> >>  wrote:
> >>
> >>> Hi all,
> >>>
> >>> I would like to start the discussion on KIP-983: Full speed async
> >>> processing during rebalance [1].
> >>>
> >>> The idea is that we can prevent the drop in throughput during a
> >>> cooperative rebalance.
> >>>
> >>> I am curious to your ideas and comments.
> >>>
> >>> Kind regards,
> >>>   Erik.
> >>>
> >>> [1]
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-983%3A+Full+speed+async+processing+during+rebalance
> >>>
> >
> --
> Erik van Oosten
> e.vanoos...@grons.nl
> https://day-to-day-stuff.blogspot.com
>
>


Re: [DISCUSS] KIP-983: Full speed async processing during rebalance

2023-10-13 Thread Erik van Oosten

Hi client developers,

The text is updated so that it is more clear that you can only use 
auto-commit when doing synchronous processing (approach 1). I am 
assuming that auto-commit commits whatever was consumed in the previous 
poll.


I am wondering why this KIP doesn't get more attention. Is async 
processing not something that the kafka client wants to support?


Kind regards,
    Erik.


Op 25-09-2023 om 18:17 schreef Erik van Oosten:

Hi Viktor,

Good questions!

1. Auto-commits would only work with approach 1 in the KIP. Any async 
solution is incompatible with auto-commits. Do you think the text will 
improve when this is mentioned?


2. That is entirely correct. If you use async commits you can await 
completion by doing a single sync commit with an empty offsets Map 
(this will work as of Kafka 3.6.0).


Is there anything I can do to make the text clearer?

Kind regards,
    Erik.


Op 25-09-2023 om 17:04 schreef Viktor Somogyi-Vass:

Hi Erik,

I'm still trying to wrap my head around the KIP, however I have a few
questions that weren't clear to me regarding offset commits:
1. Would auto-commits interfere with the behavior defined in your KIP or
would it work the same as manual commits?
2. As I see you don't separate offset commits by whether they're sync or
async. For sync commits timing isn't really a problem but how would you
change work in case of async offset commits? There can be a few caveats
there as you may not know whether a commit is finished or not until your
callback is called.

Thanks,
Viktor

On Sat, Sep 23, 2023 at 4:00 PM Erik van Oosten
 wrote:


Hi all,

I would like to start the discussion on KIP-983: Full speed async
processing during rebalance [1].

The idea is that we can prevent the drop in throughput during a
cooperative rebalance.

I am curious to your ideas and comments.

Kind regards,
  Erik.

[1]

https://cwiki.apache.org/confluence/display/KAFKA/KIP-983%3A+Full+speed+async+processing+during+rebalance 




--
Erik van Oosten
e.vanoos...@grons.nl
https://day-to-day-stuff.blogspot.com



Re: Requesting permission for contributions

2023-10-13 Thread Apoorv Mittal
Thank you Matthias.

Regards,
Apoorv Mittal


On Fri, Oct 13, 2023 at 5:55 PM Matthias J. Sax  wrote:

> Done. You should be all set.
>
> -Matthias
>
> On 10/13/23 8:21 AM, Apoorv Mittal wrote:
> > Hi,
> > Can I please get permission to contribute KIP and assign Jiras to myself.
> >
> > Wiki and Jira Id: apoorvmittal10
> > Email: apoorvmitta...@gmail.com
> >
> > Regards,
> > Apoorv Mittal
> > +44 7721681581
> >
>


Re: Requesting permission for contributions

2023-10-13 Thread Matthias J. Sax

Done. You should be all set.

-Matthias

On 10/13/23 8:21 AM, Apoorv Mittal wrote:

Hi,
Can I please get permission to contribute KIP and assign Jiras to myself.

Wiki and Jira Id: apoorvmittal10
Email: apoorvmitta...@gmail.com

Regards,
Apoorv Mittal
+44 7721681581



[jira] [Created] (KAFKA-15604) Add Telemetry RPCs Definitions

2023-10-13 Thread Apoorv Mittal (Jira)
Apoorv Mittal created KAFKA-15604:
-

 Summary: Add Telemetry RPCs Definitions
 Key: KAFKA-15604
 URL: https://issues.apache.org/jira/browse/KAFKA-15604
 Project: Kafka
  Issue Type: Sub-task
Reporter: Apoorv Mittal


The goal of this task is to define the Telemetry RPCs as described 
[here|https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability#KIP714:Clientmetricsandobservability-KafkaProtocolChanges];
 

 

This requires the following steps:
 # The request/response schemas must be defined (json schema)
 # Request/response classes must be defined



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] 3.5.2 Release

2023-10-13 Thread Matthias J. Sax
Thanks -- there is a few fixed for Kafka Streams we are considering to 
cherry-pick to get into 3.5.2 release -- can you give us a few more days 
for this?


-Matthias

On 10/12/23 6:20 PM, Sophie Blee-Goldman wrote:

Thanks for volunteering Luke!

On Thu, Oct 12, 2023 at 2:55 AM Levani Kokhreidze 
wrote:


Hi Divij,

Thanks for the explanation, makes sense.

Hi Luke, thanks you! It would be awesome to see 3.5.2 out.

Best,
Levani


On 12. Oct 2023, at 12:39, Luke Chen  wrote:

Hi Levani and Divij,

I can work on the 3.5.2 release.
I'll start a new thread for volunteering it maybe next week.

Thanks.
Luke

On Thu, Oct 12, 2023 at 5:07 PM Divij Vaidya 
wrote:


Hello Levani

 From a process perspective, there is no fixed schedule for bug fix
releases. If we have a volunteer for release manager (must be a

committer),

they can start with the process of bug fix release (with the approval of
PMC).

My personal opinion is that it's too early to start 3.6.1 and we should
wait at least 1 months to hear feedback on 3.6.0. We need to make a

careful

balance between getting the critical fixes in the hands of users as soon
as possible vs. spending community effort towards releases (the effort

that

could be used to make Kafka better, feature-wise & operational
stability-wise, otherwise).

For 3.5.2, I think there are sufficient pending (including some CVE

fixes)

to start a bug fix release. We just need a volunteer for the release
manager.

--
Divij Vaidya



On Thu, Oct 12, 2023 at 9:57 AM Levani Kokhreidze <

levani.co...@gmail.com>

wrote:


Hello,

KAFKA-15571 [1] was merged and backported to the 3.5 and 3.6 branches.

Bug

fixes the feature that was added in 3.5. Considering the feature

doesn't

work as expected without a fix, I would like to know if it's reasonable

to

start the 3.5.2 release. Of course, releasing such a massive project

like

Kafka is not a trivial task, and I am looking for the community's input

on

this if it's reasonable to start the 3.5.2 release process.

Best,
Levani

[1] - https://issues.apache.org/jira/browse/KAFKA-15571









Re: [PR] MINOR: fix Kafka Streams metric names [kafka-site]

2023-10-13 Thread via GitHub


mjsax merged PR #558:
URL: https://github.com/apache/kafka-site/pull/558


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] KIP-985 Add reverseRange and reverseAll query over kv-store in IQv2

2023-10-13 Thread Hanyu (Peter) Zheng
Thank you, Matthias and Alieh,

I've initiated a vote.

Sincerely,
Hanyu

On Fri, Oct 13, 2023 at 9:14 AM Matthias J. Sax  wrote:

> Thanks for pointing this out Alieh! I totally missed this.
>
> So I guess everything is settled and Hanyu can start a VOTE?
>
> For the KIP PR, we should ensure to update the JavaDocs to avoid
> confusion in the future.
>
>
> -Matthias
>
> On 10/12/23 12:21 PM, Alieh Saeedi wrote:
> > Hi,
> > just pointing to javadocs for range() and reverseRange():
> >
> > range(): *@return The iterator for this range, from smallest to largest
> > bytes.*
> > reverseRange(): * @return The reverse iterator for this range, from
> largest
> > to smallest key bytes.
> >
> > Cheers,
> > Alieh
> >
> >
> > On Thu, Oct 12, 2023 at 7:32 AM Matthias J. Sax 
> wrote:
> >
> >> Quick addendum.
> >>
> >> Some DSL operator use `range` and seems to rely on ascending order,
> >> too... Of course, if we say `range()` has no ordering guarantee, we
> >> would add `forwardRange()` and let the DSL use `forwardRange`, too.
> >>
> >> The discussion of course also applies to `all()` and `reveserAll()`, and
> >> and I assume also `prefixScan()` (even if there is no "reverse" version
> >> for it).
> >>
> >>
> >> On 10/11/23 10:22 PM, Matthias J. Sax wrote:
> >>> Thanks for raising this question Hanyu. Great find!
> >>>
> >>> My interpretation is as follows (it's actually a warning signal that
> the
> >>> API contract is not better defined, and we should fix this by extending
> >>> JavaDocs and docs on the web page about it).
> >>>
> >>> We have existing `range()` and `reverseRange()` methods on
> >>> `ReadOnlyKeyValueStore` -- the interface itself is not typed (ie, just
> >>> generics), and we state that we don't guarantee "logical order" because
> >>> underlying stores are based on `byte[]` type. So far so... well.
> >>>
> >>> However, to make matters worse, we are also not explicit if the
> >>> underlying store implementation *must* return keys is
> >>> byte[]-lexicographical order or not...
> >>>
> >>> For `range()`, I would be kinda willing to accept that there is no
> >>> ordering guarantee at all -- for example, if the underlying
> byte[]-store
> >>> is hash-based and implements a full scan to answer a `range()` it might
> >>> not be efficient, but also not incorrect if keys are be returned in
> some
> >>> "random" (byte[]-)order. In isolation, I don't see an API contract
> >>> violation.
> >>>
> >>> However, `reverseRange` implicitly states with its name, that some
> >>> "descending order" (base on keys) is expected. Given the JavaDoc
> comment
> >>> about "logical" vs "byte[]" order, the contract (at least to me) is
> >>> clear: returns records in descending byte[]-lexicographical order. --
> >>> Any other interpretation seems to be off? Curious to hear if you agree
> >>> or disagree to this interpretation?
> >>>
> >>>
> >>>
> >>> If this is correct, it means we are actually lacking a API contract for
> >>> ascending byte[]-lexicographical range scan. Furthermore, a hash-based
> >>> byte[]-store would need to actually explicitly sort it's result for
> >>> `reverseRange` to not violate the contract.
> >>>
> >>> To me, this raises the question if `range()` actually has a
> >>> (non-explicit) contract about returning data in byte[]-lexicographical
> >>> order? It seems a lot of people rely on this, and our default stores
> >>> actually implement it this way. So if we don't look at `range()` in
> >>> isolation, but look at the `ReadOnlyKeyValueStore` interface
> >>> holistically, I would also buy the argument that `range()` implies
> >>> "ascending "byte[]-lexicographical order". Thoughts?
> >>>
> >>> To be frank: to me, it's pretty clear that the original idea to add
> >>> `range()` was to return data in ascending order.
> >>>
> >>>
> >>> Question 1:
> >>>- Do we believe that the range() contract is ascending
> >>> byte[]-lexicographical order right now?
> >>>
> >>>  If yes, I would propose to make it explicit in the JavaDocs.
> >>>
> >>>  If no, I would also propose to make it explicit in the JavaDocs.
> In
> >>> addition, it raises the question if a method `forwardRange()` (for the
> >>> lack of a better idea about a name right now) is actually missing to
> >>> provide such a contract?
> >>>
> >>>
> >>> Of course, we always depend on the serialization format for order, and
> >>> if users need "logical order" they need to ensure to use a
> serialization
> >>> format that align byte[]-lexicographical order to logical order. But
> for
> >>> the scope of this work, I would not even try to open this can of
> worms...
> >>>
> >>>
> >>>
> >>>
> >>> Looking into `RangeQuery` the JavaDocs don't say anything about order.
> >>> Thus, `RangeQuery#range()` could actually also be implemented by
> calling
> >>> `reverseRange()` without violating the contract as it seems. A
> hash-base
> >>> store could also implement it, without the need to explicitly sort...
> >>>
> >>> What brings be back to my original though 

Re: [VOTE] KIP-985 Add reverseRange and reverseAll query over kv-store in IQv2

2023-10-13 Thread Hanyu (Peter) Zheng
Hello everyone,

I would like to start a vote for KIP-985 that Add reverseRange and
reverseAll query over kv-store in IQv2.

Sincerely,
Hanyu

On Fri, Oct 13, 2023 at 9:15 AM Hanyu (Peter) Zheng 
wrote:

>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-985:+Add+reverseRange+and+reverseAll+query+over+kv-store+in+IQv2
>
> --
>
> [image: Confluent] 
> Hanyu (Peter) Zheng he/him/his
> Software Engineer Intern
> +1 (213) 431-7193 <+1+(213)+431-7193>
> Follow us: [image: Blog]
> [image:
> Twitter] [image: LinkedIn]
> [image: Slack]
> [image: YouTube]
> 
>
> [image: Try Confluent Cloud for Free]
> 
>


-- 

[image: Confluent] 
Hanyu (Peter) Zheng he/him/his
Software Engineer Intern
+1 (213) 431-7193 <+1+(213)+431-7193>
Follow us: [image: Blog]
[image:
Twitter] [image: LinkedIn]
[image: Slack]
[image: YouTube]


[image: Try Confluent Cloud for Free]



[jira] [Created] (KAFKA-15603) kafka-dump-log --offsets-decoder should handle parsing errors

2023-10-13 Thread David Jacot (Jira)
David Jacot created KAFKA-15603:
---

 Summary: kafka-dump-log --offsets-decoder should handle parsing 
errors
 Key: KAFKA-15603
 URL: https://issues.apache.org/jira/browse/KAFKA-15603
 Project: Kafka
  Issue Type: Bug
Reporter: David Jacot


I needed to dump an __consumer_offset partition so I used `kafka-dump-log 
--offsets-decoder --files ` to do it but I got the following error:

 
Exception in thread "main" 
org.apache.kafka.common.protocol.types.SchemaException: Buffer underflow while 
parsing consumer protocol's header
at 
org.apache.kafka.clients.consumer.internals.ConsumerProtocol.deserializeVersion(ConsumerProtocol.java:72)
at 
org.apache.kafka.clients.consumer.internals.ConsumerProtocol.deserializeAssignment(ConsumerProtocol.java:179)
at 
kafka.coordinator.group.GroupMetadataManager$.$anonfun$parseGroupMetadata$2(GroupMetadataManager.scala:1562)
at 
kafka.coordinator.group.GroupMetadataManager$.parseGroupMetadata(GroupMetadataManager.scala:1560)
at 
kafka.coordinator.group.GroupMetadataManager$.formatRecordKeyAndValue(GroupMetadataManager.scala:1526)
at 
kafka.tools.DumpLogSegments$OffsetsMessageParser.parse(DumpLogSegments.scala:416)
at 
kafka.tools.DumpLogSegments$.$anonfun$dumpLog$2(DumpLogSegments.scala:327)
at 
kafka.tools.DumpLogSegments$.$anonfun$dumpLog$2$adapted(DumpLogSegments.scala:285)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:575)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:573)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
at 
kafka.tools.DumpLogSegments$.$anonfun$dumpLog$1(DumpLogSegments.scala:285)
at 
kafka.tools.DumpLogSegments$.$anonfun$dumpLog$1$adapted(DumpLogSegments.scala:282)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:575)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:573)
at scala.collection.AbstractIterable.foreach(Iterable.scala:933)
at kafka.tools.DumpLogSegments$.dumpLog(DumpLogSegments.scala:282)
at 
kafka.tools.DumpLogSegments$.$anonfun$main$1(DumpLogSegments.scala:70)
at kafka.tools.DumpLogSegments$.main(DumpLogSegments.scala:61)
at kafka.tools.DumpLogSegments.main(DumpLogSegments.scala)
Caused by: java.nio.BufferUnderflowException
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:707)
at java.base/java.nio.HeapByteBuffer.getShort(HeapByteBuffer.java:383)
at 
org.apache.kafka.clients.consumer.internals.ConsumerProtocol.deserializeVersion(ConsumerProtocol.java:70)
... 19 more
The issue is that it is never guaranteed that the metadata associated with each 
member in a consumer group really follow the format used by the official Apache 
Kafka Consumer. They may have another format and use the same protocol type.

I think that it would be better to handle those errors. No being able to dump a 
segment is really annoying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is still unstable: Kafka » Kafka Branch Builder » trunk #2284

2023-10-13 Thread Apache Jenkins Server
See 




[VOTE] KIP-985 Add reverseRange and reverseAll query over kv-store in IQv2

2023-10-13 Thread Hanyu (Peter) Zheng
https://cwiki.apache.org/confluence/display/KAFKA/KIP-985:+Add+reverseRange+and+reverseAll+query+over+kv-store+in+IQv2

-- 

[image: Confluent] 
Hanyu (Peter) Zheng he/him/his
Software Engineer Intern
+1 (213) 431-7193 <+1+(213)+431-7193>
Follow us: [image: Blog]
[image:
Twitter] [image: LinkedIn]
[image: Slack]
[image: YouTube]


[image: Try Confluent Cloud for Free]



Re: [DISCUSS] KIP-985 Add reverseRange and reverseAll query over kv-store in IQv2

2023-10-13 Thread Matthias J. Sax

Thanks for pointing this out Alieh! I totally missed this.

So I guess everything is settled and Hanyu can start a VOTE?

For the KIP PR, we should ensure to update the JavaDocs to avoid 
confusion in the future.



-Matthias

On 10/12/23 12:21 PM, Alieh Saeedi wrote:

Hi,
just pointing to javadocs for range() and reverseRange():

range(): *@return The iterator for this range, from smallest to largest
bytes.*
reverseRange(): * @return The reverse iterator for this range, from largest
to smallest key bytes.

Cheers,
Alieh


On Thu, Oct 12, 2023 at 7:32 AM Matthias J. Sax  wrote:


Quick addendum.

Some DSL operator use `range` and seems to rely on ascending order,
too... Of course, if we say `range()` has no ordering guarantee, we
would add `forwardRange()` and let the DSL use `forwardRange`, too.

The discussion of course also applies to `all()` and `reveserAll()`, and
and I assume also `prefixScan()` (even if there is no "reverse" version
for it).


On 10/11/23 10:22 PM, Matthias J. Sax wrote:

Thanks for raising this question Hanyu. Great find!

My interpretation is as follows (it's actually a warning signal that the
API contract is not better defined, and we should fix this by extending
JavaDocs and docs on the web page about it).

We have existing `range()` and `reverseRange()` methods on
`ReadOnlyKeyValueStore` -- the interface itself is not typed (ie, just
generics), and we state that we don't guarantee "logical order" because
underlying stores are based on `byte[]` type. So far so... well.

However, to make matters worse, we are also not explicit if the
underlying store implementation *must* return keys is
byte[]-lexicographical order or not...

For `range()`, I would be kinda willing to accept that there is no
ordering guarantee at all -- for example, if the underlying byte[]-store
is hash-based and implements a full scan to answer a `range()` it might
not be efficient, but also not incorrect if keys are be returned in some
"random" (byte[]-)order. In isolation, I don't see an API contract
violation.

However, `reverseRange` implicitly states with its name, that some
"descending order" (base on keys) is expected. Given the JavaDoc comment
about "logical" vs "byte[]" order, the contract (at least to me) is
clear: returns records in descending byte[]-lexicographical order. --
Any other interpretation seems to be off? Curious to hear if you agree
or disagree to this interpretation?



If this is correct, it means we are actually lacking a API contract for
ascending byte[]-lexicographical range scan. Furthermore, a hash-based
byte[]-store would need to actually explicitly sort it's result for
`reverseRange` to not violate the contract.

To me, this raises the question if `range()` actually has a
(non-explicit) contract about returning data in byte[]-lexicographical
order? It seems a lot of people rely on this, and our default stores
actually implement it this way. So if we don't look at `range()` in
isolation, but look at the `ReadOnlyKeyValueStore` interface
holistically, I would also buy the argument that `range()` implies
"ascending "byte[]-lexicographical order". Thoughts?

To be frank: to me, it's pretty clear that the original idea to add
`range()` was to return data in ascending order.


Question 1:
   - Do we believe that the range() contract is ascending
byte[]-lexicographical order right now?

 If yes, I would propose to make it explicit in the JavaDocs.

 If no, I would also propose to make it explicit in the JavaDocs. In
addition, it raises the question if a method `forwardRange()` (for the
lack of a better idea about a name right now) is actually missing to
provide such a contract?


Of course, we always depend on the serialization format for order, and
if users need "logical order" they need to ensure to use a serialization
format that align byte[]-lexicographical order to logical order. But for
the scope of this work, I would not even try to open this can of worms...




Looking into `RangeQuery` the JavaDocs don't say anything about order.
Thus, `RangeQuery#range()` could actually also be implemented by calling
`reverseRange()` without violating the contract as it seems. A hash-base
store could also implement it, without the need to explicitly sort...

What brings be back to my original though about having three types of
results for `Range`
   - no ordering guarantee
   - ascending (we would only give byte[]-lexicographical order)
   - descending (we would only give byte[]-lexicographical order)

Again, I actually believe that the original intent of RangeQuery was to
inherit the ascending order of `ReadOnlyKeyValueStore#range()`... Please
keep me honest about it.  On the other hand, both APIs seems to be
independent enough to not couple them... -- this could actually be a
step into the right direction and would follow the underlying idea of
IQv2 to begin with: decouple semantics for the store interfaces from the
query types and semantics...


OR: we actually say that 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Nick Telford
Hi Bruno,

4.
I'll hold off on making that change until we have a consensus as to what
configuration to use to control all of this, as it'll be affected by the
decision on EOS isolation levels.

5.
Done. I've chosen "committedOffsets".

Regards,
Nick

On Fri, 13 Oct 2023 at 16:23, Bruno Cadonna  wrote:

> Hi Nick,
>
> 1.
> Yeah, you are probably right that it does not make too much sense.
> Thanks for the clarification!
>
>
> 4.
> Yes, sorry for the back and forth, but I think for the sake of the KIP
> it is better to let the ALOS behavior as it is for now due to the
> possible issues you would run into. Maybe we can find a solution in the
> future. Now the question returns to whether we really need
> default.state.isolation.level. Maybe the config could be the feature
> flag Sophie requested.
>
>
> 5.
> There is a guideline in Kafka not to use the get prefix for getters (at
> least in the public API). Thus, could you please rename
>
> getCommittedOffset(TopicPartition partition) ->
> committedOffsetFor(TopicPartition partition)
>
> You can also propose an alternative to committedOffsetFor().
>
>
> Best,
> Bruno
>
>
> On 10/13/23 3:21 PM, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for getting back to me.
> >
> > 1.
> > I think this should be possible. Are you thinking of the situation where
> a
> > user may downgrade to a previous version of Kafka Streams? In that case,
> > sadly, the RocksDBStore would get wiped by the older version of Kafka
> > Streams anyway, because that version wouldn't understand the extra column
> > family (that holds offsets), so the missing Position file would
> > automatically get rebuilt when the store is rebuilt from the changelog.
> > Are there other situations than downgrade where a transactional store
> could
> > be replaced by a non-transactional one? I can't think of any.
> >
> > 2.
> > Ahh yes, the Test Plan - my Kryptonite! This section definitely needs to
> be
> > fleshed out. I'll work on that. How much detail do you need?
> >
> > 3.
> > See my previous email discussing this.
> >
> > 4.
> > Hmm, this is an interesting point. Are you suggesting that under ALOS
> > READ_COMMITTED should not be supported?
> >
> > Regards,
> > Nick
> >
> > On Fri, 13 Oct 2023 at 13:52, Bruno Cadonna  wrote:
> >
> >> Hi Nick,
> >>
> >> I think the KIP is converging!
> >>
> >>
> >> 1.
> >> I am wondering whether it makes sense to write the position file during
> >> close as we do for the checkpoint file, so that in case the state store
> >> is replaced with a non-transactional state store the non-transactional
> >> state store finds the position file. I think, this is not strictly
> >> needed, but would be a nice behavior instead of just deleting the
> >> position file.
> >>
> >>
> >> 2.
> >> The test plan does not mention integration tests. Do you not need to
> >> extend existing ones and add new ones. Also for upgrading and
> >> downgrading you might need integration and/or system tests.
> >>
> >>
> >> 3.
> >> I think Sophie made a point. Although, IQ reading from uncommitted data
> >> under EOS might be considered a bug by some people. Thus, your KIP would
> >> fix a bug rather than changing the intended behavior. However, I also
> >> see that a feature flag would help users that rely on this buggy
> >> behavior (at least until AK 4.0).
> >>
> >>
> >> 4.
> >> This is related to the previous point. I assume that the difference
> >> between READ_COMMITTED and READ_UNCOMMITTED for ALOS is that in the
> >> former you enable transactions on the state store and in the latter you
> >> disable them. If my assumption is correct, I think that is an issue.
> >> Let's assume under ALOS Streams fails over a couple of times more or
> >> less at the same step in processing after value 3 is added to an
> >> aggregation but the offset of the corresponding input record was not
> >> committed. Without transactions disabled, the aggregation value would
> >> increase by 3 for each failover. With transactions enabled, value 3
> >> would only be added to the aggregation once when the offset of the input
> >> record is committed and the transaction finally completes. So the
> >> content of the state store would change depending on the configuration
> >> for IQ. IMO, the content of the state store should be independent from
> >> IQ. Given this issue, I propose to not use transactions with ALOS at
> >> all. I was a big proponent of using transactions with ALOS, but I
> >> realized that transactions with ALOS is not as easy as enabling
> >> transactions on state stores. Another aspect that is problematic is that
> >> the changelog topic which actually replicates the state store is not
> >> transactional under ALOS. Thus, it might happen that the state store and
> >> the changelog differ in their content. All of this is maybe solvable
> >> somehow, but for the sake of this KIP, I would leave it for the future.
> >>
> >>
> >> Best,
> >> Bruno
> >>
> >>
> >>
> >> On 10/12/23 10:32 PM, Sophie Blee-Goldman wrote:

Requesting permission for contributions

2023-10-13 Thread Apoorv Mittal
Hi,
Can I please get permission to contribute KIP and assign Jiras to myself.

Wiki and Jira Id: apoorvmittal10
Email: apoorvmitta...@gmail.com

Regards,
Apoorv Mittal
+44 7721681581


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Bruno Cadonna

Hi Nick,

1.
Yeah, you are probably right that it does not make too much sense. 
Thanks for the clarification!



4.
Yes, sorry for the back and forth, but I think for the sake of the KIP 
it is better to let the ALOS behavior as it is for now due to the 
possible issues you would run into. Maybe we can find a solution in the 
future. Now the question returns to whether we really need 
default.state.isolation.level. Maybe the config could be the feature 
flag Sophie requested.



5.
There is a guideline in Kafka not to use the get prefix for getters (at 
least in the public API). Thus, could you please rename


getCommittedOffset(TopicPartition partition) -> 
committedOffsetFor(TopicPartition partition)


You can also propose an alternative to committedOffsetFor().


Best,
Bruno


On 10/13/23 3:21 PM, Nick Telford wrote:

Hi Bruno,

Thanks for getting back to me.

1.
I think this should be possible. Are you thinking of the situation where a
user may downgrade to a previous version of Kafka Streams? In that case,
sadly, the RocksDBStore would get wiped by the older version of Kafka
Streams anyway, because that version wouldn't understand the extra column
family (that holds offsets), so the missing Position file would
automatically get rebuilt when the store is rebuilt from the changelog.
Are there other situations than downgrade where a transactional store could
be replaced by a non-transactional one? I can't think of any.

2.
Ahh yes, the Test Plan - my Kryptonite! This section definitely needs to be
fleshed out. I'll work on that. How much detail do you need?

3.
See my previous email discussing this.

4.
Hmm, this is an interesting point. Are you suggesting that under ALOS
READ_COMMITTED should not be supported?

Regards,
Nick

On Fri, 13 Oct 2023 at 13:52, Bruno Cadonna  wrote:


Hi Nick,

I think the KIP is converging!


1.
I am wondering whether it makes sense to write the position file during
close as we do for the checkpoint file, so that in case the state store
is replaced with a non-transactional state store the non-transactional
state store finds the position file. I think, this is not strictly
needed, but would be a nice behavior instead of just deleting the
position file.


2.
The test plan does not mention integration tests. Do you not need to
extend existing ones and add new ones. Also for upgrading and
downgrading you might need integration and/or system tests.


3.
I think Sophie made a point. Although, IQ reading from uncommitted data
under EOS might be considered a bug by some people. Thus, your KIP would
fix a bug rather than changing the intended behavior. However, I also
see that a feature flag would help users that rely on this buggy
behavior (at least until AK 4.0).


4.
This is related to the previous point. I assume that the difference
between READ_COMMITTED and READ_UNCOMMITTED for ALOS is that in the
former you enable transactions on the state store and in the latter you
disable them. If my assumption is correct, I think that is an issue.
Let's assume under ALOS Streams fails over a couple of times more or
less at the same step in processing after value 3 is added to an
aggregation but the offset of the corresponding input record was not
committed. Without transactions disabled, the aggregation value would
increase by 3 for each failover. With transactions enabled, value 3
would only be added to the aggregation once when the offset of the input
record is committed and the transaction finally completes. So the
content of the state store would change depending on the configuration
for IQ. IMO, the content of the state store should be independent from
IQ. Given this issue, I propose to not use transactions with ALOS at
all. I was a big proponent of using transactions with ALOS, but I
realized that transactions with ALOS is not as easy as enabling
transactions on state stores. Another aspect that is problematic is that
the changelog topic which actually replicates the state store is not
transactional under ALOS. Thus, it might happen that the state store and
the changelog differ in their content. All of this is maybe solvable
somehow, but for the sake of this KIP, I would leave it for the future.


Best,
Bruno



On 10/12/23 10:32 PM, Sophie Blee-Goldman wrote:

Hey Nick! First of all thanks for taking up this awesome feature, I'm

sure

every single
Kafka Streams user and dev would agree that it is sorely needed.

I've just been catching up on the KIP and surrounding discussion, so

please

forgive me
for any misunderstandings or misinterpretations of the current plan and
don't hesitate to
correct me.

Before I jump in, I just want to say that having seen this drag on for so
long, my singular
goal in responding is to help this KIP past a perceived impasse so we can
finally move on
to voting and implementing it. Long discussions are to be expected for
major features like
this but it's completely on us as the Streams devs to make sure there is

an

end in sight
for any ongoing 

Build failed in Jenkins: Kafka » Kafka Branch Builder » 3.6 #92

2023-10-13 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 306736 lines...]

Gradle Test Run :core:test > Gradle Test Executor 90 > TransactionsTest > 
testBumpTransactionalEpoch(String) > 
testBumpTransactionalEpoch(String).quorum=zk PASSED

Gradle Test Run :core:test > Gradle Test Executor 90 > TransactionsTest > 
testBumpTransactionalEpoch(String) > 
testBumpTransactionalEpoch(String).quorum=kraft STARTED

Gradle Test Run :core:test > Gradle Test Executor 90 > TransactionsTest > 
testBumpTransactionalEpoch(String) > 
testBumpTransactionalEpoch(String).quorum=kraft PASSED

5104 tests completed, 1 failed, 9 skipped
There were failing tests. See the report at: 
file:///home/jenkins/workspace/Kafka_kafka_3.6/core/build/reports/tests/test/index.html

Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings 
and determine if they come from your own scripts or plugins.

For more on this, please refer to 
https://docs.gradle.org/8.2.1/userguide/command_line_interface.html#sec:command_line_warnings
 in the Gradle documentation.

BUILD SUCCESSFUL in 2h 26m 46s
296 actionable tasks: 109 executed, 187 up-to-date

Publishing build scan...
https://ge.apache.org/s/v7kgndgiph3ug


See the profiling report at: 
file:///home/jenkins/workspace/Kafka_kafka_3.6/build/reports/profile/profile-2023-10-13-12-08-37.html
A fine-grained performance profile is available: use the --scan option.
[Pipeline] junit
Recording test results
[Checks API] No suitable checks publisher found.
[Pipeline] echo
Skipping Kafka Streams archetype test for Java 20
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
[Pipeline] // timeout
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
> Task :core:classes
> Task :core:compileTestJava NO-SOURCE
> Task :core:compileTestScala
> Task :core:testClasses
> Task :streams:compileTestJava
> Task :streams:testClasses
> Task :streams:testJar
> Task :streams:testSrcJar
> Task :streams:publishMavenJavaPublicationToMavenLocal
> Task :streams:publishToMavenLocal

Deprecated Gradle features were used in this build, making it incompatible with 
Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings 
and determine if they come from your own scripts or plugins.

For more on this, please refer to 
https://docs.gradle.org/8.2.1/userguide/command_line_interface.html#sec:command_line_warnings
 in the Gradle documentation.

BUILD SUCCESSFUL in 8m 1s
94 actionable tasks: 41 executed, 53 up-to-date

Publishing build scan...
https://ge.apache.org/s/v4ud2lcbhh4ae

[Pipeline] sh
+ grep ^version= gradle.properties
+ cut -d= -f 2
[Pipeline] dir
Running in /home/jenkins/workspace/Kafka_kafka_3.6/streams/quickstart
[Pipeline] {
[Pipeline] sh
+ mvn clean install -Dgpg.skip
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Kafka Streams :: Quickstart[pom]
[INFO] streams-quickstart-java[maven-archetype]
[INFO] 
[INFO] < org.apache.kafka:streams-quickstart >-
[INFO] Building Kafka Streams :: Quickstart 3.6.1-SNAPSHOT[1/2]
[INFO]   from pom.xml
[INFO] [ pom ]-
[INFO] 
[INFO] --- clean:3.0.0:clean (default-clean) @ streams-quickstart ---
[INFO] 
[INFO] --- remote-resources:1.5:process (process-resource-bundles) @ 
streams-quickstart ---
[INFO] 
[INFO] --- site:3.5.1:attach-descriptor (attach-descriptor) @ 
streams-quickstart ---
[INFO] 
[INFO] --- gpg:1.6:sign (sign-artifacts) @ streams-quickstart ---
[INFO] 
[INFO] --- install:2.5.2:install (default-install) @ streams-quickstart ---
[INFO] Installing 
/home/jenkins/workspace/Kafka_kafka_3.6/streams/quickstart/pom.xml to 
/home/jenkins/.m2/repository/org/apache/kafka/streams-quickstart/3.6.1-SNAPSHOT/streams-quickstart-3.6.1-SNAPSHOT.pom
[INFO] 
[INFO] --< org.apache.kafka:streams-quickstart-java >--
[INFO] Building streams-quickstart-java 3.6.1-SNAPSHOT[2/2]
[INFO]   from java/pom.xml
[INFO] --[ maven-archetype ]---
[INFO] 
[INFO] --- clean:3.0.0:clean (default-clean) @ streams-quickstart-java ---
[INFO] 
[INFO] --- remote-resources:1.5:process (process-resource-bundles) @ 
streams-quickstart-java ---
[INFO] 
[INFO] --- resources:2.7:resources (default-resources) @ 
streams-quickstart-java ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 6 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- 

Jenkins build is unstable: Kafka » Kafka Branch Builder » trunk #2283

2023-10-13 Thread Apache Jenkins Server
See 




Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Nick Telford
Hi Bruno,

Thanks for getting back to me.

1.
I think this should be possible. Are you thinking of the situation where a
user may downgrade to a previous version of Kafka Streams? In that case,
sadly, the RocksDBStore would get wiped by the older version of Kafka
Streams anyway, because that version wouldn't understand the extra column
family (that holds offsets), so the missing Position file would
automatically get rebuilt when the store is rebuilt from the changelog.
Are there other situations than downgrade where a transactional store could
be replaced by a non-transactional one? I can't think of any.

2.
Ahh yes, the Test Plan - my Kryptonite! This section definitely needs to be
fleshed out. I'll work on that. How much detail do you need?

3.
See my previous email discussing this.

4.
Hmm, this is an interesting point. Are you suggesting that under ALOS
READ_COMMITTED should not be supported?

Regards,
Nick

On Fri, 13 Oct 2023 at 13:52, Bruno Cadonna  wrote:

> Hi Nick,
>
> I think the KIP is converging!
>
>
> 1.
> I am wondering whether it makes sense to write the position file during
> close as we do for the checkpoint file, so that in case the state store
> is replaced with a non-transactional state store the non-transactional
> state store finds the position file. I think, this is not strictly
> needed, but would be a nice behavior instead of just deleting the
> position file.
>
>
> 2.
> The test plan does not mention integration tests. Do you not need to
> extend existing ones and add new ones. Also for upgrading and
> downgrading you might need integration and/or system tests.
>
>
> 3.
> I think Sophie made a point. Although, IQ reading from uncommitted data
> under EOS might be considered a bug by some people. Thus, your KIP would
> fix a bug rather than changing the intended behavior. However, I also
> see that a feature flag would help users that rely on this buggy
> behavior (at least until AK 4.0).
>
>
> 4.
> This is related to the previous point. I assume that the difference
> between READ_COMMITTED and READ_UNCOMMITTED for ALOS is that in the
> former you enable transactions on the state store and in the latter you
> disable them. If my assumption is correct, I think that is an issue.
> Let's assume under ALOS Streams fails over a couple of times more or
> less at the same step in processing after value 3 is added to an
> aggregation but the offset of the corresponding input record was not
> committed. Without transactions disabled, the aggregation value would
> increase by 3 for each failover. With transactions enabled, value 3
> would only be added to the aggregation once when the offset of the input
> record is committed and the transaction finally completes. So the
> content of the state store would change depending on the configuration
> for IQ. IMO, the content of the state store should be independent from
> IQ. Given this issue, I propose to not use transactions with ALOS at
> all. I was a big proponent of using transactions with ALOS, but I
> realized that transactions with ALOS is not as easy as enabling
> transactions on state stores. Another aspect that is problematic is that
> the changelog topic which actually replicates the state store is not
> transactional under ALOS. Thus, it might happen that the state store and
> the changelog differ in their content. All of this is maybe solvable
> somehow, but for the sake of this KIP, I would leave it for the future.
>
>
> Best,
> Bruno
>
>
>
> On 10/12/23 10:32 PM, Sophie Blee-Goldman wrote:
> > Hey Nick! First of all thanks for taking up this awesome feature, I'm
> sure
> > every single
> > Kafka Streams user and dev would agree that it is sorely needed.
> >
> > I've just been catching up on the KIP and surrounding discussion, so
> please
> > forgive me
> > for any misunderstandings or misinterpretations of the current plan and
> > don't hesitate to
> > correct me.
> >
> > Before I jump in, I just want to say that having seen this drag on for so
> > long, my singular
> > goal in responding is to help this KIP past a perceived impasse so we can
> > finally move on
> > to voting and implementing it. Long discussions are to be expected for
> > major features like
> > this but it's completely on us as the Streams devs to make sure there is
> an
> > end in sight
> > for any ongoing discussion.
> >
> > With that said, it's my understanding that the KIP as currently proposed
> is
> > just not tenable
> > for Kafka Streams, and would prevent some EOS users from upgrading to the
> > version it
> > first appears in. Given that we can't predict or guarantee whether any of
> > the followup KIPs
> > would be completed in the same release cycle as this one, we need to make
> > sure that the
> > feature is either compatible with all current users or else
> feature-flagged
> > so that they may
> > opt in/out.
> >
> > Therefore, IIUC we need to have either (or both) of these as
> > fully-implemented config options:
> > 1. 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Bruno Cadonna

Hi Nick,

I think the KIP is converging!


1.
I am wondering whether it makes sense to write the position file during 
close as we do for the checkpoint file, so that in case the state store 
is replaced with a non-transactional state store the non-transactional 
state store finds the position file. I think, this is not strictly 
needed, but would be a nice behavior instead of just deleting the 
position file.



2.
The test plan does not mention integration tests. Do you not need to 
extend existing ones and add new ones. Also for upgrading and 
downgrading you might need integration and/or system tests.



3.
I think Sophie made a point. Although, IQ reading from uncommitted data 
under EOS might be considered a bug by some people. Thus, your KIP would 
fix a bug rather than changing the intended behavior. However, I also 
see that a feature flag would help users that rely on this buggy 
behavior (at least until AK 4.0).



4.
This is related to the previous point. I assume that the difference 
between READ_COMMITTED and READ_UNCOMMITTED for ALOS is that in the 
former you enable transactions on the state store and in the latter you 
disable them. If my assumption is correct, I think that is an issue. 
Let's assume under ALOS Streams fails over a couple of times more or 
less at the same step in processing after value 3 is added to an 
aggregation but the offset of the corresponding input record was not 
committed. Without transactions disabled, the aggregation value would 
increase by 3 for each failover. With transactions enabled, value 3 
would only be added to the aggregation once when the offset of the input 
record is committed and the transaction finally completes. So the 
content of the state store would change depending on the configuration 
for IQ. IMO, the content of the state store should be independent from 
IQ. Given this issue, I propose to not use transactions with ALOS at 
all. I was a big proponent of using transactions with ALOS, but I 
realized that transactions with ALOS is not as easy as enabling 
transactions on state stores. Another aspect that is problematic is that 
the changelog topic which actually replicates the state store is not 
transactional under ALOS. Thus, it might happen that the state store and 
the changelog differ in their content. All of this is maybe solvable 
somehow, but for the sake of this KIP, I would leave it for the future.



Best,
Bruno



On 10/12/23 10:32 PM, Sophie Blee-Goldman wrote:

Hey Nick! First of all thanks for taking up this awesome feature, I'm sure
every single
Kafka Streams user and dev would agree that it is sorely needed.

I've just been catching up on the KIP and surrounding discussion, so please
forgive me
for any misunderstandings or misinterpretations of the current plan and
don't hesitate to
correct me.

Before I jump in, I just want to say that having seen this drag on for so
long, my singular
goal in responding is to help this KIP past a perceived impasse so we can
finally move on
to voting and implementing it. Long discussions are to be expected for
major features like
this but it's completely on us as the Streams devs to make sure there is an
end in sight
for any ongoing discussion.

With that said, it's my understanding that the KIP as currently proposed is
just not tenable
for Kafka Streams, and would prevent some EOS users from upgrading to the
version it
first appears in. Given that we can't predict or guarantee whether any of
the followup KIPs
would be completed in the same release cycle as this one, we need to make
sure that the
feature is either compatible with all current users or else feature-flagged
so that they may
opt in/out.

Therefore, IIUC we need to have either (or both) of these as
fully-implemented config options:
1. default.state.isolation.level
2. enable.transactional.state.stores

This way EOS users for whom read_committed semantics are not viable can
still upgrade,
and either use the isolation.level config to leverage the new txn state
stores without sacrificing
their application semantics, or else simply keep the transactional state
stores disabled until we
are able to fully implement the isolation level configuration at either an
application or query level.

Frankly you are the expert here and know much more about the tradeoffs in
both semantics and
effort level of implementing one of these configs vs the other. In my
opinion, either option would
be fine and I would leave the decision of which one to include in this KIP
completely up to you.
I just don't see a way for the KIP to proceed without some variation of the
above that would allow
EOS users to opt-out of read_committed.

(If it's all the same to you, I would recommend always including a feature
flag in large structural
changes like this. No matter how much I trust someone or myself to
implement a feature, you just
never know what kind of bugs might slip in, especially with the very first
iteration that gets released.
So personally, my choice 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Nick Telford
Hi Sophie,

Thanks for taking the time to review the KIP and catch up.

> my singular goal in responding is to help this KIP past a perceived
impasse so we can finally move on to voting and implementing it
Just so we're clear, is the impasse you're referring to this limitation in
the current version of the KIP?

> The READ_UNCOMMITTED isolation level will only be available under the
at-least-once processing.mode. If READ_UNCOMMITTED is selected with an EOS
processing.mode, it will be automatically upgraded to READ_COMMITTED and a
warning will be produced.

Firstly, I'd like to make an argument for this being reasonable for all EOS
users, and then I'm going to discuss the technical reasons why I didn't
include a feature flag, and why including one might be problematic.

Note: I'm using emphasis to highlight important points, not to convey
emotion.

*READ_COMMITTED by default for EOS:*
The main thing to bear in mind is that the isolation level only affects
Interactive Queries. Stream processing (processors that read from stores,
like join) will *always* read from the ongoing transaction, irrespective of
isolation level.
This means that to be affected by this change, you must:

   1. have processing.mode = exactly-once(-v2|-beta)
   2. be conducting Interactive Queries against stores

At READ_COMMITTED, the maximum latency (the time between a record being
processed and a record being visible to Interactive Queries) is dictated by
the commit.interval.ms. Under EOS this defaults to 100ms, which should be
sufficiently low that a difference will be (nearly) undetectable to users.

Regardless, there is a change being introduced here, so I want to think
about what that means, semantically.

Presently under EOS, (essentially READ_UNCOMMITTED) records are visible to
Interactive Queries as soon as they are written to the local StateStore,
but *before* the records are made available to the changelog topic by a
Task commit. This creates the curious situation where an Interactive Query
could see a write rolled back. The order of events would be this:

   1. Record A is processed and written to a StateStore
   2. An Interactive Query observes Record A
   3. The application crashes *before* the Task commits.
   4. On restart, the application rebuilds its state from the changelog.
   5. The Task transitions to RUNNING, making its StateStore queryable by
   Interactive Queries.
   6. *An Interactive Query attempts to observe Record A, but finds it
   missing.*
   7. Record A is processed and written to the StateStore.

For this reason, I would classify an isolation level of READ_UNCOMMITTED
under EOS as a bug, rather than a feature. The changelogs are intended to
be the gold-store for StateStores, so *Interactive Queries should only ever
be able to observe state that has been persisted to the changelog*.

Colt did raise a use-case above that would depend on READ_UNCOMMITTED under
EOS. This appears to require that two Interactive Queries are processed
that are related by a record, one that creates it, and one that reads it
back. I would like to better understand this use-case/usage pattern; in
particular, how the Interactive Query creates a StateStore record. *Colt,
could you either reply to this thread or reach out on Slack with more
detail here please?* Depending on your design, it's possible that this
might be mitigated/resolved by having your processor explicitly request an
early Task commit via ProcessorContext#commit().

*Feature flag:*
I have been implementing this KIP in parallel with its design, because I
don't know enough about the internals of Kafka Streams to settle on a
design without first exploring the feasibility. While such a feature flag
might well be possible, I have not yet had the time to attempt to implement
it. I believe the main difficulty will be with offset management, as it
will need a completely different code-path when operating
non-transactionally, in order to detect when the store contains uncommitted
data.

FWIW, I don't think an explicit feature flag is necessary/desirable, but
instead (if possible) we should replace the current restriction with:

> Iff processing.mode = exactly-once(-v2|-beta) and
default.state.isolation.level = READ_UNCOMMITTED, local state will be wiped
and rebuilt from changelogs on-error, as is currently the case in 3.6.0.

This way, there is no change to behaviour at all, until a user explicitly
sets default.state.isolation.level to READ_COMMITTED, which would (under
EOS) no longer require wiping of the store.

The advantage of this over an explicit "enable transactions" feature flag
is that:

   1. It doesn't need to be deprecated and removed in the future once
   transactionality no longer has restrictions and becomes the default.
   2. It allows for the possibility of a future KIP that selectively alters
   the isolation level for individual Interactive Queries.

However, this all assumes that it is indeed possible to make
READ_UNCOMMITTED work under EOS.

In the 

Build failed in Jenkins: Kafka » Kafka Branch Builder » trunk #2282

2023-10-13 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 315602 lines...]

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldDrainRestoredActiveTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemovePausedTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemovePausedTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemoveTasksFromAndClearInputQueueOnShutdown() 
STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldRemoveTasksFromAndClearInputQueueOnShutdown() 
PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfAddingActiveTasksWithSameId() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfAddingActiveTasksWithSameId() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > 
shouldRestoreActiveStatefulTasksAndUpdateStandbyTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > 
shouldRestoreActiveStatefulTasksAndUpdateStandbyTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldPauseStandbyTask() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldPauseStandbyTask() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfStatefulTaskNotInStateRestoring() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultStateUpdaterTest > shouldThrowIfStatefulTaskNotInStateRestoring() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetUncaughtStreamsException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetUncaughtStreamsException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskTimeoutOnProcessed() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskTimeoutOnProcessed() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenRequired() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenRequired() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskReleaseFutureOnShutdown() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldClearTaskReleaseFutureOnShutdown() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldProcessTasks() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldProcessTasks() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateStreamTime() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateStreamTime() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldShutdownTaskExecutor() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldShutdownTaskExecutor() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldAwaitProcessableTasksIfNoneAssignable() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldAwaitProcessableTasksIfNoneAssignable() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > 
shouldRespectPunctuationDisabledByTaskExecutionMetadata() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > 
shouldRespectPunctuationDisabledByTaskExecutionMetadata() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldSetTaskTimeoutOnTimeoutException() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldPunctuateSystemTime() PASSED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > shouldUnassignTaskWhenNotProgressing() STARTED

Gradle Test Run :streams:test > Gradle Test Executor 84 > 
DefaultTaskExecutorTest > 

Re: [DISCUSS] KIP-963: Upload and delete lag metrics in Tiered Storage

2023-10-13 Thread Christo Lolov
Heya Gantigmaa,

Apologies for the (very) late reply!

Now that 3.6 has been released and reviewers have a bit more time I will be
picking up this KIP again. I am more than happy to add useful new metrics
to the KIP, I would just ask for a couple of days to review your pull
request and I will come back to you.

Best,
Christo

On Mon, 25 Sept 2023 at 10:49, Gantigmaa Selenge 
wrote:

> Hi Christo,
>
> Thank you for writing the KIP.
>
> I recently raised a PR to add metrics for tracking remote segment deletions
> (https://github.com/apache/kafka/pull/14375) but realised those metrics
> were not mentioned in the original KIP-405 or KIP-930. Do you think these
> would make sense to be added to this KIP and get included in the
> discussion?
>
> Regards,
> Gantigmaa
>
> On Wed, Aug 9, 2023 at 1:53 PM Christo Lolov 
> wrote:
>
> > Heya Kamal,
> >
> > Thank you for going through the KIP and for the question!
> >
> > I have been thinking about this and as an operator I might find it the
> most
> > useful to know all three of them actually.
> >
> > I would find knowing the size in bytes useful to determine how much disk
> I
> > might need to add temporarily to compensate for the slowdown.
> > I would find knowing the number of records useful, because using the
> > MessagesInPerSec metric I would be able to determine how old the records
> > which are facing problems are.
> > I would find knowing the number of segments useful because I would be
> able
> > to correlate this with whether I need to change
> > *remote.log.manager.task.interval.ms
> >  *to a lower or higher
> value.
> >
> > What are your thoughts on the above? Would you find some of them more
> > useful than others?
> >
> > Best,
> > Christo
> >
> > On Tue, 8 Aug 2023 at 16:43, Kamal Chandraprakash <
> > kamal.chandraprak...@gmail.com> wrote:
> >
> > > Hi Christo,
> > >
> > > Thanks for the KIP!
> > >
> > > The proposed tiered storage metrics are useful. The unit mentioned in
> the
> > > KIP is the number of records.
> > > Each topic can have varying amounts of records in a segment depending
> on
> > > the record size.
> > >
> > > Do you think having the tier-lag by number of segments (or) size of
> > > segments in bytes will be useful
> > > to the operator?
> > >
> > > Thanks,
> > > Kamal
> > >
> > > On Tue, Aug 8, 2023 at 8:56 PM Christo Lolov 
> > > wrote:
> > >
> > > > Hello all!
> > > >
> > > > I would like to start a discussion for KIP-963: Upload and delete lag
> > > > metrics in Tiered Storage (
> > https://cwiki.apache.org/confluence/x/sZGzDw
> > > ).
> > > >
> > > > The purpose of this KIP is to introduce a couple of metrics to track
> > lag
> > > > with respect to remote storage from the point of view of Kafka.
> > > >
> > > > Thanks in advance for leaving a review!
> > > >
> > > > Best,
> > > > Christo
> > > >
> > >
> >
>


Re: [PR] Add upgrade documentation for 3.6.0 [kafka-site]

2023-10-13 Thread via GitHub


showuon merged PR #559:
URL: https://github.com/apache/kafka-site/pull/559


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Add upgrade documentation for 3.6.0 [kafka-site]

2023-10-13 Thread via GitHub


fvaleri opened a new pull request, #559:
URL: https://github.com/apache/kafka-site/pull/559

   This change adds the upgrade documentation for 3.6.0 and fix the notable 
changes position in 3.5.0. In previous releases, notable changes always come 
after the upgrade instructions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org