Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-05-17 Thread Nick Telford
there is state
> but no
>   checkpoint, and write the checkpoint if needed. If it can't grab
> the lock,
>   then we know one of the other StreamThreads must be handling the
> checkpoint
>   file for that task directory, and we can move on.
>
> Don't really feel too strongly about which approach is best,  doing it in
> KafkaStreams#start is certainly the most simple while doing it in the
> StreamThread's startup is more efficient. If we're worried about adding too
> much weight to KafkaStreams#start then the 2nd option is probably best,
> though slightly more complicated.
>
> Thoughts?
>
> On Tue, May 14, 2024 at 10:02 AM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > Sorry for the delay in replying. I've finally now got some time to work
> on
> > this.
> >
> > Addressing Matthias's comments:
> >
> > 100.
> > Good point. As Bruno mentioned, there's already
> AbstractReadWriteDecorator
> > which we could leverage to provide that protection. I'll add details on
> > this to the KIP.
> >
> > 101,102.
> > It looks like these points have already been addressed by Bruno. Let me
> > know if anything here is still unclear or you feel needs to be detailed
> > more in the KIP.
> >
> > 103.
> > I'm in favour of anything that gets the old code removed sooner, but
> > wouldn't deprecating an API that we expect (some) users to implement
> cause
> > problems?
> > I'm thinking about implementers of custom StateStores, as they may be
> > confused by managesOffsets() being deprecated, especially since they
> would
> > have to mark their implementation as @Deprecated in order to avoid
> compile
> > warnings.
> > If deprecating an API *while it's still expected to be implemented* is
> > something that's generally done in the project, then I'm happy to do so
> > here.
> >
> > 104.
> > I think this is technically possible, but at the cost of considerable
> > additional code to maintain. Would we ever have a pathway to remove this
> > downgrade code in the future?
> >
> >
> > Regarding rebalance metadata:
> > Opening all stores on start-up to read and cache their offsets is an
> > interesting idea, especially if we can avoid re-opening the stores once
> the
> > Tasks have been assigned. Scalability shouldn't be too much of a problem,
> > because typically users have a fairly short state.cleanup.delay, so the
> > number of on-disk Task directories should rarely exceed the number of
> Tasks
> > previously assigned to that instance.
> > An advantage of this approach is that it would also simplify StateStore
> > implementations, as they would only need to guarantee that committed
> > offsets are available when the store is open.
> >
> > I'll investigate this approach this week for feasibility and report back.
> >
> > I think that covers all the outstanding feedback, unless I missed
> anything?
> >
> > Regards,
> > Nick
> >
> > On Mon, 6 May 2024 at 14:06, Bruno Cadonna  wrote:
> >
> > > Hi Matthias,
> > >
> > > I see what you mean.
> > >
> > > To sum up:
> > >
> > > With this KIP the .checkpoint file is written when the store closes.
> > > That is when:
> > > 1. a task moves away from Kafka Streams client
> > > 2. Kafka Streams client shuts down
> > >
> > > A Kafka Streams client needs the information in the .checkpoint file
> > > 1. on startup because it does not have any open stores yet.
> > > 2. during rebalances for non-empty state directories of tasks that are
> > > not assigned to the Kafka Streams client.
> > >
> > > With hard crashes, i.e., when the Streams client is not able to close
> > > its state stores and write the .checkpoint file, the .checkpoint file
> > > might be quite stale. That influences the next rebalance after failover
> > > negatively.
> > >
> > >
> > > My conclusion is that Kafka Streams either needs to open the state
> > > stores at start up or we write the checkpoint file more often.
> > >
> > > Writing the .checkpoint file during processing more often without
> > > controlling the flush to disk would work. However, Kafka Streams would
> > > checkpoint offsets that are not yet persisted on disk by the state
> > > store. That is with a hard crash the offsets in the .checkpoint file
> > > might be larger than the offsets checkpointed in the state store. That
> > > might not be a problem if Kafka Streams uses the .che

Re: [VOTE] KIP-989: RocksDB Iterator Metrics

2024-05-16 Thread Nick Telford
Oh shoot, you're right. I miscounted.

The vote remains open.

On Thu, 16 May 2024, 20:11 Josep Prat,  wrote:

> Hi Nick,
> I think you need one more day to reach the 72 hours. You opened the vote on
> the 14th, right?
>
> Best,
>
> 
>
> Josep Prat
> Open Source Engineering Director, aivenjosep.p...@aiven.io   |
> +491715557497 | aiven.io
> Aiven Deutschland GmbH
> Alexanderufer 3-7, 10117 Berlin
> Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> Amtsgericht Charlottenburg, HRB 209739 B
>
> On Thu, May 16, 2024, 19:40 Nick Telford  wrote:
>
> > Hi everyone,
> >
> > With 3 binding votes and no objections, the vote passes.
> >
> > KIP-989 is adopted.
> >
> > Cheers,
> > Nick
> >
> > On Wed, 15 May 2024 at 03:41, Sophie Blee-Goldman  >
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks!
> > >
> > > On Tue, May 14, 2024 at 6:58 PM Matthias J. Sax 
> > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On 5/14/24 9:19 AM, Lucas Brutschy wrote:
> > > > > Hi Nick!
> > > > >
> > > > > Thanks for the KIP.
> > > > >
> > > > > +1 (binding)
> > > > >
> > > > > On Tue, May 14, 2024 at 5:16 PM Nick Telford <
> nick.telf...@gmail.com
> > >
> > > > wrote:
> > > > >>
> > > > >> Hi everyone,
> > > > >>
> > > > >> I'd like to call a vote on the Kafka Streams KIP-989: RocksDB
> > Iterator
> > > > >> Metrics:
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics
> > > > >>
> > > > >> All of the points in the discussion thread have now been
> addressed.
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Nick
> > > >
> > >
> >
>


Re: [VOTE] KIP-989: RocksDB Iterator Metrics

2024-05-16 Thread Nick Telford
Hi everyone,

With 3 binding votes and no objections, the vote passes.

KIP-989 is adopted.

Cheers,
Nick

On Wed, 15 May 2024 at 03:41, Sophie Blee-Goldman 
wrote:

> +1 (binding)
>
> Thanks!
>
> On Tue, May 14, 2024 at 6:58 PM Matthias J. Sax  wrote:
>
> > +1 (binding)
> >
> > On 5/14/24 9:19 AM, Lucas Brutschy wrote:
> > > Hi Nick!
> > >
> > > Thanks for the KIP.
> > >
> > > +1 (binding)
> > >
> > > On Tue, May 14, 2024 at 5:16 PM Nick Telford 
> > wrote:
> > >>
> > >> Hi everyone,
> > >>
> > >> I'd like to call a vote on the Kafka Streams KIP-989: RocksDB Iterator
> > >> Metrics:
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics
> > >>
> > >> All of the points in the discussion thread have now been addressed.
> > >>
> > >> Regards,
> > >>
> > >> Nick
> >
>


Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2024-05-16 Thread Nick Telford
Actually, one other point: our existing state store operation metrics are
measured in nanoseconds[1].

Should iterator-duration-(avg|max) also be measured in nanoseconds, for
consistency, or should we keep them milliseconds, as the KIP currently
states?

1:
https://docs.confluent.io/platform/current/streams/monitoring.html#state-store-metrics

On Thu, 16 May 2024 at 12:15, Nick Telford  wrote:

> Good point! I've updated it to "Improved StateStore Iterator metrics for
> detecting leaks" - let me know if you have a better suggestion.
>
> This should affect the voting imo, as nothing of substance has changed.
>
> Regards,
> Nick
>
> On Thu, 16 May 2024 at 01:39, Sophie Blee-Goldman 
> wrote:
>
>> One quick thing -- can you update the title of this KIP to reflect the
>> decision to implement these metrics for all state stores implementations
>> rather than just RocksDB?
>>
>>
>> On Tue, May 14, 2024 at 1:36 PM Nick Telford 
>> wrote:
>>
>> > Woops! Thanks for the catch Lucas. Given this was just a typo, I don't
>> > think this affects the voting.
>> >
>> > Cheers,
>> > Nick
>> >
>> > On Tue, 14 May 2024 at 18:06, Lucas Brutschy > > .invalid>
>> > wrote:
>> >
>> > > Hi Nick,
>> > >
>> > > you are still referring to oldest-open-iterator-age-ms in the
>> > > `Proposed Changes` section.
>> > >
>> > > Cheers,
>> > > Lucas
>> > >
>> > > On Thu, May 2, 2024 at 4:00 PM Lucas Brutschy > >
>> > > wrote:
>> > > >
>> > > > Hi Nick!
>> > > >
>> > > > I agree, the age variant is a bit nicer since the semantics are very
>> > > > clear from the name. If you'd rather go for the simple
>> implementation,
>> > > > how about calling it `oldest-iterator-open-since-ms`? I believe this
>> > > > could be understood without docs. Either way, I think we should be
>> > > > able to open the vote for this KIP because nobody raised any major /
>> > > > blocking concerns.
>> > > >
>> > > > Looking forward to getting this voted on soon!
>> > > >
>> > > > Cheers
>> > > > Lucas
>> > > >
>> > > > On Sun, Mar 31, 2024 at 5:23 PM Nick Telford <
>> nick.telf...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > Hi Matthias,
>> > > > >
>> > > > > > For the oldest iterator metric, I would propose something simple
>> > like
>> > > > > > `iterator-opened-ms` and it would just be the actual timestamp
>> when
>> > > the
>> > > > > > iterator was opened. I don't think we need to compute the actual
>> > age,
>> > > > > > but user can to this computation themselves?
>> > > > >
>> > > > > That works for me; it's easier to implement like that :-D I'm a
>> > little
>> > > > > concerned that the name "iterator-opened-ms" may not be obvious
>> > enough
>> > > > > without reading the docs.
>> > > > >
>> > > > > > If we think reporting the age instead of just the timestamp is
>> > > better, I
>> > > > > > would propose `iterator-max-age-ms`. I should be sufficient to
>> call
>> > > out
>> > > > > > (as it's kinda "obvious" anyway) that the metric applies to open
>> > > > > > iterator only.
>> > > > >
>> > > > > While I think it's preferable to record the timestamp, rather than
>> > the
>> > > age,
>> > > > > this does have the benefit of a more obvious metric name.
>> > > > >
>> > > > > > Nit: the KIP says it's a store-level metric, but I think it
>> would
>> > be
>> > > > > > good to say explicitly that it's recorded with DEBUG level only?
>> > > > >
>> > > > > Yes, I've already updated the KIP with this information in the
>> table.
>> > > > >
>> > > > > Regards,
>> > > > >
>> > > > > Nick
>> > > > >
>> > > > > On Sun, 31 Mar 2024 at 10:53, Matthias J. Sax 
>> > > wrote:
>> > > > >
>> > > > > > The time window thing was just an idea. Happ

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2024-05-16 Thread Nick Telford
Good point! I've updated it to "Improved StateStore Iterator metrics for
detecting leaks" - let me know if you have a better suggestion.

This should affect the voting imo, as nothing of substance has changed.

Regards,
Nick

On Thu, 16 May 2024 at 01:39, Sophie Blee-Goldman 
wrote:

> One quick thing -- can you update the title of this KIP to reflect the
> decision to implement these metrics for all state stores implementations
> rather than just RocksDB?
>
>
> On Tue, May 14, 2024 at 1:36 PM Nick Telford 
> wrote:
>
> > Woops! Thanks for the catch Lucas. Given this was just a typo, I don't
> > think this affects the voting.
> >
> > Cheers,
> > Nick
> >
> > On Tue, 14 May 2024 at 18:06, Lucas Brutschy  > .invalid>
> > wrote:
> >
> > > Hi Nick,
> > >
> > > you are still referring to oldest-open-iterator-age-ms in the
> > > `Proposed Changes` section.
> > >
> > > Cheers,
> > > Lucas
> > >
> > > On Thu, May 2, 2024 at 4:00 PM Lucas Brutschy 
> > > wrote:
> > > >
> > > > Hi Nick!
> > > >
> > > > I agree, the age variant is a bit nicer since the semantics are very
> > > > clear from the name. If you'd rather go for the simple
> implementation,
> > > > how about calling it `oldest-iterator-open-since-ms`? I believe this
> > > > could be understood without docs. Either way, I think we should be
> > > > able to open the vote for this KIP because nobody raised any major /
> > > > blocking concerns.
> > > >
> > > > Looking forward to getting this voted on soon!
> > > >
> > > > Cheers
> > > > Lucas
> > > >
> > > > On Sun, Mar 31, 2024 at 5:23 PM Nick Telford  >
> > > wrote:
> > > > >
> > > > > Hi Matthias,
> > > > >
> > > > > > For the oldest iterator metric, I would propose something simple
> > like
> > > > > > `iterator-opened-ms` and it would just be the actual timestamp
> when
> > > the
> > > > > > iterator was opened. I don't think we need to compute the actual
> > age,
> > > > > > but user can to this computation themselves?
> > > > >
> > > > > That works for me; it's easier to implement like that :-D I'm a
> > little
> > > > > concerned that the name "iterator-opened-ms" may not be obvious
> > enough
> > > > > without reading the docs.
> > > > >
> > > > > > If we think reporting the age instead of just the timestamp is
> > > better, I
> > > > > > would propose `iterator-max-age-ms`. I should be sufficient to
> call
> > > out
> > > > > > (as it's kinda "obvious" anyway) that the metric applies to open
> > > > > > iterator only.
> > > > >
> > > > > While I think it's preferable to record the timestamp, rather than
> > the
> > > age,
> > > > > this does have the benefit of a more obvious metric name.
> > > > >
> > > > > > Nit: the KIP says it's a store-level metric, but I think it would
> > be
> > > > > > good to say explicitly that it's recorded with DEBUG level only?
> > > > >
> > > > > Yes, I've already updated the KIP with this information in the
> table.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Nick
> > > > >
> > > > > On Sun, 31 Mar 2024 at 10:53, Matthias J. Sax 
> > > wrote:
> > > > >
> > > > > > The time window thing was just an idea. Happy to drop it.
> > > > > >
> > > > > > For the oldest iterator metric, I would propose something simple
> > like
> > > > > > `iterator-opened-ms` and it would just be the actual timestamp
> when
> > > the
> > > > > > iterator was opened. I don't think we need to compute the actual
> > age,
> > > > > > but user can to this computation themselves?
> > > > > >
> > > > > > If we think reporting the age instead of just the timestamp is
> > > better, I
> > > > > > would propose `iterator-max-age-ms`. I should be sufficient to
> call
> > > out
> > > > > > (as it's kinda "obvious" anyway) that the metric applies to open
> > > > > > iterator o

Re: [DISCUSS] Apache Kafka 3.8.0 release

2024-05-15 Thread Nick Telford
Hi Josep,

Would it be possible to sneak KIP-989 into 3.8? Just as with 1028, it's
currently being voted on and has already received the requisite votes. The
only thing holding it back is the 72 hour voting window.

Vote thread here:
https://lists.apache.org/thread/nhr65h4784z49jbsyt5nx8ys81q90k6s

Regards,

Nick

On Wed, 15 May 2024 at 17:47, Josep Prat 
wrote:

> And my maths are wrong! I added 24 hours more to all the numbers in there.
> If after 72 hours no vetoes appear, I have no objections on adding this
> specific KIP as it shouldn't have a big blast radius of affectation.
>
> Best,
>
> On Wed, May 15, 2024 at 6:44 PM Josep Prat  wrote:
>
> > Ah, I see Chris was faster writing this than me.
> >
> > On Wed, May 15, 2024 at 6:43 PM Josep Prat  wrote:
> >
> >> Hi all,
> >> You still have the full day of today (independently for the timezone) to
> >> get KIPs approved. Tomorrow morning (CEST timezone) I'll send another
> email
> >> asking developers to assign future approved KIPs to another version
> that is
> >> not 3.8.
> >>
> >> So, the only problem I see with KIP-1028 is that it hasn't been open for
> >> a vote for 72 hours (48 hours as of now). If there is no negative
> voting on
> >> the KIP I think we can let that one in, given it would only miss the
> >> deadline by less than 12 hours (if my timezone maths add up).
> >>
> >> Best,
> >>
> >> On Wed, May 15, 2024 at 6:35 PM Ismael Juma  wrote:
> >>
> >>> The KIP freeze is just about having the KIP accepted. Not sure why we
> >>> would
> >>> need an exception for that.
> >>>
> >>> Ismael
> >>>
> >>> On Wed, May 15, 2024 at 9:20 AM Chris Egerton  >
> >>> wrote:
> >>>
> >>> > FWIW I think that the low blast radius for KIP-1028 should allow it
> to
> >>> > proceed without adhering to the usual KIP and feature freeze dates.
> >>> Code
> >>> > freeze is probably worth still  respecting, at least if changes are
> >>> > required to the docker/jvm/Dockerfile. But I defer to Josep's
> >>> judgement as
> >>> > the release manager.
> >>> >
> >>> > On Wed, May 15, 2024, 06:59 Vedarth Sharma  >
> >>> > wrote:
> >>> >
> >>> > > Hey Josep!
> >>> > >
> >>> > > The KIP 1028 has received the required votes. Voting thread:-
> >>> > > https://lists.apache.org/thread/cdq4wfv5v1gpqlxnf46ycwtcwk5wos4q
> >>> > > But we are keeping the vote open for 72 hours as per the process.
> >>> > >
> >>> > > I would like to request you to please consider it for the 3.8.0
> >>> release.
> >>> > >
> >>> > > Thanks and regards,
> >>> > > Vedarth
> >>> > >
> >>> > >
> >>> > > On Wed, May 15, 2024 at 1:14 PM Josep Prat
> >>> 
> >>> > > wrote:
> >>> > >
> >>> > > > Hi Kafka developers!
> >>> > > >
> >>> > > > Today is the KIP freeze deadline. All KIPs should be accepted by
> >>> EOD
> >>> > > today.
> >>> > > > Tomorrow morning (CEST timezone) I'll start summarizing all KIPs
> >>> that
> >>> > > have
> >>> > > > been approved. Please any KIP approved after tomorrow should be
> >>> adopted
> >>> > > in
> >>> > > > a future release version, not 3.8.
> >>> > > >
> >>> > > > Other relevant upcoming deadlines:
> >>> > > > - Feature freeze is on May 29th
> >>> > > > - Code freeze is June 12th
> >>> > > >
> >>> > > > Best,
> >>> > > >
> >>> > > > On Fri, May 3, 2024 at 3:59 PM Josep Prat 
> >>> wrote:
> >>> > > >
> >>> > > > > Hi Kafka developers!
> >>> > > > > I just wanted to remind you all of the upcoming relevant dates
> >>> for
> >>> > > Kafka
> >>> > > > > 3.8.0:
> >>> > > > > - KIP freeze is on May 15th (this is in a little less than 2
> >>> weeks)
> >>> > > > > - Feature freeze is on May 29th (this is in a little more than
> 25
> >>> > > days).
> >>> > > > >
> >>> > > > > If there is a KIP you really want to have in the 3.8 series,
> now
> >>> is
> >>> > the
> >>> > > > > time to make the last push. Once the deadline for KIP freeze is
> >>> over
> >>> > > I'll
> >>> > > > > update the release plan with the final list of KIPs accepted
> and
> >>> that
> >>> > > may
> >>> > > > > make it to the release.
> >>> > > > >
> >>> > > > > Best!
> >>> > > > >
> >>> > > > > On Wed, Mar 6, 2024 at 10:40 AM Josep Prat <
> josep.p...@aiven.io>
> >>> > > wrote:
> >>> > > > >
> >>> > > > >> Hi all,
> >>> > > > >>
> >>> > > > >> Thanks for your support. I updated the skeleton release plan
> >>> created
> >>> > > by
> >>> > > > >> Colin. You can find it here:
> >>> > > > >>
> >>> > https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.8.0
> >>> > > > >>
> >>> > > > >> Our last release stumbled upon some problems while releasing
> >>> and was
> >>> > > > >> delayed by several weeks, so I won't try to shave some weeks
> >>> from
> >>> > our
> >>> > > > plan
> >>> > > > >> for 3.8.0 (we might end up having delays again). Please raise
> >>> your
> >>> > > > concerns
> >>> > > > >> if you don't agree with the proposed dates.
> >>> > > > >>
> >>> > > > >> The current proposal on dates are:
> >>> > > > >>
> >>> > > > >>- KIP Freeze: *15nd May *(Wednesday)
> >>> > > > >>   - A KIP must be 

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2024-05-14 Thread Nick Telford
Woops! Thanks for the catch Lucas. Given this was just a typo, I don't
think this affects the voting.

Cheers,
Nick

On Tue, 14 May 2024 at 18:06, Lucas Brutschy 
wrote:

> Hi Nick,
>
> you are still referring to oldest-open-iterator-age-ms in the
> `Proposed Changes` section.
>
> Cheers,
> Lucas
>
> On Thu, May 2, 2024 at 4:00 PM Lucas Brutschy 
> wrote:
> >
> > Hi Nick!
> >
> > I agree, the age variant is a bit nicer since the semantics are very
> > clear from the name. If you'd rather go for the simple implementation,
> > how about calling it `oldest-iterator-open-since-ms`? I believe this
> > could be understood without docs. Either way, I think we should be
> > able to open the vote for this KIP because nobody raised any major /
> > blocking concerns.
> >
> > Looking forward to getting this voted on soon!
> >
> > Cheers
> > Lucas
> >
> > On Sun, Mar 31, 2024 at 5:23 PM Nick Telford 
> wrote:
> > >
> > > Hi Matthias,
> > >
> > > > For the oldest iterator metric, I would propose something simple like
> > > > `iterator-opened-ms` and it would just be the actual timestamp when
> the
> > > > iterator was opened. I don't think we need to compute the actual age,
> > > > but user can to this computation themselves?
> > >
> > > That works for me; it's easier to implement like that :-D I'm a little
> > > concerned that the name "iterator-opened-ms" may not be obvious enough
> > > without reading the docs.
> > >
> > > > If we think reporting the age instead of just the timestamp is
> better, I
> > > > would propose `iterator-max-age-ms`. I should be sufficient to call
> out
> > > > (as it's kinda "obvious" anyway) that the metric applies to open
> > > > iterator only.
> > >
> > > While I think it's preferable to record the timestamp, rather than the
> age,
> > > this does have the benefit of a more obvious metric name.
> > >
> > > > Nit: the KIP says it's a store-level metric, but I think it would be
> > > > good to say explicitly that it's recorded with DEBUG level only?
> > >
> > > Yes, I've already updated the KIP with this information in the table.
> > >
> > > Regards,
> > >
> > > Nick
> > >
> > > On Sun, 31 Mar 2024 at 10:53, Matthias J. Sax 
> wrote:
> > >
> > > > The time window thing was just an idea. Happy to drop it.
> > > >
> > > > For the oldest iterator metric, I would propose something simple like
> > > > `iterator-opened-ms` and it would just be the actual timestamp when
> the
> > > > iterator was opened. I don't think we need to compute the actual age,
> > > > but user can to this computation themselves?
> > > >
> > > > If we think reporting the age instead of just the timestamp is
> better, I
> > > > would propose `iterator-max-age-ms`. I should be sufficient to call
> out
> > > > (as it's kinda "obvious" anyway) that the metric applies to open
> > > > iterator only.
> > > >
> > > > And yes, I was hoping that the code inside MetereXxxStore might
> already
> > > > be setup in a way that custom stores would inherit the iterator
> metrics
> > > > automatically -- I am just not sure, and left it as an exercise for
> > > > somebody to confirm :)
> > > >
> > > >
> > > > Nit: the KIP says it's a store-level metric, but I think it would be
> > > > good to say explicitly that it's recorded with DEBUG level only?
> > > >
> > > >
> > > >
> > > > -Matthias
> > > >
> > > >
> > > > On 3/28/24 2:52 PM, Nick Telford wrote:
> > > > > Quick addendum:
> > > > >
> > > > > My suggested metric "oldest-open-iterator-age-seconds" should be
> > > > > "oldest-open-iterator-age-ms". Milliseconds is obviously a better
> > > > > granularity for such a metric.
> > > > >
> > > > > Still accepting suggestions for a better name.
> > > > >
> > > > > On Thu, 28 Mar 2024 at 13:41, Nick Telford  >
> > > > wrote:
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> Sorry for leaving this for so long. So much for "3 weeks until KIP
> > > > freeze"!
> > > 

Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-05-14 Thread Nick Telford
have such a wrapper. It is called
> >>>>>> AbstractReadWriteDecorator.
> >>>>>>
> >>>>>>
> >>>>>> 101
> >>>>>> Currently, the position is checkpointed when a offset checkpoint
> >>>>>> is written. If we let the state store manage the committed
> >>>>>> offsets, we need to also let the state store also manage the
> >>>>>> position otherwise they might diverge. State store managed offsets
> >>>>>> can get flushed (i.e. checkpointed) to the disk when the state
> >>>>>> store decides to flush its in-memory data structures, but the
> >>>>>> position is only checkpointed at commit time. Recovering after a
> >>>>>> failure might load inconsistent offsets and positions.
> >>>>>>
> >>>>>>
> >>>>>> 102
> >>>>>> The position is maintained inside the state store, but is
> >>>>>> persisted in the .position file when the state store closes. The
> >>>>>> only public interface that uses the position is IQv2 in a
> >>>>>> read-only mode. So the position is only updated within the state
> >>>>>> store and read from IQv2. No need to add anything to the public
> >>>>>> StateStore interface.
> >>>>>>
> >>>>>>
> >>>>>> 103
> >>>>>> Deprecating managesOffsets() right away might be a good idea.
> >>>>>>
> >>>>>>
> >>>>>> 104
> >>>>>> I agree that we should try to support downgrades without wipes. At
> >>>>>> least Nick should state in the KIP why we do not support it.
> >>>>>>
> >>>>>>
> >>>>>> Best,
> >>>>>> Bruno
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 4/23/24 8:13 AM, Matthias J. Sax wrote:
> >>>>>>> Thanks for splitting out this KIP. The discussion shows, that it
> >>>>>>> is a complex beast by itself, so worth to discuss by its own.
> >>>>>>>
> >>>>>>>
> >>>>>>> Couple of question / comment:
> >>>>>>>
> >>>>>>>
> >>>>>>> 100 `StateStore#commit()`: The JavaDoc says "must not be called
> >>>>>>> by users" -- I would propose to put a guard in place for this, by
> >>>>>>> either throwing an exception (preferable) or adding a no-op
> >>>>>>> implementation (at least for our own stores, by wrapping them --
> >>>>>>> we cannot enforce it for custom stores I assume), and document
> >>>>>>> this contract explicitly.
> >>>>>>>
> >>>>>>>
> >>>>>>> 101 adding `.position` to the store: Why do we actually need
> >>>>>>> this? The KIP says "To ensure consistency with the committed data
> >>>>>>> and changelog offsets" but I am not sure if I can follow? Can you
> >>>>>>> elaborate why leaving the `.position` file as-is won't work?
> >>>>>>>
> >>>>>>>> If it's possible at all, it will need to be done by
> >>>>>>>> creating temporary StateManagers and StateStores during
> >>>>>>>> rebalance. I think
> >>>>>>>> it is possible, and probably not too expensive, but the devil
> >>>>>>>> will be in
> >>>>>>>> the detail.
> >>>>>>>
> >>>>>>> This sounds like a significant overhead to me. We know that
> >>>>>>> opening a single RocksDB takes about 500ms, and thus opening
> >>>>>>> RocksDB to get this information might slow down rebalances
> >>>>>>> significantly.
> >>>>>>>
> >>>>>>>
> >>>>>>> 102: It's unclear to me, how `.position` information is added.
> >>>>>>> The KIP only says: "position offsets will be stored in RocksDB,
> >>>>>>> in the same column family as the changelog offsets". Do you
> >>>>>>> int

[VOTE] KIP-989: RocksDB Iterator Metrics

2024-05-14 Thread Nick Telford
Hi everyone,

I'd like to call a vote on the Kafka Streams KIP-989: RocksDB Iterator
Metrics:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics

All of the points in the discussion thread have now been addressed.

Regards,

Nick


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2024-04-17 Thread Nick Telford
Hi Walker,

Feel free to ask away, either on the mailing list of the Confluent
Community Slack, where I hang out :-)

The implementation is *mostly* complete, although it needs some polishing.
It's worth noting that KIP-1035 is a hard prerequisite for this.

Regards,
Nick


Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-04-16 Thread Nick Telford
That does make sense. The one thing I can't figure out is how per-Task
StateStore instances are constructed.

It looks like we construct one StateStore instance for the whole Topology
(in InternalTopologyBuilder), and pass that into ProcessorStateManager (via
StateManagerUtil) for each Task, which then initializes it.

This can't be the case though, otherwise multiple partitions of the same
sub-topology (aka Tasks) would share the same StateStore instance, which
they don't.

What am I missing?

On Tue, 16 Apr 2024 at 16:22, Sophie Blee-Goldman 
wrote:

> I don't think we need to *require* a constructor accept the TaskId, but we
> would definitely make sure that the RocksDB state store changes its
> constructor to one that accepts the TaskID (which we can do without
> deprecation since its an internal API), and custom state stores can just
> decide for themselves whether they want to opt-in/use the TaskId param
> or not. I mean custom state stores would have to opt-in anyways by
> implementing the new StoreSupplier#get(TaskId) API and the only
> reason to do that would be to have created a constructor that accepts
> a TaskId
>
> Just to be super clear about the proposal, this is what I had in mind.
> It's actually fairly simple and wouldn't add much to the scope of the
> KIP (I think -- if it turns out to be more complicated than I'm assuming,
> we should definitely do whatever has the smallest LOE to get this done
>
> Anyways, the (only) public API changes would be to add this new
> method to the StoreSupplier API:
>
> default T get(final TaskId taskId) {
> return get();
> }
>
> We can decide whether or not to deprecate the old #get but it's not
> really necessary and might cause a lot of turmoil, so I'd personally
> say we just leave both APIs in place.
>
> And that's it for public API changes! Internally, we would just adapt
> each of the rocksdb StoreSupplier classes to implement this new
> API. So for example with the RocksDBKeyValueBytesStoreSupplier,
> we just add
>
> @Override
> public KeyValueStore get(final TaskId taskId) {
> return returnTimestampedStore ?
> new RocksDBTimestampedStore(name, metricsScope(), taskId) :
> new RocksDBStore(name, metricsScope(), taskId);
> }
>
> And of course add the TaskId parameter to each of the actual
> state store constructors returned here.
>
> Does that make sense? It's entirely possible I'm missing something
> important here, but I think this would be a pretty small addition that
> would solve the problem you mentioned earlier while also being
> useful to anyone who uses custom state stores.
>
> On Mon, Apr 15, 2024 at 10:21 AM Nick Telford 
> wrote:
>
> > Hi Sophie,
> >
> > Interesting idea! Although what would that mean for the StateStore
> > interface? Obviously we can't require that the constructor take the
> TaskId.
> > Is it enough to add the parameter to the StoreSupplier?
> >
> > Would doing this be in-scope for this KIP, or are we over-complicating
> it?
> >
> > Nick
> >
> > On Fri, 12 Apr 2024 at 21:30, Sophie Blee-Goldman  >
> > wrote:
> >
> > > Somewhat minor point overall, but it actually drives me crazy that you
> > > can't get access to the taskId of a StateStore until #init is called.
> > This
> > > has caused me a huge headache personally (since the same is true for
> > > processors and I was trying to do something that's probably too hacky
> to
> > > actually complain about here lol)
> > >
> > > Can we just change the StateStoreSupplier to receive and pass along the
> > > taskId when creating a new store? Presumably by adding a new version of
> > the
> > > #get method that takes in a taskId parameter? We can have it default to
> > > invoking the old one for compatibility reasons and it should be
> > completely
> > > safe to tack on.
> > >
> > > Would also prefer the same for a ProcessorSupplier, but that's
> definitely
> > > outside the scope of this KIP
> > >
> > > On Fri, Apr 12, 2024 at 3:31 AM Nick Telford 
> > > wrote:
> > >
> > > > On further thought, it's clear that this can't work for one simple
> > > reason:
> > > > StateStores don't know their associated TaskId (and hence, their
> > > > StateDirectory) until the init() call. Therefore, committedOffset()
> > can't
> > > > be called before init(), unless we also added a StateStoreContext
> > > argument
> > > > to committedOffset(), which I think might be trying to shoehorn too
> > much
> > > > into committedOffset().
> > > >
&

Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-04-15 Thread Nick Telford
Hi Sophie,

Interesting idea! Although what would that mean for the StateStore
interface? Obviously we can't require that the constructor take the TaskId.
Is it enough to add the parameter to the StoreSupplier?

Would doing this be in-scope for this KIP, or are we over-complicating it?

Nick

On Fri, 12 Apr 2024 at 21:30, Sophie Blee-Goldman 
wrote:

> Somewhat minor point overall, but it actually drives me crazy that you
> can't get access to the taskId of a StateStore until #init is called. This
> has caused me a huge headache personally (since the same is true for
> processors and I was trying to do something that's probably too hacky to
> actually complain about here lol)
>
> Can we just change the StateStoreSupplier to receive and pass along the
> taskId when creating a new store? Presumably by adding a new version of the
> #get method that takes in a taskId parameter? We can have it default to
> invoking the old one for compatibility reasons and it should be completely
> safe to tack on.
>
> Would also prefer the same for a ProcessorSupplier, but that's definitely
> outside the scope of this KIP
>
> On Fri, Apr 12, 2024 at 3:31 AM Nick Telford 
> wrote:
>
> > On further thought, it's clear that this can't work for one simple
> reason:
> > StateStores don't know their associated TaskId (and hence, their
> > StateDirectory) until the init() call. Therefore, committedOffset() can't
> > be called before init(), unless we also added a StateStoreContext
> argument
> > to committedOffset(), which I think might be trying to shoehorn too much
> > into committedOffset().
> >
> > I still don't like the idea of the Streams engine maintaining the cache
> of
> > changelog offsets independently of stores, mostly because of the
> > maintenance burden of the code duplication, but it looks like we'll have
> to
> > live with it.
> >
> > Unless you have any better ideas?
> >
> > Regards,
> > Nick
> >
> > On Wed, 10 Apr 2024 at 14:12, Nick Telford 
> wrote:
> >
> > > Hi Bruno,
> > >
> > > Immediately after I sent my response, I looked at the codebase and came
> > to
> > > the same conclusion. If it's possible at all, it will need to be done
> by
> > > creating temporary StateManagers and StateStores during rebalance. I
> > think
> > > it is possible, and probably not too expensive, but the devil will be
> in
> > > the detail.
> > >
> > > I'll try to find some time to explore the idea to see if it's possible
> > and
> > > report back, because we'll need to determine this before we can vote on
> > the
> > > KIP.
> > >
> > > Regards,
> > > Nick
> > >
> > > On Wed, 10 Apr 2024 at 11:36, Bruno Cadonna 
> wrote:
> > >
> > >> Hi Nick,
> > >>
> > >> Thanks for reacting on my comments so quickly!
> > >>
> > >>
> > >> 2.
> > >> Some thoughts on your proposal.
> > >> State managers (and state stores) are parts of tasks. If the task is
> not
> > >> assigned locally, we do not create those tasks. To get the offsets
> with
> > >> your approach, we would need to either create kind of inactive tasks
> > >> besides active and standby tasks or store and manage state managers of
> > >> non-assigned tasks differently than the state managers of assigned
> > >> tasks. Additionally, the cleanup thread that removes unassigned task
> > >> directories needs to concurrently delete those inactive tasks or
> > >> task-less state managers of unassigned tasks. This seems all quite
> messy
> > >> to me.
> > >> Could we create those state managers (or state stores) for locally
> > >> existing but unassigned tasks on demand when
> > >> TaskManager#getTaskOffsetSums() is executed? Or have a different
> > >> encapsulation for the unused task directories?
> > >>
> > >>
> > >> Best,
> > >> Bruno
> > >>
> > >>
> > >>
> > >> On 4/10/24 11:31 AM, Nick Telford wrote:
> > >> > Hi Bruno,
> > >> >
> > >> > Thanks for the review!
> > >> >
> > >> > 1, 4, 5.
> > >> > Done
> > >> >
> > >> > 3.
> > >> > You're right. I've removed the offending paragraph. I had originally
> > >> > adapted this from the guarantees outlined in KIP-892. But it's
> > >> difficult to
> > >> > prov

Re: [DISCUSS] KIP-1034: Dead letter queue in Kafka Streams

2024-04-12 Thread Nick Telford
Hi Damien and Sebastien,

1.
I think you can just add a `String topic` argument to the existing
`withDeadLetterQueueRecord(ProducerRecord
deadLetterQueueRecord)` method, and then the implementation of the
exception handler could choose the topic to send records to using whatever
logic the user desires. You could perhaps provide a built-in implementation
that leverages your new config to send all records to an untyped DLQ topic?

1a.
BTW you have a typo: in your DeserializationExceptionHandler, the type of
your `deadLetterQueueRecord` argument is `ProducerRecord`, when it should
probably be `ConsumerRecord`.

2.
Agreed. I think it's a good idea to provide an implementation that sends to
a single DLQ by default, but it's important to enable users to customize
this with their own exception handlers.

2a.
I'm not convinced that "errors" (e.g. failed punctuate) should be sent to a
DLQ topic like it's a bad record. To me, a DLQ should only contain records
that failed to process. I'm not even sure how a user would
re-process/action one of these other errors; it seems like the purview of
error logging to me?

4.
My point here was that I think it would be useful for the KIP to contain an
explanation of the behavior both with KIP-1033 and without it. i.e. clarify
if/how records that throw an exception in a processor are handled. At the
moment, I'm assuming that without KIP-1033, processing exceptions would not
cause records to be sent to the DLQ, but with KIP-1033, they would. If this
assumption is correct, I think it should be made explicit in the KIP.

5.
Understood. You may want to make this explicit in the documentation for
users, so they understand the consequences of re-processing data sent to
their DLQ. The main reason I raised this point is it's something that's
tripped me up in numerous KIPs that that committers frequently remind me
of; so I wanted to get ahead of it for once! :D

And one new point:
6.
The DLQ record schema appears to discard all custom headers set on the
source record. Is there a way these can be included? In particular, I'm
concerned with "schema pointer" headers (like those set by Schema
Registry), that may need to be propagated, especially if the records are
fed back into the source topics for re-processing by the user.

Regards,
Nick


On Fri, 12 Apr 2024 at 13:20, Damien Gasparina 
wrote:

> Hi Nick,
>
> Thanks a lot for your review and your useful comments!
>
> 1. It is a good point, as you mentioned, I think it would make sense
> in some use cases to have potentially multiple DLQ topics, so we
> should provide an API to let users do it.
> Thinking out-loud here, maybe it is a better approach to create a new
> Record class containing the topic name, e.g. DeadLetterQueueRecord and
> changing the signature to
> withDeadLetterQueueRecords(Iteratable
> deadLetterQueueRecords) instead of
> withDeadLetterQueueRecord(ProducerRecord
> deadLetterQueueRecord). What do you think? DeadLetterQueueRecord would
> be something like "class DeadLetterQueueRecord extends
> org.apache.kafka.streams.processor.api;.ProducerRecords { String
> topic; /*  + getter/setter + */ } "
>
> 2. I think the root question here is: should we have one DLQ topic or
> multiple DLQ topics by default. This question highly depends on the
> context, but implementing a default implementation to handle multiple
> DLQ topics would be opinionated, e.g. how to manage errors in a
> punctuate?
> I think it makes sense to have the default implementation writing all
> faulty records to a single DLQ, that's at least the approach I used in
> past applications: one DLQ per Kafka Streams application. Of course
> the message format could change in the DLQ e.g. due to the source
> topic, but those DLQ records will be very likely troubleshooted, and
> maybe replay, manually anyway.
> If a user needs to have multiple DLQ topics or want to enforce a
> specific schema, it's still possible, but they would need to implement
> custom Exception Handlers.
> Coming back to 1. I do agree that it would make sense to have the user
> set the DLQ topic name in the handlers for more flexibility.
>
> 3. Good point, sorry it was a typo, the ProcessingContext makes much
> more sense here indeed.
>
> 4. I do assume that we could implement KIP-1033 (Processing exception
> handler) independently from KIP-1034. I do hope that KIP-1033 would be
> adopted and implemented before KIP-1034, but if that's not the case,
> we could implement KIP-1034 indepantly and update KIP-1033 to include
> the DLQ record afterward (in the same KIP or in a new one if not
> possible).
>
> 5. I think we should be clear that this KIP only covers the DLQ record
> produced.
> Everything related to replay messages or recovery plan should be
> considered out-of-scope as it is use-case and error specific.
>
>

Re: [DISCUSS] KIP-1034: Dead letter queue in Kafka Streams

2024-04-12 Thread Nick Telford
Oh, and one more thing:

5.
Whenever you take a record out of the stream, and then potentially
re-introduce it at a later date, you introduce the potential for record
ordering issues. For example, that record could have been destined for a
Window that has been closed by the time it's re-processed. I'd like to see
a section that considers these consequences, and perhaps make those risks
clear to users. For the record, this is exactly what sunk KIP-990, which
was an alternative approach to error handling that introduced the same
issues.

Cheers,

Nick

On Fri, 12 Apr 2024 at 11:54, Nick Telford  wrote:

> Hi Damien,
>
> Thanks for the KIP! Dead-letter queues are something that I think a lot of
> users would like.
>
> I think there are a few points with this KIP that concern me:
>
> 1.
> It looks like you can only define a single, global DLQ for the entire
> Kafka Streams application? What about applications that would like to
> define different DLQs for different data flows? This is especially
> important when dealing with multiple source topics that have different
> record schemas.
>
> 2.
> Your DLQ payload value can either be the record value that failed, or an
> error string (such as "error during punctuate"). This is likely to cause
> problems when users try to process the records from the DLQ, as they can't
> guarantee the format of every record value will be the same. This is very
> loosely related to point 1. above.
>
> 3.
> You provide a ProcessorContext to both exception handlers, but state they
> cannot be used to forward records. In that case, I believe you should use
> ProcessingContext instead, which statically guarantees that it can't be
> used to forward records.
>
> 4.
> You mention the KIP-1033 ProcessingExceptionHandler, but what's the plan
> if KIP-1033 is not adopted, or if KIP-1034 lands before 1033?
>
> Regards,
>
> Nick
>
> On Fri, 12 Apr 2024 at 11:38, Damien Gasparina 
> wrote:
>
>> In a general way, if the user does not configure the right ACL, that
>> would be a security issue, but that's true for any topic.
>>
>> This KIP allows users to configure a Dead Letter Queue without writing
>> custom Java code in Kafka Streams, not at the topic level.
>> A lot of applications are already implementing this pattern, but the
>> required code to do it is quite painful and error prone, for example
>> most apps I have seen created a new KafkaProducer to send records to
>> their DLQ.
>>
>> As it would be disabled by default for backward compatibility, I doubt
>> it would generate any security concern.
>> If a user explicitly configures a Deal Letter Queue, it would be up to
>> him to configure the relevant ACLs to ensure that the right principal
>> can access it.
>> It is already the case for all internal, input and output Kafka
>> Streams topics (e.g. repartition, changelog topics) that also could
>> contain confidential data, so I do not think we should implement a
>> different behavior for this one.
>>
>> In this KIP, we configured the default DLQ record to have the initial
>> record key/value as we assume that it is the expected and wanted
>> behavior for most applications.
>> If a user does not want to have the key/value in the DLQ record for
>> any reason, they could still implement exception handlers to build
>> their own DLQ record.
>>
>> Regarding ACL, maybe something smarter could be done in Kafka Streams,
>> but this is out of scope for this KIP.
>>
>> On Fri, 12 Apr 2024 at 11:58, Claude Warren  wrote:
>> >
>> > My concern is that someone would create a dead letter queue on a
>> sensitive
>> > topic and not get the ACL correct from the start.  Thus causing
>> potential
>> > confidential data leak.  Is there anything in the proposal that would
>> > prevent that from happening?  If so I did not recognize it as such.
>> >
>> > On Fri, Apr 12, 2024 at 9:45 AM Damien Gasparina > >
>> > wrote:
>> >
>> > > Hi Claude,
>> > >
>> > > In  this KIP, the Dead Letter Queue is materialized by a standard and
>> > > independant topic, thus normal ACL applies to it like any other topic.
>> > > This should not introduce any security issues, obviously, the right
>> > > ACL would need to be provided to write to the DLQ if configured.
>> > >
>> > > Cheers,
>> > > Damien
>> > >
>> > > On Fri, 12 Apr 2024 at 08:59, Claude Warren, Jr
>> > >  wrote:
>> > > >
>> > > > I am new to the Kafka codebase so please excuse any ignoranc

Re: [DISCUSS] KIP-1034: Dead letter queue in Kafka Streams

2024-04-12 Thread Nick Telford
Hi Damien,

Thanks for the KIP! Dead-letter queues are something that I think a lot of
users would like.

I think there are a few points with this KIP that concern me:

1.
It looks like you can only define a single, global DLQ for the entire Kafka
Streams application? What about applications that would like to define
different DLQs for different data flows? This is especially important when
dealing with multiple source topics that have different record schemas.

2.
Your DLQ payload value can either be the record value that failed, or an
error string (such as "error during punctuate"). This is likely to cause
problems when users try to process the records from the DLQ, as they can't
guarantee the format of every record value will be the same. This is very
loosely related to point 1. above.

3.
You provide a ProcessorContext to both exception handlers, but state they
cannot be used to forward records. In that case, I believe you should use
ProcessingContext instead, which statically guarantees that it can't be
used to forward records.

4.
You mention the KIP-1033 ProcessingExceptionHandler, but what's the plan if
KIP-1033 is not adopted, or if KIP-1034 lands before 1033?

Regards,

Nick

On Fri, 12 Apr 2024 at 11:38, Damien Gasparina 
wrote:

> In a general way, if the user does not configure the right ACL, that
> would be a security issue, but that's true for any topic.
>
> This KIP allows users to configure a Dead Letter Queue without writing
> custom Java code in Kafka Streams, not at the topic level.
> A lot of applications are already implementing this pattern, but the
> required code to do it is quite painful and error prone, for example
> most apps I have seen created a new KafkaProducer to send records to
> their DLQ.
>
> As it would be disabled by default for backward compatibility, I doubt
> it would generate any security concern.
> If a user explicitly configures a Deal Letter Queue, it would be up to
> him to configure the relevant ACLs to ensure that the right principal
> can access it.
> It is already the case for all internal, input and output Kafka
> Streams topics (e.g. repartition, changelog topics) that also could
> contain confidential data, so I do not think we should implement a
> different behavior for this one.
>
> In this KIP, we configured the default DLQ record to have the initial
> record key/value as we assume that it is the expected and wanted
> behavior for most applications.
> If a user does not want to have the key/value in the DLQ record for
> any reason, they could still implement exception handlers to build
> their own DLQ record.
>
> Regarding ACL, maybe something smarter could be done in Kafka Streams,
> but this is out of scope for this KIP.
>
> On Fri, 12 Apr 2024 at 11:58, Claude Warren  wrote:
> >
> > My concern is that someone would create a dead letter queue on a
> sensitive
> > topic and not get the ACL correct from the start.  Thus causing potential
> > confidential data leak.  Is there anything in the proposal that would
> > prevent that from happening?  If so I did not recognize it as such.
> >
> > On Fri, Apr 12, 2024 at 9:45 AM Damien Gasparina 
> > wrote:
> >
> > > Hi Claude,
> > >
> > > In  this KIP, the Dead Letter Queue is materialized by a standard and
> > > independant topic, thus normal ACL applies to it like any other topic.
> > > This should not introduce any security issues, obviously, the right
> > > ACL would need to be provided to write to the DLQ if configured.
> > >
> > > Cheers,
> > > Damien
> > >
> > > On Fri, 12 Apr 2024 at 08:59, Claude Warren, Jr
> > >  wrote:
> > > >
> > > > I am new to the Kafka codebase so please excuse any ignorance on my
> part.
> > > >
> > > > When a dead letter queue is established is there a process to ensure
> that
> > > > it at least is defined with the same ACL as the original queue?
> Without
> > > > such a guarantee at the start it seems that managing dead letter
> queues
> > > > will be fraught with security issues.
> > > >
> > > >
> > > > On Wed, Apr 10, 2024 at 10:34 AM Damien Gasparina <
> d.gaspar...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > To continue on our effort to improve Kafka Streams error handling,
> we
> > > > > propose a new KIP to add out of the box support for Dead Letter
> Queue.
> > > > > The goal of this KIP is to provide a default implementation that
> > > > > should be suitable for most applications and allow users to
> override
> > > > > it if they have specific requirements.
> > > > >
> > > > > In order to build a suitable payload, some additional changes are
> > > > > included in this KIP:
> > > > >   1. extend the ProcessingContext to hold, when available, the
> source
> > > > > node raw key/value byte[]
> > > > >   2. expose the ProcessingContext to the
> ProductionExceptionHandler,
> > > > > it is currently not available in the handle parameters.
> > > > >
> > > > > Regarding point 2.,  to expose the ProcessingContext to the
> > > > > 

Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-04-12 Thread Nick Telford
On further thought, it's clear that this can't work for one simple reason:
StateStores don't know their associated TaskId (and hence, their
StateDirectory) until the init() call. Therefore, committedOffset() can't
be called before init(), unless we also added a StateStoreContext argument
to committedOffset(), which I think might be trying to shoehorn too much
into committedOffset().

I still don't like the idea of the Streams engine maintaining the cache of
changelog offsets independently of stores, mostly because of the
maintenance burden of the code duplication, but it looks like we'll have to
live with it.

Unless you have any better ideas?

Regards,
Nick

On Wed, 10 Apr 2024 at 14:12, Nick Telford  wrote:

> Hi Bruno,
>
> Immediately after I sent my response, I looked at the codebase and came to
> the same conclusion. If it's possible at all, it will need to be done by
> creating temporary StateManagers and StateStores during rebalance. I think
> it is possible, and probably not too expensive, but the devil will be in
> the detail.
>
> I'll try to find some time to explore the idea to see if it's possible and
> report back, because we'll need to determine this before we can vote on the
> KIP.
>
> Regards,
> Nick
>
> On Wed, 10 Apr 2024 at 11:36, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> Thanks for reacting on my comments so quickly!
>>
>>
>> 2.
>> Some thoughts on your proposal.
>> State managers (and state stores) are parts of tasks. If the task is not
>> assigned locally, we do not create those tasks. To get the offsets with
>> your approach, we would need to either create kind of inactive tasks
>> besides active and standby tasks or store and manage state managers of
>> non-assigned tasks differently than the state managers of assigned
>> tasks. Additionally, the cleanup thread that removes unassigned task
>> directories needs to concurrently delete those inactive tasks or
>> task-less state managers of unassigned tasks. This seems all quite messy
>> to me.
>> Could we create those state managers (or state stores) for locally
>> existing but unassigned tasks on demand when
>> TaskManager#getTaskOffsetSums() is executed? Or have a different
>> encapsulation for the unused task directories?
>>
>>
>> Best,
>> Bruno
>>
>>
>>
>> On 4/10/24 11:31 AM, Nick Telford wrote:
>> > Hi Bruno,
>> >
>> > Thanks for the review!
>> >
>> > 1, 4, 5.
>> > Done
>> >
>> > 3.
>> > You're right. I've removed the offending paragraph. I had originally
>> > adapted this from the guarantees outlined in KIP-892. But it's
>> difficult to
>> > provide these guarantees without the KIP-892 transaction buffers.
>> Instead,
>> > we'll add the guarantees back into the JavaDoc when KIP-892 lands.
>> >
>> > 2.
>> > Good point! This is the only part of the KIP that was (significantly)
>> > changed when I extracted it from KIP-892. My prototype currently
>> maintains
>> > this "cache" of changelog offsets in .checkpoint, but doing so becomes
>> very
>> > messy. My intent with this change was to try to better encapsulate this
>> > offset "caching", especially for StateStores that can cheaply provide
>> the
>> > offsets stored directly in them without needing to duplicate them in
>> this
>> > cache.
>> >
>> > It's clear some more work is needed here to better encapsulate this. My
>> > immediate thought is: what if we construct *but don't initialize* the
>> > StateManager and StateStores for every Task directory on-disk? That
>> should
>> > still be quite cheap to do, and would enable us to query the offsets for
>> > all on-disk stores, even if they're not open. If the StateManager (aka.
>> > ProcessorStateManager/GlobalStateManager) proves too expensive to hold
>> open
>> > for closed stores, we could always have a "StubStateManager" in its
>> place,
>> > that enables the querying of offsets, but nothing else?
>> >
>> > IDK, what do you think?
>> >
>> > Regards,
>> >
>> > Nick
>> >
>> > On Tue, 9 Apr 2024 at 15:00, Bruno Cadonna  wrote:
>> >
>> >> Hi Nick,
>> >>
>> >> Thanks for breaking out the KIP from KIP-892!
>> >>
>> >> Here a couple of comments/questions:
>> >>
>> >> 1.
>> >> In Kafka Streams, we have a design guideline which says to not use the
>> >> "get"-prefix for getters 

Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-04-10 Thread Nick Telford
Hi Bruno,

Immediately after I sent my response, I looked at the codebase and came to
the same conclusion. If it's possible at all, it will need to be done by
creating temporary StateManagers and StateStores during rebalance. I think
it is possible, and probably not too expensive, but the devil will be in
the detail.

I'll try to find some time to explore the idea to see if it's possible and
report back, because we'll need to determine this before we can vote on the
KIP.

Regards,
Nick

On Wed, 10 Apr 2024 at 11:36, Bruno Cadonna  wrote:

> Hi Nick,
>
> Thanks for reacting on my comments so quickly!
>
>
> 2.
> Some thoughts on your proposal.
> State managers (and state stores) are parts of tasks. If the task is not
> assigned locally, we do not create those tasks. To get the offsets with
> your approach, we would need to either create kind of inactive tasks
> besides active and standby tasks or store and manage state managers of
> non-assigned tasks differently than the state managers of assigned
> tasks. Additionally, the cleanup thread that removes unassigned task
> directories needs to concurrently delete those inactive tasks or
> task-less state managers of unassigned tasks. This seems all quite messy
> to me.
> Could we create those state managers (or state stores) for locally
> existing but unassigned tasks on demand when
> TaskManager#getTaskOffsetSums() is executed? Or have a different
> encapsulation for the unused task directories?
>
>
> Best,
> Bruno
>
>
>
> On 4/10/24 11:31 AM, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for the review!
> >
> > 1, 4, 5.
> > Done
> >
> > 3.
> > You're right. I've removed the offending paragraph. I had originally
> > adapted this from the guarantees outlined in KIP-892. But it's difficult
> to
> > provide these guarantees without the KIP-892 transaction buffers.
> Instead,
> > we'll add the guarantees back into the JavaDoc when KIP-892 lands.
> >
> > 2.
> > Good point! This is the only part of the KIP that was (significantly)
> > changed when I extracted it from KIP-892. My prototype currently
> maintains
> > this "cache" of changelog offsets in .checkpoint, but doing so becomes
> very
> > messy. My intent with this change was to try to better encapsulate this
> > offset "caching", especially for StateStores that can cheaply provide the
> > offsets stored directly in them without needing to duplicate them in this
> > cache.
> >
> > It's clear some more work is needed here to better encapsulate this. My
> > immediate thought is: what if we construct *but don't initialize* the
> > StateManager and StateStores for every Task directory on-disk? That
> should
> > still be quite cheap to do, and would enable us to query the offsets for
> > all on-disk stores, even if they're not open. If the StateManager (aka.
> > ProcessorStateManager/GlobalStateManager) proves too expensive to hold
> open
> > for closed stores, we could always have a "StubStateManager" in its
> place,
> > that enables the querying of offsets, but nothing else?
> >
> > IDK, what do you think?
> >
> > Regards,
> >
> > Nick
> >
> > On Tue, 9 Apr 2024 at 15:00, Bruno Cadonna  wrote:
> >
> >> Hi Nick,
> >>
> >> Thanks for breaking out the KIP from KIP-892!
> >>
> >> Here a couple of comments/questions:
> >>
> >> 1.
> >> In Kafka Streams, we have a design guideline which says to not use the
> >> "get"-prefix for getters on the public API. Could you please change
> >> getCommittedOffsets() to committedOffsets()?
> >>
> >>
> >> 2.
> >> It is not clear to me how TaskManager#getTaskOffsetSums() should read
> >> offsets of tasks the stream thread does not own but that have a state
> >> directory on the Streams client by calling
> >> StateStore#getCommittedOffsets(). If the thread does not own a task it
> >> does also not create any state stores for the task, which means there is
> >> no state store on which to call getCommittedOffsets().
> >> I would have rather expected that a checkpoint file is written for all
> >> state stores on close -- not only for the RocksDBStore -- and that this
> >> checkpoint file is read in TaskManager#getTaskOffsetSums() for the tasks
> >> that have a state directory on the client but are not currently assigned
> >> to any stream thread of the Streams client.
> >>
> >>
> >> 3.
> >> In the javadocs for commit() you write
> >>
> >> &

Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-04-10 Thread Nick Telford
Hi Bruno,

Thanks for the review!

1, 4, 5.
Done

3.
You're right. I've removed the offending paragraph. I had originally
adapted this from the guarantees outlined in KIP-892. But it's difficult to
provide these guarantees without the KIP-892 transaction buffers. Instead,
we'll add the guarantees back into the JavaDoc when KIP-892 lands.

2.
Good point! This is the only part of the KIP that was (significantly)
changed when I extracted it from KIP-892. My prototype currently maintains
this "cache" of changelog offsets in .checkpoint, but doing so becomes very
messy. My intent with this change was to try to better encapsulate this
offset "caching", especially for StateStores that can cheaply provide the
offsets stored directly in them without needing to duplicate them in this
cache.

It's clear some more work is needed here to better encapsulate this. My
immediate thought is: what if we construct *but don't initialize* the
StateManager and StateStores for every Task directory on-disk? That should
still be quite cheap to do, and would enable us to query the offsets for
all on-disk stores, even if they're not open. If the StateManager (aka.
ProcessorStateManager/GlobalStateManager) proves too expensive to hold open
for closed stores, we could always have a "StubStateManager" in its place,
that enables the querying of offsets, but nothing else?

IDK, what do you think?

Regards,

Nick

On Tue, 9 Apr 2024 at 15:00, Bruno Cadonna  wrote:

> Hi Nick,
>
> Thanks for breaking out the KIP from KIP-892!
>
> Here a couple of comments/questions:
>
> 1.
> In Kafka Streams, we have a design guideline which says to not use the
> "get"-prefix for getters on the public API. Could you please change
> getCommittedOffsets() to committedOffsets()?
>
>
> 2.
> It is not clear to me how TaskManager#getTaskOffsetSums() should read
> offsets of tasks the stream thread does not own but that have a state
> directory on the Streams client by calling
> StateStore#getCommittedOffsets(). If the thread does not own a task it
> does also not create any state stores for the task, which means there is
> no state store on which to call getCommittedOffsets().
> I would have rather expected that a checkpoint file is written for all
> state stores on close -- not only for the RocksDBStore -- and that this
> checkpoint file is read in TaskManager#getTaskOffsetSums() for the tasks
> that have a state directory on the client but are not currently assigned
> to any stream thread of the Streams client.
>
>
> 3.
> In the javadocs for commit() you write
>
> "... all writes since the last commit(Map), or since init(StateStore)
> *MUST* be available to readers, even after a restart."
>
> This is only true for a clean close before the restart, isn't it?
> If the task fails with a dirty close, Kafka Streams cannot guarantee
> that the in-memory structures of the state store (e.g. memtable in the
> case of RocksDB) are flushed so that the records and the committed
> offsets are persisted.
>
>
> 4.
> The wrapper that provides the legacy checkpointing behavior is actually
> an implementation detail. I would remove it from the KIP, but still
> state that the legacy checkpointing behavior will be supported when the
> state store does not manage the checkpoints.
>
>
> 5.
> Regarding the metrics, could you please add the tags, and the recording
> level (DEBUG or INFO) as done in KIP-607 or KIP-444.
>
>
> Best,
> Bruno
>
> On 4/7/24 5:35 PM, Nick Telford wrote:
> > Hi everyone,
> >
> > Based on some offline discussion, I've split out the "Atomic
> Checkpointing"
> > section from KIP-892: Transactional Semantics for StateStores, into its
> own
> > KIP
> >
> > KIP-1035: StateStore managed changelog offsets
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1035%3A+StateStore+managed+changelog+offsets
> >
> > While KIP-892 was adopted *with* the changes outlined in KIP-1035, these
> > changes were always the most contentious part, and continued to spur
> > discussion even after KIP-892 was adopted.
> >
> > All the changes introduced in KIP-1035 have been removed from KIP-892,
> and
> > a hard dependency on KIP-1035 has been added to KIP-892 in their place.
> >
> > I'm hopeful that with some more focus on this set of changes, we can
> > deliver something that we're all happy with.
> >
> > Regards,
> > Nick
> >
>


[DISCUSS] KIP-1035: StateStore managed changelog offsets

2024-04-07 Thread Nick Telford
Hi everyone,

Based on some offline discussion, I've split out the "Atomic Checkpointing"
section from KIP-892: Transactional Semantics for StateStores, into its own
KIP

KIP-1035: StateStore managed changelog offsets
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1035%3A+StateStore+managed+changelog+offsets

While KIP-892 was adopted *with* the changes outlined in KIP-1035, these
changes were always the most contentious part, and continued to spur
discussion even after KIP-892 was adopted.

All the changes introduced in KIP-1035 have been removed from KIP-892, and
a hard dependency on KIP-1035 has been added to KIP-892 in their place.

I'm hopeful that with some more focus on this set of changes, we can
deliver something that we're all happy with.

Regards,
Nick


Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2024-03-31 Thread Nick Telford
Hi Matthias,

> For the oldest iterator metric, I would propose something simple like
> `iterator-opened-ms` and it would just be the actual timestamp when the
> iterator was opened. I don't think we need to compute the actual age,
> but user can to this computation themselves?

That works for me; it's easier to implement like that :-D I'm a little
concerned that the name "iterator-opened-ms" may not be obvious enough
without reading the docs.

> If we think reporting the age instead of just the timestamp is better, I
> would propose `iterator-max-age-ms`. I should be sufficient to call out
> (as it's kinda "obvious" anyway) that the metric applies to open
> iterator only.

While I think it's preferable to record the timestamp, rather than the age,
this does have the benefit of a more obvious metric name.

> Nit: the KIP says it's a store-level metric, but I think it would be
> good to say explicitly that it's recorded with DEBUG level only?

Yes, I've already updated the KIP with this information in the table.

Regards,

Nick

On Sun, 31 Mar 2024 at 10:53, Matthias J. Sax  wrote:

> The time window thing was just an idea. Happy to drop it.
>
> For the oldest iterator metric, I would propose something simple like
> `iterator-opened-ms` and it would just be the actual timestamp when the
> iterator was opened. I don't think we need to compute the actual age,
> but user can to this computation themselves?
>
> If we think reporting the age instead of just the timestamp is better, I
> would propose `iterator-max-age-ms`. I should be sufficient to call out
> (as it's kinda "obvious" anyway) that the metric applies to open
> iterator only.
>
> And yes, I was hoping that the code inside MetereXxxStore might already
> be setup in a way that custom stores would inherit the iterator metrics
> automatically -- I am just not sure, and left it as an exercise for
> somebody to confirm :)
>
>
> Nit: the KIP says it's a store-level metric, but I think it would be
> good to say explicitly that it's recorded with DEBUG level only?
>
>
>
> -Matthias
>
>
> On 3/28/24 2:52 PM, Nick Telford wrote:
> > Quick addendum:
> >
> > My suggested metric "oldest-open-iterator-age-seconds" should be
> > "oldest-open-iterator-age-ms". Milliseconds is obviously a better
> > granularity for such a metric.
> >
> > Still accepting suggestions for a better name.
> >
> > On Thu, 28 Mar 2024 at 13:41, Nick Telford 
> wrote:
> >
> >> Hi everyone,
> >>
> >> Sorry for leaving this for so long. So much for "3 weeks until KIP
> freeze"!
> >>
> >> On Sophie's comments:
> >> 1. Would Matthias's suggestion of a separate metric tracking the age of
> >> the oldest open iterator (within the tag set) satisfy this? That way we
> can
> >> keep iterator-duration-(avg|max) for closed iterators, which can be
> useful
> >> for performance debugging for iterators that don't leak. I'm not sure
> what
> >> we'd call this metric, maybe: "oldest-open-iterator-age-seconds"? Seems
> >> like a mouthful.
> >>
> >> 2. You're right, it makes more sense to provide
> >> iterator-duration-(avg|max). Honestly, I can't remember why I had
> "total"
> >> before, or why I was computing a rate-of-change over it.
> >>
> >> 3, 4, 5, 6. Agreed, I'll make all those changes as suggested.
> >>
> >> 7. Combined with Matthias's point about RocksDB, I'm convinced that this
> >> is the wrong KIP for these. I'll introduce the additional Rocks metrics
> in
> >> another KIP.
> >>
> >> On Matthias's comments:
> >> A. Not sure about the time window. I'm pretty sure all existing avg/max
> >> metrics are since the application was started? Any other suggestions
> here
> >> would be appreciated.
> >>
> >> B. Agreed. See point 1 above.
> >>
> >> C. Good point. My focus was very much on Rocks memory leaks when I wrote
> >> the first draft. I can generalise it. My only concern is that it might
> make
> >> it more difficult to detect Rocks iterator leaks caused *within* our
> >> high-level iterator, e.g. RocksJNI, RocksDB, RocksDBStore, etc. But we
> >> could always provide a RocksDB-specific metric for this, as you
> suggested.
> >>
> >> D. Hmm, we do already have MeteredKeyValueIterator, which automatically
> >> wraps the iterator from inner-stores of MeteredKeyValueStore. If we
> >> implemented these metrics there, then custom stores would automatically
> >> gain the functional

Re: [DISCUSS] KIP-816: Topology changes without local state reset

2024-03-28 Thread Nick Telford
Hi everyone,

I'm going to resurrect this KIP, because I would like the community to
benefit from our solution.

In the end, we internally solved this problem using Option B: automatically
moving state directories to the correct location whenever they're no longer
aligned with the Topology. We implemented this for ourselves externally to
Kafka Streams, by using Topology#describe() to analyse the Topology, and
then moving state directories before calling KafkaStreams#start().

I've updated/re-written the KIP to focus on this solution, albeit properly
integrated into Kafka Streams.

Let me know what you think,

Nick

On Tue, 15 Feb 2022 at 16:23, Nick Telford  wrote:

> In the KIP, for Option A I suggested a new path of:
>
> /state/dir/stores//
>
> I made the mistake of thinking that the rocksdb/ segment goes *after* the
> store name in the current scheme, e.g.
>
> /state/dir//[/rocksdb]
>
> This is a mistake. I'd always intended for a combination of the store name
> and partition number to be encoded in the new path (instead of the store
> name and task ID, that we have now). The exact encoding doesn't really
> bother me too much, so if you have any conventions you think we should
> follow here (hyphenated vs. underscored vs. directory separator, etc.)
> please let me know.
>
> I should be able to find some time hopefully next week to start working on
> this, which should shed some more light on issues that might arise.
>
> In the meantime I'll correct the KIP to include the rocksdb segment.
>
> Thanks everyone for your input so far!
>
> Nick
>
> On Mon, 14 Feb 2022 at 22:02, Guozhang Wang  wrote:
>
>> Thanks for the clarification John!
>>
>> Nick, sorry that I was not super clear in my latest email. I meant exactly
>> what John said.
>>
>> Just to clarify, I do think that this KIP is relatively orthogonal to the
>> named topology work; as long as we still keep the topo name encoded it
>> should be fine since two named topologies can indeed have the same store
>> name, but that would not need to be considered as part of this KIP.
>>
>>
>> Guozhang
>>
>> On Mon, Feb 14, 2022 at 9:02 AM John Roesler  wrote:
>>
>> > Hi Nick,
>> >
>> > When Guozgang and I were chatting, we realized that it’s not completely
>> > sufficient just to move the state store directories, because their names
>> > are not unique. In particular, more than one partition of the store may
>> be
>> > assigned to the same instance. Right now, this is handled because the
>> task
>> > is encoded the partition number.
>> >
>> > For example, if we have a store "mystore" in subtopology 1 and we have
>> two
>> > out of four partitions (0 and 3) assigned to the local node, the disk
>> will
>> > have these paths:
>> >
>> > {app_id}/1_0/rocksdb/mystore
>> > {app_id}/1_3/rocksdb/mystore
>> >
>> > Clearly, we can't just elevate both "mystore" directories to reside
>> under
>> > {appid}, because
>> > they have the same name. When I think of option (A), here's what I
>> picture:
>> >
>> > {app_id}/rocksdb/mystore-0
>> > {app_id}/rocksdb/mystore-3
>> >
>> > In the future, one thing we're considering to do is actually store all
>> the
>> > positions in the same rocksDB database, which is a pretty convenient
>> step
>> > away from option (A) (another reason to prefer it to option (B) ).
>> >
>> > I just took a look at how named topologies are handled, and they're
>> > actually
>> > a separate path segment, not part of the task id, like this:
>> >
>> > {app_id}/__{topo_name}__/1_0/rocksdb/mystore
>> > {app_id}/__{topo_name}__/1_3/rocksdb/mystore
>> >
>> > Which is pretty convenient because it means there are no
>> > implications for your proposal. If you implement the above
>> > code, then we'll just wind up with:
>> >
>> > {app_id}/__{topo_name}__/rocksdb/mystore-0
>> > {app_id}/__{topo_name}__/rocksdb/mystore-3
>> >
>> > Does that make sense?
>> >
>> > Thanks,
>> > -John
>> >
>> >
>> > On Mon, Feb 14, 2022, at 03:57, Nick Telford wrote:
>> > > Hi Guozhang,
>> > >
>> > > Sorry I haven't had the time to respond to your earlier email, but I
>> just
>> > > wanted to clarify something with respect to your most recent email.
>> > >
>> > > My original plan in option A is to remove the entire Task ID from the
>> > State
>> &

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2024-03-28 Thread Nick Telford
Quick addendum:

My suggested metric "oldest-open-iterator-age-seconds" should be
"oldest-open-iterator-age-ms". Milliseconds is obviously a better
granularity for such a metric.

Still accepting suggestions for a better name.

On Thu, 28 Mar 2024 at 13:41, Nick Telford  wrote:

> Hi everyone,
>
> Sorry for leaving this for so long. So much for "3 weeks until KIP freeze"!
>
> On Sophie's comments:
> 1. Would Matthias's suggestion of a separate metric tracking the age of
> the oldest open iterator (within the tag set) satisfy this? That way we can
> keep iterator-duration-(avg|max) for closed iterators, which can be useful
> for performance debugging for iterators that don't leak. I'm not sure what
> we'd call this metric, maybe: "oldest-open-iterator-age-seconds"? Seems
> like a mouthful.
>
> 2. You're right, it makes more sense to provide
> iterator-duration-(avg|max). Honestly, I can't remember why I had "total"
> before, or why I was computing a rate-of-change over it.
>
> 3, 4, 5, 6. Agreed, I'll make all those changes as suggested.
>
> 7. Combined with Matthias's point about RocksDB, I'm convinced that this
> is the wrong KIP for these. I'll introduce the additional Rocks metrics in
> another KIP.
>
> On Matthias's comments:
> A. Not sure about the time window. I'm pretty sure all existing avg/max
> metrics are since the application was started? Any other suggestions here
> would be appreciated.
>
> B. Agreed. See point 1 above.
>
> C. Good point. My focus was very much on Rocks memory leaks when I wrote
> the first draft. I can generalise it. My only concern is that it might make
> it more difficult to detect Rocks iterator leaks caused *within* our
> high-level iterator, e.g. RocksJNI, RocksDB, RocksDBStore, etc. But we
> could always provide a RocksDB-specific metric for this, as you suggested.
>
> D. Hmm, we do already have MeteredKeyValueIterator, which automatically
> wraps the iterator from inner-stores of MeteredKeyValueStore. If we
> implemented these metrics there, then custom stores would automatically
> gain the functionality, right? This seems like a pretty logical place to
> implement these metrics, since MeteredKeyValueStore is all about adding
> metrics to state stores.
>
> > I imagine the best way to implement this would be to do so at the
> > high-level iterator rather than implementing it separately for each
> > specific iterator implementation for every store type.
>
> Sophie, does MeteredKeyValueIterator fit with your recommendation?
>
> Thanks for your thoughts everyone, I'll update the KIP now.
>
> Nick
>
> On Thu, 14 Mar 2024 at 03:37, Sophie Blee-Goldman 
> wrote:
>
>> About your last two points: I completely agree that we should try to
>> make this independent of RocksDB, and should probably adopt a
>> general philosophy of being store-implementation agnostic unless
>> there is good reason to focus on a particular store type: eg if it was
>> only possible to implement for certain stores, or only made sense in
>> the context of a certain store type but not necessarily stores in general.
>>
>> While leaking memory due to unclosed iterators on RocksDB stores is
>> certainly the most common issue, I think Matthias sufficiently
>> demonstrated that the problem of leaking iterators is not actually
>> unique to RocksDB, and we should consider including in-memory
>> stores at the very least. I also think that at this point, we may as well
>> just implement the metrics for *all* store types, whether rocksdb or
>> in-memory or custom. Not just because it probably applies to all
>> store types (leaking iterators are rarely a good thing!) but because
>> I imagine the best way to implement this would be to do so at the
>> high-level iterator rather than implementing it separately for each
>> specific iterator implementation for every store type.
>>
>> That said, I haven't thought all that carefully about the implementation
>> yet -- it just strikes me as easiest to do at the top level of the store
>> hierarchy rather than at the bottom. My gut instinct may very well be
>> wrong, but that's what it's saying
>>
>> On Thu, Mar 7, 2024 at 10:43 AM Matthias J. Sax  wrote:
>>
>> > Seems I am late to this party. Can we pick this up again aiming for 3.8
>> > release? I think it would be a great addition. Few comments:
>> >
>> >
>> > - I think it does make sense to report `iterator-duration-avg` and
>> > `iterator-duration-max` for all *closed* iterators -- it just seems to
>> > be a useful metric (wondering if this would be _overall_ or bounded to
&g

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2024-03-28 Thread Nick Telford
e could still expose them, similar to other RocksDB metrics which we
> > expose already). However, for this new metric, we should track it
> > ourselves and thus make it independent of RocksDB -- in the end, an
> > in-memory store could also leak memory (and kill a JVM with an
> > out-of-memory error) and we should be able to track it.
> >
> > - Not sure if we would like to add support for custom stores, to allow
> > them to register their iterators with this metric? Or would this not be
> > necessary, because custom stores could just register a custom metric
> > about it to begin with?
> >
> >
> >
> > -Matthias
> >
> > On 10/25/23 4:41 PM, Sophie Blee-Goldman wrote:
> > >>
> > >>   If we used "iterator-duration-max", for
> > >> example, would it not be confusing that it includes Iterators that are
> > >> still open, and therefore the duration is not yet known?
> > >
> > >
> > > 1. Ah, I think I understand your concern better now -- I totally agree
> > that
> > > a
> > >   "iterator-duration-max" metric would be confusing/misleading. I was
> > > thinking about it a bit differently, something more akin to the
> > > "last-rebalance-seconds-ago" consumer metric. As the name suggests,
> > > that basically just tracks how long the consumer has gone without
> > > rebalancing -- it doesn't purport to represent the actual duration
> > between
> > > rebalances, just the current time since the last one.  The hard part is
> > > really
> > > in choosing a name that reflects this -- maybe you have some better
> ideas
> > > but off the top of my head, perhaps something like
> > "iterator-lifetime-max"?
> > >
> > > 2. I'm not quite sure how to interpret the "iterator-duration-total"
> > metric
> > > -- what exactly does it mean to add up all the iterator durations? For
> > > some context, while this is not a hard-and-fast rule, in general you'll
> > > find that Kafka/Streams metrics tend to come in pairs of avg/max or
> > > rate/total. Something that you might measure the avg for usually is
> > > also useful to measure the max, whereas a total metric is probably
> > > also useful as a rate but not so much as an avg. I actually think this
> > > is part of why it feels like it makes so much sense to include a "max"
> > > version of this metric, as Lucas suggested, even if the name of
> > > "iterator-duration-max" feels misleading. Ultimately the metric names
> > > are up to you, but for this reason, I would personally advocate for
> > > just going with an "iterator-duration-avg" and "iterator-duration-max"
> > >
> > > I did see your example in which you mention one could monitor the
> > > rate of change of the "-total" metric. While this does make sense to
> > > me, if the only way to interpret a metric is by computing another
> > > function over it, then why not just make that computation the metric
> > > and cut out the middle man? And in this case, to me at least, it feels
> > > much easier to understand a metric like "iterator-duration-max" vs
> > > something like "iterator-duration-total-rate"
> > >
> > > 3. By the way, can you add another column to the table with the new
> > metrics
> > > that lists the recording level? My suggestion would be to put the
> > > "number-open-iterators" at INFO and the other two at DEBUG. See
> > > the following for my reasoning behind this recommendation
> > >
> > > 4. I would change the "Type" entry for the "number-open-iterators" from
> > > "Value" to "Gauge". This helps justify the "INFO" level for this
> metric,
> > > since unlike the other metrics which are "Measurables", the current
> > > timestamp won't need to be retrieved on each recording
> > >
> > > 5. Can you list the tags that would be associated with each of these
> > > metrics (either in the table, or separately above/below if they will be
> > > the same for all)
> > >
> > > 6. Do you have a strong preference for the name "number-open-iterators"
> > > or would you be alright in shortening this to "num-open-iterators"? The
> > > latter is more in line with the naming scheme used elsewhere in Kafka
> > > for similar kinds of metrics, and a shorter name is always nice.
> > &g

Re: [DISCUSS] KIP-990: Capability to SUSPEND Tasks on DeserializationException

2024-03-28 Thread Nick Telford
 the PAUSE option would simply stall
> > the
> > > task, and upon #resume it would just be discarding that record and then
> > > continuing on with processing (or even committing the offset
> immediately
> > > after
> > > it, perhaps even asynchronously since it presumably doesn't matter if
> it
> > > doesn't succeed and the record is picked up again by accident -- as
> long
> > as
> > >   that doesn't happen repeatedly in an infinite loop, which I don't see
> > why
> > > it would.)
> > >
> > > On the subject of committing...
> > >
> > > Other questions: if a task would be paused, would we commit the current
> > >> offset? What happens if we re-balance? Would we just lose the "pause"
> > >> state, and hit the same error again and just pause again?
> > >
> > >
> > > I was imagining that we would either just wait without committing, or
> > > perhaps
> > > even commit everything up to -- but not including -- the "bad" record
> > when
> > > PAUSE is triggered. Again, if we rebalance and "lose the pause" then
> > > we'll just attempt to process it again, fail, and end up back in PAUSE.
> > This
> > > is no different than how successful processing works, no? Who cares if
> a
> > > rebalance happens to strike and causes it to be PAUSED again?
> > >
> > > All in all, I feel like these concerns are all essentially "true", but
> to
> > > me they
> > > just seem like implementation or design decisions and none of them
> strike
> > > them as posing an unsolvable problem for this feature. But maybe I'm
> > > just lacking in imagination...
> > >
> > > Thoughts?
> > >
> > >
> > > On Fri, Mar 8, 2024 at 5:30 PM Matthias J. Sax 
> wrote:
> > >
> > >> Hey Nick,
> > >>
> > >> I am sorry that I have to say that I am not a fan of this KIP. I see
> way
> > >> too many food-guns and complications that can be introduced.
> > >>
> > >> I am also not sure if I understand the motivation. You say, CONTINUE
> and
> > >> FAIL is not good enough, but don't describe in detail why? If we
> > >> understand the actual problem better, it might also get clear how
> > >> task-pausing would help to address the problem.
> > >>
> > >>
> > >> The main problem I see, as already mentioned by Sophie, it's about
> time
> > >> synchronization. However, its not limited to joins, but affect all
> > >> time-based operations, ie, also all windowed aggregations. If one task
> > >> pauses but other keep running, we keep advancing stream-time
> downstream,
> > >> and thus when the task would resume later, there is a very high
> > >> probability that records are dropped as window got already closed.
> > >>
> > >> For the runtime itself, we also cannot really do a cascading
> downstream
> > >> pause, because the runtime does not know anything about the semantics
> of
> > >> operators. We don't know if we execute a DSL operator or a PAPI
> > >> operator. (We could maybe track all downsteam tasks independent of
> > >> semantics, but in the end it might just imply we could also just pause
> > >> all task...)
> > >>
> > >> For the "skip record case", it's also not possible to skip over an
> > >> offset from outside while the application is running. The offset in
> > >> question is cached inside the consumer and the consumer would not go
> > >> back to Kafka to re-read the offset (only when a partitions is
> > >> re-assigned to a new consumer, the consumer would fetch the offset
> once
> > >> to init itself). -- But even if the consumer would go back to read the
> > >> offset, as long as the partition is assigned to a member of the group,
> > >> it's not even possible to commit a new offset using some external
> tool.
> > >> Only member of the group are allowed to commit offset, and all tools
> > >> that allow to manipulate offsets require that the corresponding
> > >> application is stopped, and that the consumer group is empty (and the
> > >> tool will join the consumer group as only member and commit offsets).
> > >>
> > >> Of course, we could pause all tasks, but that's kind similar to shut
> > >> down? I agree though, that `FAIL` is rather harsh, and it could be a

Re: Kafka trunk test & build stability

2024-01-02 Thread Nick Telford
Addendum: I've opened a PR with what I believe are the changes necessary to
enable Remote Build Caching, if you choose to go that route:
https://github.com/apache/kafka/pull/15109

On Tue, 2 Jan 2024 at 14:31, Nick Telford  wrote:

> Hi everyone,
>
> Regarding building a "dependency graph"... Gradle already has this
> information, albeit fairly coarse-grained. You might be able to get some
> considerable improvement by configuring the Gradle Remote Build Cache. It
> looks like it's currently disabled explicitly:
> https://github.com/apache/kafka/blob/trunk/settings.gradle#L46
>
> The trick is to have trunk builds write to the cache, and PR builds only
> read from it. This way, any PR based on trunk should be able to cache not
> only the compilation, but also the tests from dependent modules that
> haven't changed (e.g. for a PR that only touches the connect/streams
> modules).
>
> This would probably be preferable to having to hand-maintain some
> rules/dependency graph in the CI configuration, and it's quite
> straight-forward to configure.
>
> Bonus points if the Remote Build Cache is readable publicly, enabling
> contributors to benefit from it locally.
>
> Regards,
> Nick
>
> On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy 
> wrote:
>
>> Thanks for all the work that has already been done on this in the past
>> days!
>>
>> Have we considered running our test suite with
>> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as
>> Jenkins build artifacts? This could speed up debugging. Even if we
>> store them only for a day and do it only for trunk, I think it could
>> be worth it. The heap dumps shouldn't contain any secrets, and I
>> checked with the ASF infra team, and they are not concerned about the
>> additional disk usage.
>>
>> Cheers,
>> Lucas
>>
>> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya 
>> wrote:
>> >
>> > I have started to perform an analysis of the OOM at
>> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel free to
>> > contribute to the investigation.
>> >
>> > --
>> > Divij Vaidya
>> >
>> >
>> >
>> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan
>> 
>> > wrote:
>> >
>> > > I am still seeing quite a few OOM errors in the builds and I was
>> curious if
>> > > folks had any ideas on how to identify the cause and fix the issue. I
>> was
>> > > looking in gradle enterprise and found some info about memory usage,
>> but
>> > > nothing detailed enough to help figure the issue out.
>> > >
>> > > OOMs sometimes fail the build immediately and in other cases I see it
>> get
>> > > stuck for 8 hours. (See
>> > >
>> > >
>> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12
>> > > )
>> > >
>> > > I appreciate all the work folks are doing here and I will continue to
>> try
>> > > to help as best as I can.
>> > >
>> > > Justine
>> > >
>> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur
>> > >  wrote:
>> > >
>> > > > S2. We’ve looked into this before, and it wasn’t possible at the
>> time
>> > > with
>> > > > JUnit. We commonly set a timeout on each test class (especially
>> > > integration
>> > > > tests). It is probably worth looking at this again and seeing if
>> > > something
>> > > > has changed with JUnit (or our usage of it) that would allow a
>> global
>> > > > timeout.
>> > > >
>> > > >
>> > > > S3. Dedicated infra sounds nice, if we can get it. It would at least
>> > > remove
>> > > > some variability between the builds, and hopefully eliminate the
>> > > > infra/setup class of failures.
>> > > >
>> > > >
>> > > > S4. Running tests for what has changed sounds nice, but I think it
>> is
>> > > risky
>> > > > to implement broadly. As Sophie mentioned, there are probably some
>> lines
>> > > we
>> > > > could draw where we feel confident that only running a subset of
>> tests is
>> > > > safe. As a start, we could probably work towards skipping CI for
>> non-code
>> > > > PRs.
>> > > >
>> > > >
>> > > > ---
>> > > >
>> > > >

Re: Kafka trunk test & build stability

2024-01-02 Thread Nick Telford
Hi everyone,

Regarding building a "dependency graph"... Gradle already has this
information, albeit fairly coarse-grained. You might be able to get some
considerable improvement by configuring the Gradle Remote Build Cache. It
looks like it's currently disabled explicitly:
https://github.com/apache/kafka/blob/trunk/settings.gradle#L46

The trick is to have trunk builds write to the cache, and PR builds only
read from it. This way, any PR based on trunk should be able to cache not
only the compilation, but also the tests from dependent modules that
haven't changed (e.g. for a PR that only touches the connect/streams
modules).

This would probably be preferable to having to hand-maintain some
rules/dependency graph in the CI configuration, and it's quite
straight-forward to configure.

Bonus points if the Remote Build Cache is readable publicly, enabling
contributors to benefit from it locally.

Regards,
Nick

On Tue, 2 Jan 2024 at 13:00, Lucas Brutschy 
wrote:

> Thanks for all the work that has already been done on this in the past
> days!
>
> Have we considered running our test suite with
> -XX:+HeapDumpOnOutOfMemoryError and uploading the heap dumps as
> Jenkins build artifacts? This could speed up debugging. Even if we
> store them only for a day and do it only for trunk, I think it could
> be worth it. The heap dumps shouldn't contain any secrets, and I
> checked with the ASF infra team, and they are not concerned about the
> additional disk usage.
>
> Cheers,
> Lucas
>
> On Wed, Dec 27, 2023 at 2:25 PM Divij Vaidya 
> wrote:
> >
> > I have started to perform an analysis of the OOM at
> > https://issues.apache.org/jira/browse/KAFKA-16052. Please feel free to
> > contribute to the investigation.
> >
> > --
> > Divij Vaidya
> >
> >
> >
> > On Wed, Dec 27, 2023 at 1:23 AM Justine Olshan
> 
> > wrote:
> >
> > > I am still seeing quite a few OOM errors in the builds and I was
> curious if
> > > folks had any ideas on how to identify the cause and fix the issue. I
> was
> > > looking in gradle enterprise and found some info about memory usage,
> but
> > > nothing detailed enough to help figure the issue out.
> > >
> > > OOMs sometimes fail the build immediately and in other cases I see it
> get
> > > stuck for 8 hours. (See
> > >
> > >
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2508/pipeline/12
> > > )
> > >
> > > I appreciate all the work folks are doing here and I will continue to
> try
> > > to help as best as I can.
> > >
> > > Justine
> > >
> > > On Tue, Dec 26, 2023 at 1:04 PM David Arthur
> > >  wrote:
> > >
> > > > S2. We’ve looked into this before, and it wasn’t possible at the time
> > > with
> > > > JUnit. We commonly set a timeout on each test class (especially
> > > integration
> > > > tests). It is probably worth looking at this again and seeing if
> > > something
> > > > has changed with JUnit (or our usage of it) that would allow a global
> > > > timeout.
> > > >
> > > >
> > > > S3. Dedicated infra sounds nice, if we can get it. It would at least
> > > remove
> > > > some variability between the builds, and hopefully eliminate the
> > > > infra/setup class of failures.
> > > >
> > > >
> > > > S4. Running tests for what has changed sounds nice, but I think it is
> > > risky
> > > > to implement broadly. As Sophie mentioned, there are probably some
> lines
> > > we
> > > > could draw where we feel confident that only running a subset of
> tests is
> > > > safe. As a start, we could probably work towards skipping CI for
> non-code
> > > > PRs.
> > > >
> > > >
> > > > ---
> > > >
> > > >
> > > > As an aside, I experimented with build caching and running affected
> > > tests a
> > > > few months ago. I used the opportunity to play with Github Actions,
> and I
> > > > quite liked it. Here’s the workflow I used:
> > > >
> https://github.com/mumrah/kafka/blob/trunk/.github/workflows/push.yml. I
> > > > was trying to see if we could use a build cache to reduce the
> compilation
> > > > time on PRs. A nightly/periodic job would build trunk and populate a
> > > Gradle
> > > > build cache. PR builds would read from that cache which would enable
> them
> > > > to only compile changed code. The same idea could be extended to
> tests,
> > > but
> > > > I didn’t get that far.
> > > >
> > > >
> > > > As for Github Actions, the idea there is that ASF would provide
> generic
> > > > Action “runners” that would pick up jobs from the Github Action build
> > > queue
> > > > and run them. It is also possible to self-host runners to expand the
> > > build
> > > > capacity of the project (i.e., other organizations could donate
> > > > build capacity). The advantage of this is that we would have more
> control
> > > > over our build/reports and not be “stuck” with whatever ASF Jenkins
> > > offers.
> > > > The Actions workflows are very customizable and it would let us
> create
> > > our
> > > > own custom plugins. There is also a substantial marketplace of
> plugins. I
> > 

Re: Apache Kafka 3.7.0 Release

2023-11-23 Thread Nick Telford
Hi Stan,

I'd like to propose including KIP-892 in the 3.7 release. The KIP has been
accepted and I'm just working on rebasing the implementation against trunk
before I open a PR.

Regards,
Nick

On Tue, 21 Nov 2023 at 11:27, Mayank Shekhar Narula <
mayanks.nar...@gmail.com> wrote:

> Hi Stan
>
> Can you include KIP-951 to the 3.7 release plan? All PRs are merged in the
> trunk.
>
> On Wed, Nov 15, 2023 at 4:05 PM Stanislav Kozlovski
>  wrote:
>
> > Friendly reminder to everybody that the KIP Freeze is *exactly 7 days
> away*
> > - November 22.
> >
> > A KIP must be accepted by this date in order to be considered for this
> > release. Note, any KIP that may not be implemented in time, or otherwise
> > risks heavily destabilizing the release, should be deferred.
> >
> > Best,
> > Stan
> >
> > On Fri, Nov 3, 2023 at 6:03 AM Sophie Blee-Goldman <
> sop...@responsive.dev>
> > wrote:
> >
> > > Looks great, thank you! +1
> > >
> > > On Thu, Nov 2, 2023 at 10:21 AM David Jacot
>  > >
> > > wrote:
> > >
> > > > +1 from me as well. Thanks, Stan!
> > > >
> > > > David
> > > >
> > > > On Thu, Nov 2, 2023 at 6:04 PM Ismael Juma 
> wrote:
> > > >
> > > > > Thanks Stanislav, +1
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Thu, Nov 2, 2023 at 7:01 AM Stanislav Kozlovski
> > > > >  wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Given the discussion here and the lack of any pushback, I have
> > > changed
> > > > > the
> > > > > > dates of the release:
> > > > > > - KIP Freeze - *November 22 *(moved 4 days later)
> > > > > > - Feature Freeze - *December 6 *(moved 2 days earlier)
> > > > > > - Code Freeze - *December 20*
> > > > > >
> > > > > > If anyone has any thoughts against this proposal - please let me
> > > know!
> > > > It
> > > > > > would be good to settle on this early. These will be the dates
> > we're
> > > > > going
> > > > > > with
> > > > > >
> > > > > > Best,
> > > > > > Stanislav
> > > > > >
> > > > > > On Thu, Oct 26, 2023 at 12:15 AM Sophie Blee-Goldman <
> > > > > > sop...@responsive.dev>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the response and explanations -- I think the main
> > > question
> > > > > for
> > > > > > > me
> > > > > > > was whether we intended to permanently increase the KF -- FF
> gap
> > > from
> > > > > the
> > > > > > > historical 1 week to 3 weeks? Maybe this was a conscious
> decision
> > > > and I
> > > > > > > just
> > > > > > >  missed the memo, hopefully someone else can chime in here. I'm
> > all
> > > > for
> > > > > > > additional though. And looking around at some of the recent
> > > releases,
> > > > > it
> > > > > > > seems like we haven't been consistently following the "usual"
> > > > schedule
> > > > > > > since
> > > > > > > the 2.x releases.
> > > > > > >
> > > > > > > Anyways, my main concern was making sure to leave a full 2
> weeks
> > > > > between
> > > > > > > feature freeze and code freeze, so I'm generally happy with the
> > new
> > > > > > > proposal.
> > > > > > > Although I would still prefer to have the KIP freeze fall on a
> > > > > Wednesday
> > > > > > --
> > > > > > > Ismael actually brought up the same thing during the 3.5.0
> > release
> > > > > > > planning,
> > > > > > > so I'll just refer to his explanation for this:
> > > > > > >
> > > > > > > We typically choose a Wednesday for the various freeze dates -
> > > there
> > > > > are
> > > > > > > > often 1-2 day slips and it's better if that doesn't require
> > > people
> > > > > > > > working through the weekend.
> > > > > > > >
> > > > > > >
> > > > > > > (From this mailing list thread
> > > > > > > <
> > https://lists.apache.org/thread/dv1rym2jkf0141sfsbkws8ckkzw7st5h
> > > >)
> > > > > > >
> > > > > > > Thanks for driving the release!
> > > > > > > Sophie
> > > > > > >
> > > > > > > On Wed, Oct 25, 2023 at 8:13 AM Stanislav Kozlovski
> > > > > > >  wrote:
> > > > > > >
> > > > > > > > Thanks for the thorough response, Sophie.
> > > > > > > >
> > > > > > > > - Added to the "Future Release Plan"
> > > > > > > >
> > > > > > > > > 1. Why is the KIP freeze deadline on a Saturday?
> > > > > > > >
> > > > > > > > It was simply added as a starting point - around 30 days from
> > the
> > > > > > > > announcement. We can move it earlier to the 15th of November,
> > but
> > > > my
> > > > > > > > thinking is later is better with these things - it's already
> > > > > aggressive
> > > > > > > > enough. e.g given the choice of Nov 15 vs Nov 18, I don't
> > > > necessarily
> > > > > > > see a
> > > > > > > > strong reason to choose 15.
> > > > > > > >
> > > > > > > > If people feel strongly about this, to make up for this, we
> can
> > > eat
> > > > > > into
> > > > > > > > the KF-FF time as I'll touch upon later, and move FF a few
> days
> > > > > earlier
> > > > > > > to
> > > > > > > > land on a Wednesday.
> > > > > > > >
> > > > > > > > This reduces the time one has to get their feature complete
> > after
> > > > KF,
> > > > > > but
> > > > > > > > allows for longer 

Re: [VOTE] KIP-892: Transactional StateStores

2023-11-17 Thread Nick Telford
Hi everyone,

With +3 binding votes (and +1 non-binding), the vote passes.

KIP-892 Transactional StateStores is Adopted!

Regards,
Nick

On Tue, 14 Nov 2023 at 09:56, Bruno Cadonna  wrote:

> Hi Nick!
>
> Thanks a lot for the KIP!
>
> Looking forward to the implementation!
>
> +1 (binding)
>
> Best,
> Bruno
>
> On 11/14/23 2:23 AM, Sophie Blee-Goldman wrote:
> > +1 (binding)
> >
> > Thanks a lot for this KIP!
> >
> > On Mon, Nov 13, 2023 at 8:39 AM Lucas Brutschy
> >  wrote:
> >
> >> Hi Nick,
> >>
> >> really happy with the final KIP. Thanks a lot for the hard work!
> >>
> >> +1 (binding)
> >>
> >> Lucas
> >>
> >> On Mon, Nov 13, 2023 at 4:20 PM Colt McNealy 
> wrote:
> >>>
> >>> +1 (non-binding).
> >>>
> >>> Thank you, Nick, for making all of the changes (especially around the
> >>> `default.state.isolation.level` config).
> >>>
> >>> Colt McNealy
> >>>
> >>> *Founder, LittleHorse.dev*
> >>>
> >>>
> >>> On Mon, Nov 13, 2023 at 7:15 AM Nick Telford 
> >> wrote:
> >>>
> >>>> Hi everyone,
> >>>>
> >>>> I'd like to call a vote for KIP-892: Transactional StateStores[1],
> >> which
> >>>> makes Kafka Streams StateStores transactional under EOS.
> >>>>
> >>>> Regards,
> >>>>
> >>>> Nick
> >>>>
> >>>> 1:
> >>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
> >>>>
> >>
> >
>


[VOTE] KIP-892: Transactional StateStores

2023-11-13 Thread Nick Telford
Hi everyone,

I'd like to call a vote for KIP-892: Transactional StateStores[1], which
makes Kafka Streams StateStores transactional under EOS.

Regards,

Nick

1:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores


Re: [DISCUSS] KIP-990: Capability to SUSPEND Tasks on DeserializationException

2023-10-26 Thread Nick Telford
re if we would want to rename the #resume method in that case to
> make this more clear, or if javadocs would be sufficient...maybe
> something like #skipRecordAndContinue?
>
> On Tue, Oct 24, 2023 at 6:54 AM Nick Telford 
> wrote:
>
> > Hi Sophie,
> >
> > Thanks for the review!
> >
> > 1-3.
> > I had a feeling this was the case. I'm thinking of adding a PAUSED state
> > with the following valid transitions:
> >
> >- RUNNING -> PAUSED
> >- PAUSED -> RUNNING
> >- PAUSED -> SUSPENDED
> >
> > The advantage of a dedicated State is it should make testing easier and
> > also reduce the potential for introducing bugs into the existing Task
> > states.
> >
> > While I appreciate that the engine is being revised, I think I'll still
> > pursue this actively instead of waiting, as it addresses some problems my
> > team is having right now. If the KIP is accepted, then I suspect that
> this
> > feature would still be desirable with the new streams engine, so any new
> > Task state would likely want to be mirrored in the new engine, and the
> high
> > level design is unlikely to change.
> >
> > 4a.
> > This is an excellent point I hadn't considered. Correct me if I'm wrong,
> > but the only joins that this would impact are Stream-Stream and
> > Stream-Table joins? Table-Table joins should be safe, because the join is
> > commutative, so a delayed record on one side should just cause its output
> > record to be delayed, but not lost.
> >
> > 4b.
> > If we can enumerate only the node types that are impacted by this (i.e.
> > Stream-Stream and Stream-Table joins), then perhaps we could restrict it
> > such that it only pauses dependent Tasks if there's a Stream-Stream/Table
> > join involved? The drawback here would be that custom stateful Processors
> > might also be impacted, but there'd be no way to know if they're safe to
> > not pause.
> >
> > 4c.
> > Regardless, I like this idea, but I have very little knowledge about
> making
> > changes to the rebalance/network protocol. It looks like this could be
> > added via StreamsPartitionAssignor#subscriptionUserData? I might need
> some
> > help designing this aspect of this KIP.
> >
> > Regards,
> > Nick
> >
> > On Tue, 24 Oct 2023 at 07:30, Sophie Blee-Goldman  >
> > wrote:
> >
> > > Hey Nick,
> > >
> > > A few high-level thoughts:
> > >
> > > 1. We definitely don't want to piggyback on the SUSPENDED task state,
> as
> > > this is currently more like an intermediate state that a task passes
> > > through as it's being closed/migrated elsewhere, it doesn't really mean
> > > that a task is "suspended" and there's no logic to suspend processing
> on
> > > it. What you want is probably closer in spirit to the concept of a
> paused
> > > "named topology", where we basically freeze processing on a specific
> task
> > > (or set of tasks).
> > > 2. More importantly however, the SUSPENDED state was only ever needed
> to
> > > support efficient eager rebalancing, and we plan to remove the eager
> > > rebalancing protocol from Streams entirely in the near future. And
> > > unfortunately, the named topologies feature was never fully implemented
> > and
> > > will probably be ripped out sometime soon as well.
> > > 3. In short, to go this route, you'd probably need to implement a
> PAUSED
> > > state from scratch. This wouldn't be impossible, but we are planning to
> > > basically revamp the entire thread model and decouple the consumer
> > > (potentially including the deserialization step) from the processing
> > > threads. Much as I love the idea of this feature, it might not make a
> lot
> > > of sense to spend time implementing right now when much of that work
> > could
> > > be scrapped in the mid-term future. We don't have a timeline for this,
> > > however, so I don't think this should discourage you if the feature
> seems
> > > worth it, just want to give you a sense of the upcoming roadmap.
> > > 4. As for the feature itself, my only concern is that this feels like a
> > > very advanced feature but it would be easy for new users to
> accidentally
> > > abuse it and get their application in trouble. Specifically I'm worried
> > > about how this could be harmful to applications for which some degree
> of
> > > synchronization is required, such as a join. Correct join semantics
> rely
> > &

Re: [DISCUSS] KIP-990: Capability to SUSPEND Tasks on DeserializationException

2023-10-24 Thread Nick Telford
Hi Sophie,

Thanks for the review!

1-3.
I had a feeling this was the case. I'm thinking of adding a PAUSED state
with the following valid transitions:

   - RUNNING -> PAUSED
   - PAUSED -> RUNNING
   - PAUSED -> SUSPENDED

The advantage of a dedicated State is it should make testing easier and
also reduce the potential for introducing bugs into the existing Task
states.

While I appreciate that the engine is being revised, I think I'll still
pursue this actively instead of waiting, as it addresses some problems my
team is having right now. If the KIP is accepted, then I suspect that this
feature would still be desirable with the new streams engine, so any new
Task state would likely want to be mirrored in the new engine, and the high
level design is unlikely to change.

4a.
This is an excellent point I hadn't considered. Correct me if I'm wrong,
but the only joins that this would impact are Stream-Stream and
Stream-Table joins? Table-Table joins should be safe, because the join is
commutative, so a delayed record on one side should just cause its output
record to be delayed, but not lost.

4b.
If we can enumerate only the node types that are impacted by this (i.e.
Stream-Stream and Stream-Table joins), then perhaps we could restrict it
such that it only pauses dependent Tasks if there's a Stream-Stream/Table
join involved? The drawback here would be that custom stateful Processors
might also be impacted, but there'd be no way to know if they're safe to
not pause.

4c.
Regardless, I like this idea, but I have very little knowledge about making
changes to the rebalance/network protocol. It looks like this could be
added via StreamsPartitionAssignor#subscriptionUserData? I might need some
help designing this aspect of this KIP.

Regards,
Nick

On Tue, 24 Oct 2023 at 07:30, Sophie Blee-Goldman 
wrote:

> Hey Nick,
>
> A few high-level thoughts:
>
> 1. We definitely don't want to piggyback on the SUSPENDED task state, as
> this is currently more like an intermediate state that a task passes
> through as it's being closed/migrated elsewhere, it doesn't really mean
> that a task is "suspended" and there's no logic to suspend processing on
> it. What you want is probably closer in spirit to the concept of a paused
> "named topology", where we basically freeze processing on a specific task
> (or set of tasks).
> 2. More importantly however, the SUSPENDED state was only ever needed to
> support efficient eager rebalancing, and we plan to remove the eager
> rebalancing protocol from Streams entirely in the near future. And
> unfortunately, the named topologies feature was never fully implemented and
> will probably be ripped out sometime soon as well.
> 3. In short, to go this route, you'd probably need to implement a PAUSED
> state from scratch. This wouldn't be impossible, but we are planning to
> basically revamp the entire thread model and decouple the consumer
> (potentially including the deserialization step) from the processing
> threads. Much as I love the idea of this feature, it might not make a lot
> of sense to spend time implementing right now when much of that work could
> be scrapped in the mid-term future. We don't have a timeline for this,
> however, so I don't think this should discourage you if the feature seems
> worth it, just want to give you a sense of the upcoming roadmap.
> 4. As for the feature itself, my only concern is that this feels like a
> very advanced feature but it would be easy for new users to accidentally
> abuse it and get their application in trouble. Specifically I'm worried
> about how this could be harmful to applications for which some degree of
> synchronization is required, such as a join. Correct join semantics rely
> heavily on receiving records from both sides of the join and carefully
> selecting the next one to process based on timestamp. Imagine if a
> DeserializationException occurs upstream of a repartition feeding into one
> side of a join (but not the other) and the user opts to PAUSE this task. If
> the join continues  as usual it could lead to missed or incorrect results
> when processing is enforced with no records present on one side of the join
> but usual traffic flowing through the other. Maybe we could somehow signal
> to also PAUSE all downstream/dependent tasks? Should be able to add this
> information to the subscription metadata and send to all clients via a
> rebalance. There might be better options I'm not seeing. Or we could just
> decide to trust the users not to shoot themselves in the foot -- as long as
> we write a clear warning in the javadocs this might be fine
>
> Thanks for all the great KIPs!
>
> On Thu, Oct 12, 2023 at 9:51 AM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > This is a Streams KIP to add a new DeserializationHandlerResponse,
&g

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2023-10-24 Thread Nick Telford
I don't really have a problem with adding such a metric, I'm just not
entirely sure how it would work. If we used "iterator-duration-max", for
example, would it not be confusing that it includes Iterators that are
still open, and therefore the duration is not yet known? When graphing that
over time, I suspect it would be difficult to understand.

3.
FWIW, this would still be picked up by "open-iterators", since that metric
is only decremented when Iterator#close is called (via the
ManagedKeyValueIterator#onClose hook).

I'm actually considering expanding the scope of this KIP slightly to
include improved Block Cache metrics, as my own memory leak investigations
have trended in that direction. Do you think the following metrics should
be included in this KIP, or should I create a new KIP?

   - block-cache-index-usage (number of bytes occupied by index blocks)
   - block-cache-filter-usage (number of bytes occupied by filter blocks)

Regards,
Nick

On Tue, 24 Oct 2023 at 07:09, Sophie Blee-Goldman 
wrote:

> I actually think we could implement Lucas' suggestion pretty easily and
> without too much additional effort. We have full control over the iterator
> that is returned by the various range queries, so it would be easy to
> register a gauge metric for how long it has been since the iterator was
> created. Then we just deregister the metric when the iterator is closed.
>
> With respect to how useful this metric would be, both Nick and Lucas have
> made good points: I would agree that in general, leaking iterators would
> mean an ever-increasing iterator count that should be possible to spot
> without this. However, a few things to consider:
>
> 1. it's really easy to set up an alert based on some maximum threshold of
> how long an iterator should remain open for. It's relatively more tricky to
> set up alerts based on the current count of open iterators and how it
> changes over time.
> 2. As Lucas mentioned, it only takes a few iterators to wreak havoc in
> extreme cases. Sometimes more advanced applications end up with just a few
> leaking iterators despite closing the majority of them. I've seen this
> happen just once personally, but it was driving everyone crazy until we
> figured it out.
> 3. A metric for how long the iterator has been open would help to identify
> hanging iterators due to some logic where the iterator is properly closed
> but for whatever reason just isn't being advanced to the end, and thus not
> reached the iterator#close line of the user code. This case seems difficult
> to spot without the specific metric for iterator lifetime
> 4. Lastly, I think you could argue that all of the above are fairly
> advanced use cases, but this seems like a fairly advanced feature already,
> so it doesn't seem unreasonable to try and cover all the bases.
>
> All that said, my philosophy is that the KIP author gets the final word on
> what to pull into scope as long as the proposal isn't harming anyone
> without the extra feature/changes. So it's up to you Nick --  just wanted
> to add some context on how it could work, and why it would be helpful
>
> Thanks for the KIP!
>
> On Wed, Oct 18, 2023 at 7:04 AM Lucas Brutschy
>  wrote:
>
> > Hi Nick,
> >
> > I did not think in detail about how to implement it, just about what
> > metrics would be nice to have. You are right, we'd have to
> > register/deregister the iterators during open/close. This would be
> > more complicated to implement than the other metrics, but I do not see
> > a fundamental problem with it.
> >
> > As far as I understand, even a low number of leaked iterators can hurt
> > RocksDB compaction significantly. So we may even want to detect if the
> > iterators are opened by one-time / rare queries against the state
> > store.
> >
> > But, as I said, that would be an addition and not a change of the
> > current contents of the KIP, so I'd support the KIP moving forward
> > even without this extension.
> >
> > Cheers, Lucas
> >
> >
> >
> > On Tue, Oct 17, 2023 at 3:45 PM Nick Telford 
> > wrote:
> > >
> > > Hi Lucas,
> > >
> > > Hmm, I'm not sure how we could reliably identify such leaked Iterators.
> > If
> > > we tried to include open iterators when calculating iterator-duration,
> > we'd
> > > need some kind of registry of all the open iterator creation
> timestamps,
> > > wouldn't we?
> > >
> > > In general, if you have a leaky Iterator, it should manifest as a
> > > persistently climbing "open-iterators" metric, even on a busy node,
> > because
> > > each time that Iterator is used, it will leak another one. So even in
> the

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-18 Thread Nick Telford
Hi Lucas,

TaskCorruptedException is how Streams signals that the Task state needs to
be wiped, so we can't retain that exception without also wiping state on
timeouts.

Regards,
Nick

On Wed, 18 Oct 2023 at 14:48, Lucas Brutschy 
wrote:

> Hi Nick,
>
> I think indeed the better behavior would be to retry commitTransaction
> until we risk running out of time to meet `max.poll.interval.ms`.
>
> However, if it's handled as a `TaskCorruptedException` at the moment,
> I would do the same in this KIP, and leave exception handling
> improvements to future work. This KIP is already improving the
> situation a lot by not wiping the state store.
>
> Cheers,
> Lucas
>
> On Tue, Oct 17, 2023 at 3:51 PM Nick Telford 
> wrote:
> >
> > Hi Lucas,
> >
> > Yeah, this is pretty much the direction I'm thinking of going in now. You
> > make an interesting point about committing on-error under
> > ALOS/READ_COMMITTED, although I haven't had a chance to think through the
> > implications yet.
> >
> > Something that I ran into earlier this week is an issue with the new
> > handling of TimeoutException. Without TX stores, TimeoutException under
> EOS
> > throws a TaskCorruptedException, which wipes the stores. However, with TX
> > stores, TimeoutException is now just bubbled up and dealt with as it is
> > under ALOS. The problem arises when the Producer#commitTransaction call
> > times out: Streams attempts to ignore the error and continue producing,
> > which causes the next call to Producer#send to throw
> > "IllegalStateException: Cannot attempt operation `send` because the
> > previous call to `commitTransaction` timed out and must be retried".
> >
> > I'm not sure what we should do here: retrying the commitTransaction seems
> > logical, but what if it times out again? Where do we draw the line and
> > shutdown the instance?
> >
> > Regards,
> > Nick
> >
> > On Mon, 16 Oct 2023 at 13:19, Lucas Brutschy  .invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > I think I liked your suggestion of allowing EOS with READ_UNCOMMITTED,
> > > but keep wiping the state on error, and I'd vote for this solution
> > > when introducing `default.state.isolation.level`. This way, we'd have
> > > the most low-risk roll-out of this feature (no behavior change without
> > > reconfiguration), with the possibility of switching to the most sane /
> > > battle-tested default settings in 4.0. Essentially, we'd have a
> > > feature flag but call it `default.state.isolation.level` and don't
> > > have to deprecate it later.
> > >
> > > So the possible configurations would then be this:
> > >
> > > 1. ALOS/READ_UNCOMMITTED (default) = processing uses direct-to-DB, IQ
> > > reads from DB.
> > > 2. ALOS/READ_COMMITTED = processing uses WriteBatch, IQ reads from
> > > WriteBatch/DB. Flush on error (see note below).
> > > 3. EOS/READ_UNCOMMITTED (default) = processing uses direct-to-DB, IQ
> > > reads from DB. Wipe state on error.
> > > 4. EOS/READ_COMMITTED = processing uses WriteBatch, IQ reads from
> > > WriteBatch/DB.
> > >
> > > I believe the feature is important enough that we will see good
> > > adoption even without changing the default. In 4.0, when we have seen
> > > this being adopted and is battle-tested, we make READ_COMMITTED the
> > > default for EOS, or even READ_COMITTED always the default, depending
> > > on our experiences. And we could add a clever implementation of
> > > READ_UNCOMITTED with WriteBatches later.
> > >
> > > The only smell here is that `default.state.isolation.level` wouldn't
> > > be purely an IQ setting, but it would also (slightly) change the
> > > behavior of the processing, but that seems unavoidable as long as we
> > > haven't solve READ_UNCOMITTED IQ with WriteBatches.
> > >
> > > Minor: As for Bruno's point 4, I think if we are concerned about this
> > > behavior (we don't necessarily have to be, because it doesn't violate
> > > ALOS guarantees as far as I can see), we could make
> > > ALOS/READ_COMMITTED more similar to ALOS/READ_UNCOMITTED by flushing
> > > the WriteBatch on error (obviously, only if we have a chance to do
> > > that).
> > >
> > > Cheers,
> > > Lucas
> > >
> > > On Mon, Oct 16, 2023 at 12:19 PM Nick Telford 
> > > wrote:
> > > >
> > > > Hi Guozhang,
> > > >
> > > > The KIP as it stands introduces a new configura

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-17 Thread Nick Telford
Hi Lucas,

Yeah, this is pretty much the direction I'm thinking of going in now. You
make an interesting point about committing on-error under
ALOS/READ_COMMITTED, although I haven't had a chance to think through the
implications yet.

Something that I ran into earlier this week is an issue with the new
handling of TimeoutException. Without TX stores, TimeoutException under EOS
throws a TaskCorruptedException, which wipes the stores. However, with TX
stores, TimeoutException is now just bubbled up and dealt with as it is
under ALOS. The problem arises when the Producer#commitTransaction call
times out: Streams attempts to ignore the error and continue producing,
which causes the next call to Producer#send to throw
"IllegalStateException: Cannot attempt operation `send` because the
previous call to `commitTransaction` timed out and must be retried".

I'm not sure what we should do here: retrying the commitTransaction seems
logical, but what if it times out again? Where do we draw the line and
shutdown the instance?

Regards,
Nick

On Mon, 16 Oct 2023 at 13:19, Lucas Brutschy 
wrote:

> Hi all,
>
> I think I liked your suggestion of allowing EOS with READ_UNCOMMITTED,
> but keep wiping the state on error, and I'd vote for this solution
> when introducing `default.state.isolation.level`. This way, we'd have
> the most low-risk roll-out of this feature (no behavior change without
> reconfiguration), with the possibility of switching to the most sane /
> battle-tested default settings in 4.0. Essentially, we'd have a
> feature flag but call it `default.state.isolation.level` and don't
> have to deprecate it later.
>
> So the possible configurations would then be this:
>
> 1. ALOS/READ_UNCOMMITTED (default) = processing uses direct-to-DB, IQ
> reads from DB.
> 2. ALOS/READ_COMMITTED = processing uses WriteBatch, IQ reads from
> WriteBatch/DB. Flush on error (see note below).
> 3. EOS/READ_UNCOMMITTED (default) = processing uses direct-to-DB, IQ
> reads from DB. Wipe state on error.
> 4. EOS/READ_COMMITTED = processing uses WriteBatch, IQ reads from
> WriteBatch/DB.
>
> I believe the feature is important enough that we will see good
> adoption even without changing the default. In 4.0, when we have seen
> this being adopted and is battle-tested, we make READ_COMMITTED the
> default for EOS, or even READ_COMITTED always the default, depending
> on our experiences. And we could add a clever implementation of
> READ_UNCOMITTED with WriteBatches later.
>
> The only smell here is that `default.state.isolation.level` wouldn't
> be purely an IQ setting, but it would also (slightly) change the
> behavior of the processing, but that seems unavoidable as long as we
> haven't solve READ_UNCOMITTED IQ with WriteBatches.
>
> Minor: As for Bruno's point 4, I think if we are concerned about this
> behavior (we don't necessarily have to be, because it doesn't violate
> ALOS guarantees as far as I can see), we could make
> ALOS/READ_COMMITTED more similar to ALOS/READ_UNCOMITTED by flushing
> the WriteBatch on error (obviously, only if we have a chance to do
> that).
>
> Cheers,
> Lucas
>
> On Mon, Oct 16, 2023 at 12:19 PM Nick Telford 
> wrote:
> >
> > Hi Guozhang,
> >
> > The KIP as it stands introduces a new configuration,
> > default.state.isolation.level, which is independent of processing.mode.
> > It's intended that this new configuration be used to configure a global
> IQ
> > isolation level in the short term, with a future KIP introducing the
> > capability to change the isolation level on a per-query basis, falling
> back
> > to the "default" defined by this config. That's why I called it
> "default",
> > for future-proofing.
> >
> > However, it currently includes the caveat that READ_UNCOMMITTED is not
> > available under EOS. I think this is the coupling you are alluding to?
> >
> > This isn't intended to be a restriction of the API, but is currently a
> > technical limitation. However, after discussing with some users about
> > use-cases that would require READ_UNCOMMITTED under EOS, I'm inclined to
> > remove that clause and put in the necessary work to make that combination
> > possible now.
> >
> > I currently see two possible approaches:
> >
> >1. Disable TX StateStores internally when the IsolationLevel is
> >READ_UNCOMMITTED and the processing.mode is EOS. This is more
> difficult
> >than it sounds, as there are many assumptions being made throughout
> the
> >internals about the guarantees StateStores provide. It would
> definitely add
> >a lot of extra "if (read_uncommitted && eos)" branches, complicating
> >maintenance

Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2023-10-17 Thread Nick Telford
Hi Lucas,

Hmm, I'm not sure how we could reliably identify such leaked Iterators. If
we tried to include open iterators when calculating iterator-duration, we'd
need some kind of registry of all the open iterator creation timestamps,
wouldn't we?

In general, if you have a leaky Iterator, it should manifest as a
persistently climbing "open-iterators" metric, even on a busy node, because
each time that Iterator is used, it will leak another one. So even in the
presence of many non-leaky Iterators on a busy instance, the metric should
still consistently climb.

Regards,

Nick

On Mon, 16 Oct 2023 at 14:24, Lucas Brutschy 
wrote:

> Hi Nick!
>
> thanks for the KIP! I think this could be quite useful, given the
> problems that we had with leaks due to RocksDB resources not being
> closed.
>
> I don't have any pressing issues why we can't accept it like it is,
> just one minor point for discussion: would it also make sense to make
> it possible to identify a few very long-running / leaked iterators? I
> can imagine on a busy node, it would be hard to spot that 1% of the
> iterators never close when looking only at closed iterator or the
> number of iterators. But it could still be good to identify those
> leaks early. One option would be to add `iterator-duration-max` and
> take open iterators into account when computing the metric.
>
> Cheers,
> Lucas
>
>
> On Thu, Oct 5, 2023 at 3:50 PM Nick Telford 
> wrote:
> >
> > Hi Colt,
> >
> > I kept the details out of the KIP, because KIPs generally document
> > high-level design, but the way I'm doing it is like this:
> >
> > final ManagedKeyValueIterator
> > rocksDbPrefixSeekIterator = cf.prefixScan(accessor, prefixBytes);
> > --> final long startedAt = System.nanoTime();
> > openIterators.add(rocksDbPrefixSeekIterator);
> > rocksDbPrefixSeekIterator.onClose(() -> {
> > --> metricsRecorder.recordIteratorDuration(System.nanoTime() -
> > startedAt);
> > openIterators.remove(rocksDbPrefixSeekIterator);
> > });
> >
> > The lines with the arrow are the new code. This pattern is repeated
> > throughout RocksDBStore, wherever a new RocksDbIterator is created.
> >
> > Regards,
> > Nick
> >
> > On Thu, 5 Oct 2023 at 12:32, Colt McNealy  wrote:
> >
> > > Thank you for the KIP, Nick!
> > >
> > > This would be highly useful for many reasons. Much more sane than
> checking
> > > for leaked iterators by profiling memory usage while running 100's of
> > > thousands of range scans via interactive queries (:
> > >
> > > One small question:
> > >
> > > >The iterator-duration metrics will be updated whenever an Iterator's
> > > close() method is called
> > >
> > > Does the Iterator have its own "createdAt()" or equivalent field, or
> do we
> > > need to keep track of the Iterator's start time upon creation?
> > >
> > > Cheers,
> > > Colt McNealy
> > >
> > > *Founder, LittleHorse.dev*
> > >
> > >
> > > On Wed, Oct 4, 2023 at 9:07 AM Nick Telford 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > KIP-989 is a small Kafka Streams KIP to add a few new metrics around
> the
> > > > creation and use of RocksDB Iterators, to aid users in identifying
> > > > "Iterator leaks" that could cause applications to leak native memory.
> > > >
> > > > Let me know what you think!
> > > >
> > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics
> > > >
> > > > P.S. I'm not too sure about the formatting of the "New Metrics"
> table,
> > > any
> > > > advice there would be appreciated.
> > > >
> > > > Regards,
> > > > Nick
> > > >
> > >
>


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-16 Thread Nick Telford
Hi Guozhang,

The KIP as it stands introduces a new configuration,
default.state.isolation.level, which is independent of processing.mode.
It's intended that this new configuration be used to configure a global IQ
isolation level in the short term, with a future KIP introducing the
capability to change the isolation level on a per-query basis, falling back
to the "default" defined by this config. That's why I called it "default",
for future-proofing.

However, it currently includes the caveat that READ_UNCOMMITTED is not
available under EOS. I think this is the coupling you are alluding to?

This isn't intended to be a restriction of the API, but is currently a
technical limitation. However, after discussing with some users about
use-cases that would require READ_UNCOMMITTED under EOS, I'm inclined to
remove that clause and put in the necessary work to make that combination
possible now.

I currently see two possible approaches:

   1. Disable TX StateStores internally when the IsolationLevel is
   READ_UNCOMMITTED and the processing.mode is EOS. This is more difficult
   than it sounds, as there are many assumptions being made throughout the
   internals about the guarantees StateStores provide. It would definitely add
   a lot of extra "if (read_uncommitted && eos)" branches, complicating
   maintenance and testing.
   2. Invest the time *now* to make READ_UNCOMMITTED of EOS StateStores
   possible. I have some ideas on how this could be achieved, but they would
   need testing and could introduce some additional issues. The benefit of
   this approach is that it would make query-time IsolationLevels much simpler
   to implement in the future.

Unfortunately, both will require considerable work that will further delay
this KIP, which was the reason I placed the restriction in the KIP in the
first place.

Regards,
Nick

On Sat, 14 Oct 2023 at 03:30, Guozhang Wang 
wrote:

> Hello Nick,
>
> First of all, thanks a lot for the great effort you've put in driving
> this KIP! I really like it coming through finally, as many people in
> the community have raised this. At the same time I honestly feel a bit
> ashamed for not putting enough of my time supporting it and pushing it
> through the finish line (you raised this KIP almost a year ago).
>
> I briefly passed through the DISCUSS thread so far, not sure I've 100
> percent digested all the bullet points. But with the goal of trying to
> help take it through the finish line in mind, I'd want to throw
> thoughts on top of my head only on the point #4 above which I felt may
> be the main hurdle for the current KIP to drive to a consensus now.
>
> The general question I asked myself is, whether we want to couple "IQ
> reading mode" with "processing mode". While technically I tend to
> agree with you that, it's feels like a bug if some single user chose
> "EOS" for processing mode while choosing "read uncommitted" for IQ
> reading mode, at the same time, I'm thinking if it's possible that
> there could be two different persons (or even two teams) that would be
> using the stream API to build the app, and the IQ API to query the
> running state of the app. I know this is less of a technical thing but
> rather a more design stuff, but if it could be ever the case, I'm
> wondering if the personale using the IQ API knows about the risks of
> using read uncommitted but still chose so for the favor of
> performance, no matter if the underlying stream processing mode
> configured by another personale is EOS or not. In that regard, I'm
> leaning towards a "leaving the door open, and close it later if we
> found it's a bad idea" aspect with a configuration that we can
> potentially deprecate than "shut the door, clean for everyone". More
> specifically, allowing the processing mode / IQ read mode to be
> decoupled, and if we found that there's no such cases as I speculated
> above or people started complaining a lot, we can still enforce
> coupling them.
>
> Again, just my 2c here. Thanks again for the great patience and
> diligence on this KIP.
>
>
> Guozhang
>
>
>
> On Fri, Oct 13, 2023 at 8:48 AM Nick Telford 
> wrote:
> >
> > Hi Bruno,
> >
> > 4.
> > I'll hold off on making that change until we have a consensus as to what
> > configuration to use to control all of this, as it'll be affected by the
> > decision on EOS isolation levels.
> >
> > 5.
> > Done. I've chosen "committedOffsets".
> >
> > Regards,
> > Nick
> >
> > On Fri, 13 Oct 2023 at 16:23, Bruno Cadonna  wrote:
> >
> > > Hi Nick,
> > >
> > > 1.
> > > Yeah, you are probably right that it does not make too much sense.
> > > T

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Nick Telford
Hi Bruno,

4.
I'll hold off on making that change until we have a consensus as to what
configuration to use to control all of this, as it'll be affected by the
decision on EOS isolation levels.

5.
Done. I've chosen "committedOffsets".

Regards,
Nick

On Fri, 13 Oct 2023 at 16:23, Bruno Cadonna  wrote:

> Hi Nick,
>
> 1.
> Yeah, you are probably right that it does not make too much sense.
> Thanks for the clarification!
>
>
> 4.
> Yes, sorry for the back and forth, but I think for the sake of the KIP
> it is better to let the ALOS behavior as it is for now due to the
> possible issues you would run into. Maybe we can find a solution in the
> future. Now the question returns to whether we really need
> default.state.isolation.level. Maybe the config could be the feature
> flag Sophie requested.
>
>
> 5.
> There is a guideline in Kafka not to use the get prefix for getters (at
> least in the public API). Thus, could you please rename
>
> getCommittedOffset(TopicPartition partition) ->
> committedOffsetFor(TopicPartition partition)
>
> You can also propose an alternative to committedOffsetFor().
>
>
> Best,
> Bruno
>
>
> On 10/13/23 3:21 PM, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for getting back to me.
> >
> > 1.
> > I think this should be possible. Are you thinking of the situation where
> a
> > user may downgrade to a previous version of Kafka Streams? In that case,
> > sadly, the RocksDBStore would get wiped by the older version of Kafka
> > Streams anyway, because that version wouldn't understand the extra column
> > family (that holds offsets), so the missing Position file would
> > automatically get rebuilt when the store is rebuilt from the changelog.
> > Are there other situations than downgrade where a transactional store
> could
> > be replaced by a non-transactional one? I can't think of any.
> >
> > 2.
> > Ahh yes, the Test Plan - my Kryptonite! This section definitely needs to
> be
> > fleshed out. I'll work on that. How much detail do you need?
> >
> > 3.
> > See my previous email discussing this.
> >
> > 4.
> > Hmm, this is an interesting point. Are you suggesting that under ALOS
> > READ_COMMITTED should not be supported?
> >
> > Regards,
> > Nick
> >
> > On Fri, 13 Oct 2023 at 13:52, Bruno Cadonna  wrote:
> >
> >> Hi Nick,
> >>
> >> I think the KIP is converging!
> >>
> >>
> >> 1.
> >> I am wondering whether it makes sense to write the position file during
> >> close as we do for the checkpoint file, so that in case the state store
> >> is replaced with a non-transactional state store the non-transactional
> >> state store finds the position file. I think, this is not strictly
> >> needed, but would be a nice behavior instead of just deleting the
> >> position file.
> >>
> >>
> >> 2.
> >> The test plan does not mention integration tests. Do you not need to
> >> extend existing ones and add new ones. Also for upgrading and
> >> downgrading you might need integration and/or system tests.
> >>
> >>
> >> 3.
> >> I think Sophie made a point. Although, IQ reading from uncommitted data
> >> under EOS might be considered a bug by some people. Thus, your KIP would
> >> fix a bug rather than changing the intended behavior. However, I also
> >> see that a feature flag would help users that rely on this buggy
> >> behavior (at least until AK 4.0).
> >>
> >>
> >> 4.
> >> This is related to the previous point. I assume that the difference
> >> between READ_COMMITTED and READ_UNCOMMITTED for ALOS is that in the
> >> former you enable transactions on the state store and in the latter you
> >> disable them. If my assumption is correct, I think that is an issue.
> >> Let's assume under ALOS Streams fails over a couple of times more or
> >> less at the same step in processing after value 3 is added to an
> >> aggregation but the offset of the corresponding input record was not
> >> committed. Without transactions disabled, the aggregation value would
> >> increase by 3 for each failover. With transactions enabled, value 3
> >> would only be added to the aggregation once when the offset of the input
> >> record is committed and the transaction finally completes. So the
> >> content of the state store would change depending on the configuration
> >> for IQ. IMO, the content of the state store should be independent from
> >> IQ. Giv

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Nick Telford
 followup KIPs
> > would be completed in the same release cycle as this one, we need to make
> > sure that the
> > feature is either compatible with all current users or else
> feature-flagged
> > so that they may
> > opt in/out.
> >
> > Therefore, IIUC we need to have either (or both) of these as
> > fully-implemented config options:
> > 1. default.state.isolation.level
> > 2. enable.transactional.state.stores
> >
> > This way EOS users for whom read_committed semantics are not viable can
> > still upgrade,
> > and either use the isolation.level config to leverage the new txn state
> > stores without sacrificing
> > their application semantics, or else simply keep the transactional state
> > stores disabled until we
> > are able to fully implement the isolation level configuration at either
> an
> > application or query level.
> >
> > Frankly you are the expert here and know much more about the tradeoffs in
> > both semantics and
> > effort level of implementing one of these configs vs the other. In my
> > opinion, either option would
> > be fine and I would leave the decision of which one to include in this
> KIP
> > completely up to you.
> > I just don't see a way for the KIP to proceed without some variation of
> the
> > above that would allow
> > EOS users to opt-out of read_committed.
> >
> > (If it's all the same to you, I would recommend always including a
> feature
> > flag in large structural
> > changes like this. No matter how much I trust someone or myself to
> > implement a feature, you just
> > never know what kind of bugs might slip in, especially with the very
> first
> > iteration that gets released.
> > So personally, my choice would be to add the feature flag and leave it
> off
> > by default. If all goes well
> > you can do a quick KIP to enable it by default as soon as the
> > isolation.level config has been
> > completed. But feel free to just pick whichever option is easiest or
> > quickest for you to implement)
> >
> > Hope this helps move the discussion forward,
> > Sophie
> >
> > On Tue, Sep 19, 2023 at 1:57 AM Nick Telford 
> wrote:
> >
> >> Hi Bruno,
> >>
> >> Agreed, I can live with that for now.
> >>
> >> In an effort to keep the scope of this KIP from expanding, I'm leaning
> >> towards just providing a configurable default.state.isolation.level and
> >> removing IsolationLevel from the StateStoreContext. This would be
> >> compatible with adding support for query-time IsolationLevels in the
> >> future, whilst providing a way for users to select an isolation level
> now.
> >>
> >> The big problem with this, however, is that if a user selects
> >> processing.mode
> >> = "exactly-once(-v2|-beta)", and default.state.isolation.level =
> >> "READ_UNCOMMITTED", we need to guarantee that the data isn't written to
> >> disk until commit() is called, but we also need to permit IQ threads to
> >> read from the ongoing transaction.
> >>
> >> A simple solution would be to (temporarily) forbid this combination of
> >> configuration, and have default.state.isolation.level automatically
> switch
> >> to READ_COMMITTED when processing.mode is anything other than
> >> at-least-once. Do you think this would be acceptable?
> >>
> >> In a later KIP, we can add support for query-time isolation levels and
> >> solve this particular problem there, which would relax this restriction.
> >>
> >> Regards,
> >> Nick
> >>
> >> On Tue, 19 Sept 2023 at 09:30, Bruno Cadonna 
> wrote:
> >>
> >>> Why do we need to add READ_COMMITTED to InMemoryKeyValueStore? I think
> >>> it is perfectly valid to say InMemoryKeyValueStore do not support
> >>> READ_COMMITTED for now, since READ_UNCOMMITTED is the de-facto default
> >>> at the moment.
> >>>
> >>> Best,
> >>> Bruno
> >>>
> >>> On 9/18/23 7:12 PM, Nick Telford wrote:
> >>>> Oh! One other concern I haven't mentioned: if we make IsolationLevel a
> >>>> query-time constraint, then we need to add support for READ_COMMITTED
> >> to
> >>>> InMemoryKeyValueStore too, which will require some changes to the
> >>>> implementation.
> >>>>
> >>>> On Mon, 18 Sept 2023 at 17:24, Nick Telford 
> >>> wrote:
> >>>>
> >>>>> Hi everyone,
> >>>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-10-13 Thread Nick Telford
TTED work under EOS.

In the interests of trying to get this KIP over the line ASAP, I settled on
adding the restriction that READ_UNCOMMITTED would be unavailable under
EOS, with the goal of relaxing this in a future KIP.

If it turns out that this restriction is a blocker, then I'll try to find
the time to explore the possibility of adding a flag.

Regards,
Nick

On Thu, 12 Oct 2023 at 21:32, Sophie Blee-Goldman 
wrote:

> Hey Nick! First of all thanks for taking up this awesome feature, I'm sure
> every single
> Kafka Streams user and dev would agree that it is sorely needed.
>
> I've just been catching up on the KIP and surrounding discussion, so please
> forgive me
> for any misunderstandings or misinterpretations of the current plan and
> don't hesitate to
> correct me.
>
> Before I jump in, I just want to say that having seen this drag on for so
> long, my singular
> goal in responding is to help this KIP past a perceived impasse so we can
> finally move on
> to voting and implementing it. Long discussions are to be expected for
> major features like
> this but it's completely on us as the Streams devs to make sure there is an
> end in sight
> for any ongoing discussion.
>
> With that said, it's my understanding that the KIP as currently proposed is
> just not tenable
> for Kafka Streams, and would prevent some EOS users from upgrading to the
> version it
> first appears in. Given that we can't predict or guarantee whether any of
> the followup KIPs
> would be completed in the same release cycle as this one, we need to make
> sure that the
> feature is either compatible with all current users or else feature-flagged
> so that they may
> opt in/out.
>
> Therefore, IIUC we need to have either (or both) of these as
> fully-implemented config options:
> 1. default.state.isolation.level
> 2. enable.transactional.state.stores
>
> This way EOS users for whom read_committed semantics are not viable can
> still upgrade,
> and either use the isolation.level config to leverage the new txn state
> stores without sacrificing
> their application semantics, or else simply keep the transactional state
> stores disabled until we
> are able to fully implement the isolation level configuration at either an
> application or query level.
>
> Frankly you are the expert here and know much more about the tradeoffs in
> both semantics and
> effort level of implementing one of these configs vs the other. In my
> opinion, either option would
> be fine and I would leave the decision of which one to include in this KIP
> completely up to you.
> I just don't see a way for the KIP to proceed without some variation of the
> above that would allow
> EOS users to opt-out of read_committed.
>
> (If it's all the same to you, I would recommend always including a feature
> flag in large structural
> changes like this. No matter how much I trust someone or myself to
> implement a feature, you just
> never know what kind of bugs might slip in, especially with the very first
> iteration that gets released.
> So personally, my choice would be to add the feature flag and leave it off
> by default. If all goes well
> you can do a quick KIP to enable it by default as soon as the
> isolation.level config has been
> completed. But feel free to just pick whichever option is easiest or
> quickest for you to implement)
>
> Hope this helps move the discussion forward,
> Sophie
>
> On Tue, Sep 19, 2023 at 1:57 AM Nick Telford 
> wrote:
>
> > Hi Bruno,
> >
> > Agreed, I can live with that for now.
> >
> > In an effort to keep the scope of this KIP from expanding, I'm leaning
> > towards just providing a configurable default.state.isolation.level and
> > removing IsolationLevel from the StateStoreContext. This would be
> > compatible with adding support for query-time IsolationLevels in the
> > future, whilst providing a way for users to select an isolation level
> now.
> >
> > The big problem with this, however, is that if a user selects
> > processing.mode
> > = "exactly-once(-v2|-beta)", and default.state.isolation.level =
> > "READ_UNCOMMITTED", we need to guarantee that the data isn't written to
> > disk until commit() is called, but we also need to permit IQ threads to
> > read from the ongoing transaction.
> >
> > A simple solution would be to (temporarily) forbid this combination of
> > configuration, and have default.state.isolation.level automatically
> switch
> > to READ_COMMITTED when processing.mode is anything other than
> > at-least-once. Do you think this would be acceptable?
> >
> > In a later KIP, we can add support for query-time isolation levels and
> 

[DISCUSS] KIP-990: Capability to SUSPEND Tasks on DeserializationException

2023-10-12 Thread Nick Telford
Hi everyone,

This is a Streams KIP to add a new DeserializationHandlerResponse,
"SUSPEND", that suspends the failing Task but continues to process other
Tasks normally.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-990%3A+Capability+to+SUSPEND+Tasks+on+DeserializationException

I'm not yet completely convinced that this is practical, as I suspect it
might be abusing the SUSPENDED Task.State for something it was not designed
for. The intent is to pause an active Task *without* re-assigning it to
another instance, which causes cascading failures when the FAIL
DeserializationHandlerResponse is used.

Let me know what you think!

Regards,
Nick


Re: [DISCUSS] KIP-989: RocksDB Iterator Metrics

2023-10-05 Thread Nick Telford
Hi Colt,

I kept the details out of the KIP, because KIPs generally document
high-level design, but the way I'm doing it is like this:

final ManagedKeyValueIterator
rocksDbPrefixSeekIterator = cf.prefixScan(accessor, prefixBytes);
--> final long startedAt = System.nanoTime();
openIterators.add(rocksDbPrefixSeekIterator);
rocksDbPrefixSeekIterator.onClose(() -> {
--> metricsRecorder.recordIteratorDuration(System.nanoTime() -
startedAt);
openIterators.remove(rocksDbPrefixSeekIterator);
});

The lines with the arrow are the new code. This pattern is repeated
throughout RocksDBStore, wherever a new RocksDbIterator is created.

Regards,
Nick

On Thu, 5 Oct 2023 at 12:32, Colt McNealy  wrote:

> Thank you for the KIP, Nick!
>
> This would be highly useful for many reasons. Much more sane than checking
> for leaked iterators by profiling memory usage while running 100's of
> thousands of range scans via interactive queries (:
>
> One small question:
>
> >The iterator-duration metrics will be updated whenever an Iterator's
> close() method is called
>
> Does the Iterator have its own "createdAt()" or equivalent field, or do we
> need to keep track of the Iterator's start time upon creation?
>
> Cheers,
> Colt McNealy
>
> *Founder, LittleHorse.dev*
>
>
> On Wed, Oct 4, 2023 at 9:07 AM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > KIP-989 is a small Kafka Streams KIP to add a few new metrics around the
> > creation and use of RocksDB Iterators, to aid users in identifying
> > "Iterator leaks" that could cause applications to leak native memory.
> >
> > Let me know what you think!
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics
> >
> > P.S. I'm not too sure about the formatting of the "New Metrics" table,
> any
> > advice there would be appreciated.
> >
> > Regards,
> > Nick
> >
>


[DISCUSS] KIP-989: RocksDB Iterator Metrics

2023-10-04 Thread Nick Telford
Hi everyone,

KIP-989 is a small Kafka Streams KIP to add a few new metrics around the
creation and use of RocksDB Iterators, to aid users in identifying
"Iterator leaks" that could cause applications to leak native memory.

Let me know what you think!

https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+RocksDB+Iterator+Metrics

P.S. I'm not too sure about the formatting of the "New Metrics" table, any
advice there would be appreciated.

Regards,
Nick


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-19 Thread Nick Telford
Hi Bruno,

Agreed, I can live with that for now.

In an effort to keep the scope of this KIP from expanding, I'm leaning
towards just providing a configurable default.state.isolation.level and
removing IsolationLevel from the StateStoreContext. This would be
compatible with adding support for query-time IsolationLevels in the
future, whilst providing a way for users to select an isolation level now.

The big problem with this, however, is that if a user selects processing.mode
= "exactly-once(-v2|-beta)", and default.state.isolation.level =
"READ_UNCOMMITTED", we need to guarantee that the data isn't written to
disk until commit() is called, but we also need to permit IQ threads to
read from the ongoing transaction.

A simple solution would be to (temporarily) forbid this combination of
configuration, and have default.state.isolation.level automatically switch
to READ_COMMITTED when processing.mode is anything other than
at-least-once. Do you think this would be acceptable?

In a later KIP, we can add support for query-time isolation levels and
solve this particular problem there, which would relax this restriction.

Regards,
Nick

On Tue, 19 Sept 2023 at 09:30, Bruno Cadonna  wrote:

> Why do we need to add READ_COMMITTED to InMemoryKeyValueStore? I think
> it is perfectly valid to say InMemoryKeyValueStore do not support
> READ_COMMITTED for now, since READ_UNCOMMITTED is the de-facto default
> at the moment.
>
> Best,
> Bruno
>
> On 9/18/23 7:12 PM, Nick Telford wrote:
> > Oh! One other concern I haven't mentioned: if we make IsolationLevel a
> > query-time constraint, then we need to add support for READ_COMMITTED to
> > InMemoryKeyValueStore too, which will require some changes to the
> > implementation.
> >
> > On Mon, 18 Sept 2023 at 17:24, Nick Telford 
> wrote:
> >
> >> Hi everyone,
> >>
> >> I agree that having IsolationLevel be determined at query-time is the
> >> ideal design, but there are a few sticking points:
> >>
> >> 1.
> >> There needs to be some way to communicate the IsolationLevel down to the
> >> RocksDBStore itself, so that the query can respect it. Since stores are
> >> "layered" in functionality (i.e. ChangeLoggingStore, MeteredStore,
> etc.),
> >> we need some way to deliver that information to the bottom layer. For
> IQv2,
> >> we can use the existing State#query() method, but IQv1 has no way to do
> >> this.
> >>
> >> A simple approach, which would potentially open up other options, would
> be
> >> to add something like: ReadOnlyKeyValueStore
> >> readOnlyView(IsolationLevel isolationLevel) to ReadOnlyKeyValueStore
> (and
> >> similar to ReadOnlyWindowStore, ReadOnlySessionStore, etc.).
> >>
> >> 2.
> >> As mentioned above, RocksDB WriteBatches are not thread-safe, which
> causes
> >> a problem if we want to provide READ_UNCOMMITTED Iterators. I also had a
> >> look at RocksDB Transactions[1], but they solve a very different
> problem,
> >> and have the same thread-safety issue.
> >>
> >> One possible approach that I mentioned is chaining WriteBatches: every
> >> time a new Interactive Query is received (i.e. readOnlyView, see above,
> >> is called) we "freeze" the existing WriteBatch, and start a new one for
> new
> >> writes. The Interactive Query queries the "chain" of previous
> WriteBatches
> >> + the underlying database; while the StreamThread starts writing to the
> >> *new* WriteBatch. On-commit, the StreamThread would write *all*
> >> WriteBatches in the chain to the database (that have not yet been
> written).
> >>
> >> WriteBatches would be closed/freed only when they have been both
> >> committed, and all open Interactive Queries on them have been closed.
> This
> >> would require some reference counting.
> >>
> >> Obviously a drawback of this approach is the potential for increased
> >> memory usage: if an Interactive Query is long-lived, for example by
> doing a
> >> full scan over a large database, or even just pausing in the middle of
> an
> >> iteration, then the existing chain of WriteBatches could be kept around
> for
> >> a long time, potentially forever.
> >>
> >> --
> >>
> >> A.
> >> Going off on a tangent, it looks like in addition to supporting
> >> READ_COMMITTED queries, we could go further and support REPEATABLE_READ
> >> queries (i.e. where subsequent reads to the same key in the same
> >> Interactive Query are guaranteed to yield the same value) by ma

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-18 Thread Nick Telford
Oh! One other concern I haven't mentioned: if we make IsolationLevel a
query-time constraint, then we need to add support for READ_COMMITTED to
InMemoryKeyValueStore too, which will require some changes to the
implementation.

On Mon, 18 Sept 2023 at 17:24, Nick Telford  wrote:

> Hi everyone,
>
> I agree that having IsolationLevel be determined at query-time is the
> ideal design, but there are a few sticking points:
>
> 1.
> There needs to be some way to communicate the IsolationLevel down to the
> RocksDBStore itself, so that the query can respect it. Since stores are
> "layered" in functionality (i.e. ChangeLoggingStore, MeteredStore, etc.),
> we need some way to deliver that information to the bottom layer. For IQv2,
> we can use the existing State#query() method, but IQv1 has no way to do
> this.
>
> A simple approach, which would potentially open up other options, would be
> to add something like: ReadOnlyKeyValueStore
> readOnlyView(IsolationLevel isolationLevel) to ReadOnlyKeyValueStore (and
> similar to ReadOnlyWindowStore, ReadOnlySessionStore, etc.).
>
> 2.
> As mentioned above, RocksDB WriteBatches are not thread-safe, which causes
> a problem if we want to provide READ_UNCOMMITTED Iterators. I also had a
> look at RocksDB Transactions[1], but they solve a very different problem,
> and have the same thread-safety issue.
>
> One possible approach that I mentioned is chaining WriteBatches: every
> time a new Interactive Query is received (i.e. readOnlyView, see above,
> is called) we "freeze" the existing WriteBatch, and start a new one for new
> writes. The Interactive Query queries the "chain" of previous WriteBatches
> + the underlying database; while the StreamThread starts writing to the
> *new* WriteBatch. On-commit, the StreamThread would write *all*
> WriteBatches in the chain to the database (that have not yet been written).
>
> WriteBatches would be closed/freed only when they have been both
> committed, and all open Interactive Queries on them have been closed. This
> would require some reference counting.
>
> Obviously a drawback of this approach is the potential for increased
> memory usage: if an Interactive Query is long-lived, for example by doing a
> full scan over a large database, or even just pausing in the middle of an
> iteration, then the existing chain of WriteBatches could be kept around for
> a long time, potentially forever.
>
> --
>
> A.
> Going off on a tangent, it looks like in addition to supporting
> READ_COMMITTED queries, we could go further and support REPEATABLE_READ
> queries (i.e. where subsequent reads to the same key in the same
> Interactive Query are guaranteed to yield the same value) by making use of
> RocksDB Snapshots[2]. These are fairly lightweight, so the performance
> impact is likely to be negligible, but they do require that the Interactive
> Query session can be explicitly closed.
>
> This could be achieved if we made the above readOnlyView interface look
> more like:
>
> interface ReadOnlyKeyValueView implements ReadOnlyKeyValueStore V>, AutoCloseable {}
>
> interface ReadOnlyKeyValueStore {
> ...
> ReadOnlyKeyValueView readOnlyView(IsolationLevel isolationLevel);
> }
>
> But this would be a breaking change, as existing IQv1 queries are
> guaranteed to never call store.close(), and therefore these would leak
> memory under REPEATABLE_READ.
>
> B.
> One thing that's notable: MyRocks states that they support READ_COMMITTED
> and REPEATABLE_READ, but they make no mention of READ_UNCOMMITTED[3][4].
> This could be because doing so is technically difficult/impossible using
> the primitives available in RocksDB.
>
> --
>
> Lucas, to address your points:
>
> U1.
> It's only "SHOULD" to permit alternative (i.e. non-RocksDB)
> implementations of StateStore that do not support atomic writes. Obviously
> in those cases, the guarantees Kafka Streams provides/expects would be
> relaxed. Do you think we should require all implementations to support
> atomic writes?
>
> U2.
> Stores can support multiple IsolationLevels. As we've discussed above, the
> ideal scenario would be to specify the IsolationLevel at query-time.
> Failing that, I think the second-best approach is to define the
> IsolationLevel for *all* queries based on the processing.mode, which is
> what the default StateStoreContext#isolationLevel() achieves. Would you
> prefer an alternative?
>
> While the existing implementation is equivalent to READ_UNCOMMITTED, this
> can yield unexpected results/errors under EOS, if a transaction is rolled
> back. While this would be a change in behaviour for users, it would look
> more like a bug fix than a breaking chang

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-18 Thread Nick Telford
ot
3: https://github.com/facebook/mysql-5.6/wiki/Transaction-Isolation
4: https://mariadb.com/kb/en/myrocks-transactional-isolation/

On Mon, 18 Sept 2023 at 16:19, Lucas Brutschy
 wrote:

> Hi Nick,
>
> since I last read it in April, the KIP has become much cleaner and
> easier to read. Great work!
>
> It feels to me the last big open point is whether we can implement
> isolation level as a query parameter. I understand that there are
> implementation concerns, but as Colt says, it would be a great
> addition, and would also simplify the migration path for this change.
> Is the implementation problem you mentioned caused by the WriteBatch
> not having a notion of a snapshot, as the underlying DB iterator does?
> In that case, I am not sure a chain of WriteBatches as you propose
> would fully solve the problem, but maybe I didn't dig enough into the
> details to fully understand it.
>
> If it's not possible to implement it now, would it be an option to
> make sure in this KIP that we do not fully close the door on per-query
> isolation levels in the interface, as it may be possible to implement
> the missing primitives in RocksDB or Speedb in the future.
>
> Understanding:
>
> * U1) Why is it only "SHOULD" for changelogOffsets to be persisted
> atomically with the records?
> * U2) Don't understand the default implementation of `isolationLevel`.
> The isolation level should be a property of the underlying store, and
> not be defined by the default config? Existing stores probably don't
> guarantee READ_COMMITTED, so the default should be to return
> READ_UNCOMMITTED.
>
> Nits:
> * N1) Could `getComittedOffset` use an `OptionalLong` return type, to
> avoid the `null`?
> * N2) Could `apporixmateNumUncomittedBytes` use an `OptionalLong`
> return type, to avoid the `-1`?
> * N3) I don't understand why `managesOffsets` uses the 'manage' verb,
> whereas all other methods use the "commits" verb. I'd suggest
> `commitsOffsets`.
>
> Either way, it feels this KIP is very close to the finish line, I'm
> looking forward to seeing this in production!
>
> Cheers,
> Lucas
>
> On Mon, Sep 18, 2023 at 6:57 AM Colt McNealy  wrote:
> >
> > > Making IsolationLevel a query-time constraint, rather than linking it
> to
> > the processing.guarantee.
> >
> > As I understand it, would this allow even a user of EOS to control
> whether
> > reading committed or uncommitted records? If so, I am highly in favor of
> > this.
> >
> > I know that I was one of the early people to point out the current
> > shortcoming that IQ reads uncommitted records, but just this morning I
> > realized a pattern we use which means that (for certain queries) our
> system
> > needs to be able to read uncommitted records, which is the current
> behavior
> > of Kafka Streams in EOS.***
> >
> > If IsolationLevel being a query-time decision allows for this, then that
> > would be amazing. I would also vote that the default behavior should be
> for
> > reading uncommitted records, because it is totally possible for a valid
> > application to depend on that behavior, and breaking it in a minor
> release
> > might be a bit strong.
> >
> > *** (Note, for the curious reader) Our use-case/query pattern is a
> bit
> > complex, but reading "uncommitted" records is actually safe in our case
> > because processing is deterministic. Additionally, IQ being able to read
> > uncommitted records is crucial to enable "read your own writes" on our
> API:
> > Due to the deterministic processing, we send an "ack" to the client who
> > makes the request as soon as the processor processes the result. If they
> > can't read uncommitted records, they may receive a "201 - Created"
> > response, immediately followed by a "404 - Not Found" when doing a lookup
> > for the object they just created).
> >
> > Thanks,
> > Colt McNealy
> >
> > *Founder, LittleHorse.dev*
> >
> >
> > On Wed, Sep 13, 2023 at 9:19 AM Nick Telford 
> wrote:
> >
> > > Addendum:
> > >
> > > I think we would also face the same problem with the approach John
> outlined
> > > earlier (using the record cache as a transaction buffer and flushing it
> > > straight to SST files). This is because the record cache (the
> ThreadCache
> > > class) is not thread-safe, so every commit would invalidate open IQ
> > > Iterators in the same way that RocksDB WriteBatches do.
> > > --
> > > Nick
> > >
> > > On Wed, 13 Sept 2023 at 16:58, Nick Telford 
> > 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-13 Thread Nick Telford
Addendum:

I think we would also face the same problem with the approach John outlined
earlier (using the record cache as a transaction buffer and flushing it
straight to SST files). This is because the record cache (the ThreadCache
class) is not thread-safe, so every commit would invalidate open IQ
Iterators in the same way that RocksDB WriteBatches do.
--
Nick

On Wed, 13 Sept 2023 at 16:58, Nick Telford  wrote:

> Hi Bruno,
>
> I've updated the KIP based on our conversation. The only things I've not
> yet done are:
>
> 1. Using transactions under ALOS and EOS.
> 2. Making IsolationLevel a query-time constraint, rather than linking it
> to the processing.guarantee.
>
> There's a wrinkle that makes this a challenge: Interactive Queries that
> open an Iterator, when using transactions and READ_UNCOMMITTED.
> The problem is that under READ_UNCOMMITTED, queries need to be able to
> read records from the currently uncommitted transaction buffer
> (WriteBatch). This includes for Iterators, which should iterate both the
> transaction buffer and underlying database (using
> WriteBatch#iteratorWithBase()).
>
> The issue is that when the StreamThread commits, it writes the current
> WriteBatch to RocksDB *and then clears the WriteBatch*. Clearing the
> WriteBatch while an Interactive Query holds an open Iterator on it will
> invalidate the Iterator. Worse, it turns out that Iterators over a
> WriteBatch become invalidated not just when the WriteBatch is cleared, but
> also when the Iterators' current key receives a new write.
>
> Now that I'm writing this, I remember that this is the major reason that I
> switched the original design from having a query-time IsolationLevel to
> having the IsolationLevel linked to the transactionality of the stores
> themselves.
>
> It *might* be possible to resolve this, by having a "chain" of
> WriteBatches, with the StreamThread switching to a new WriteBatch whenever
> a new Interactive Query attempts to read from the database, but that could
> cause some performance problems/memory pressure when subjected to a high
> Interactive Query load. It would also reduce the efficiency of WriteBatches
> on-commit, as we'd have to write N WriteBatches, where N is the number of
> Interactive Queries since the last commit.
>
> I realise this is getting into the weeds of the implementation, and you'd
> rather we focus on the API for now, but I think it's important to consider
> how to implement the desired API, in case we come up with an API that
> cannot be implemented efficiently, or even at all!
>
> Thoughts?
> --
> Nick
>
> On Wed, 13 Sept 2023 at 13:03, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> 6.
>> Of course, you are right! My bad!
>> Wiping out the state in the downgrading case is fine.
>>
>>
>> 3a.
>> Focus on the public facing changes for the KIP. We will manage to get
>> the internals right. Regarding state stores that do not support
>> READ_COMMITTED, they should throw an error stating that they do not
>> support READ_COMMITTED. No need to adapt all state stores immediately.
>>
>> 3b.
>> I am in favor of using transactions also for ALOS.
>>
>>
>> Best,
>> Bruno
>>
>> On 9/13/23 11:57 AM, Nick Telford wrote:
>> > Hi Bruno,
>> >
>> > Thanks for getting back to me!
>> >
>> > 2.
>> > The fact that implementations can always track estimated memory usage in
>> > the wrapper is a good point. I can remove -1 as an option, and I'll
>> clarify
>> > the JavaDoc that 0 is not just for non-transactional stores, which is
>> > currently misleading.
>> >
>> > 6.
>> > The problem with catching the exception in the downgrade process is that
>> > would require new code in the Kafka version being downgraded to. Since
>> > users could conceivably downgrade to almost *any* older version of Kafka
>> > Streams, I'm not sure how we could add that code?
>> > The only way I can think of doing it would be to provide a dedicated
>> > downgrade tool, that goes through every local store and removes the
>> > offsets column families. But that seems like an unnecessary amount of
>> extra
>> > code to maintain just to handle a somewhat niche situation, when the
>> > alternative (automatically wipe and restore stores) should be
>> acceptable.
>> >
>> > 1, 4, 5: Agreed. I'll make the changes you've requested.
>> >
>> > 3a.
>> > I agree that IsolationLevel makes more sense at query-time, and I
>> actually
>> > initially attempted to place the IsolationLevel at query-time, but I ran
&g

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-13 Thread Nick Telford
Hi Bruno,

I've updated the KIP based on our conversation. The only things I've not
yet done are:

1. Using transactions under ALOS and EOS.
2. Making IsolationLevel a query-time constraint, rather than linking it to
the processing.guarantee.

There's a wrinkle that makes this a challenge: Interactive Queries that
open an Iterator, when using transactions and READ_UNCOMMITTED.
The problem is that under READ_UNCOMMITTED, queries need to be able to read
records from the currently uncommitted transaction buffer (WriteBatch).
This includes for Iterators, which should iterate both the transaction
buffer and underlying database (using WriteBatch#iteratorWithBase()).

The issue is that when the StreamThread commits, it writes the current
WriteBatch to RocksDB *and then clears the WriteBatch*. Clearing the
WriteBatch while an Interactive Query holds an open Iterator on it will
invalidate the Iterator. Worse, it turns out that Iterators over a
WriteBatch become invalidated not just when the WriteBatch is cleared, but
also when the Iterators' current key receives a new write.

Now that I'm writing this, I remember that this is the major reason that I
switched the original design from having a query-time IsolationLevel to
having the IsolationLevel linked to the transactionality of the stores
themselves.

It *might* be possible to resolve this, by having a "chain" of
WriteBatches, with the StreamThread switching to a new WriteBatch whenever
a new Interactive Query attempts to read from the database, but that could
cause some performance problems/memory pressure when subjected to a high
Interactive Query load. It would also reduce the efficiency of WriteBatches
on-commit, as we'd have to write N WriteBatches, where N is the number of
Interactive Queries since the last commit.

I realise this is getting into the weeds of the implementation, and you'd
rather we focus on the API for now, but I think it's important to consider
how to implement the desired API, in case we come up with an API that
cannot be implemented efficiently, or even at all!

Thoughts?
--
Nick

On Wed, 13 Sept 2023 at 13:03, Bruno Cadonna  wrote:

> Hi Nick,
>
> 6.
> Of course, you are right! My bad!
> Wiping out the state in the downgrading case is fine.
>
>
> 3a.
> Focus on the public facing changes for the KIP. We will manage to get
> the internals right. Regarding state stores that do not support
> READ_COMMITTED, they should throw an error stating that they do not
> support READ_COMMITTED. No need to adapt all state stores immediately.
>
> 3b.
> I am in favor of using transactions also for ALOS.
>
>
> Best,
> Bruno
>
> On 9/13/23 11:57 AM, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for getting back to me!
> >
> > 2.
> > The fact that implementations can always track estimated memory usage in
> > the wrapper is a good point. I can remove -1 as an option, and I'll
> clarify
> > the JavaDoc that 0 is not just for non-transactional stores, which is
> > currently misleading.
> >
> > 6.
> > The problem with catching the exception in the downgrade process is that
> > would require new code in the Kafka version being downgraded to. Since
> > users could conceivably downgrade to almost *any* older version of Kafka
> > Streams, I'm not sure how we could add that code?
> > The only way I can think of doing it would be to provide a dedicated
> > downgrade tool, that goes through every local store and removes the
> > offsets column families. But that seems like an unnecessary amount of
> extra
> > code to maintain just to handle a somewhat niche situation, when the
> > alternative (automatically wipe and restore stores) should be acceptable.
> >
> > 1, 4, 5: Agreed. I'll make the changes you've requested.
> >
> > 3a.
> > I agree that IsolationLevel makes more sense at query-time, and I
> actually
> > initially attempted to place the IsolationLevel at query-time, but I ran
> > into some problems:
> > - The key issue is that, under ALOS we're not staging writes in
> > transactions, so can't perform writes at the READ_COMMITTED isolation
> > level. However, this may be addressed if we decide to *always* use
> > transactions as discussed under 3b.
> > - IQv1 and IQv2 have quite different implementations. I remember having
> > some difficulty understanding the IQv1 internals, which made it difficult
> > to determine what needed to be changed. However, I *think* this can be
> > addressed for both implementations by wrapping the RocksDBStore in an
> > IsolationLevel-dependent wrapper, that overrides read methods (get, etc.)
> > to either read directly from the database or from the ongoing
> transaction.
> > But IQv1 might still be difficult

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-13 Thread Nick Telford
Bruno,

Thinking about 3a. in addition to adding the IsolationLevel to
QueryStoreParameters and Query, what about also adding a method like
"ReadOnlyKeyValueStore view(IsolationLevel isolationLevel)" to ReadOnlyKeyValueStore?

This would enable us to easily select/switch between IsolationLevels, even
if the StateStore has many layers of wrappers (as is the case at the point
IQv1 deals with the store). Would this be acceptable, or do you have
another approach in mind?

Regards,
Nick

On Wed, 13 Sept 2023 at 10:57, Nick Telford  wrote:

> Hi Bruno,
>
> Thanks for getting back to me!
>
> 2.
> The fact that implementations can always track estimated memory usage in
> the wrapper is a good point. I can remove -1 as an option, and I'll clarify
> the JavaDoc that 0 is not just for non-transactional stores, which is
> currently misleading.
>
> 6.
> The problem with catching the exception in the downgrade process is that
> would require new code in the Kafka version being downgraded to. Since
> users could conceivably downgrade to almost *any* older version of Kafka
> Streams, I'm not sure how we could add that code?
> The only way I can think of doing it would be to provide a dedicated
> downgrade tool, that goes through every local store and removes the
> offsets column families. But that seems like an unnecessary amount of extra
> code to maintain just to handle a somewhat niche situation, when the
> alternative (automatically wipe and restore stores) should be acceptable.
>
> 1, 4, 5: Agreed. I'll make the changes you've requested.
>
> 3a.
> I agree that IsolationLevel makes more sense at query-time, and I actually
> initially attempted to place the IsolationLevel at query-time, but I ran
> into some problems:
> - The key issue is that, under ALOS we're not staging writes in
> transactions, so can't perform writes at the READ_COMMITTED isolation
> level. However, this may be addressed if we decide to *always* use
> transactions as discussed under 3b.
> - IQv1 and IQv2 have quite different implementations. I remember having
> some difficulty understanding the IQv1 internals, which made it difficult
> to determine what needed to be changed. However, I *think* this can be
> addressed for both implementations by wrapping the RocksDBStore in an
> IsolationLevel-dependent wrapper, that overrides read methods (get, etc.)
> to either read directly from the database or from the ongoing transaction.
> But IQv1 might still be difficult.
> - If IsolationLevel becomes a query constraint, then all other StateStores
> will need to respect it, including the in-memory stores. This would require
> us to adapt in-memory stores to stage their writes so they can be isolated
> from READ_COMMITTTED queries. It would also become an important
> consideration for third-party stores on upgrade, as without changes, they
> would not support READ_COMMITTED queries correctly.
>
> Ultimately, I may need some help making the necessary change to IQv1 to
> support this, but I don't think it's fundamentally impossible, if we want
> to pursue this route.
>
> 3b.
> The main reason I chose to keep ALOS un-transactional was to minimize
> behavioural change for most users (I believe most Streams users use the
> default configuration, which is ALOS). That said, it's clear that if ALOS
> also used transactional stores, the only change in behaviour would be that
> it would become *more correct*, which could be considered a "bug fix" by
> users, rather than a change they need to handle.
>
> I believe that performance using transactions (aka. RocksDB WriteBatches)
> should actually be *better* than the un-batched write-path that is
> currently used[1]. The only "performance" consideration will be the
> increased memory usage that transactions require. Given the mitigations for
> this memory that we have in place, I would expect that this is not a
> problem for most users.
>
> If we're happy to do so, we can make ALOS also use transactions.
>
> Regards,
> Nick
>
> Link 1:
> https://github.com/adamretter/rocksjava-write-methods-benchmark#results
>
> On Wed, 13 Sept 2023 at 09:41, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> Thanks for the updates and sorry for the delay on my side!
>>
>>
>> 1.
>> Making the default implementation for flush() a no-op sounds good to me.
>>
>>
>> 2.
>> I think what was bugging me here is that a third-party state store needs
>> to implement the state store interface. That means they need to
>> implement a wrapper around the actual state store as we do for RocksDB
>> with RocksDBStore. So, a third-party state store can always estimate the
>> uncommitted bytes, if it wants, because the wrap

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-13 Thread Nick Telford
 think if we move the isolation level to IQ (v1 and v2)?
> In the end this is the only component that really needs to specify the
> isolation level. It is similar to the Kafka consumer that can choose
> with what isolation level to read the input topic.
> For IQv1 the isolation level should go into StoreQueryParameters. For
> IQv2, I would add it to the Query interface.
>
> b) Point a) raises the question what should happen during at-least-once
> processing when the state store does not use transactions? John in the
> past proposed to also use transactions on state stores for
> at-least-once. I like that idea, because it avoids aggregating the same
> records over and over again in the case of a failure. We had a case in
> the past where a Streams applications in at-least-once mode was failing
> continuously for some reasons I do not remember before committing the
> offsets. After each failover, the app aggregated again and again the
> same records. Of course the aggregate increased to very wrong values
> just because of the failover. With transactions on the state stores we
> could have avoided this. The app would have output the same aggregate
> multiple times (i.e., after each failover) but at least the value of the
> aggregate would not depend on the number of failovers. Outputting the
> same aggregate multiple times would be incorrect under exactly-once but
> it is OK for at-least-once.
> If it makes sense to add a config to turn on and off transactions on
> state stores under at-least-once or just use transactions in any case is
> a question we should also discuss in this KIP. It depends a bit on the
> performance trade-off. Maybe to be safe, I would add a config.
>
>
> 4.
> Your points are all valid. I tend to say to keep the metrics around
> flush() until we remove flush() completely from the interface. Calls to
> flush() might still exist since existing processors might still call
> flush() explicitly as you mentioned in 1). For sure, we need to document
> how the metrics change due to the transactions in the upgrade notes.
>
>
> 5.
> I see. Then you should describe how the .position files are handled  in
> a dedicated section of the KIP or incorporate the description in the
> "Atomic Checkpointing" section instead of only mentioning it in the
> "Compatibility, Deprecation, and Migration Plan".
>
>
> 6.
> Describing upgrading and downgrading in the KIP is a good idea.
> Regarding downgrading, I think you could also catch the exception and do
> what is needed to downgrade, e.g., drop the column family. See here for
> an example:
>
>
> https://github.com/apache/kafka/blob/63fee01366e6ce98b9dfafd279a45d40b80e282d/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBTimestampedStore.java#L75
>
> It is a bit brittle, but it works.
>
>
> Best,
> Bruno
>
>
> On 8/24/23 12:18 PM, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for taking the time to review the KIP. I'm back from leave now and
> > intend to move this forwards as quickly as I can.
> >
> > Addressing your points:
> >
> > 1.
> > Because flush() is part of the StateStore API, it's exposed to custom
> > Processors, which might be making calls to flush(). This was actually the
> > case in a few integration tests.
> > To maintain as much compatibility as possible, I'd prefer not to make
> this
> > an UnsupportedOperationException, as it will cause previously working
> > Processors to start throwing exceptions at runtime.
> > I agree that it doesn't make sense for it to proxy commit(), though, as
> > that would cause it to violate the "StateStores commit only when the Task
> > commits" rule.
> > Instead, I think we should make this a no-op. That way, existing user
> > Processors will continue to work as-before, without violation of store
> > consistency that would be caused by premature flush/commit of StateStore
> > data to disk.
> > What do you think?
> >
> > 2.
> > As stated in the JavaDoc, when a StateStore implementation is
> > transactional, but is unable to estimate the uncommitted memory usage,
> the
> > method will return -1.
> > The intention here is to permit third-party implementations that may not
> be
> > able to estimate memory usage.
> >
> > Yes, it will be 0 when nothing has been written to the store yet. I
> thought
> > that was implied by "This method will return an approximation of the
> memory
> > would be freed by the next call to {@link #commit(Map)}" and "@return The
> > approximate size of all records awaiting {@link #commit(Map)}", however,
> I
> > can

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-11 Thread Nick Telford
ose who want to replicate the tests, you can find the branch
> of our streams app here:
>
> https://github.com/littlehorse-enterprises/littlehorse/tree/minor/testing-streams-forks
> . The example I ran was `examples/hundred-tasks`, and I ran the server with
> `./local-dev/do-server.sh one-partition`. The `STREAMS_TESTS.md` file has a
> detailed breakdown of the testing.
>
> Anyways, I'm super excited about this KIP and if a bit more future testing
> goes well, we plan to ship our product with a build of KIP-892, Speedb OSS,
> and potentially a few other minor tweaks that we are thinking about.
>
> Thanks Nick!
>
> Ride well,
> Colt McNealy
>
> *Founder, LittleHorse.dev*
>
>
> On Thu, Aug 24, 2023 at 3:19 AM Nick Telford 
> wrote:
>
> > Hi Bruno,
> >
> > Thanks for taking the time to review the KIP. I'm back from leave now and
> > intend to move this forwards as quickly as I can.
> >
> > Addressing your points:
> >
> > 1.
> > Because flush() is part of the StateStore API, it's exposed to custom
> > Processors, which might be making calls to flush(). This was actually the
> > case in a few integration tests.
> > To maintain as much compatibility as possible, I'd prefer not to make
> this
> > an UnsupportedOperationException, as it will cause previously working
> > Processors to start throwing exceptions at runtime.
> > I agree that it doesn't make sense for it to proxy commit(), though, as
> > that would cause it to violate the "StateStores commit only when the Task
> > commits" rule.
> > Instead, I think we should make this a no-op. That way, existing user
> > Processors will continue to work as-before, without violation of store
> > consistency that would be caused by premature flush/commit of StateStore
> > data to disk.
> > What do you think?
> >
> > 2.
> > As stated in the JavaDoc, when a StateStore implementation is
> > transactional, but is unable to estimate the uncommitted memory usage,
> the
> > method will return -1.
> > The intention here is to permit third-party implementations that may not
> be
> > able to estimate memory usage.
> >
> > Yes, it will be 0 when nothing has been written to the store yet. I
> thought
> > that was implied by "This method will return an approximation of the
> memory
> > would be freed by the next call to {@link #commit(Map)}" and "@return The
> > approximate size of all records awaiting {@link #commit(Map)}", however,
> I
> > can add it explicitly to the JavaDoc if you think this is unclear?
> >
> > 3.
> > I realise this is probably the most contentious point in my design, and
> I'm
> > open to changing it if I'm unable to convince you of the benefits.
> > Nevertheless, here's my argument:
> > The Interactive Query (IQ) API(s) are directly provided StateStores to
> > query, and it may be important for users to programmatically know which
> > mode the StateStore is operating under. If we simply provide an
> > "eosEnabled" boolean (as used throughout the internal streams engine), or
> > similar, then users will need to understand the operation and
> consequences
> > of each available processing mode and how it pertains to their
> StateStore.
> >
> > Interactive Query users aren't the only people that care about the
> > processing.mode/IsolationLevel of a StateStore: implementers of custom
> > StateStores also need to understand the behaviour expected of their
> > implementation. KIP-892 introduces some assumptions into the Streams
> Engine
> > about how StateStores operate under each processing mode, and it's
> > important that custom implementations adhere to those assumptions in
> order
> > to maintain the consistency guarantees.
> >
> > IsolationLevels provide a high-level contract on the behaviour of the
> > StateStore: a user knows that under READ_COMMITTED, they will see writes
> > only after the Task has committed, and under READ_UNCOMMITTED they will
> see
> > writes immediately. No understanding of the details of each
> processing.mode
> > is required, either for IQ users or StateStore implementers.
> >
> > An argument can be made that these contractual guarantees can simply be
> > documented for the processing.mode (i.e. that exactly-once and
> > exactly-once-v2 behave like READ_COMMITTED and at-least-once behaves like
> > READ_UNCOMMITTED), but there are several small issues with this I'd
> prefer
> > to avoid:
> >
> >- Where would we document these contracts, in a way that is difficult
> >for 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-08-24 Thread Nick Telford
etty bad to me:

   1. Have them record calls to commit(), which would be misleading, as
   data is no longer explicitly "flushed" to disk by this call.
   2. Have them record nothing at all, which is equivalent to removing the
   metrics, except that users will see the metric still exists and so assume
   that the metric is correct, and that there's a problem with their system
   when there isn't.

I agree that removing them is also a bad solution, and I'd like some
guidance on the best path forward here.

5.
Position files are updated on every write to a StateStore. Since our writes
are now buffered until commit(), we can't update the Position file until
commit() has been called, otherwise it would be inconsistent with the data
in the event of a rollback. Consequently, we need to manage these offsets
the same way we manage the checkpoint offsets, and ensure they're only
written on commit().

6.
Agreed, although I'm not exactly sure yet what tests to write. How explicit
do we need to be here in the KIP?

As for upgrade/downgrade: upgrade is designed to be seamless, and we should
definitely add some tests around that. Downgrade, it transpires, isn't
currently possible, as the extra column family for offset storage is
incompatible with the pre-KIP-892 implementation: when you open a RocksDB
database, you must open all available column families or receive an error.
What currently happens on downgrade is that it attempts to open the store,
throws an error about the offsets column family not being opened, which
triggers a wipe and rebuild of the Task. Given that downgrades should be
uncommon, I think this is acceptable behaviour, as the end-state is
consistent, even if it results in an undesirable state restore.

Should I document the upgrade/downgrade behaviour explicitly in the KIP?

--

Regards,
Nick


On Mon, 14 Aug 2023 at 22:31, Bruno Cadonna  wrote:

> Hi Nick!
>
> Thanks for the updates!
>
> 1.
> Why does StateStore#flush() default to
> StateStore#commit(Collections.emptyMap())?
> Since calls to flush() will not exist anymore after this KIP is
> released, I would rather throw an unsupported operation exception by
> default.
>
>
> 2.
> When would a state store return -1 from
> StateStore#approximateNumUncommittedBytes() while being transactional?
>
> Wouldn't StateStore#approximateNumUncommittedBytes() also return 0 if
> the state store is transactional but nothing has been written to the
> state store yet?
>
>
> 3.
> Sorry for bringing this up again. Does this KIP really need to introduce
> StateStoreContext#isolationLevel()? StateStoreContext has already
> appConfigs() which basically exposes the same information, i.e., if EOS
> is enabled or not.
> In one of your previous e-mails you wrote:
>
> "My idea was to try to keep the StateStore interface as loosely coupled
> from the Streams engine as possible, to give implementers more freedom,
> and reduce the amount of internal knowledge required."
>
> While I understand the intent, I doubt that it decreases the coupling of
> a StateStore interface and the Streams engine. READ_COMMITTED only
> applies to IQ but not to reads by processors. Thus, implementers need to
> understand how Streams accesses the state stores.
>
> I would like to hear what others think about this.
>
>
> 4.
> Great exposing new metrics for transactional state stores! However, I
> would prefer to add new metrics and deprecate (in the docs) the old
> ones. You can find examples of deprecated metrics here:
> https://kafka.apache.org/documentation/#selector_monitoring
>
>
> 5.
> Why does the KIP mention position files? I do not think they are related
> to transactions or flushes.
>
>
> 6.
> I think we will also need to adapt/add integration tests besides unit
> tests. Additionally, we probably need integration or system tests to
> verify that upgrades and downgrades between transactional and
> non-transactional state stores work as expected.
>
>
> Best,
> Bruno
>
>
>
>
>
> On 7/21/23 10:34 PM, Nick Telford wrote:
> > One more thing: I noted John's suggestion in the KIP, under "Rejected
> > Alternatives". I still think it's an idea worth pursuing, but I believe
> > that it's out of the scope of this KIP, because it solves a different set
> > of problems to this KIP, and the scope of this one has already grown
> quite
> > large!
> >
> > On Fri, 21 Jul 2023 at 21:33, Nick Telford 
> wrote:
> >
> >> Hi everyone,
> >>
> >> I've updated the KIP (
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
> )
> >> with the latest changes; mostly bringing back "Atomic Checkpointing"
> (for
> >

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-21 Thread Nick Telford
One more thing: I noted John's suggestion in the KIP, under "Rejected
Alternatives". I still think it's an idea worth pursuing, but I believe
that it's out of the scope of this KIP, because it solves a different set
of problems to this KIP, and the scope of this one has already grown quite
large!

On Fri, 21 Jul 2023 at 21:33, Nick Telford  wrote:

> Hi everyone,
>
> I've updated the KIP (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores)
> with the latest changes; mostly bringing back "Atomic Checkpointing" (for
> what feels like the 10th time!). I think the one thing missing is some
> changes to metrics (notably the store "flush" metrics will need to be
> renamed to "commit").
>
> The reason I brought back Atomic Checkpointing was to decouple store flush
> from store commit. This is important, because with Transactional
> StateStores, we now need to call "flush" on *every* Task commit, and not
> just when the StateStore is closing, otherwise our transaction buffer will
> never be written and persisted, instead growing unbounded! I experimented
> with some simple solutions, like forcing a store flush whenever the
> transaction buffer was likely to exceed its configured size, but this was
> brittle: it prevented the transaction buffer from being configured to be
> unbounded, and it still would have required explicit flushes of RocksDB,
> yielding sub-optimal performance and memory utilization.
>
> I deemed Atomic Checkpointing to be the "right" way to resolve this
> problem. By ensuring that the changelog offsets that correspond to the most
> recently written records are always atomically written to the StateStore
> (by writing them to the same transaction buffer), we can avoid forcibly
> flushing the RocksDB memtables to disk, letting RocksDB flush them only
> when necessary, without losing any of our consistency guarantees. See the
> updated KIP for more info.
>
> I have fully implemented these changes, although I'm still not entirely
> happy with the implementation for segmented StateStores, so I plan to
> refactor that. Despite that, all tests pass. If you'd like to try out or
> review this highly experimental and incomplete branch, it's available here:
> https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0. Note: it's built
> against Kafka 3.5.0 so that I had a stable base to build and test it on,
> and to enable easy apples-to-apples comparisons in a live environment. I
> plan to rebase it against trunk once it's nearer completion and has been
> proven on our main application.
>
> I would really appreciate help in reviewing and testing:
> - Segmented (Versioned, Session and Window) stores
> - Global stores
>
> As I do not currently use either of these, so my primary test environment
> doesn't test these areas.
>
> I'm going on Parental Leave starting next week for a few weeks, so will
> not have time to move this forward until late August. That said, your
> feedback is welcome and appreciated, I just won't be able to respond as
> quickly as usual.
>
> Regards,
> Nick
>
> On Mon, 3 Jul 2023 at 16:23, Nick Telford  wrote:
>
>> Hi Bruno
>>
>> Yes, that's correct, although the impact on IQ is not something I had
>> considered.
>>
>> What about atomically updating the state store from the transaction
>>> buffer every commit interval and writing the checkpoint (thus, flushing
>>> the memtable) every configured amount of data and/or number of commit
>>> intervals?
>>>
>>
>> I'm not quite sure I follow. Are you suggesting that we add an additional
>> config for the max number of commit intervals between checkpoints? That
>> way, we would checkpoint *either* when the transaction buffers are nearly
>> full, *OR* whenever a certain number of commit intervals have elapsed,
>> whichever comes first?
>>
>> That certainly seems reasonable, although this re-ignites an earlier
>> debate about whether a config should be measured in "number of commit
>> intervals", instead of just an absolute time.
>>
>> FWIW, I realised that this issue is the reason I was pursuing the Atomic
>> Checkpoints, as it de-couples memtable flush from checkpointing, which
>> enables us to just checkpoint on every commit without any performance
>> impact. Atomic Checkpointing is definitely the "best" solution, but I'm not
>> sure if this is enough to bring it back into this KIP.
>>
>> I'm currently working on moving all the transactional logic directly into
>> RocksDBStore itself, which does away with the StateStore#newTransaction
>> method, and r

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-21 Thread Nick Telford
Hi everyone,

I've updated the KIP (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores)
with the latest changes; mostly bringing back "Atomic Checkpointing" (for
what feels like the 10th time!). I think the one thing missing is some
changes to metrics (notably the store "flush" metrics will need to be
renamed to "commit").

The reason I brought back Atomic Checkpointing was to decouple store flush
from store commit. This is important, because with Transactional
StateStores, we now need to call "flush" on *every* Task commit, and not
just when the StateStore is closing, otherwise our transaction buffer will
never be written and persisted, instead growing unbounded! I experimented
with some simple solutions, like forcing a store flush whenever the
transaction buffer was likely to exceed its configured size, but this was
brittle: it prevented the transaction buffer from being configured to be
unbounded, and it still would have required explicit flushes of RocksDB,
yielding sub-optimal performance and memory utilization.

I deemed Atomic Checkpointing to be the "right" way to resolve this
problem. By ensuring that the changelog offsets that correspond to the most
recently written records are always atomically written to the StateStore
(by writing them to the same transaction buffer), we can avoid forcibly
flushing the RocksDB memtables to disk, letting RocksDB flush them only
when necessary, without losing any of our consistency guarantees. See the
updated KIP for more info.

I have fully implemented these changes, although I'm still not entirely
happy with the implementation for segmented StateStores, so I plan to
refactor that. Despite that, all tests pass. If you'd like to try out or
review this highly experimental and incomplete branch, it's available here:
https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0. Note: it's built
against Kafka 3.5.0 so that I had a stable base to build and test it on,
and to enable easy apples-to-apples comparisons in a live environment. I
plan to rebase it against trunk once it's nearer completion and has been
proven on our main application.

I would really appreciate help in reviewing and testing:
- Segmented (Versioned, Session and Window) stores
- Global stores

As I do not currently use either of these, so my primary test environment
doesn't test these areas.

I'm going on Parental Leave starting next week for a few weeks, so will not
have time to move this forward until late August. That said, your feedback
is welcome and appreciated, I just won't be able to respond as quickly as
usual.

Regards,
Nick

On Mon, 3 Jul 2023 at 16:23, Nick Telford  wrote:

> Hi Bruno
>
> Yes, that's correct, although the impact on IQ is not something I had
> considered.
>
> What about atomically updating the state store from the transaction
>> buffer every commit interval and writing the checkpoint (thus, flushing
>> the memtable) every configured amount of data and/or number of commit
>> intervals?
>>
>
> I'm not quite sure I follow. Are you suggesting that we add an additional
> config for the max number of commit intervals between checkpoints? That
> way, we would checkpoint *either* when the transaction buffers are nearly
> full, *OR* whenever a certain number of commit intervals have elapsed,
> whichever comes first?
>
> That certainly seems reasonable, although this re-ignites an earlier
> debate about whether a config should be measured in "number of commit
> intervals", instead of just an absolute time.
>
> FWIW, I realised that this issue is the reason I was pursuing the Atomic
> Checkpoints, as it de-couples memtable flush from checkpointing, which
> enables us to just checkpoint on every commit without any performance
> impact. Atomic Checkpointing is definitely the "best" solution, but I'm not
> sure if this is enough to bring it back into this KIP.
>
> I'm currently working on moving all the transactional logic directly into
> RocksDBStore itself, which does away with the StateStore#newTransaction
> method, and reduces the number of new classes introduced, significantly
> reducing the complexity. If it works, and the complexity is drastically
> reduced, I may try bringing back Atomic Checkpoints into this KIP.
>
> Regards,
> Nick
>
> On Mon, 3 Jul 2023 at 15:27, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> Thanks for the insights! Very interesting!
>>
>> As far as I understand, you want to atomically update the state store
>> from the transaction buffer, flush the memtable of a state store and
>> write the checkpoint not after the commit time elapsed but after the
>> transaction buffer reached a size that would lead to exceeding
>> statestore.transaction.buffer.max.bytes before 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-03 Thread Nick Telford
Hi Bruno

Yes, that's correct, although the impact on IQ is not something I had
considered.

What about atomically updating the state store from the transaction
> buffer every commit interval and writing the checkpoint (thus, flushing
> the memtable) every configured amount of data and/or number of commit
> intervals?
>

I'm not quite sure I follow. Are you suggesting that we add an additional
config for the max number of commit intervals between checkpoints? That
way, we would checkpoint *either* when the transaction buffers are nearly
full, *OR* whenever a certain number of commit intervals have elapsed,
whichever comes first?

That certainly seems reasonable, although this re-ignites an earlier debate
about whether a config should be measured in "number of commit intervals",
instead of just an absolute time.

FWIW, I realised that this issue is the reason I was pursuing the Atomic
Checkpoints, as it de-couples memtable flush from checkpointing, which
enables us to just checkpoint on every commit without any performance
impact. Atomic Checkpointing is definitely the "best" solution, but I'm not
sure if this is enough to bring it back into this KIP.

I'm currently working on moving all the transactional logic directly into
RocksDBStore itself, which does away with the StateStore#newTransaction
method, and reduces the number of new classes introduced, significantly
reducing the complexity. If it works, and the complexity is drastically
reduced, I may try bringing back Atomic Checkpoints into this KIP.

Regards,
Nick

On Mon, 3 Jul 2023 at 15:27, Bruno Cadonna  wrote:

> Hi Nick,
>
> Thanks for the insights! Very interesting!
>
> As far as I understand, you want to atomically update the state store
> from the transaction buffer, flush the memtable of a state store and
> write the checkpoint not after the commit time elapsed but after the
> transaction buffer reached a size that would lead to exceeding
> statestore.transaction.buffer.max.bytes before the next commit interval
> ends.
> That means, the Kafka transaction would commit every commit interval but
> the state store will only be atomically updated roughly every
> statestore.transaction.buffer.max.bytes of data. Also IQ would then only
> see new data roughly every statestore.transaction.buffer.max.bytes.
> After a failure the state store needs to restore up to
> statestore.transaction.buffer.max.bytes.
>
> Is this correct?
>
> What about atomically updating the state store from the transaction
> buffer every commit interval and writing the checkpoint (thus, flushing
> the memtable) every configured amount of data and/or number of commit
> intervals? In such a way, we would have the same delay for records
> appearing in output topics and IQ because both would appear when the
> Kafka transaction is committed. However, after a failure the state store
> still needs to restore up to statestore.transaction.buffer.max.bytes and
> it might restore data that is already in the state store because the
> checkpoint lags behind the last stable offset (i.e. the last committed
> offset) of the changelog topics. Restoring data that is already in the
> state store is idempotent, so eos should not violated.
> This solution needs at least one new config to specify when a checkpoint
> should be written.
>
>
>
> A small correction to your previous e-mail that does not change anything
> you said: Under alos the default commit interval is 30 seconds, not five
> seconds.
>
>
> Best,
> Bruno
>
>
> On 01.07.23 12:37, Nick Telford wrote:
> > Hi everyone,
> >
> > I've begun performance testing my branch on our staging environment,
> > putting it through its paces in our non-trivial application. I'm already
> > observing the same increased flush rate that we saw the last time we
> > attempted to use a version of this KIP, but this time, I think I know
> why.
> >
> > Pre-KIP-892, StreamTask#postCommit, which is called at the end of the
> Task
> > commit process, has the following behaviour:
> >
> > - Under ALOS: checkpoint the state stores. This includes
> > flushing memtables in RocksDB. This is acceptable because the default
> > commit.interval.ms is 5 seconds, so forcibly flushing memtables
> every 5
> > seconds is acceptable for most applications.
> > - Under EOS: checkpointing is not done, *unless* it's being forced,
> due
> > to e.g. the Task closing or being revoked. This means that under
> normal
> > processing conditions, the state stores will not be checkpointed,
> and will
> > not have memtables flushed at all , unless RocksDB decides to flush
> them on
> > its own. Checkpointing stores and force-flushing their memtables is
> only
>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-01 Thread Nick Telford
Hi everyone,

I've begun performance testing my branch on our staging environment,
putting it through its paces in our non-trivial application. I'm already
observing the same increased flush rate that we saw the last time we
attempted to use a version of this KIP, but this time, I think I know why.

Pre-KIP-892, StreamTask#postCommit, which is called at the end of the Task
commit process, has the following behaviour:

   - Under ALOS: checkpoint the state stores. This includes
   flushing memtables in RocksDB. This is acceptable because the default
   commit.interval.ms is 5 seconds, so forcibly flushing memtables every 5
   seconds is acceptable for most applications.
   - Under EOS: checkpointing is not done, *unless* it's being forced, due
   to e.g. the Task closing or being revoked. This means that under normal
   processing conditions, the state stores will not be checkpointed, and will
   not have memtables flushed at all , unless RocksDB decides to flush them on
   its own. Checkpointing stores and force-flushing their memtables is only
   done when a Task is being closed.

Under EOS, KIP-892 needs to checkpoint stores on at least *some* normal
Task commits, in order to write the RocksDB transaction buffers to the
state stores, and to ensure the offsets are synced to disk to prevent
restores from getting out of hand. Consequently, my current implementation
calls maybeCheckpoint on *every* Task commit, which is far too frequent.
This causes checkpoints every 10,000 records, which is a change in flush
behaviour, potentially causing performance problems for some applications.

I'm looking into possible solutions, and I'm currently leaning towards
using the statestore.transaction.buffer.max.bytes configuration to
checkpoint Tasks once we are likely to exceed it. This would complement the
existing "early Task commit" functionality that this configuration
provides, in the following way:

   - Currently, we use statestore.transaction.buffer.max.bytes to force an
   early Task commit if processing more records would cause our state store
   transactions to exceed the memory assigned to them.
   - New functionality: when a Task *does* commit, we will not checkpoint
   the stores (and hence flush the transaction buffers) unless we expect to
   cross the statestore.transaction.buffer.max.bytes threshold before the next
   commit

I'm also open to suggestions.

Regards,
Nick

On Thu, 22 Jun 2023 at 14:06, Nick Telford  wrote:

> Hi Bruno!
>
> 3.
> By "less predictable for users", I meant in terms of understanding the
> performance profile under various circumstances. The more complex the
> solution, the more difficult it would be for users to understand the
> performance they see. For example, spilling records to disk when the
> transaction buffer reaches a threshold would, I expect, reduce write
> throughput. This reduction in write throughput could be unexpected, and
> potentially difficult to diagnose/understand for users.
> At the moment, I think the "early commit" concept is relatively
> straightforward; it's easy to document, and conceptually fairly obvious to
> users. We could probably add a metric to make it easier to understand when
> it happens though.
>
> 3. (the second one)
> The IsolationLevel is *essentially* an indirect way of telling StateStores
> whether they should be transactional. READ_COMMITTED essentially requires
> transactions, because it dictates that two threads calling
> `newTransaction()` should not see writes from the other transaction until
> they have been committed. With READ_UNCOMMITTED, all bets are off, and
> stores can allow threads to observe written records at any time, which is
> essentially "no transactions". That said, StateStores are free to implement
> these guarantees however they can, which is a bit more relaxed than
> dictating "you must use transactions". For example, with RocksDB we would
> implement these as READ_COMMITTED == WBWI-based "transactions",
> READ_UNCOMMITTED == direct writes to the database. But with other storage
> engines, it might be preferable to *always* use transactions, even when
> unnecessary; or there may be storage engines that don't provide
> transactions, but the isolation guarantees can be met using a different
> technique.
> My idea was to try to keep the StateStore interface as loosely coupled
> from the Streams engine as possible, to give implementers more freedom, and
> reduce the amount of internal knowledge required.
> That said, I understand that "IsolationLevel" might not be the right
> abstraction, and we can always make it much more explicit if required, e.g.
> boolean transactional()
>
> 7-8.
> I can make these changes either later today or tomorrow.
>
> Small update:
> I've rebased my branch on trunk and fixed a bunch of issues

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-22 Thread Nick Telford
 of memory bytes to be used to
> buffer uncommitted state-store records." My thinking was that even if a
> state store spills uncommitted bytes to disk, limiting the overall bytes
> might make sense. Thinking about it again and considering the recent
> discussions, it does not make too much sense anymore.
> I like the name statestore.transaction.buffer.max.bytes that you proposed.
>
> 8.
> A high-level description (without implementation details) of how Kafka
> Streams will manage the commit of changelog transactions, state store
> transactions and checkpointing would be great. Would be great if you
> could also add some sentences about the behavior in case of a failure.
> For instance how does a transactional state store recover after a
> failure or what happens with the transaction buffer, etc. (that is what
> I meant by "fail-over" in point 9.)
>
> Best,
> Bruno
>
> On 21.06.23 18:50, Nick Telford wrote:
> > Hi Bruno,
> >
> > 1.
> > Isn't this exactly the same issue that WriteBatchWithIndex transactions
> > have, whereby exceeding (or likely to exceed) configured memory needs to
> > trigger an early commit?
> >
> > 2.
> > This is one of my big concerns. Ultimately, any approach based on
> cracking
> > open RocksDB internals and using it in ways it's not really designed for
> is
> > likely to have some unforseen performance or consistency issues.
> >
> > 3.
> > What's your motivation for removing these early commits? While not
> ideal, I
> > think they're a decent compromise to ensure consistency whilst
> maintaining
> > good and predictable performance.
> > All 3 of your suggested ideas seem *very* complicated, and might actually
> > make behaviour less predictable for users as a consequence.
> >
> > I'm a bit concerned that the scope of this KIP is growing a bit out of
> > control. While it's good to discuss ideas for future improvements, I
> think
> > it's important to narrow the scope down to a design that achieves the
> most
> > pressing objectives (constant sized restorations during dirty
> > close/unexpected errors). Any design that this KIP produces can
> ultimately
> > be changed in the future, especially if the bulk of it is internal
> > behaviour.
> >
> > I'm going to spend some time next week trying to re-work the original
> > WriteBatchWithIndex design to remove the newTransaction() method, such
> that
> > it's just an implementation detail of RocksDBStore. That way, if we want
> to
> > replace WBWI with something in the future, like the SST file management
> > outlined by John, then we can do so with little/no API changes.
> >
> > Regards,
> >
> > Nick
> >
>


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-21 Thread Nick Telford
Sorry John, I didn't mean to mis-characterize it like that. I was mostly
referring to disabling memtables. AFAIK the SstFileWriter API is primarily
designed for bulk ingest, e.g. for bootstrapping a database from a backup,
rather than during normal operation of an online database. That said, I was
overly alarmist in my phrasing.

My concern is only that, while the concept seems quite reasonable, there
are no doubt hidden issues lurking.

On Wed, 21 Jun 2023 at 18:25, John Roesler  wrote:

> Thanks Nick,
>
> That sounds good to me.
>
> I can't let (2) slide, though.. Writing and ingesting SST files is not a
> RocksDB internal, but rather a supported usage pattern on public APIs.
> Regardless, I think your overall preference is fine with me, especially
> if we can internalize this change within the store implementation itself.
>
> Thanks,
> -John
>
> On 6/21/23 11:50, Nick Telford wrote:
> > Hi Bruno,
> >
> > 1.
> > Isn't this exactly the same issue that WriteBatchWithIndex transactions
> > have, whereby exceeding (or likely to exceed) configured memory needs to
> > trigger an early commit?
> >
> > 2.
> > This is one of my big concerns. Ultimately, any approach based on
> cracking
> > open RocksDB internals and using it in ways it's not really designed for
> is
> > likely to have some unforseen performance or consistency issues.
> >
> > 3.
> > What's your motivation for removing these early commits? While not
> ideal, I
> > think they're a decent compromise to ensure consistency whilst
> maintaining
> > good and predictable performance.
> > All 3 of your suggested ideas seem *very* complicated, and might actually
> > make behaviour less predictable for users as a consequence.
> >
> > I'm a bit concerned that the scope of this KIP is growing a bit out of
> > control. While it's good to discuss ideas for future improvements, I
> think
> > it's important to narrow the scope down to a design that achieves the
> most
> > pressing objectives (constant sized restorations during dirty
> > close/unexpected errors). Any design that this KIP produces can
> ultimately
> > be changed in the future, especially if the bulk of it is internal
> > behaviour.
> >
> > I'm going to spend some time next week trying to re-work the original
> > WriteBatchWithIndex design to remove the newTransaction() method, such
> that
> > it's just an implementation detail of RocksDBStore. That way, if we want
> to
> > replace WBWI with something in the future, like the SST file management
> > outlined by John, then we can do so with little/no API changes.
> >
> > Regards,
> >
> > Nick
> >
>


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-21 Thread Nick Telford
Hi Bruno,

1.
Isn't this exactly the same issue that WriteBatchWithIndex transactions
have, whereby exceeding (or likely to exceed) configured memory needs to
trigger an early commit?

2.
This is one of my big concerns. Ultimately, any approach based on cracking
open RocksDB internals and using it in ways it's not really designed for is
likely to have some unforseen performance or consistency issues.

3.
What's your motivation for removing these early commits? While not ideal, I
think they're a decent compromise to ensure consistency whilst maintaining
good and predictable performance.
All 3 of your suggested ideas seem *very* complicated, and might actually
make behaviour less predictable for users as a consequence.

I'm a bit concerned that the scope of this KIP is growing a bit out of
control. While it's good to discuss ideas for future improvements, I think
it's important to narrow the scope down to a design that achieves the most
pressing objectives (constant sized restorations during dirty
close/unexpected errors). Any design that this KIP produces can ultimately
be changed in the future, especially if the bulk of it is internal
behaviour.

I'm going to spend some time next week trying to re-work the original
WriteBatchWithIndex design to remove the newTransaction() method, such that
it's just an implementation detail of RocksDBStore. That way, if we want to
replace WBWI with something in the future, like the SST file management
outlined by John, then we can do so with little/no API changes.

Regards,

Nick


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-20 Thread Nick Telford
Here's what I'm thinking: based on Bruno's earlier feedback, I'm going to
try to simplify my original design down such that it needs no/minimal
changes to the public interface.

If that succeeds, then it should also be possible to transparently
implement the "no memtables" solution as a performance optimization when
the record cache is enabled. I consider this approach only an optimisation,
because of the need to still support stores with the cache disabled.

For that reason, I think the "no memtables" approach would probably best be
suited as a follow-up KIP, but that we keep it in mind during the design of
this one.

What do you think?

Regards,
Nick


On Tue, 20 Jun 2023, 22:26 John Roesler,  wrote:

> Oh, that's a good point.
>
> On the topic of a behavioral switch for disabled caches, the typical use
> case for disabling the cache is to cause each individual update to
> propagate down the topology, so another thought might be to just go
> ahead and add the memory we would have used for the memtables to the
> cache size, but if people did disable the cache entirely, then we could
> still go ahead and forward the records on each write?
>
> I know that Guozhang was also proposing for a while to actually decouple
> caching and forwarding, which might provide a way to side-step this
> dilemma (i.e., we just always forward and only apply the cache to state
> and changelog writes).
>
> By the way, I'm basing my statement on why you'd disable caches on
> memory, but also on the guidance here:
>
> https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html
> . That doc also contains a section on how to bound the total memory
> usage across RocksDB memtables, which points to another benefit of
> disabling memtables and managing the write buffer ourselves (simplified
> memory configuration).
>
> Thanks,
> -John
>
> On 6/20/23 16:05, Nick Telford wrote:
> > Potentially we could just go the memorable with Rocks WriteBatches route
> if
> > the cache is disabled?
> >
> > On Tue, 20 Jun 2023, 22:00 John Roesler,  wrote:
> >
> >> Touché!
> >>
> >> Ok, I agree that figuring out the case of a disabled cache would be
> >> non-trivial. Ingesting single-record SST files will probably not be
> >> performant, but benchmarking may prove different. Or maybe we can have
> >> some reserved cache space on top of the user-configured cache, which we
> >> would have reclaimed from the memtable space. Or some other, more
> >> creative solution.
> >>
> >> Thanks,
> >> -John
> >>
> >> On 6/20/23 15:30, Nick Telford wrote:
> >>>> Note that users can disable the cache, which would still be
> >>> ok, I think. We wouldn't ingest the SST files on every record, but just
> >>> append to them and only ingest them on commit, when we're already
> >>> waiting for acks and a RocksDB commit.
> >>>
> >>> In this case, how would uncommitted records be read by joins?
> >>>
> >>> On Tue, 20 Jun 2023, 20:51 John Roesler,  wrote:
> >>>
> >>>> Ah, sorry Nick,
> >>>>
> >>>> I just meant the regular heap based cache that we maintain in
> Streams. I
> >>>> see that it's not called "RecordCache" (my mistake).
> >>>>
> >>>> The actual cache is ThreadCache:
> >>>>
> >>>>
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java
> >>>>
> >>>> Here's the example of how we use the cache in KeyValueStore:
> >>>>
> >>>>
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
> >>>>
> >>>> It's basically just an on-heap Map of records that have not yet been
> >>>> written to the changelog or flushed into the underlying store. It gets
> >>>> flushed when the total cache size exceeds `cache.max.bytes.buffering`
> or
> >>>> the `commit.interval.ms` elapses.
> >>>>
> >>>> Speaking of those configs, another benefit to this idea is that we
> would
> >>>> no longer need to trigger extra commits based on the size of the
> ongoing
> >>>> transaction. Instead, we'd just preserve the existing cache-flush
> >>>> behavior. Note that users can disable the cache, which would still be
> >>>> ok, I think. We wouldn't ingest the SST files on every record, but
>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-20 Thread Nick Telford
Potentially we could just go the memorable with Rocks WriteBatches route if
the cache is disabled?

On Tue, 20 Jun 2023, 22:00 John Roesler,  wrote:

> Touché!
>
> Ok, I agree that figuring out the case of a disabled cache would be
> non-trivial. Ingesting single-record SST files will probably not be
> performant, but benchmarking may prove different. Or maybe we can have
> some reserved cache space on top of the user-configured cache, which we
> would have reclaimed from the memtable space. Or some other, more
> creative solution.
>
> Thanks,
> -John
>
> On 6/20/23 15:30, Nick Telford wrote:
> >> Note that users can disable the cache, which would still be
> > ok, I think. We wouldn't ingest the SST files on every record, but just
> > append to them and only ingest them on commit, when we're already
> > waiting for acks and a RocksDB commit.
> >
> > In this case, how would uncommitted records be read by joins?
> >
> > On Tue, 20 Jun 2023, 20:51 John Roesler,  wrote:
> >
> >> Ah, sorry Nick,
> >>
> >> I just meant the regular heap based cache that we maintain in Streams. I
> >> see that it's not called "RecordCache" (my mistake).
> >>
> >> The actual cache is ThreadCache:
> >>
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java
> >>
> >> Here's the example of how we use the cache in KeyValueStore:
> >>
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
> >>
> >> It's basically just an on-heap Map of records that have not yet been
> >> written to the changelog or flushed into the underlying store. It gets
> >> flushed when the total cache size exceeds `cache.max.bytes.buffering` or
> >> the `commit.interval.ms` elapses.
> >>
> >> Speaking of those configs, another benefit to this idea is that we would
> >> no longer need to trigger extra commits based on the size of the ongoing
> >> transaction. Instead, we'd just preserve the existing cache-flush
> >> behavior. Note that users can disable the cache, which would still be
> >> ok, I think. We wouldn't ingest the SST files on every record, but just
> >> append to them and only ingest them on commit, when we're already
> >> waiting for acks and a RocksDB commit.
> >>
> >> Thanks,
> >> -John
> >>
> >> On 6/20/23 14:09, Nick Telford wrote:
> >>> Hi John,
> >>>
> >>> By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find
> any
> >>> class called "RecordCache"...
> >>>
> >>> Cheers,
> >>>
> >>> Nick
> >>>
> >>> On Tue, 20 Jun 2023 at 19:42, John Roesler 
> wrote:
> >>>
> >>>> Hi Nick,
> >>>>
> >>>> Thanks for picking this up again!
> >>>>
> >>>> I did have one new thought over the intervening months, which I'd like
> >>>> your take on.
> >>>>
> >>>> What if, instead of using the RocksDB atomic write primitive at all,
> we
> >>>> instead just:
> >>>> 1. disable memtables entirely
> >>>> 2. directly write the RecordCache into SST files when we flush
> >>>> 3. atomically ingest the SST file(s) into RocksDB when we get the ACK
> >>>> from the changelog (see
> >>>>
> >>>>
> >>
> https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
> >>>> and
> >>>>
> >>>>
> >>
> https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
> >>>> and
> >>>>
> >>>>
> >>
> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429
> >>>> )
> >>>> 4. track the changelog offsets either in another CF or the same CF
> with
> >>>> a reserved key, either of which will make the changelog offset update
> >>>> atomic with the file ingestions
> >>>>
> >>>> I suspect this'll have a number of benefits:
> >>>> * writes to RocksDB will always be atomic
> >>>> * we don't fragment memory between the RecordCache and the memtables
> >>>> * RecordCache gives far higher performance than memtable for rea

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-20 Thread Nick Telford
> Note that users can disable the cache, which would still be
ok, I think. We wouldn't ingest the SST files on every record, but just
append to them and only ingest them on commit, when we're already
waiting for acks and a RocksDB commit.

In this case, how would uncommitted records be read by joins?

On Tue, 20 Jun 2023, 20:51 John Roesler,  wrote:

> Ah, sorry Nick,
>
> I just meant the regular heap based cache that we maintain in Streams. I
> see that it's not called "RecordCache" (my mistake).
>
> The actual cache is ThreadCache:
>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java
>
> Here's the example of how we use the cache in KeyValueStore:
>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
>
> It's basically just an on-heap Map of records that have not yet been
> written to the changelog or flushed into the underlying store. It gets
> flushed when the total cache size exceeds `cache.max.bytes.buffering` or
> the `commit.interval.ms` elapses.
>
> Speaking of those configs, another benefit to this idea is that we would
> no longer need to trigger extra commits based on the size of the ongoing
> transaction. Instead, we'd just preserve the existing cache-flush
> behavior. Note that users can disable the cache, which would still be
> ok, I think. We wouldn't ingest the SST files on every record, but just
> append to them and only ingest them on commit, when we're already
> waiting for acks and a RocksDB commit.
>
> Thanks,
> -John
>
> On 6/20/23 14:09, Nick Telford wrote:
> > Hi John,
> >
> > By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find any
> > class called "RecordCache"...
> >
> > Cheers,
> >
> > Nick
> >
> > On Tue, 20 Jun 2023 at 19:42, John Roesler  wrote:
> >
> >> Hi Nick,
> >>
> >> Thanks for picking this up again!
> >>
> >> I did have one new thought over the intervening months, which I'd like
> >> your take on.
> >>
> >> What if, instead of using the RocksDB atomic write primitive at all, we
> >> instead just:
> >> 1. disable memtables entirely
> >> 2. directly write the RecordCache into SST files when we flush
> >> 3. atomically ingest the SST file(s) into RocksDB when we get the ACK
> >> from the changelog (see
> >>
> >>
> https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
> >> and
> >>
> >>
> https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
> >> and
> >>
> >>
> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429
> >> )
> >> 4. track the changelog offsets either in another CF or the same CF with
> >> a reserved key, either of which will make the changelog offset update
> >> atomic with the file ingestions
> >>
> >> I suspect this'll have a number of benefits:
> >> * writes to RocksDB will always be atomic
> >> * we don't fragment memory between the RecordCache and the memtables
> >> * RecordCache gives far higher performance than memtable for reads and
> >> writes
> >> * we don't need any new "transaction" concepts or memory bound configs
> >>
> >> What do you think?
> >>
> >> Thanks,
> >> -John
> >>
> >> On 6/20/23 10:51, Nick Telford wrote:
> >>> Hi Bruno,
> >>>
> >>> Thanks for reviewing the KIP. It's been a long road, I started working
> on
> >>> this more than a year ago, and most of the time in the last 6 months
> has
> >>> been spent on the "Atomic Checkpointing" stuff that's been benched, so
> >> some
> >>> of the reasoning behind some of my decisions have been lost, but I'll
> do
> >> my
> >>> best to reconstruct them.
> >>>
> >>> 1.
> >>> IIRC, this was the initial approach I tried. I don't remember the exact
> >>> reasons I changed it to use a separate "view" of the StateStore that
> >>> encapsulates the transaction, but I believe it had something to do with
> >>> concurrent access to the StateStore from Interactive Query threads.
> Reads
> >>> from interactive queries need to be isolated from the currently ongoing
> >>> transaction, both for consistency (so interactive queries don't

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-20 Thread Nick Telford
Hi John,

I think you're referring to the "record cache" that's provided by the
ThreadCache class?

1-3.
I was hoping to (eventually) remove the "flush-on-commit" behaviour from
RocksDbStore, so that RocksDB can choose when to flush memtables, enabling
users to tailor RocksDB performance to their workload. Explicitly flushing
the Record Cache to files instead would entail either flushing on every
commit, or the current behaviour, of flushing on every commit provided at
least 10K records have been processed. Compared with RocksDB-managed
memtable flushing, this is very inflexible. If we pursue this design, I
highly recommend replacing the hard-coded 10K limit with something
configurable so that users can tune flush behaviour for their workloads.

4.
Tracking the changelog offsets in another CF and atomically updating it
with the main CFs is orthogonal, I think, as it can be done when using
memtables provided the "Atomic Flush" feature of RocksDB is enabled. This
is something I'd originally planned for this KIP, but we're trying to pull
out into a later KIP to make things more manageable.

> * we don't fragment memory between the RecordCache and the memtables
I think by memory fragmentation, you mean duplication, because we're
caching the records both in the (on-heap) Record Cache and the RocksDB
memtables? This is a good point that I hadn't considered before. Wouldn't a
simpler solution be to just disable the record cache for RocksDB stores (by
default), and let the memtables do the caching? Although I guess that would
reduce read performance, which could be especially important for joins.

> * RecordCache gives far higher performance than memtable for reads and
writes
I'll concede this point. The JNI boundary plus RocksDB record encoding will
likely make it impossible to ever match the Record Cache on throughput.

> * we don't need any new "transaction" concepts or memory bound configs
Maybe. Unless I'm mistaken, the Record Cache only retains the most recently
written value for a key, which would mean that Interactive Queries would
always observe new record values *before* they're committed to the
changelog. While this is the current behaviour, it's also a violation of
consistency, because successive IQ could observe a regression of a value,
due to an error writing to the changelog (e.g. a changelog transaction
rollback or a timeout). This is something that KIP-892 aims to improve on,
as the current design would ensure that records are only observed by IQ
*after* they have been committed to the Kafka changelog.

That said, it definitely sounds *feasible*.

Regards,

Nick


Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-20 Thread Nick Telford
Hi John,

By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find any
class called "RecordCache"...

Cheers,

Nick

On Tue, 20 Jun 2023 at 19:42, John Roesler  wrote:

> Hi Nick,
>
> Thanks for picking this up again!
>
> I did have one new thought over the intervening months, which I'd like
> your take on.
>
> What if, instead of using the RocksDB atomic write primitive at all, we
> instead just:
> 1. disable memtables entirely
> 2. directly write the RecordCache into SST files when we flush
> 3. atomically ingest the SST file(s) into RocksDB when we get the ACK
> from the changelog (see
>
> https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
> and
>
> https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
> and
>
> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429
> )
> 4. track the changelog offsets either in another CF or the same CF with
> a reserved key, either of which will make the changelog offset update
> atomic with the file ingestions
>
> I suspect this'll have a number of benefits:
> * writes to RocksDB will always be atomic
> * we don't fragment memory between the RecordCache and the memtables
> * RecordCache gives far higher performance than memtable for reads and
> writes
> * we don't need any new "transaction" concepts or memory bound configs
>
> What do you think?
>
> Thanks,
> -John
>
> On 6/20/23 10:51, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for reviewing the KIP. It's been a long road, I started working on
> > this more than a year ago, and most of the time in the last 6 months has
> > been spent on the "Atomic Checkpointing" stuff that's been benched, so
> some
> > of the reasoning behind some of my decisions have been lost, but I'll do
> my
> > best to reconstruct them.
> >
> > 1.
> > IIRC, this was the initial approach I tried. I don't remember the exact
> > reasons I changed it to use a separate "view" of the StateStore that
> > encapsulates the transaction, but I believe it had something to do with
> > concurrent access to the StateStore from Interactive Query threads. Reads
> > from interactive queries need to be isolated from the currently ongoing
> > transaction, both for consistency (so interactive queries don't observe
> > changes that are subsequently rolled-back), but also to prevent Iterators
> > opened by an interactive query from being closed and invalidated by the
> > StreamThread when it commits the transaction, which causes your
> interactive
> > queries to crash.
> >
> > Another reason I believe I implemented it this way was a separation of
> > concerns. Recall that newTransaction() originally created an object of
> type
> > Transaction, not StateStore. My intent was to improve the type-safety of
> > the API, in an effort to ensure Transactions weren't used incorrectly.
> > Unfortunately, this didn't pan out, but newTransaction() remained.
> >
> > Finally, this had the added benefit that implementations could easily add
> > support for transactions *without* re-writing their existing,
> > non-transactional implementation. I think this can be a benefit both for
> > implementers of custom StateStores, but also for anyone extending
> > RocksDbStore, as they can rely on the existing access methods working how
> > they expect them to.
> >
> > I'm not too happy with the way the current design has panned out, so I'm
> > open to ideas on how to improve it. Key to this is finding some way to
> > ensure that reads from Interactive Query threads are properly isolated
> from
> > the transaction, *without* the performance overhead of checking which
> > thread the method is being called from on every access.
> >
> > As for replacing flush() with commit() - I saw no reason to add this
> > complexity to the KIP, unless there was a need to add arguments to the
> > flush/commit method. This need arises with Atomic Checkpointing, but that
> > will be implemented separately, in a future KIP. Do you see a need for
> some
> > arguments to the flush/commit method that I've missed? Or were you simply
> > suggesting a rename?
> >
> > 2.
> > This is simply due to the practical reason that isolationLevel() is
> really
> > a proxy for checking if the app is under EOS. The application
> configuration
> > is not provided to the constructor of StateStores, but it *is* provided
> to
> > init(), via StateStoreContext. For this reason, it seemed somew

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-20 Thread Nick Telford
 no-doubt
includes some record overheads, and WriteBatchWithIndex has to maintain an
index.

Ideally, we could trivially add a method upstream to WriteBatchInterface
that provides the exact size of the batch, but that would require an
upgrade of RocksDB, which won't happen soon. So for the time being, we're
stuck with an approximation, so I felt that the new method should reflect
that.

Would you prefer the new method name ignores this constraint and that we
simply make the rocks measurement more accurate in the future?

6.
Done

7.
Very good point. The KIP already specifically calls out memory in the
documentation of the config: "Maximum number of memory bytes to be used to
buffer uncommitted state-store records." - did you have something else in
mind?

Should we also make this clearer by renaming the config property itself?
Perhaps to something like statestore.transaction.buffer.max.bytes?

8.
OK, I can remove this. The intent here was to describe how Streams itself
will manage transaction roll-over etc. Presumably that means we also don't
need a description of how Streams will manage the commit of changelog
transactions, state store transactions and checkpointing?

9.
What do you mean by fail-over? Do you mean failing over an Active Task to
an instance already hosting a Standby Task?

Thanks again and sorry for the essay of a response!

Regards,
Nick

On Tue, 20 Jun 2023 at 10:49, Bruno Cadonna  wrote:

> Hi Nick,
>
> Thanks for the updates!
>
> I really appreciate that you simplified the KIP by removing some
> aspects. As I have already told you, I think the removed aspects are
> also good ideas and we can discuss them on follow-up KIPs.
>
> Regarding the current KIP, I have the following feedback.
>
> 1.
> Is there a good reason to add method newTransaction() to the StateStore
> interface? As far as I understand, the idea is that users of a state
> store (transactional or not) call this method at start-up and after each
> commit. Since the call to newTransaction() is done in any case and I
> think it would simplify the caller code if we just start a new
> transaction after a commit in the implementation?
> As far as I understand, you plan to commit the transaction in the
> flush() method. I find the idea to replace flush() with commit()
> presented in KIP-844 an elegant solution.
>
> 2.
> Why is the method to query the isolation level added to the state store
> context?
>
> 3.
> Do we need all the isolation level definitions? I think it is good to
> know the guarantees of the transactionality of the state store.
> However, currently, Streams guarantees that there will only be one
> transaction that writes to the state store. Only the stream thread that
> executes the active task that owns the state store will write to the
> state store. I think it should be enough to know if the state store is
> transactional or not. So my proposal would be to just add a method on
> the state store interface the returns if a state store is transactional
> or not by returning a boolean or an enum.
>
> 4.
> I am wondering why AbstractTransaction and AbstractTransactionalStore
> are part of the KIP. They look like implementation details that should
> not be exposed in the public API.
>
> 5.
> Why does StateStore#approximateNumUncommittedBytes() return an
> approximate number of bytes?
>
> 6.
> RocksDB is just one implementation of the state stores in Streams.
> However, the issues regarding OOM errors might also apply to other
> custom implementations. So in the KIP I would extract that part from
> section "RocksDB Transaction". I would also move section "RocksDB
> Transaction" to the end of section "Proposed Changes" and handle it as
> an example implementation for a state store.
>
> 7.
> Should statestore.uncommitted.max.bytes only limit the uncommitted bytes
> or the uncommitted bytes that reside in memory? In future, other
> transactional state store implementations might implement a buffer for
> uncommitted records that are able to spill records on disk. I think
> statestore.uncommitted.max.bytes needs to limit the uncommitted bytes
> irrespective if they reside in memory or disk. Since Streams will use
> this config to decide if it needs to trigger a commit, state store
> implementations that can spill to disk will never be able to spill to
> disk. You would only need to change the doc of the config, if you agree
> with me.
>
> 8.
> Section "Transaction Management" about the wrappers is rather a
> implementation detail that should not be in the KIP.
>
> 9.
> Could you add a section that describes how failover will work with the
> transactional state stores? I think section "Error handling" is already
> a good start.
>
>
> Best,
> Bruno
>
&

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-05-15 Thread Nick Telford
Hi everyone,

Quick update: I've added a new section to the KIP: "Offsets for Consumer
Rebalances", that outlines my solution to the problem that
StreamsPartitionAssignor needs to read StateStore offsets even if they're
not currently open.

Regards,
Nick

On Wed, 3 May 2023 at 11:34, Nick Telford  wrote:

> Hi Bruno,
>
> Thanks for reviewing my proposal.
>
> 1.
> The main reason I added it was because it was easy to do. If we see no
> value in it, I can remove it.
>
> 2.
> Global StateStores can have multiple partitions in their input topics
> (which function as their changelogs), so they would have more than one
> partition.
>
> 3.
> That's a good point. At present, the only method it adds is
> isolationLevel(), which is likely not necessary outside of StateStores.
> It *does* provide slightly different guarantees in the documentation to
> several of the methods (hence the overrides). I'm not sure if this is
> enough to warrant a new interface though.
> I think the question that remains is whether this interface makes it
> easier to implement custom transactional StateStores than if we were to
> remove it? Probably not.
>
> 4.
> The main motivation for the Atomic Checkpointing is actually performance.
> My team has been testing out an implementation of this KIP without it, and
> we had problems with RocksDB doing *much* more compaction, due to the
> significantly increased flush rate. It was enough of a problem that (for
> the time being), we had to revert back to Kafka Streams proper.
> I think the best way to solve this, as you say, is to keep the .checkpoint
> files *in addition* to the offsets being stored within the store itself.
> Essentially, when closing StateStores, we force a memtable flush, then
> call getCommittedOffsets and write those out to the .checkpoint file.
> That would ensure the metadata is available to the
> StreamsPartitionAssignor for all closed stores.
> If there's a crash (no clean close), then we won't be able to guarantee
> which offsets were flushed to disk by RocksDB, so we'd need to open (
> init()), read offsets, and then close() those stores. But since this is
> the exception, and will only occur once (provided it doesn't crash every
> time!), I think the performance impact here would be acceptable.
>
> Thanks for the feedback, please let me know if you have any more comments
> or questions!
>
> I'm currently working on rebasing against trunk. This involves adding
> support for transactionality to VersionedStateStores. I will probably need
> to revise my implementation for transactional "segmented" stores, both to
> accommodate VersionedStateStore, and to clean up some other stuff.
>
> Regards,
> Nick
>
>
> On Tue, 2 May 2023 at 13:45, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> Thanks for the updates!
>>
>> I have a couple of questions/comments.
>>
>> 1.
>> Why do you propose a configuration that involves max. bytes and max.
>> reords? I think we are mainly concerned about memory consumption because
>> we want to limit the off-heap memory used. I cannot think of a case
>> where one would want to set the max. number of records.
>>
>>
>> 2.
>> Why does
>>
>>   default void commit(final Map changelogOffsets) {
>>   flush();
>>   }
>>
>> take a map of partitions to changelog offsets?
>> The mapping between state stores to partitions is a 1:1 relationship.
>> Passing in a single changelog offset should suffice.
>>
>>
>> 3.
>> Why do we need the Transaction interface? It should be possible to hide
>> beginning and committing a transactions withing the state store
>> implementation, so that from outside the state store, it does not matter
>> whether the state store is transactional or not. What would be the
>> advantage of using the Transaction interface?
>>
>>
>> 4.
>> Regarding checkpointing offsets, I think we should keep the checkpoint
>> file in any case for the reason you mentioned about rebalancing. Even if
>> that would not be an issue, I would propose to move the change to offset
>> management to a new KIP and to not add more complexity than needed to
>> this one. I would not be too concerned about the consistency violation
>> you mention. As far as I understand, with transactional state stores
>> Streams would write the checkpoint file during every commit even under
>> EOS. In the failure case you describe, Streams would restore the state
>> stores from the offsets found in the checkpoint file written during the
>> penultimate commit instead of during the last commit. Basically, Streams
>> would overwrite the records

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-05-03 Thread Nick Telford
Hi Bruno,

Thanks for reviewing my proposal.

1.
The main reason I added it was because it was easy to do. If we see no
value in it, I can remove it.

2.
Global StateStores can have multiple partitions in their input topics
(which function as their changelogs), so they would have more than one
partition.

3.
That's a good point. At present, the only method it adds is
isolationLevel(), which is likely not necessary outside of StateStores.
It *does* provide slightly different guarantees in the documentation to
several of the methods (hence the overrides). I'm not sure if this is
enough to warrant a new interface though.
I think the question that remains is whether this interface makes it easier
to implement custom transactional StateStores than if we were to remove it?
Probably not.

4.
The main motivation for the Atomic Checkpointing is actually performance.
My team has been testing out an implementation of this KIP without it, and
we had problems with RocksDB doing *much* more compaction, due to the
significantly increased flush rate. It was enough of a problem that (for
the time being), we had to revert back to Kafka Streams proper.
I think the best way to solve this, as you say, is to keep the .checkpoint
files *in addition* to the offsets being stored within the store itself.
Essentially, when closing StateStores, we force a memtable flush, then call
getCommittedOffsets and write those out to the .checkpoint file. That would
ensure the metadata is available to the StreamsPartitionAssignor for all
closed stores.
If there's a crash (no clean close), then we won't be able to guarantee
which offsets were flushed to disk by RocksDB, so we'd need to open (init()),
read offsets, and then close() those stores. But since this is the
exception, and will only occur once (provided it doesn't crash every
time!), I think the performance impact here would be acceptable.

Thanks for the feedback, please let me know if you have any more comments
or questions!

I'm currently working on rebasing against trunk. This involves adding
support for transactionality to VersionedStateStores. I will probably need
to revise my implementation for transactional "segmented" stores, both to
accommodate VersionedStateStore, and to clean up some other stuff.

Regards,
Nick


On Tue, 2 May 2023 at 13:45, Bruno Cadonna  wrote:

> Hi Nick,
>
> Thanks for the updates!
>
> I have a couple of questions/comments.
>
> 1.
> Why do you propose a configuration that involves max. bytes and max.
> reords? I think we are mainly concerned about memory consumption because
> we want to limit the off-heap memory used. I cannot think of a case
> where one would want to set the max. number of records.
>
>
> 2.
> Why does
>
>   default void commit(final Map changelogOffsets) {
>   flush();
>   }
>
> take a map of partitions to changelog offsets?
> The mapping between state stores to partitions is a 1:1 relationship.
> Passing in a single changelog offset should suffice.
>
>
> 3.
> Why do we need the Transaction interface? It should be possible to hide
> beginning and committing a transactions withing the state store
> implementation, so that from outside the state store, it does not matter
> whether the state store is transactional or not. What would be the
> advantage of using the Transaction interface?
>
>
> 4.
> Regarding checkpointing offsets, I think we should keep the checkpoint
> file in any case for the reason you mentioned about rebalancing. Even if
> that would not be an issue, I would propose to move the change to offset
> management to a new KIP and to not add more complexity than needed to
> this one. I would not be too concerned about the consistency violation
> you mention. As far as I understand, with transactional state stores
> Streams would write the checkpoint file during every commit even under
> EOS. In the failure case you describe, Streams would restore the state
> stores from the offsets found in the checkpoint file written during the
> penultimate commit instead of during the last commit. Basically, Streams
> would overwrite the records written to the state store between the last
> two commits with the same records read from the changelogs. While I
> understand that this is wasteful, it is -- at the same time --
> acceptable and most importantly it does not break EOS.
>
> Best,
> Bruno
>
>
> On 27.04.23 12:34, Nick Telford wrote:
> > Hi everyone,
> >
> > I find myself (again) considering removing the offset management from
> > StateStores, and keeping the old checkpoint file system. The reason is
> that
> > the StreamPartitionAssignor directly reads checkpoint files in order to
> > determine which instance has the most up-to-date copy of the local state.
> > If we move offsets into the StateStore itself, then we

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-04-27 Thread Nick Telford
Hi everyone,

I find myself (again) considering removing the offset management from
StateStores, and keeping the old checkpoint file system. The reason is that
the StreamPartitionAssignor directly reads checkpoint files in order to
determine which instance has the most up-to-date copy of the local state.
If we move offsets into the StateStore itself, then we will need to open,
initialize, read offsets and then close each StateStore (that is not
already assigned and open) for which we have *any* local state, on every
rebalance.

Generally, I don't think there are many "orphan" stores like this sitting
around on most instances, but even a few would introduce additional latency
to an already somewhat lengthy rebalance procedure.

I'm leaning towards Colt's (Slack) suggestion of just keeping things in the
checkpoint file(s) for now, and not worrying about the race. The downside
is that we wouldn't be able to remove the explicit RocksDB flush on-commit,
which likely hurts performance.

If anyone has any thoughts or ideas on this subject, I would appreciate it!

Regards,
Nick

On Wed, 19 Apr 2023 at 15:05, Nick Telford  wrote:

> Hi Colt,
>
> The issue is that if there's a crash between 2 and 3, then you still end
> up with inconsistent data in RocksDB. The only way to guarantee that your
> checkpoint offsets and locally stored data are consistent with each other
> are to atomically commit them, which can be achieved by having the offsets
> stored in RocksDB.
>
> The offsets column family is likely to be extremely small (one
> per-changelog partition + one per Topology input partition for regular
> stores, one per input partition for global stores). So the overhead will be
> minimal.
>
> A major benefit of doing this is that we can remove the explicit calls to
> db.flush(), which forcibly flushes memtables to disk on-commit. It turns
> out, RocksDB memtable flushes are largely dictated by Kafka Streams
> commits, *not* RocksDB configuration, which could be a major source of
> confusion. Atomic checkpointing makes it safe to remove these explicit
> flushes, because it no longer matters exactly when RocksDB flushes data to
> disk; since the data and corresponding checkpoint offsets will always be
> flushed together, the local store is always in a consistent state, and
> on-restart, it can always safely resume restoration from the on-disk
> offsets, restoring the small amount of data that hadn't been flushed when
> the app exited/crashed.
>
> Regards,
> Nick
>
> On Wed, 19 Apr 2023 at 14:35, Colt McNealy  wrote:
>
>> Nick,
>>
>> Thanks for your reply. Ack to A) and B).
>>
>> For item C), I see what you're referring to. Your proposed solution will
>> work, so no need to change it. What I was suggesting was that it might be
>> possible to achieve this with only one column family. So long as:
>>
>>- No uncommitted records (i.e. not committed to the changelog) are
>>*committed* to the state store, AND
>>- The Checkpoint offset (which refers to the changelog topic) is less
>>than or equal to the last written changelog offset in rocksdb
>>
>> I don't see the need to do the full restoration from scratch. My
>> understanding was that prior to 844/892, full restorations were required
>> because there could be uncommitted records written to RocksDB; however,
>> given your use of RocksDB transactions, that can be avoided with the
>> pattern of 1) commit Kafka transaction, 2) commit RocksDB transaction, 3)
>> update offset in checkpoint file.
>>
>> Anyways, your proposed solution works equivalently and I don't believe
>> there is much overhead to an additional column family in RocksDB. Perhaps
>> it may even perform better than making separate writes to the checkpoint
>> file.
>>
>> Colt McNealy
>> *Founder, LittleHorse.io*
>>
>>
>> On Wed, Apr 19, 2023 at 5:53 AM Nick Telford 
>> wrote:
>>
>> > Hi Colt,
>> >
>> > A. I've done my best to de-couple the StateStore stuff from the rest of
>> the
>> > Streams engine. The fact that there will be only one ongoing (write)
>> > transaction at a time is not guaranteed by any API, and is just a
>> > consequence of the way Streams operates. To that end, I tried to ensure
>> the
>> > documentation and guarantees provided by the new APIs are independent of
>> > this incidental behaviour. In practice, you're right, this essentially
>> > refers to "interactive queries", which are technically "read
>> transactions",
>> > even if they don't actually use the transaction API to isolate
>> themselves.
>> >
>> > B. Yes, although not ideal. Th

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-04-19 Thread Nick Telford
Hi Colt,

The issue is that if there's a crash between 2 and 3, then you still end up
with inconsistent data in RocksDB. The only way to guarantee that your
checkpoint offsets and locally stored data are consistent with each other
are to atomically commit them, which can be achieved by having the offsets
stored in RocksDB.

The offsets column family is likely to be extremely small (one
per-changelog partition + one per Topology input partition for regular
stores, one per input partition for global stores). So the overhead will be
minimal.

A major benefit of doing this is that we can remove the explicit calls to
db.flush(), which forcibly flushes memtables to disk on-commit. It turns
out, RocksDB memtable flushes are largely dictated by Kafka Streams
commits, *not* RocksDB configuration, which could be a major source of
confusion. Atomic checkpointing makes it safe to remove these explicit
flushes, because it no longer matters exactly when RocksDB flushes data to
disk; since the data and corresponding checkpoint offsets will always be
flushed together, the local store is always in a consistent state, and
on-restart, it can always safely resume restoration from the on-disk
offsets, restoring the small amount of data that hadn't been flushed when
the app exited/crashed.

Regards,
Nick

On Wed, 19 Apr 2023 at 14:35, Colt McNealy  wrote:

> Nick,
>
> Thanks for your reply. Ack to A) and B).
>
> For item C), I see what you're referring to. Your proposed solution will
> work, so no need to change it. What I was suggesting was that it might be
> possible to achieve this with only one column family. So long as:
>
>- No uncommitted records (i.e. not committed to the changelog) are
>*committed* to the state store, AND
>- The Checkpoint offset (which refers to the changelog topic) is less
>than or equal to the last written changelog offset in rocksdb
>
> I don't see the need to do the full restoration from scratch. My
> understanding was that prior to 844/892, full restorations were required
> because there could be uncommitted records written to RocksDB; however,
> given your use of RocksDB transactions, that can be avoided with the
> pattern of 1) commit Kafka transaction, 2) commit RocksDB transaction, 3)
> update offset in checkpoint file.
>
> Anyways, your proposed solution works equivalently and I don't believe
> there is much overhead to an additional column family in RocksDB. Perhaps
> it may even perform better than making separate writes to the checkpoint
> file.
>
> Colt McNealy
> *Founder, LittleHorse.io*
>
>
> On Wed, Apr 19, 2023 at 5:53 AM Nick Telford 
> wrote:
>
> > Hi Colt,
> >
> > A. I've done my best to de-couple the StateStore stuff from the rest of
> the
> > Streams engine. The fact that there will be only one ongoing (write)
> > transaction at a time is not guaranteed by any API, and is just a
> > consequence of the way Streams operates. To that end, I tried to ensure
> the
> > documentation and guarantees provided by the new APIs are independent of
> > this incidental behaviour. In practice, you're right, this essentially
> > refers to "interactive queries", which are technically "read
> transactions",
> > even if they don't actually use the transaction API to isolate
> themselves.
> >
> > B. Yes, although not ideal. This is for backwards compatibility, because:
> > 1) Existing custom StateStore implementations will implement flush(),
> > and not commit(), but the Streams engine now calls commit(), so those
> calls
> > need to be forwarded to flush() for these legacy stores.
> > 2) Existing StateStore *users*, i.e. outside of the Streams engine
> > itself, may depend on explicitly calling flush(), so for these cases,
> > flush() needs to be redirected to call commit().
> > If anyone has a better way to guarantee compatibility without introducing
> > this potential recursion loop, I'm open to changes!
> >
> > C. This is described in the "Atomic Checkpointing" section. Offsets are
> > stored in a separate RocksDB column family, which is guaranteed to be
> > atomically flushed to disk with all other column families. The issue of
> > checkpoints being written to disk after commit causing inconsistency if
> it
> > crashes in between is the reason why, under EOS, checkpoint files are
> only
> > written on clean shutdown. This is one of the major causes of "full
> > restorations", so moving the offsets into a place where they can be
> > guaranteed to be atomically written with the data they checkpoint allows
> us
> > to write the checkpoint offsets *on every commit*, not just on clean
> > shutdown.
> >
> > Regards,
> &

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-04-19 Thread Nick Telford
Hi Colt,

A. I've done my best to de-couple the StateStore stuff from the rest of the
Streams engine. The fact that there will be only one ongoing (write)
transaction at a time is not guaranteed by any API, and is just a
consequence of the way Streams operates. To that end, I tried to ensure the
documentation and guarantees provided by the new APIs are independent of
this incidental behaviour. In practice, you're right, this essentially
refers to "interactive queries", which are technically "read transactions",
even if they don't actually use the transaction API to isolate themselves.

B. Yes, although not ideal. This is for backwards compatibility, because:
1) Existing custom StateStore implementations will implement flush(),
and not commit(), but the Streams engine now calls commit(), so those calls
need to be forwarded to flush() for these legacy stores.
2) Existing StateStore *users*, i.e. outside of the Streams engine
itself, may depend on explicitly calling flush(), so for these cases,
flush() needs to be redirected to call commit().
If anyone has a better way to guarantee compatibility without introducing
this potential recursion loop, I'm open to changes!

C. This is described in the "Atomic Checkpointing" section. Offsets are
stored in a separate RocksDB column family, which is guaranteed to be
atomically flushed to disk with all other column families. The issue of
checkpoints being written to disk after commit causing inconsistency if it
crashes in between is the reason why, under EOS, checkpoint files are only
written on clean shutdown. This is one of the major causes of "full
restorations", so moving the offsets into a place where they can be
guaranteed to be atomically written with the data they checkpoint allows us
to write the checkpoint offsets *on every commit*, not just on clean
shutdown.

Regards,
Nick

On Tue, 18 Apr 2023 at 15:39, Colt McNealy  wrote:

> Nick,
>
> Thank you for continuing this work. I have a few minor clarifying
> questions.
>
> A) "Records written to any transaction are visible to all other
> transactions immediately." I am confused here—I thought there could only be
> one transaction going on at a time for a given state store given the
> threading model for processing records on a Task. Do you mean Interactive
> Queries by "other transactions"? (If so, then everything makes sense—I
> thought that since IQ were read-only then they didn't count as
> transactions).
>
> B) Is it intentional that the default implementations of the flush() and
> commit() methods in the StateStore class refer to each other in some sort
> of unbounded recursion?
>
> C) How will the getCommittedOffset() method work? At first I thought the
> way to do it would be using a special key in the RocksDB store to store the
> offset, and committing that with the transaction. But upon second thought,
> since restoration from the changelog is an idempotent procedure, I think it
> would be fine to 1) commit the RocksDB transaction and then 2) write the
> offset to disk in a checkpoint file. If there is a crash between 1) and 2),
> I think the only downside is now we replay a few more records (at a cost of
> <100ms). Am I missing something there?
>
> Other than that, everything makes sense to me.
>
> Cheers,
> Colt McNealy
> *Founder, LittleHorse.io*
>
>
> On Tue, Apr 18, 2023 at 3:59 AM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > I've updated the KIP to reflect the latest version of the design:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
> >
> > There are several changes in there that reflect feedback from this
> thread,
> > and there's a new section and a bunch of interface changes relating to
> > Atomic Checkpointing, which is the final piece of the puzzle to making
> > everything robust.
> >
> > Let me know what you think!
> >
> > Regards,
> > Nick
> >
> > On Tue, 3 Jan 2023 at 11:33, Nick Telford 
> wrote:
> >
> > > Hi Lucas,
> > >
> > > Thanks for looking over my KIP.
> > >
> > > A) The bound is per-instance, not per-Task. This was a typo in the KIP
> > > that I've now corrected. It was originally per-Task, but I changed it
> to
> > > per-instance for exactly the reason you highlighted.
> > > B) It's worth noting that transactionality is only enabled under EOS,
> and
> > > in the default mode of operation (ALOS), there should be no change in
> > > behavior at all. I think, under EOS, we can mitigate the impact on
> users
> > by
> > > sufficiently low default values for the memory bound configuration. I
&

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-04-18 Thread Nick Telford
Hi everyone,

I've updated the KIP to reflect the latest version of the design:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores

There are several changes in there that reflect feedback from this thread,
and there's a new section and a bunch of interface changes relating to
Atomic Checkpointing, which is the final piece of the puzzle to making
everything robust.

Let me know what you think!

Regards,
Nick

On Tue, 3 Jan 2023 at 11:33, Nick Telford  wrote:

> Hi Lucas,
>
> Thanks for looking over my KIP.
>
> A) The bound is per-instance, not per-Task. This was a typo in the KIP
> that I've now corrected. It was originally per-Task, but I changed it to
> per-instance for exactly the reason you highlighted.
> B) It's worth noting that transactionality is only enabled under EOS, and
> in the default mode of operation (ALOS), there should be no change in
> behavior at all. I think, under EOS, we can mitigate the impact on users by
> sufficiently low default values for the memory bound configuration. I
> understand your hesitation to include a significant change of behaviour,
> especially in a minor release, but I suspect that most users will prefer
> the memory impact (under EOS) to the existing behaviour of frequent state
> restorations! If this is a problem, the changes can wait until the next
> major release. I'll be running a patched version of streams in production
> with these changes as soon as they're ready, so it won't disrupt me :-D
> C) The main purpose of this sentence was just to note that some changes
> will need to be made to the way Segments are handled in order to ensure
> they also benefit from transactions. At the time I wrote it, I hadn't
> figured out the specific changes necessary, so it was deliberately vague.
> This is the one outstanding problem I'm currently working on, and I'll
> update this section with more detail once I have figured out the exact
> changes required.
> D) newTransaction() provides the necessary isolation guarantees. While
> the RocksDB implementation of transactions doesn't technically *need*
> read-only users to call newTransaction(), other implementations (e.g. a
> hypothetical PostgresStore) may require it. Calling newTransaction() when
> no transaction is necessary is essentially free, as it will just return
> this.
>
> I didn't do any profiling of the KIP-844 PoC, but I think it should be
> fairly obvious where the performance problems stem from: writes under
> KIP-844 require 3 extra memory-copies: 1 to encode it with the
> tombstone/record flag, 1 to decode it from the tombstone/record flag, and 1
> to copy the record from the "temporary" store to the "main" store, when the
> transaction commits. The different approach taken by KIP-869 should perform
> much better, as it avoids all these copies, and may actually perform
> slightly better than trunk, due to batched writes in RocksDB performing
> better than non-batched writes.[1]
>
> Regards,
> Nick
>
> 1: https://github.com/adamretter/rocksjava-write-methods-benchmark#results
>
> On Mon, 2 Jan 2023 at 16:18, Lucas Brutschy 
> wrote:
>
>> Hi Nick,
>>
>> I'm just starting to read up on the whole discussion about KIP-892 and
>> KIP-844. Thanks a lot for your work on this, I do think
>> `WriteBatchWithIndex` may be the way to go here. I do have some
>> questions about the latest draft.
>>
>>  A) If I understand correctly, you propose to put a bound on the
>> (native) memory consumed by each task. However, I wonder if this is
>> sufficient if we have temporary imbalances in the cluster. For
>> example, depending on the timing of rebalances during a cluster
>> restart, it could happen that a single streams node is assigned a lot
>> more tasks than expected. With your proposed change, this would mean
>> that the memory required by this one node could be a multiple of what
>> is required during normal operation. I wonder if it wouldn't be safer
>> to put a global bound on the memory use, across all tasks.
>>  B) Generally, the memory concerns still give me the feeling that this
>> should not be enabled by default for all users in a minor release.
>>  C) In section "Transaction Management": the sentence "A similar
>> analogue will be created to automatically manage `Segment`
>> transactions.". Maybe this is just me lacking some background, but I
>> do not understand this, it would be great if you could clarify what
>> you mean here.
>>  D) Could you please clarify why IQ has to call newTransaction(), when
>> it's read-only.
>>
>> And one last thing not strictly related to your KIP: if there is an
>> eas

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-01-03 Thread Nick Telford
Hi Lucas,

Thanks for looking over my KIP.

A) The bound is per-instance, not per-Task. This was a typo in the KIP that
I've now corrected. It was originally per-Task, but I changed it to
per-instance for exactly the reason you highlighted.
B) It's worth noting that transactionality is only enabled under EOS, and
in the default mode of operation (ALOS), there should be no change in
behavior at all. I think, under EOS, we can mitigate the impact on users by
sufficiently low default values for the memory bound configuration. I
understand your hesitation to include a significant change of behaviour,
especially in a minor release, but I suspect that most users will prefer
the memory impact (under EOS) to the existing behaviour of frequent state
restorations! If this is a problem, the changes can wait until the next
major release. I'll be running a patched version of streams in production
with these changes as soon as they're ready, so it won't disrupt me :-D
C) The main purpose of this sentence was just to note that some changes
will need to be made to the way Segments are handled in order to ensure
they also benefit from transactions. At the time I wrote it, I hadn't
figured out the specific changes necessary, so it was deliberately vague.
This is the one outstanding problem I'm currently working on, and I'll
update this section with more detail once I have figured out the exact
changes required.
D) newTransaction() provides the necessary isolation guarantees. While the
RocksDB implementation of transactions doesn't technically *need* read-only
users to call newTransaction(), other implementations (e.g. a hypothetical
PostgresStore) may require it. Calling newTransaction() when no transaction
is necessary is essentially free, as it will just return this.

I didn't do any profiling of the KIP-844 PoC, but I think it should be
fairly obvious where the performance problems stem from: writes under
KIP-844 require 3 extra memory-copies: 1 to encode it with the
tombstone/record flag, 1 to decode it from the tombstone/record flag, and 1
to copy the record from the "temporary" store to the "main" store, when the
transaction commits. The different approach taken by KIP-869 should perform
much better, as it avoids all these copies, and may actually perform
slightly better than trunk, due to batched writes in RocksDB performing
better than non-batched writes.[1]

Regards,
Nick

1: https://github.com/adamretter/rocksjava-write-methods-benchmark#results

On Mon, 2 Jan 2023 at 16:18, Lucas Brutschy 
wrote:

> Hi Nick,
>
> I'm just starting to read up on the whole discussion about KIP-892 and
> KIP-844. Thanks a lot for your work on this, I do think
> `WriteBatchWithIndex` may be the way to go here. I do have some
> questions about the latest draft.
>
>  A) If I understand correctly, you propose to put a bound on the
> (native) memory consumed by each task. However, I wonder if this is
> sufficient if we have temporary imbalances in the cluster. For
> example, depending on the timing of rebalances during a cluster
> restart, it could happen that a single streams node is assigned a lot
> more tasks than expected. With your proposed change, this would mean
> that the memory required by this one node could be a multiple of what
> is required during normal operation. I wonder if it wouldn't be safer
> to put a global bound on the memory use, across all tasks.
>  B) Generally, the memory concerns still give me the feeling that this
> should not be enabled by default for all users in a minor release.
>  C) In section "Transaction Management": the sentence "A similar
> analogue will be created to automatically manage `Segment`
> transactions.". Maybe this is just me lacking some background, but I
> do not understand this, it would be great if you could clarify what
> you mean here.
>  D) Could you please clarify why IQ has to call newTransaction(), when
> it's read-only.
>
> And one last thing not strictly related to your KIP: if there is an
> easy way for you to find out why the KIP-844 PoC is 20x slower (e.g.
> by providing a flame graph), that would be quite interesting.
>
> Cheers,
> Lucas
>
> On Thu, Dec 22, 2022 at 8:30 PM Nick Telford 
> wrote:
> >
> > Hi everyone,
> >
> > I've updated the KIP with a more detailed design, which reflects the
> > implementation I've been working on:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
> >
> > This new design should address the outstanding points already made in the
> > thread.
> >
> > Please let me know if there are areas that are unclear or need more
> > clarification.
> >
> > I have a (nearly) working implementation. I'm confident that the
> remaining
> > work (making Segments

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-12-22 Thread Nick Telford
Hi everyone,

I've updated the KIP with a more detailed design, which reflects the
implementation I've been working on:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores

This new design should address the outstanding points already made in the
thread.

Please let me know if there are areas that are unclear or need more
clarification.

I have a (nearly) working implementation. I'm confident that the remaining
work (making Segments behave) will not impact the documented design.

Regards,

Nick

On Tue, 6 Dec 2022 at 19:24, Colt McNealy  wrote:

> Nick,
>
> Thank you for the reply; that makes sense. I was hoping that, since reading
> uncommitted records from IQ in EOS isn't part of the documented API, maybe
> you *wouldn't* have to wait for the next major release to make that change;
> but given that it would be considered a major change, I like your approach
> the best.
>
> Wishing you a speedy recovery and happy coding!
>
> Thanks,
> Colt McNealy
> *Founder, LittleHorse.io*
>
>
> On Tue, Dec 6, 2022 at 10:30 AM Nick Telford 
> wrote:
>
> > Hi Colt,
> >
> > 10: Yes, I agree it's not ideal. I originally intended to try to keep the
> > behaviour unchanged as much as possible, otherwise we'd have to wait for
> a
> > major version release to land these changes.
> > 20: Good point, ALOS doesn't need the same level of guarantee, and the
> > typically longer commit intervals would be problematic when reading only
> > "committed" records.
> >
> > I've been away for 5 days recovering from minor surgery, but I spent a
> > considerable amount of that time working through ideas for possible
> > solutions in my head. I think your suggestion of keeping ALOS as-is, but
> > buffering writes for EOS is the right path forwards, although I have a
> > solution that both expands on this, and provides for some more formal
> > guarantees.
> >
> > Essentially, adding support to KeyValueStores for "Transactions", with
> > clearly defined IsolationLevels. Using "Read Committed" when under EOS,
> and
> > "Read Uncommitted" under ALOS.
> >
> > The nice thing about this approach is that it gives us much more clearly
> > defined isolation behaviour that can be properly documented to ensure
> users
> > know what to expect.
> >
> > I'm still working out the kinks in the design, and will update the KIP
> when
> > I have something. The main struggle is trying to implement this without
> > making any major changes to the existing interfaces or breaking existing
> > implementations, because currently everything expects to operate directly
> > on a StateStore, and not a Transaction of that store. I think I'm getting
> > close, although sadly I won't be able to progress much until next week
> due
> > to some work commitments.
> >
> > Regards,
> > Nick
> >
> > On Thu, 1 Dec 2022 at 00:01, Colt McNealy  wrote:
> >
> > > Nick,
> > >
> > > Thank you for the explanation, and also for the updated KIP. I am quite
> > > eager for this improvement to be released as it would greatly reduce
> the
> > > operational difficulties of EOS streams apps.
> > >
> > > Two questions:
> > >
> > > 10)
> > > >When reading records, we will use the
> > > WriteBatchWithIndex#getFromBatchAndDB
> > >  and WriteBatchWithIndex#newIteratorWithBase utilities in order to
> ensure
> > > that uncommitted writes are available to query.
> > > Why do extra work to enable the reading of uncommitted writes during
> IQ?
> > > Code complexity aside, reading uncommitted writes is, in my opinion, a
> > > minor flaw in EOS IQ; it would be very nice to have the guarantee that,
> > > with EOS, IQ only reads committed records. In order to avoid dirty
> reads,
> > > one currently must query a standby replica (but this still doesn't
> fully
> > > guarantee monotonic reads).
> > >
> > > 20) Is it also necessary to enable this optimization on ALOS stores?
> The
> > > motivation of KIP-844 was mainly to reduce the need to restore state
> from
> > > scratch on unclean EOS shutdowns; with ALOS it was acceptable to accept
> > > that there may have been uncommitted writes on disk. On a side note, if
> > you
> > > enable this type of store on ALOS processors, the community would
> > > definitely want to enable queries on dirty reads; otherwise users would
> > > have to wait 30 seconds (default) to see an update.
> > >
> > > Thank you f

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-12-06 Thread Nick Telford
Hi Colt,

10: Yes, I agree it's not ideal. I originally intended to try to keep the
behaviour unchanged as much as possible, otherwise we'd have to wait for a
major version release to land these changes.
20: Good point, ALOS doesn't need the same level of guarantee, and the
typically longer commit intervals would be problematic when reading only
"committed" records.

I've been away for 5 days recovering from minor surgery, but I spent a
considerable amount of that time working through ideas for possible
solutions in my head. I think your suggestion of keeping ALOS as-is, but
buffering writes for EOS is the right path forwards, although I have a
solution that both expands on this, and provides for some more formal
guarantees.

Essentially, adding support to KeyValueStores for "Transactions", with
clearly defined IsolationLevels. Using "Read Committed" when under EOS, and
"Read Uncommitted" under ALOS.

The nice thing about this approach is that it gives us much more clearly
defined isolation behaviour that can be properly documented to ensure users
know what to expect.

I'm still working out the kinks in the design, and will update the KIP when
I have something. The main struggle is trying to implement this without
making any major changes to the existing interfaces or breaking existing
implementations, because currently everything expects to operate directly
on a StateStore, and not a Transaction of that store. I think I'm getting
close, although sadly I won't be able to progress much until next week due
to some work commitments.

Regards,
Nick

On Thu, 1 Dec 2022 at 00:01, Colt McNealy  wrote:

> Nick,
>
> Thank you for the explanation, and also for the updated KIP. I am quite
> eager for this improvement to be released as it would greatly reduce the
> operational difficulties of EOS streams apps.
>
> Two questions:
>
> 10)
> >When reading records, we will use the
> WriteBatchWithIndex#getFromBatchAndDB
>  and WriteBatchWithIndex#newIteratorWithBase utilities in order to ensure
> that uncommitted writes are available to query.
> Why do extra work to enable the reading of uncommitted writes during IQ?
> Code complexity aside, reading uncommitted writes is, in my opinion, a
> minor flaw in EOS IQ; it would be very nice to have the guarantee that,
> with EOS, IQ only reads committed records. In order to avoid dirty reads,
> one currently must query a standby replica (but this still doesn't fully
> guarantee monotonic reads).
>
> 20) Is it also necessary to enable this optimization on ALOS stores? The
> motivation of KIP-844 was mainly to reduce the need to restore state from
> scratch on unclean EOS shutdowns; with ALOS it was acceptable to accept
> that there may have been uncommitted writes on disk. On a side note, if you
> enable this type of store on ALOS processors, the community would
> definitely want to enable queries on dirty reads; otherwise users would
> have to wait 30 seconds (default) to see an update.
>
> Thank you for doing this fantastic work!
> Colt McNealy
> *Founder, LittleHorse.io*
>
>
> On Wed, Nov 30, 2022 at 10:44 AM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > I've drastically reduced the scope of this KIP to no longer include the
> > StateStore management of checkpointing. This can be added as a KIP later
> on
> > to further optimize the consistency and performance of state stores.
> >
> > I've also added a section discussing some of the concerns around
> > concurrency, especially in the presence of Iterators. I'm thinking of
> > wrapping WriteBatchWithIndex with a reference-counting copy-on-write
> > implementation (that only makes a copy if there's an active iterator),
> but
> > I'm open to suggestions.
> >
> > Regards,
> > Nick
> >
> > On Mon, 28 Nov 2022 at 16:36, Nick Telford 
> wrote:
> >
> > > Hi Colt,
> > >
> > > I didn't do any profiling, but the 844 implementation:
> > >
> > >- Writes uncommitted records to a temporary RocksDB instance
> > >   - Since tombstones need to be flagged, all record values are
> > >   prefixed with a value/tombstone marker. This necessitates a
> memory
> > copy.
> > >- On-commit, iterates all records in this temporary instance and
> > >writes them to the main RocksDB store.
> > >- While iterating, the value/tombstone marker needs to be parsed and
> > >the real value extracted. This necessitates another memory copy.
> > >
> > > My guess is that the cost of iterating the temporary RocksDB store is
> the
> > > major factor, with the 2 extra memory copies per-Record contributing a
> > > significant amount too.
> 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-11-30 Thread Nick Telford
Hi everyone,

I've drastically reduced the scope of this KIP to no longer include the
StateStore management of checkpointing. This can be added as a KIP later on
to further optimize the consistency and performance of state stores.

I've also added a section discussing some of the concerns around
concurrency, especially in the presence of Iterators. I'm thinking of
wrapping WriteBatchWithIndex with a reference-counting copy-on-write
implementation (that only makes a copy if there's an active iterator), but
I'm open to suggestions.

Regards,
Nick

On Mon, 28 Nov 2022 at 16:36, Nick Telford  wrote:

> Hi Colt,
>
> I didn't do any profiling, but the 844 implementation:
>
>- Writes uncommitted records to a temporary RocksDB instance
>   - Since tombstones need to be flagged, all record values are
>   prefixed with a value/tombstone marker. This necessitates a memory copy.
>- On-commit, iterates all records in this temporary instance and
>writes them to the main RocksDB store.
>- While iterating, the value/tombstone marker needs to be parsed and
>the real value extracted. This necessitates another memory copy.
>
> My guess is that the cost of iterating the temporary RocksDB store is the
> major factor, with the 2 extra memory copies per-Record contributing a
> significant amount too.
>
> Regards,
> Nick
>
> On Mon, 28 Nov 2022 at 16:12, Colt McNealy  wrote:
>
>> Hi all,
>>
>> Out of curiosity, why does the performance of the store degrade so
>> significantly with the 844 implementation? I wouldn't be too surprised by
>> a
>> 50-60% drop (caused by each record being written twice), but 96% is
>> extreme.
>>
>> The only thing I can think of which could create such a bottleneck would
>> be
>> that perhaps the 844 implementation deserializes and then re-serializes
>> the
>> store values when copying from the uncommitted to committed store, but I
>> wasn't able to figure that out when I scanned the PR.
>>
>> Colt McNealy
>> *Founder, LittleHorse.io*
>>
>>
>> On Mon, Nov 28, 2022 at 7:56 AM Nick Telford 
>> wrote:
>>
>> > Hi everyone,
>> >
>> > I've updated the KIP to resolve all the points that have been raised so
>> > far, with one exception: the ALOS default commit interval of 5 minutes
>> is
>> > likely to cause WriteBatchWithIndex memory to grow too large.
>> >
>> > There's a couple of different things I can think of to solve this:
>> >
>> >- We already have a memory/record limit in the KIP to prevent OOM
>> >errors. Should we choose a default value for these? My concern here
>> is
>> > that
>> >anything we choose might seem rather arbitrary. We could change
>> >its behaviour such that under ALOS, it only triggers the commit of
>> the
>> >StateStore, but under EOS, it triggers a commit of the Kafka
>> > transaction.
>> >- We could introduce a separate `checkpoint.interval.ms` to allow
>> ALOS
>> >    to commit the StateStores more frequently than the general
>> >commit.interval.ms? My concern here is that the semantics of this
>> > config
>> >would depend on the processing.mode; under ALOS it would allow more
>> >frequently committing stores, whereas under EOS it couldn't.
>> >
>> > Any better ideas?
>> >
>> > On Wed, 23 Nov 2022 at 16:25, Nick Telford 
>> wrote:
>> >
>> > > Hi Alex,
>> > >
>> > > Thanks for the feedback.
>> > >
>> > > I've updated the discussion of OOM issues by describing how we'll
>> handle
>> > > it. Here's the new text:
>> > >
>> > > To mitigate this, we will automatically force a Task commit if the
>> total
>> > >> uncommitted records returned by
>> > >> StateStore#approximateNumUncommittedEntries()  exceeds a threshold,
>> > >> configured by max.uncommitted.state.entries.per.task; or the total
>> > >> memory used for buffering uncommitted records returned by
>> > >> StateStore#approximateNumUncommittedBytes() exceeds the threshold
>> > >> configured by max.uncommitted.state.bytes.per.task. This will roughly
>> > >> bound the memory required per-Task for buffering uncommitted records,
>> > >> irrespective of the commit.interval.ms, and will effectively bound
>> the
>> > >> number of records that will need to be restored in the event of a
>> > failure.
>> > >>
>> > >
>> > >
>> &

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-11-28 Thread Nick Telford
Hi Colt,

I didn't do any profiling, but the 844 implementation:

   - Writes uncommitted records to a temporary RocksDB instance
  - Since tombstones need to be flagged, all record values are prefixed
  with a value/tombstone marker. This necessitates a memory copy.
   - On-commit, iterates all records in this temporary instance and writes
   them to the main RocksDB store.
   - While iterating, the value/tombstone marker needs to be parsed and the
   real value extracted. This necessitates another memory copy.

My guess is that the cost of iterating the temporary RocksDB store is the
major factor, with the 2 extra memory copies per-Record contributing a
significant amount too.

Regards,
Nick

On Mon, 28 Nov 2022 at 16:12, Colt McNealy  wrote:

> Hi all,
>
> Out of curiosity, why does the performance of the store degrade so
> significantly with the 844 implementation? I wouldn't be too surprised by a
> 50-60% drop (caused by each record being written twice), but 96% is
> extreme.
>
> The only thing I can think of which could create such a bottleneck would be
> that perhaps the 844 implementation deserializes and then re-serializes the
> store values when copying from the uncommitted to committed store, but I
> wasn't able to figure that out when I scanned the PR.
>
> Colt McNealy
> *Founder, LittleHorse.io*
>
>
> On Mon, Nov 28, 2022 at 7:56 AM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > I've updated the KIP to resolve all the points that have been raised so
> > far, with one exception: the ALOS default commit interval of 5 minutes is
> > likely to cause WriteBatchWithIndex memory to grow too large.
> >
> > There's a couple of different things I can think of to solve this:
> >
> >- We already have a memory/record limit in the KIP to prevent OOM
> >errors. Should we choose a default value for these? My concern here is
> > that
> >anything we choose might seem rather arbitrary. We could change
> >its behaviour such that under ALOS, it only triggers the commit of the
> >StateStore, but under EOS, it triggers a commit of the Kafka
> > transaction.
> >- We could introduce a separate `checkpoint.interval.ms` to allow
> ALOS
> >to commit the StateStores more frequently than the general
> >commit.interval.ms? My concern here is that the semantics of this
> > config
> >would depend on the processing.mode; under ALOS it would allow more
> >frequently committing stores, whereas under EOS it couldn't.
> >
> > Any better ideas?
> >
> > On Wed, 23 Nov 2022 at 16:25, Nick Telford 
> wrote:
> >
> > > Hi Alex,
> > >
> > > Thanks for the feedback.
> > >
> > > I've updated the discussion of OOM issues by describing how we'll
> handle
> > > it. Here's the new text:
> > >
> > > To mitigate this, we will automatically force a Task commit if the
> total
> > >> uncommitted records returned by
> > >> StateStore#approximateNumUncommittedEntries()  exceeds a threshold,
> > >> configured by max.uncommitted.state.entries.per.task; or the total
> > >> memory used for buffering uncommitted records returned by
> > >> StateStore#approximateNumUncommittedBytes() exceeds the threshold
> > >> configured by max.uncommitted.state.bytes.per.task. This will roughly
> > >> bound the memory required per-Task for buffering uncommitted records,
> > >> irrespective of the commit.interval.ms, and will effectively bound
> the
> > >> number of records that will need to be restored in the event of a
> > failure.
> > >>
> > >
> > >
> > > These limits will be checked in StreamTask#process and a premature
> commit
> > >> will be requested via Task#requestCommit().
> > >>
> > >
> > >
> > > Note that these new methods provide default implementations that ensure
> > >> existing custom stores and non-transactional stores (e.g.
> > >> InMemoryKeyValueStore) do not force any early commits.
> > >
> > >
> > > I've chosen to have the StateStore expose approximations of its buffer
> > > size/count instead of opaquely requesting a commit in order to delegate
> > the
> > > decision making to the Task itself. This enables Tasks to look at *all*
> > of
> > > their StateStores, and determine whether an early commit is necessary.
> > > Notably, it enables pre-Task thresholds, instead of per-Store, which
> > > prevents Tasks with many StateStores from using much more memory than
> > Tasks
> > > 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-11-28 Thread Nick Telford
For now, I've settled on choosing an arbitrary default memory limit of
64MiB per-Task for buffering uncommitted records. I noticed that Kafka
Streams already provides some arbitrary default configuration of RocksDB
memory settings (i.e. memtable size etc.), and that many users will already
be explicitly configuring this for their purposes.

I think a further optimization for ALOS to only commit the StateStores when
exceeding this limit is reasonable, to preserve the user's desired
commit.interval.ms as much as possible.

On Mon, 28 Nov 2022 at 15:55, Nick Telford  wrote:

> Hi everyone,
>
> I've updated the KIP to resolve all the points that have been raised so
> far, with one exception: the ALOS default commit interval of 5 minutes is
> likely to cause WriteBatchWithIndex memory to grow too large.
>
> There's a couple of different things I can think of to solve this:
>
>- We already have a memory/record limit in the KIP to prevent OOM
>errors. Should we choose a default value for these? My concern here is that
>anything we choose might seem rather arbitrary. We could change
>its behaviour such that under ALOS, it only triggers the commit of the
>StateStore, but under EOS, it triggers a commit of the Kafka transaction.
>- We could introduce a separate `checkpoint.interval.ms` to allow ALOS
>to commit the StateStores more frequently than the general
>commit.interval.ms? My concern here is that the semantics of this
>config would depend on the processing.mode; under ALOS it would allow more
>frequently committing stores, whereas under EOS it couldn't.
>
> Any better ideas?
>
> On Wed, 23 Nov 2022 at 16:25, Nick Telford  wrote:
>
>> Hi Alex,
>>
>> Thanks for the feedback.
>>
>> I've updated the discussion of OOM issues by describing how we'll handle
>> it. Here's the new text:
>>
>> To mitigate this, we will automatically force a Task commit if the total
>>> uncommitted records returned by
>>> StateStore#approximateNumUncommittedEntries()  exceeds a threshold,
>>> configured by max.uncommitted.state.entries.per.task; or the total
>>> memory used for buffering uncommitted records returned by
>>> StateStore#approximateNumUncommittedBytes() exceeds the threshold
>>> configured by max.uncommitted.state.bytes.per.task. This will roughly
>>> bound the memory required per-Task for buffering uncommitted records,
>>> irrespective of the commit.interval.ms, and will effectively bound the
>>> number of records that will need to be restored in the event of a failure.
>>>
>>
>>
>> These limits will be checked in StreamTask#process and a premature
>>> commit will be requested via Task#requestCommit().
>>>
>>
>>
>> Note that these new methods provide default implementations that ensure
>>> existing custom stores and non-transactional stores (e.g.
>>> InMemoryKeyValueStore) do not force any early commits.
>>
>>
>> I've chosen to have the StateStore expose approximations of its buffer
>> size/count instead of opaquely requesting a commit in order to delegate the
>> decision making to the Task itself. This enables Tasks to look at *all* of
>> their StateStores, and determine whether an early commit is necessary.
>> Notably, it enables pre-Task thresholds, instead of per-Store, which
>> prevents Tasks with many StateStores from using much more memory than Tasks
>> with one StateStore. This makes sense, since commits are done by-Task, not
>> by-Store.
>>
>> Prizes* for anyone who can come up with a better name for the new config
>> properties!
>>
>> Thanks for pointing out the potential performance issues of WBWI. From
>> the benchmarks that user posted[1], it looks like WBWI still performs
>> considerably better than individual puts, which is the existing design, so
>> I'd actually expect a performance boost from WBWI, just not as great as
>> we'd get from a plain WriteBatch. This does suggest that a good
>> optimization would be to use a regular WriteBatch for restoration (in
>> RocksDBStore#restoreBatch), since we know that those records will never be
>> queried before they're committed.
>>
>> 1:
>> https://github.com/adamretter/rocksjava-write-methods-benchmark#results
>>
>> * Just kidding, no prizes, sadly.
>>
>> On Wed, 23 Nov 2022 at 12:28, Alexander Sorokoumov
>>  wrote:
>>
>>> Hey Nick,
>>>
>>> Thank you for the KIP! With such a significant performance degradation in
>>> the secondary store approach, we should definitely consider
>>> WriteBatchWithIndex. I also like encapsula

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-11-28 Thread Nick Telford
Hi everyone,

I've updated the KIP to resolve all the points that have been raised so
far, with one exception: the ALOS default commit interval of 5 minutes is
likely to cause WriteBatchWithIndex memory to grow too large.

There's a couple of different things I can think of to solve this:

   - We already have a memory/record limit in the KIP to prevent OOM
   errors. Should we choose a default value for these? My concern here is that
   anything we choose might seem rather arbitrary. We could change
   its behaviour such that under ALOS, it only triggers the commit of the
   StateStore, but under EOS, it triggers a commit of the Kafka transaction.
   - We could introduce a separate `checkpoint.interval.ms` to allow ALOS
   to commit the StateStores more frequently than the general
   commit.interval.ms? My concern here is that the semantics of this config
   would depend on the processing.mode; under ALOS it would allow more
   frequently committing stores, whereas under EOS it couldn't.

Any better ideas?

On Wed, 23 Nov 2022 at 16:25, Nick Telford  wrote:

> Hi Alex,
>
> Thanks for the feedback.
>
> I've updated the discussion of OOM issues by describing how we'll handle
> it. Here's the new text:
>
> To mitigate this, we will automatically force a Task commit if the total
>> uncommitted records returned by
>> StateStore#approximateNumUncommittedEntries()  exceeds a threshold,
>> configured by max.uncommitted.state.entries.per.task; or the total
>> memory used for buffering uncommitted records returned by
>> StateStore#approximateNumUncommittedBytes() exceeds the threshold
>> configured by max.uncommitted.state.bytes.per.task. This will roughly
>> bound the memory required per-Task for buffering uncommitted records,
>> irrespective of the commit.interval.ms, and will effectively bound the
>> number of records that will need to be restored in the event of a failure.
>>
>
>
> These limits will be checked in StreamTask#process and a premature commit
>> will be requested via Task#requestCommit().
>>
>
>
> Note that these new methods provide default implementations that ensure
>> existing custom stores and non-transactional stores (e.g.
>> InMemoryKeyValueStore) do not force any early commits.
>
>
> I've chosen to have the StateStore expose approximations of its buffer
> size/count instead of opaquely requesting a commit in order to delegate the
> decision making to the Task itself. This enables Tasks to look at *all* of
> their StateStores, and determine whether an early commit is necessary.
> Notably, it enables pre-Task thresholds, instead of per-Store, which
> prevents Tasks with many StateStores from using much more memory than Tasks
> with one StateStore. This makes sense, since commits are done by-Task, not
> by-Store.
>
> Prizes* for anyone who can come up with a better name for the new config
> properties!
>
> Thanks for pointing out the potential performance issues of WBWI. From the
> benchmarks that user posted[1], it looks like WBWI still performs
> considerably better than individual puts, which is the existing design, so
> I'd actually expect a performance boost from WBWI, just not as great as
> we'd get from a plain WriteBatch. This does suggest that a good
> optimization would be to use a regular WriteBatch for restoration (in
> RocksDBStore#restoreBatch), since we know that those records will never be
> queried before they're committed.
>
> 1: https://github.com/adamretter/rocksjava-write-methods-benchmark#results
>
> * Just kidding, no prizes, sadly.
>
> On Wed, 23 Nov 2022 at 12:28, Alexander Sorokoumov
>  wrote:
>
>> Hey Nick,
>>
>> Thank you for the KIP! With such a significant performance degradation in
>> the secondary store approach, we should definitely consider
>> WriteBatchWithIndex. I also like encapsulating checkpointing inside the
>> default state store implementation to improve performance.
>>
>> +1 to John's comment to keep the current checkpointing as a fallback
>> mechanism. We want to keep existing users' workflows intact if we can. A
>> non-intrusive way would be to add a separate StateStore method, say,
>> StateStore#managesCheckpointing(), that controls whether the state store
>> implementation owns checkpointing.
>>
>> I think that a solution to the transactional writes should address the
>> OOMEs. One possible way to address that is to wire StateStore's commit
>> request by adding, say, StateStore#commitNeeded that is checked in
>> StreamTask#commitNeeded via the corresponding ProcessorStateManager. With
>> that change, RocksDBStore will have to track the current transaction size
>> and request a commit when the size goes over a (configurable

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-11-23 Thread Nick Telford
Hi Alex,

Thanks for the feedback.

I've updated the discussion of OOM issues by describing how we'll handle
it. Here's the new text:

To mitigate this, we will automatically force a Task commit if the total
> uncommitted records returned by
> StateStore#approximateNumUncommittedEntries()  exceeds a threshold,
> configured by max.uncommitted.state.entries.per.task; or the total memory
> used for buffering uncommitted records returned by
> StateStore#approximateNumUncommittedBytes() exceeds the threshold
> configured by max.uncommitted.state.bytes.per.task. This will roughly
> bound the memory required per-Task for buffering uncommitted records,
> irrespective of the commit.interval.ms, and will effectively bound the
> number of records that will need to be restored in the event of a failure.
>


These limits will be checked in StreamTask#process and a premature commit
> will be requested via Task#requestCommit().
>


Note that these new methods provide default implementations that ensure
> existing custom stores and non-transactional stores (e.g.
> InMemoryKeyValueStore) do not force any early commits.


I've chosen to have the StateStore expose approximations of its buffer
size/count instead of opaquely requesting a commit in order to delegate the
decision making to the Task itself. This enables Tasks to look at *all* of
their StateStores, and determine whether an early commit is necessary.
Notably, it enables pre-Task thresholds, instead of per-Store, which
prevents Tasks with many StateStores from using much more memory than Tasks
with one StateStore. This makes sense, since commits are done by-Task, not
by-Store.

Prizes* for anyone who can come up with a better name for the new config
properties!

Thanks for pointing out the potential performance issues of WBWI. From the
benchmarks that user posted[1], it looks like WBWI still performs
considerably better than individual puts, which is the existing design, so
I'd actually expect a performance boost from WBWI, just not as great as
we'd get from a plain WriteBatch. This does suggest that a good
optimization would be to use a regular WriteBatch for restoration (in
RocksDBStore#restoreBatch), since we know that those records will never be
queried before they're committed.

1: https://github.com/adamretter/rocksjava-write-methods-benchmark#results

* Just kidding, no prizes, sadly.

On Wed, 23 Nov 2022 at 12:28, Alexander Sorokoumov
 wrote:

> Hey Nick,
>
> Thank you for the KIP! With such a significant performance degradation in
> the secondary store approach, we should definitely consider
> WriteBatchWithIndex. I also like encapsulating checkpointing inside the
> default state store implementation to improve performance.
>
> +1 to John's comment to keep the current checkpointing as a fallback
> mechanism. We want to keep existing users' workflows intact if we can. A
> non-intrusive way would be to add a separate StateStore method, say,
> StateStore#managesCheckpointing(), that controls whether the state store
> implementation owns checkpointing.
>
> I think that a solution to the transactional writes should address the
> OOMEs. One possible way to address that is to wire StateStore's commit
> request by adding, say, StateStore#commitNeeded that is checked in
> StreamTask#commitNeeded via the corresponding ProcessorStateManager. With
> that change, RocksDBStore will have to track the current transaction size
> and request a commit when the size goes over a (configurable) threshold.
>
> AFAIU WriteBatchWithIndex might perform significantly slower than non-txn
> puts as the batch size grows [1]. We should have a configuration to fall
> back to the current behavior (and/or disable txn stores for ALOS) unless
> the benchmarks show negligible overhead for longer commits / large-enough
> batch sizes.
>
> If you prefer to keep the KIP smaller, I would rather cut out
> state-store-managed checkpointing rather than proper OOMe handling and
> being able to switch to non-txn behavior. The checkpointing is not
> necessary to solve the recovery-under-EOS problem. On the other hand, once
> WriteBatchWithIndex is in, it will be much easier to add
> state-store-managed checkpointing.
>
> If you share the current implementation, I am happy to help you address the
> OOMe and configuration parts as well as review and test the patch.
>
> Best,
> Alex
>
>
> 1. https://github.com/facebook/rocksdb/issues/608
>
> On Tue, Nov 22, 2022 at 6:31 PM Nick Telford 
> wrote:
>
> > Hi John,
> >
> > Thanks for the review and feedback!
> >
> > 1. Custom Stores: I've been mulling over this problem myself. As it
> stands,
> > custom stores would essentially lose checkpointing with no indication
> that
> > they're expected to make changes, besides a line in the 

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-11-22 Thread Nick Telford
he processing mode are orthogonal. A transactional store
> would serve ALOS just as well as a non-transactional one (if not better).
> Under ALOS, though, the default commit interval is five minutes, so the
> memory issue is far more pressing.
>
> As I see it, we have several options to resolve this point. We could
> demonstrate that transactional stores work just fine for ALOS and we can
> therefore just swap over unconditionally. We could also disable the
> transactional mechanism under ALOS so that stores operate just the same as
> they do today when run in ALOS mode. Finally, we could do the same as in
> KIP-844 and make transactional stores opt-in (it'd be better to avoid the
> extra opt-in mechanism, but it's a good get-out-of-jail-free card).
>
> 4. (minor point) Deprecation of methods
>
> You mentioned that the new `commit` method replaces flush,
> updateChangelogOffsets, and checkpoint. It seems to me that the point about
> atomicity and Position also suggests that it replaces the Position
> callbacks. However, the proposal only deprecates `flush`. Should we be
> deprecating other methods as well?
>
> Thanks again for the KIP! It's really nice that you and Alex will get the
> chance to collaborate on both directions so that we can get the best
> outcome for Streams and its users.
>
> -John
>
>
> On 2022/11/21 15:02:15 Nick Telford wrote:
> > Hi everyone,
> >
> > As I mentioned in the discussion thread for KIP-844, I've been working on
> > an alternative approach to achieving better transactional semantics for
> > Kafka Streams StateStores.
> >
> > I've published this separately as KIP-892: Transactional Semantics for
> > StateStores
> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
> >,
> > so that it can be discussed/reviewed separately from KIP-844.
> >
> > Alex: I'm especially interested in what you think!
> >
> > I have a nearly complete implementation of the changes outlined in this
> > KIP, please let me know if you'd like me to push them for review in
> advance
> > of a vote.
> >
> > Regards,
> >
> > Nick
> >
>


[DISCUSS] KIP-892: Transactional Semantics for StateStores

2022-11-21 Thread Nick Telford
Hi everyone,

As I mentioned in the discussion thread for KIP-844, I've been working on
an alternative approach to achieving better transactional semantics for
Kafka Streams StateStores.

I've published this separately as KIP-892: Transactional Semantics for
StateStores
,
so that it can be discussed/reviewed separately from KIP-844.

Alex: I'm especially interested in what you think!

I have a nearly complete implementation of the changes outlined in this
KIP, please let me know if you'd like me to push them for review in advance
of a vote.

Regards,

Nick


Re: [DISCUSS] KIP-844: Transactional State Stores

2022-11-21 Thread Nick Telford
Hi Alex,

Thanks for getting back to me. I actually have most of a working
implementation already. I'm going to write it up as a new KIP, so that it
can be reviewed independently of KIP-844.

Hopefully, working together we can have it ready sooner.

I'll keep you posted on my progress.

Regards,
Nick

On Mon, 21 Nov 2022 at 11:25, Alexander Sorokoumov
 wrote:

> Hey Nick,
>
> Thank you for the prototype testing and benchmarking, and sorry for the
> late reply!
>
> I agree that it is worth revisiting the WriteBatchWithIndex approach. I
> will implement a fork of the current prototype that uses that mechanism to
> ensure transactionality and let you know when it is ready for
> review/testing in this ML thread.
>
> As for time estimates, I might not have enough time to finish the prototype
> in December, so it will probably be ready for review in January.
>
> Best,
> Alex
>
> On Fri, Nov 11, 2022 at 4:24 PM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > Sorry to dredge this up again. I've had a chance to start doing some
> > testing with the WIP Pull Request, and it appears as though the secondary
> > store solution performs rather poorly.
> >
> > In our testing, we had a non-transactional state store that would restore
> > (from scratch), at a rate of nearly 1,000,000 records/second. When we
> > switched it to a transactional store, it restored at a rate of less than
> > 40,000 records/second.
> >
> > I suspect the key issues here are having to copy the data out of the
> > temporary store and into the main store on-commit, and to a lesser
> extent,
> > the extra memory copies during writes.
> >
> > I think it's worth re-visiting the WriteBatchWithIndex solution, as it's
> > clear from the RocksDB post[1] on the subject that it's the recommended
> way
> > to achieve transactionality.
> >
> > The only issue you identified with this solution was that uncommitted
> > writes are required to entirely fit in-memory, and RocksDB recommends
> they
> > don't exceed 3-4MiB. If we do some back-of-the-envelope calculations, I
> > think we'll find that this will be a non-issue for all but the most
> extreme
> > cases, and for those, I think I have a fairly simple solution.
> >
> > Firstly, when EOS is enabled, the default commit.interval.ms is set to
> > 100ms, which provides fairly short intervals that uncommitted writes need
> > to be buffered in-memory. If we assume a worst case of 1024 byte records
> > (and for most cases, they should be much smaller), then 4MiB would hold
> > ~4096 records, which with 100ms commit intervals is a throughput of
> > approximately 40,960 records/second. This seems quite reasonable.
> >
> > For use cases that wouldn't reasonably fit in-memory, my suggestion is
> that
> > we have a mechanism that tracks the number/size of uncommitted records in
> > stores, and prematurely commits the Task when this size exceeds a
> > configured threshold.
> >
> > Thanks for your time, and let me know what you think!
> > --
> > Nick
> >
> > 1: https://rocksdb.org/blog/2015/02/27/write-batch-with-index.html
> >
> > On Thu, 6 Oct 2022 at 19:31, Alexander Sorokoumov
> >  wrote:
> >
> > > Hey Nick,
> > >
> > > It is going to be option c. Existing state is considered to be
> committed
> > > and there will be an additional RocksDB for uncommitted writes.
> > >
> > > I am out of office until October 24. I will update KIP and make sure
> that
> > > we have an upgrade test for that after coming back from vacation.
> > >
> > > Best,
> > > Alex
> > >
> > > On Thu, Oct 6, 2022 at 5:06 PM Nick Telford 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I realise this has already been voted on and accepted, but it
> occurred
> > to
> > > > me today that the KIP doesn't define the migration/upgrade path for
> > > > existing non-transactional StateStores that *become* transactional,
> > i.e.
> > > by
> > > > adding the transactional boolean to the StateStore constructor.
> > > >
> > > > What would be the result, when such a change is made to a Topology,
> > > without
> > > > explicitly wiping the application state?
> > > > a) An error.
> > > > b) Local state is wiped.
> > > > c) Existing RocksDB database is used as committed writes and new
> > RocksDB
> > > > database is created for uncommitted writes.
> > > > d) So

Streams PR review request

2022-11-18 Thread Nick Telford
Hi everyone,

I found a small performance improvement in Kafka Streams state stores,
would someone be able to review/merge it please?
https://github.com/apache/kafka/pull/12842

(I'm not sure if this is the correct forum for requesting a review/merge.
If it isn't, please let me know).

Regards,

Nick


Re: Streams: clarification needed, checkpoint vs. position files

2022-11-14 Thread Nick Telford
Thank you both, the key point I was missing was that position files track
the offsets of the *source* topics, whereas the checkpoint file tracks the
offset(s) of the changelog topics.

Do you know if I need to include interface/API changes to internal
classes/interfaces (those under the
org.apache.kafka.streams.processor.internals package) in a KIP, or are they
considered implementation details?

Cheers,
Nick

On Sat, 12 Nov 2022 at 03:59, John Roesler  wrote:

> Hi all,
>
> Just to clarify: there actually is a position file. It was a small detail
> of the IQv2 implementation to add it, otherwise a persistent store's
> position would be lost after a restart.
>
> Otherwise, Sophie is right on the money. The checkpoint refers to an
> offset in the changelog, while the position refers to offsets in the task's
> input topics topics. So they are similar in function and structure, but
> they refer to two different things.
>
> I agree that, given this, it doesn't seem like consolidating them (for
> example, into one file) would be worth it. It would make the code more
> complicated without deduping any information.
>
> I hope this helps, and look forward to what you're cooking up, Nick!
> -John
>
> On 2022/11/12 00:50:27 Sophie Blee-Goldman wrote:
> > Hey Nick,
> >
> > I haven't been following the new IQv2 work very closely so take this
> with a
> > grain of salt,
> > but as far as I'm aware there's no such thing as "position files" -- the
> > Position is just an
> > in-memory object and is related to a user's query against the state
> store,
> > whereas a
> > checkpoint file reflects the current state of the store ie how much of
> the
> > changelog it
> > contains.
> >
> > In other words while these might look like they do similar things, the
> > actual usage and
> > implementation of Positions vs checkpoint files is pretty much unrelated.
> > So I don't think
> > it would sense for Streams to try and consolidate these or replace one
> with
> > another.
> >
> > Hope this answers your question, and I'll ping John to make sure I'm not
> > misleading
> > you regarding the usage/intention of Positions
> >
> > Sophie
> >
> > On Fri, Nov 11, 2022 at 6:48 AM Nick Telford 
> wrote:
> >
> > > Hi everyone,
> > >
> > > I'm trying to understand how StateStores work internally for some
> changes
> > > that I plan to propose, and I'd like some clarification around
> checkpoint
> > > files and position files.
> > >
> > > It appears as though position files are relatively new, and were
> created as
> > > part of the IQv2 initiative, as a means to track the position of the
> local
> > > state store so that reads could be bound by particular positions?
> > >
> > > Checkpoint files look much older, and are managed by the Task itself
> > > (actually, ProcessorStateManager). It looks like this is used
> exclusively
> > > for determining a) whether to restore a store, and b) which offsets to
> > > restore from?
> > >
> > > If I've understood the above correctly, is there any scope to
> potentially
> > > replace checkpoint files with StateStore#position()?
> > >
> > > Regards,
> > >
> > > Nick
> > >
> >
>


Re: [DISCUSS] KIP-844: Transactional State Stores

2022-11-11 Thread Nick Telford
Hi everyone,

Sorry to dredge this up again. I've had a chance to start doing some
testing with the WIP Pull Request, and it appears as though the secondary
store solution performs rather poorly.

In our testing, we had a non-transactional state store that would restore
(from scratch), at a rate of nearly 1,000,000 records/second. When we
switched it to a transactional store, it restored at a rate of less than
40,000 records/second.

I suspect the key issues here are having to copy the data out of the
temporary store and into the main store on-commit, and to a lesser extent,
the extra memory copies during writes.

I think it's worth re-visiting the WriteBatchWithIndex solution, as it's
clear from the RocksDB post[1] on the subject that it's the recommended way
to achieve transactionality.

The only issue you identified with this solution was that uncommitted
writes are required to entirely fit in-memory, and RocksDB recommends they
don't exceed 3-4MiB. If we do some back-of-the-envelope calculations, I
think we'll find that this will be a non-issue for all but the most extreme
cases, and for those, I think I have a fairly simple solution.

Firstly, when EOS is enabled, the default commit.interval.ms is set to
100ms, which provides fairly short intervals that uncommitted writes need
to be buffered in-memory. If we assume a worst case of 1024 byte records
(and for most cases, they should be much smaller), then 4MiB would hold
~4096 records, which with 100ms commit intervals is a throughput of
approximately 40,960 records/second. This seems quite reasonable.

For use cases that wouldn't reasonably fit in-memory, my suggestion is that
we have a mechanism that tracks the number/size of uncommitted records in
stores, and prematurely commits the Task when this size exceeds a
configured threshold.

Thanks for your time, and let me know what you think!
--
Nick

1: https://rocksdb.org/blog/2015/02/27/write-batch-with-index.html

On Thu, 6 Oct 2022 at 19:31, Alexander Sorokoumov
 wrote:

> Hey Nick,
>
> It is going to be option c. Existing state is considered to be committed
> and there will be an additional RocksDB for uncommitted writes.
>
> I am out of office until October 24. I will update KIP and make sure that
> we have an upgrade test for that after coming back from vacation.
>
> Best,
> Alex
>
> On Thu, Oct 6, 2022 at 5:06 PM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > I realise this has already been voted on and accepted, but it occurred to
> > me today that the KIP doesn't define the migration/upgrade path for
> > existing non-transactional StateStores that *become* transactional, i.e.
> by
> > adding the transactional boolean to the StateStore constructor.
> >
> > What would be the result, when such a change is made to a Topology,
> without
> > explicitly wiping the application state?
> > a) An error.
> > b) Local state is wiped.
> > c) Existing RocksDB database is used as committed writes and new RocksDB
> > database is created for uncommitted writes.
> > d) Something else?
> >
> > Regards,
> >
> > Nick
> >
> > On Thu, 1 Sept 2022 at 12:16, Alexander Sorokoumov
> >  wrote:
> >
> > > Hey Guozhang,
> > >
> > > Sounds good. I annotated all added StateStore methods (commit, recover,
> > > transactional) with @Evolving.
> > >
> > > Best,
> > > Alex
> > >
> > >
> > >
> > > On Wed, Aug 31, 2022 at 7:32 PM Guozhang Wang 
> > wrote:
> > >
> > > > Hello Alex,
> > > >
> > > > Thanks for the detailed replies, I think that makes sense, and in the
> > > long
> > > > run we would need some public indicators from StateStore to determine
> > if
> > > > checkpoints can really be used to indicate clean snapshots.
> > > >
> > > > As for the @Evolving label, I think we can still keep it but for a
> > > > different reason, since as we add more state management
> functionalities
> > > in
> > > > the near future we may need to revisit the public APIs again and
> hence
> > > > keeping it as @Evolving would allow us to modify if necessary, in an
> > > easier
> > > > path than deprecate -> delete after several minor releases.
> > > >
> > > > Besides that, I have no further comments about the KIP.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Fri, Aug 26, 2022 at 1:51 AM Alexander Sorokoumov
> > > >  wrote:
> > > >
> > > > > Hey Guozhang,
> > > > >
> > > > >
> > > > > I think that we will have 

Streams: clarification needed, checkpoint vs. position files

2022-11-11 Thread Nick Telford
Hi everyone,

I'm trying to understand how StateStores work internally for some changes
that I plan to propose, and I'd like some clarification around checkpoint
files and position files.

It appears as though position files are relatively new, and were created as
part of the IQv2 initiative, as a means to track the position of the local
state store so that reads could be bound by particular positions?

Checkpoint files look much older, and are managed by the Task itself
(actually, ProcessorStateManager). It looks like this is used exclusively
for determining a) whether to restore a store, and b) which offsets to
restore from?

If I've understood the above correctly, is there any scope to potentially
replace checkpoint files with StateStore#position()?

Regards,

Nick


Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

2022-10-12 Thread Nick Telford
Can't wait!
+1 (non-binding)

On Wed, 12 Oct 2022, 18:02 Guozhang Wang, 
wrote:

> Hello all,
>
> I'd like to start a vote for the following KIP, aiming to improve Kafka
> Stream's restoration visibility via new metrics and callback methods:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>
>
> Thanks!
> -- Guozhang
>


Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

2022-10-11 Thread Nick Telford
Hi Guozhang,

What you propose sounds good to me. Having the more detailed Task-level
metrics at DEBUG makes sense.

Regards,

Nick


Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

2022-10-10 Thread Nick Telford
Hi Guozhang,

The metrics you've described are currently "thread-level" metrics. I know
you're planning on splitting out the restoration from the StreamThread, so
I question the utility of some of these metrics being thread-level, as they
won't correlate with other thread-level metrics, which correspond to a
StreamThread.

Perhaps the record counting metrics would be more useful as
task/store-level metrics? That way, they can be aggregated by users to
determine things like total records remaining to restore by store, across
the entire app etc.

Regards,

Nick

On Thu, 22 Sept 2022 at 13:21, Bruno Cadonna  wrote:

> Hi Guozhang!
>
> Thanks for the updates!
>
> 1.
> Why do you distinguish between active and standby for the total amount
> of restored/updated records but not for the rate of restored/updated
> records?
>
> 2.
> Regarding "standby-records-remaining" I am not sure how useful this
> metric is and I am not sure how hard it will be to record. I see the
> usefulness of monitoring the lag of a single standby state store, but I
> am not sure how useful it is to monitor the "lag" of a state updater
> thread that might potentially contain multiple state stores.
> Furthermore, do we need to issue a remote call to record this metric or
> do we get this information from each poll?
>
> 3.
> Could you please give an example where "active-records-restored-total"
> and "standby-records-updated-total" are useful?
>
> Best,
> Bruno
>
> On 20.09.22 22:45, Guozhang Wang wrote:
> > Hello Bruno/Nick,
> >
> > I've updated the KIP wiki to reflect the incorporated comments from you,
> > please feel free to take another look and let me know what you think.
> >
> >
> > Guozhang
> >
> > On Tue, Sep 20, 2022 at 9:37 AM Guozhang Wang 
> wrote:
> >
> >> Hi Nick,
> >>
> >> Thanks for the reviews, and I think these are good suggestions. Note
> that
> >> currently the `restore-records-total/rate` would include both restoring
> >> active tasks as well as updating standby tasks, I think for your
> purposes
> >> you'd be more interested in active restoring tasks since they could
> >> complete, while updating standby tasks would not complete even if they
> have
> >> caught up with the active. At the same time, the changelog reader would
> >> only be restoring either active or standby at a given time, and active
> >> tasks has a higher priority such that as long as there is at least one
> >> active task still restoring, we would not restore any standby tasks.
> From
> >> your suggestion, I'm thinking that maybe I should break up the rate /
> total
> >> metric for active and standby tasks separately.
> >>
> >> For deriving estimated time remaining though, the `total` metric may not
> >> be helpful since they will not be "reset" after rebalances, i.e. they
> will
> >> be an ever-increasing number and record the total number of records for
> the
> >> lifetime of the app. But still, just the remaining records alone, with
> the
> >> time elapsed monitored by the apps, should be sufficient to get the
> >> estimated time remaining.
> >>
> >>
> >> Guozhang
> >>
> >>
> >> On Tue, Sep 20, 2022 at 3:10 AM Nick Telford 
> >> wrote:
> >>
> >>> Hi Guozhang,
> >>>
> >>> KIP looks great, I have one suggestion: in addition to
> >>> "restore-records-total", it would also be useful to track the number of
> >>> records *remaining*, that is, the records that have not yet been
> restored.
> >>> This is actually the metric I was attempting to implement in the
> >>> StateRestoreListener that bumped me up against KAFKA-10575 :-)
> >>>
> >>> With both a "restore-records-total" and a "restore-remaining-total" (or
> >>> similar) metric, it's possible to derive useful information like the
> >>> estimated time remaining for restoration (by dividing the remaining
> total
> >>> by the restoration rate).
> >>>
> >>> Regards,
> >>>
> >>> Nick
> >>>
> >>> On Mon, 19 Sept 2022 at 19:57, Guozhang Wang 
> wrote:
> >>>
> >>>> Hello Bruno,
> >>>>
> >>>> Thanks for your comments!
> >>>>
> >>>> 1. Regarding the metrics group name: originally I put
> >>>> "stream-state-metrics" as it's related to state store restorations,
> but
&

Re: [DISCUSS] KIP-844: Transactional State Stores

2022-10-06 Thread Nick Telford
t; > > > state
> > > > > > > > >> store. Maybe we need an additional method on the state
> store
> > > > > > interface
> > > > > > > > >> that returns the offset at which the state store is.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > In my opinion, we should include in the interface only the
> > > > > guarantees
> > > > > > > > that
> > > > > > > > > are necessary to preserve EOS without wiping the local
> state.
> > > > This
> > > > > > way,
> > > > > > > > we
> > > > > > > > > allow more room for possible implementations. Thanks to the
> > > > > > idempotency
> > > > > > > > of
> > > > > > > > > the changelog replay, it is "good enough" if
> > StateStore#recover
> > > > > > returns
> > > > > > > > the
> > > > > > > > > offset that is less than what it actually is. The only
> > > limitation
> > > > > > here
> > > > > > > is
> > > > > > > > > that the state store should never commit writes that are
> not
> > > yet
> > > > > > > > committed
> > > > > > > > > in Kafka changelog.
> > > > > > > > >
> > > > > > > > > Please let me know what you think about this. First of
> all, I
> > > am
> > > > > > > > relatively
> > > > > > > > > new to the codebase, so I might be wrong in my
> understanding
> > of
> > > > > > > > > how it works. Second, while writing this, it occured to me
> > that
> > > > the
> > > > > > > > > StateStore#recover interface method is not straightforward
> as
> > > it
> > > > > can
> > > > > > > be.
> > > > > > > > > Maybe we can change it like that:
> > > > > > > > >
> > > > > > > > > /**
> > > > > > > > >  * Recover a transactional state store
> > > > > > > > >  * 
> > > > > > > > >  * If a transactional state store shut down with a
> crash
> > > > > failure,
> > > > > > > > this
> > > > > > > > > method ensures that the
> > > > > > > > >  * state store is in a consistent state that
> corresponds
> > to
> > > > > > {@code
> > > > > > > > > changelofOffset} or later.
> > > > > > > > >  *
> > > > > > > > >  * @param changelogOffset the checkpointed changelog
> > > offset.
> > > > > > > > >  * @return {@code true} if recovery succeeded, {@code
> > > false}
> > > > > > > > otherwise.
> > > > > > > > >  */
> > > > > > > > > boolean recover(final Long changelogOffset) {
> > > > > > > > >
> > > > > > > > > Note: all links below except for [10] lead to the
> prototype's
> > > > code.
> > > > > > > > > 1.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L468
> > > > > > > > > 2.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L580
> > > > > > > > > 3.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.c

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

2022-09-20 Thread Nick Telford
Hi Guozhang,

KIP looks great, I have one suggestion: in addition to
"restore-records-total", it would also be useful to track the number of
records *remaining*, that is, the records that have not yet been restored.
This is actually the metric I was attempting to implement in the
StateRestoreListener that bumped me up against KAFKA-10575 :-)

With both a "restore-records-total" and a "restore-remaining-total" (or
similar) metric, it's possible to derive useful information like the
estimated time remaining for restoration (by dividing the remaining total
by the restoration rate).

Regards,

Nick

On Mon, 19 Sept 2022 at 19:57, Guozhang Wang  wrote:

> Hello Bruno,
>
> Thanks for your comments!
>
> 1. Regarding the metrics group name: originally I put
> "stream-state-metrics" as it's related to state store restorations, but
> after a second thought I think I agree with you that this is quite
> confusing and not right. About the metrics groups, right now I have two
> ideas:
>
> a) Still use the metric group name "stream-thread-metrics", but
> differentiate with the processing threads on the thread id. The pros is
> that we do not introduce a new group, the cons is that users who want to
> separate processing from restoration/updating in the future needs to do
> that on the thread id labels.
> b) Introduce a new group name, for example "stream-state-updater-metrics"
> and still have a thread-id label. We would be introducing a new group which
> still have a thread-id, which may sound a bit weird (maybe if we do that we
> could change the existing "stream-thread-metrics" into
> "stream-processor-metrics").
>
> Right now I'm leaning towards b) and maybe in the future rename
> "thread-metrics" to "processor-metrics", LMK what do you think.
>
> 2. Regarding the metric names: today we may also pause a standby tasks, and
> hence I'm trying to differentiate updating standbys from paused standbys.
> Right now I'm thinking we can change "restoring-standby-tasks" to
> "updating-standby-tasks", and change all other metrics' "restore" (except
> the "restoring-active-tasks") to "state-update", a.k.a
> "state-update-ratio", "state-update-records-total",
> "updating-standby-tasks" etc.
>
> 3. Regarding the function name: yeah I think that's a valid concern, I can
> change it to "onRestoreSuspended" since "Aborted" may confuse people that
> previously called "onBatchRestored" are undone as part of the abortion.
>
>
> Guozhang
>
>
>
> On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna  wrote:
>
> > Hi Guozhang,
> >
> > Thanks for the KIP! I think this KIP is a really nice addition to better
> > understand what is going on in a Kafka Streams application.
> >
> > 1.
> > The metric names "paused-active-tasks" and "paused-standby-tasks" might
> > be a bit confusing since at least active tasks can be paused also
> > outside of restoration.
> >
> > 2.
> > Why is the type of the metrics "stream-state-metrics"? I would have
> > expected "stream-thread-metrics" as the type.
> >
> > 3.
> > Isn't the value of the metric "restoring-standby-tasks" simply the
> > number of standby tasks since standby tasks are basically always
> > updating (aka restoring)?
> >
> > 4.
> > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
> > tailored to the upcoming state updater. They do not make much sense with
> > a stream thread. Would it be better to introduce a new metrics level
> > specifically for the state updater?
> >
> > 5.
> > Personally, I do not like to use the word "restoration" together with
> > standbys since restoration somehow implies that there is an offset for
> > which the active task is considered restored and active processing can
> > start. In other words, restoration is finite. Standby tasks rather
> > update continuously their states. They can be up-to-date or lagging. I
> > see that you could say "restored" instead of "up-to-date" and "not
> > restored" instead of "lagging", but IMO it does not describe well the
> > situation. That is a rather minor point. I just wanted to mention it.
> >
> > 6.
> > The name "onRestorePaused()" might be confusing since in Kafka Streams
> > users can also pause tasks. What about "onRestoreAborted()" or
> > "onRestoreSuspended"?
> >
> > Best,
> > Bruno
> >
> >
> > On 16.09.22 19:33, Guozhang Wang wrote:
> > > Hello everyone,
> > >
> > > I'd like to start a discussion for the following KIP, aiming to improve
> > > Kafka Stream's restoration visibility via new metrics and callback
> > methods:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > >
> > >
> > > Thanks!
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>


Re: [DISCUSS] KIP-857: Streaming recursion in Kafka Streams

2022-09-06 Thread Nick Telford
The more I think about this, the more I think that automatic repartitioning
is not required in the "recursively" method itself. I've removed references
to this from the KIP, which further simplifies everything.

I don't see any need to restrict users from repartitioning, either before,
after or inside the "recursively" method. I can't see a scenario where the
recursion would cause problems with it.

Nick

On Tue, 6 Sept 2022 at 18:08, Nick Telford  wrote:

> Hi Guozhang,
>
> I mentioned this in the "Rejected Alternatives" section. Repartitioning
> gives us several significant advantages over using an explicit topic and
> "to":
>
>- Repartition topics are automatically created and managed by the
>Streams runtime; explicit topics have to be created and managed by the 
> user.
>- Repartitioning topics have no retention criteria and instead purge
>records once consumed, this prevents data loss. Explicit topics need
>retention criteria, which have to be set large enough to avoid data loss,
>often wasting considerable resources.
>- The "recursively" method requires significantly less code than
>recursion via an explicit topic, and is significantly easier to understand.
>
> Ultimately, I don't think repartitioning inside the unary operator adds
> much complexity to the implementation. Certainly no more than other DSL
> operations.
>
> Regards,
> Nick
>
> On Tue, 6 Sept 2022 at 17:28, Guozhang Wang  wrote:
>
>> Hello Nick,
>>
>> Thanks for the re-written KIP! I read through it again, and so far have
>> just one quick question on top of my head regarding repartitioning: it
>> seems to me that when there's an intermediate topic inside the recursion
>> step, then using this new API would basically give us the same behavior as
>> using the existing `to` APIs. Of course, with the new API the user can
>> make
>> it more explicit that it is supposed to be recursive, but efficiency wise
>> it provides no further optimizations. Is my understanding correct? If yes,
>> I'm wondering if it's worthy the complexity to allow repartitioning inside
>> the unary operator, or should we just restrict the recursion inside a
>> single sub-topology.
>>
>>
>> Guozhang
>>
>> On Tue, Sep 6, 2022 at 9:05 AM Nick Telford 
>> wrote:
>>
>> > Hi everyone,
>> >
>> > I've re-written the KIP, with a new design that I think resolves the
>> issues
>> > you highlighted, and also simplifies usage.
>> >
>> >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-857%3A+Streaming+recursion+in+Kafka+Streams
>> >
>> > Note: I'm still working out the "automatic repartitioning" in my head,
>> as I
>> > don't think it's quite right. It may turn out that the additional
>> overload
>> > (with the Produced argument) is not necessary.
>> >
>> > Thanks for all your feedback so far. Let me know what you think!
>> >
>> > Regards,
>> >
>> > Nick
>> >
>> > On Thu, 25 Aug 2022 at 17:46, Nick Telford 
>> wrote:
>> >
>> > > Hi Sophie,
>> > >
>> > > The reason I chose to add a new overload of "to", instead of creating
>> a
>> > > new method, is simply because I felt that "to" was about sending
>> records
>> > > "to" somewhere, and that "somewhere" just happens to currently be
>> > > exclusively topics. By re-using "to", we can send records *to other
>> > > KStreams*, including a KStream from an earlier point in the current
>> > > KStreams' pipeline, which would facilitate recursion. Sending records
>> to
>> > a
>> > > completely different KStream would be essentially a merge.
>> > >
>> > > However, I'm happy to reduce the scope of this method to focus
>> > exclusively
>> > > on recursion: we'd simply need to add a check in to the method that
>> > ensures
>> > > the target is an ancestor node of the current KStream node.
>> > >
>> > > Which brings me to your first query...
>> > >
>> > > My argument is simply that a 0-ary method isn't enough to facilitate
>> > > recursive streaming, because you need to be able to communicate which
>> > point
>> > > in the process graph you want to feed your records back in to.
>> > >
>> > > Consider my example from the KIP, but re-written with a 0-ary
>> > > "recursiv

Re: [DISCUSS] KIP-857: Streaming recursion in Kafka Streams

2022-09-06 Thread Nick Telford
Hi Guozhang,

I mentioned this in the "Rejected Alternatives" section. Repartitioning
gives us several significant advantages over using an explicit topic and
"to":

   - Repartition topics are automatically created and managed by the
   Streams runtime; explicit topics have to be created and managed by the user.
   - Repartitioning topics have no retention criteria and instead purge
   records once consumed, this prevents data loss. Explicit topics need
   retention criteria, which have to be set large enough to avoid data loss,
   often wasting considerable resources.
   - The "recursively" method requires significantly less code than
   recursion via an explicit topic, and is significantly easier to understand.

Ultimately, I don't think repartitioning inside the unary operator adds
much complexity to the implementation. Certainly no more than other DSL
operations.

Regards,
Nick

On Tue, 6 Sept 2022 at 17:28, Guozhang Wang  wrote:

> Hello Nick,
>
> Thanks for the re-written KIP! I read through it again, and so far have
> just one quick question on top of my head regarding repartitioning: it
> seems to me that when there's an intermediate topic inside the recursion
> step, then using this new API would basically give us the same behavior as
> using the existing `to` APIs. Of course, with the new API the user can make
> it more explicit that it is supposed to be recursive, but efficiency wise
> it provides no further optimizations. Is my understanding correct? If yes,
> I'm wondering if it's worthy the complexity to allow repartitioning inside
> the unary operator, or should we just restrict the recursion inside a
> single sub-topology.
>
>
> Guozhang
>
> On Tue, Sep 6, 2022 at 9:05 AM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > I've re-written the KIP, with a new design that I think resolves the
> issues
> > you highlighted, and also simplifies usage.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-857%3A+Streaming+recursion+in+Kafka+Streams
> >
> > Note: I'm still working out the "automatic repartitioning" in my head,
> as I
> > don't think it's quite right. It may turn out that the additional
> overload
> > (with the Produced argument) is not necessary.
> >
> > Thanks for all your feedback so far. Let me know what you think!
> >
> > Regards,
> >
> > Nick
> >
> > On Thu, 25 Aug 2022 at 17:46, Nick Telford 
> wrote:
> >
> > > Hi Sophie,
> > >
> > > The reason I chose to add a new overload of "to", instead of creating a
> > > new method, is simply because I felt that "to" was about sending
> records
> > > "to" somewhere, and that "somewhere" just happens to currently be
> > > exclusively topics. By re-using "to", we can send records *to other
> > > KStreams*, including a KStream from an earlier point in the current
> > > KStreams' pipeline, which would facilitate recursion. Sending records
> to
> > a
> > > completely different KStream would be essentially a merge.
> > >
> > > However, I'm happy to reduce the scope of this method to focus
> > exclusively
> > > on recursion: we'd simply need to add a check in to the method that
> > ensures
> > > the target is an ancestor node of the current KStream node.
> > >
> > > Which brings me to your first query...
> > >
> > > My argument is simply that a 0-ary method isn't enough to facilitate
> > > recursive streaming, because you need to be able to communicate which
> > point
> > > in the process graph you want to feed your records back in to.
> > >
> > > Consider my example from the KIP, but re-written with a 0-ary
> > > "recursively" method:
> > >
> > > updates
> > > .join(parents, (count, parent) -> { KeyValue(parent, count) })
> > > .recursively()
> > >
> > > Where does the join output get fed to?
> > >
> > >1. The "updates" (source) node?
> > >2. The "join" node itself?
> > >
> > > It would probably be most intuitive if it simply caused the last step
> to
> > > be recursive, but that won't always be what you want. Consider if we
> add
> > > some more steps in to the above:
> > >
> > > updates
> > > .map((parent, count) -> KeyValue(parent, count + 1)) // doesn't
> make
> > > sense in this algorithm, but let's pretend it does
> > > .join(parents, (count, parent) -> { KeyValue(parent, count) })
> &

Re: [DISCUSS] KIP-857: Streaming recursion in Kafka Streams

2022-09-06 Thread Nick Telford
Hi everyone,

I've re-written the KIP, with a new design that I think resolves the issues
you highlighted, and also simplifies usage.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-857%3A+Streaming+recursion+in+Kafka+Streams

Note: I'm still working out the "automatic repartitioning" in my head, as I
don't think it's quite right. It may turn out that the additional overload
(with the Produced argument) is not necessary.

Thanks for all your feedback so far. Let me know what you think!

Regards,

Nick

On Thu, 25 Aug 2022 at 17:46, Nick Telford  wrote:

> Hi Sophie,
>
> The reason I chose to add a new overload of "to", instead of creating a
> new method, is simply because I felt that "to" was about sending records
> "to" somewhere, and that "somewhere" just happens to currently be
> exclusively topics. By re-using "to", we can send records *to other
> KStreams*, including a KStream from an earlier point in the current
> KStreams' pipeline, which would facilitate recursion. Sending records to a
> completely different KStream would be essentially a merge.
>
> However, I'm happy to reduce the scope of this method to focus exclusively
> on recursion: we'd simply need to add a check in to the method that ensures
> the target is an ancestor node of the current KStream node.
>
> Which brings me to your first query...
>
> My argument is simply that a 0-ary method isn't enough to facilitate
> recursive streaming, because you need to be able to communicate which point
> in the process graph you want to feed your records back in to.
>
> Consider my example from the KIP, but re-written with a 0-ary
> "recursively" method:
>
> updates
> .join(parents, (count, parent) -> { KeyValue(parent, count) })
> .recursively()
>
> Where does the join output get fed to?
>
>1. The "updates" (source) node?
>2. The "join" node itself?
>
> It would probably be most intuitive if it simply caused the last step to
> be recursive, but that won't always be what you want. Consider if we add
> some more steps in to the above:
>
> updates
> .map((parent, count) -> KeyValue(parent, count + 1)) // doesn't make
> sense in this algorithm, but let's pretend it does
> .join(parents, (count, parent) -> { KeyValue(parent, count) })
> .recursively()
>
> If "recursively" just feeds records back into the "join", it misses out on
> potentially important steps in our recursive algorithm. It also gets even
> worse if the step you're making recursive doesn't contain your terminal
> condition:
>
> foo
> .filter((key, value) -> value <= 0) // <-- terminal condition
> .mapValues((value) -> value - 1)
> .recursively()
>
> If "recursively" feeds records back to the "mapValues" stage in our
> pipeline, and not in to "filter" or "foo", then the terminal condition in
> "filter" won't be evaluated for any values with a starting value greater
> than 0, *causing an infinite loop*.
>
> There's an argument to be had to always feed the values back to the first
> ancestor "source node", in the process-graph, but that might not be
> particularly intuitive, and is likely going to limit some of the recursive
> algorithms that some may want to implement. For example, in the previous
> example, there's no guarantee that "foo" is a source node; it could be the
> result of a "mapValues", for example.
>
> Ultimately, the solution here is to make this method take a parameter,
> explicitly specifying the KStream that records are fed back in to, making
> the above two examples:
>
> updates
> .map((parent, count) -> KeyValue(parent, count + 1))
> .join(parents, (count, parent) -> { KeyValue(parent, count) })
> .recursively(updates)
>
> and:
>
> foo
> .filter((key, value) -> value <= 0)
> .mapValues((value) -> value - 1)
> .recursively(foo)
>
> We could *also* support a 0-ary version of the method that defaults to
> recursively executing the previous node, but I'm worried that users may not
> fully understand the consequences of this, inadvertently creating infinite
> loops that are difficult to debug.
>
> Finally, I'm not convinced that "recursively" is the best name for the
> method. Perhaps "recursivelyVia" or "recursivelyTo"? Ideas welcome!
>
> If we want to prevent this method being "abused" to merge different
> streams together, it should be trivial to ensure that the provided argument
> is an ancestor of the current node, by recursively traversing up the
> process graph.
&

Re: [DISCUSS] KIP-857: Streaming recursion in Kafka Streams

2022-08-25 Thread Nick Telford
ginning of the stream you want to feed it back into.
>
>
> As I see it, the internal implementation should be, and is, essentially
> independent from the
> design of the API itself -- in other words, why does calling this
> operator/method `recursion`
> not work, or have anything at all to do with what Streams "knows" or how it
> does the actual
> recursion? And why would calling it recursion be any different from calling
> it/reusing the existing
> `to` operator method?
>
> On that note, the proposal to reuse the `to` operator for this purpose is
> the other thing I've found
> to be very confusing. Can you expand on why you think `to` would be
> appropriate here vs a
> dedicated recursion operator? I actually think it would be fairly
> misleading to have the `to` operator
> do something pretty wildly different depending on what you passed in, I
> mean stream recursion seems
> quite far removed from its current semantics -- I just don't really see the
> connection.
>
> so tl;dr why not give this operation its own dedicated operator/method
> name, vs reusing an existing operator that does something else?
>
> Overall though this sounds great, thanks for the KIP!
>
> Cheers,
> Sophie
>
> On Thu, Aug 18, 2022 at 4:48 PM Guozhang Wang  wrote:
>
> > Hello Nick,
> >
> > Thanks for the replies! They are very thoughtful. I think I agree with
> you
> > that requiring the output stream to a source stream is not sufficient
> for a
> > valid recursion, and even without the proposed API users today can still
> > create a broken recursive topology.
> >
> > Just want to clarify another question:
> >
> > In our current examples, the linked output stream and input stream are on
> > the same sub-topology, in which case this API allows us to avoid creating
> > unnecessary intermediate topics; when the linked output/input streams are
> > not on the same sub-topology, then using this API would not buy us
> > anything, right? E.g.
> >
> > ```
> > stream1 = builder.stream("topic1");
> > stream2 = stream1.repartition("topic2");
> > stream2.to(stream1)
> > ```
> >
> > Then this API would not buy us anything compared with
> >
> > ```
> > stream1 = builder.stream("topic1");
> > stream2 = stream1.repartition("topic2");
> > stream2.to("topic1")
> > ```
> >
> > Is that right?
> >
> > Guozhang
> >
> >
> >
> >
> > On Wed, Aug 10, 2022 at 11:10 AM Nick Telford 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > On Guozhang's point 1): I actually considered a "recursion" API,
> > something
> > > like you suggested, however it won't work, because to do the recursion
> > you
> > > need to know both the end of the KStream that you want to recurse, AND
> > the
> > > beginning of the stream you want to feed it back into. Your proposed
> > > "recursive(stream1.join(table))" (which is equivalent to
> > > "stream1.join(table).recursively()" etc.) won't work, because the
> > > "recursive" function only receives the tail of the stream to feed back,
> > but
> > > not the point that it needs to feed back in to. This is the reason for
> > > using the "to" API overload, as it allows you to instruct Kafka Streams
> > to
> > > take the end of a KStream and feed it back into *a specific point* in
> the
> > > process graph. It just so happens that the API has no restriction as to
> > > whether you feed the stream back into one of its own ancestor nodes,
> or a
> > > completely separate processor node, which is why I kept the "recursive"
> > > terminology out of the method name.
> > >
> > > I don't think it ultimately matters whether you feed it into a sourced
> > > stream or not. In your example, the expression "stream2.mapValues(...)"
> > > would loop recursively. Obviously since "mapValues" can't omit records
> > from
> > > its output, this would produce an infinite loop, but other, similar
> > > programs would be perfectly valid:
> > >
> > > stream2 = stream1.mapValues(...)
> > > stream3 = stream2.flatMapValues(...)
> > > stream3.to(stream2)
> > >
> > > Provided that the function passed to "flatMapValues" had a terminal
> > > condition.
> > >
> > > While you may worry about users creating infinite recursion loops, it's
> > > worth noting that the same can b

Re: [DISCUSS] KIP-857: Streaming recursion in Kafka Streams

2022-08-10 Thread Nick Telford
Hi everyone,

On Guozhang's point 1): I actually considered a "recursion" API, something
like you suggested, however it won't work, because to do the recursion you
need to know both the end of the KStream that you want to recurse, AND the
beginning of the stream you want to feed it back into. Your proposed
"recursive(stream1.join(table))" (which is equivalent to
"stream1.join(table).recursively()" etc.) won't work, because the
"recursive" function only receives the tail of the stream to feed back, but
not the point that it needs to feed back in to. This is the reason for
using the "to" API overload, as it allows you to instruct Kafka Streams to
take the end of a KStream and feed it back into *a specific point* in the
process graph. It just so happens that the API has no restriction as to
whether you feed the stream back into one of its own ancestor nodes, or a
completely separate processor node, which is why I kept the "recursive"
terminology out of the method name.

I don't think it ultimately matters whether you feed it into a sourced
stream or not. In your example, the expression "stream2.mapValues(...)"
would loop recursively. Obviously since "mapValues" can't omit records from
its output, this would produce an infinite loop, but other, similar
programs would be perfectly valid:

stream2 = stream1.mapValues(...)
stream3 = stream2.flatMapValues(...)
stream3.to(stream2)

Provided that the function passed to "flatMapValues" had a terminal
condition.

While you may worry about users creating infinite recursion loops, it's
worth noting that the same can be said of (most) programming languages,
including Java, but we don't generally consider it a big problem. If you
have any ideas on how we can protect against infinite recursion loops, that
could definitely help. I don't think requiring a "sourced node" at the
point of recursion would help, as it's ultimately the presence of a
terminal condition in the recursive process graph that determines whether
or not it loops infinitely.

I'm happy to rewrite the KIP to orient it more around the new methods
itself, and I'm happy to change the methods being added if you can come up
with a better solution :-)

Re: 2) thanks for spotting my error, I've already corrected it in the KIP.

Thank you both for your feedback so far. Keep it coming!

Regards,

Nick

On Wed, 10 Aug 2022 at 00:50, Guozhang Wang  wrote:

> Hello Nick,
>
> Thanks for bringing this KIP! Just a few thoughts:
>
> 1) I agree with Sagar that, we'd probably think about two routes to
> rephrase / restructure the proposal:
>
> * we can propose a couple of new APIs, and just list "more
> convenient recursion" as one of its benefits. Then we'd need to be careful
> and consider all possible use scenarios, e.g. what if "other" is not a
> sourced stream, e.g.:
>
> stream2 = stream1.mapValues(...)
> stream3 = stream2.mapValues(...)
> stream3.to(stream2)
>
> Would that be allowed? If yes what's the implementation semantics of this
> code.
>
> * OR, we propose sth just for more convenient recursion, but then we would
> need to consider having a more restrictive expressiveness in the new DSL,
> e.g. we'd need to enforce that "other" is a source stream, and that "other"
> is one of the ancestor of "this", programmatically. Or we can think about a
> totally different set of new DSL e.g. (I'm just making it up on top of my
> head for illustration, not really advocating it :P):
>
> stream1 = stream2.mapValues(...);
> stream1 = recursive(stream1.join(table));
>
> 2) Just a nit comment, it seems in your example, the topic name should be:
>
> ```
> nodes
> .map((node, parent) -> { KeyValue(parent, 1L) })
> .to("node-updates")
>
> updates
> .join(parents, (count, parent) -> { KeyValue(parent, count) }) // the
> root node has no parent, so recursion halts at the root
> .to("node-updates")
> ```
>
> Right?
>
>
> On Sun, Aug 7, 2022 at 7:52 PM Sagar  wrote:
>
> > Hey Nick,
> >
> > Since we are adding a new method to the public interface, we should
> > probably decide the necessity of doing so, more so when you say that it's
> > an alternative to something already existing. My suggestion would be to
> > still modify the KIP around the new API, highlight how it's an
> alternative
> > to something already existing and why we should add the new API. You have
> > already explained streaming recursion, so that's one added benefit we get
> > as part of the new API. So, try to expand a little bit around those
> points.
> > Graph traversal should be fine as an example. You could make it slightly
> > more clear.
> >
>

Re: [DISCUSS] KIP-857: Streaming recursion in Kafka Streams

2022-08-05 Thread Nick Telford
Hi Sagar,

Thanks for reading through my proposal.

While the 2 new methods were originally intended for the recursive
use-case, they could also be used as an alternative means of wiring two
different KStreams together. The main reason I didn't document this in the
KIP is that using the API for this doesn't bring anything new to the table:
it's just an alternative form of something that already exists. If you
believe it would be helpful, I can document this in more detail. I can
re-orient the KIP around the new methods themselves, but I felt there was
more value in the KIP emphasizing the new functionality and algorithms that
they enable.

What additional context would you like to see in the KIP? Some more
examples of recursive algorithms that would benefit? A more concrete
example than generic graph traversal? Something else?

Regards,

Nick Telford

On Fri, 5 Aug 2022 at 11:02, Sagar  wrote:

> Hey Nick,
>
> Thanks for the KIP. This seems like a great addition. However, just
> wondering if the 2 new methods that you plan to add are meant only for
> streaming recursion? I would imagine they could be repurposed for other use
> cases as well? If yes, then probably the KIP should revolve around the
> addition of adding these methods which would btw also support streaming
> recursion. IMHO adding 2 new methods just for streaming recursion seems
> slightly odd to me.
>
> Also, pardon my ignorance here, but I don't have much insight into
> streaming recursion. You can add some more context to it.
>
> Thanks!
> Sagar.
>
>
>
>
> On Tue, Jul 26, 2022 at 8:46 PM Nick Telford 
> wrote:
>
> > Hi everyone,
> >
> > URL:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-857%3A+Streaming+recursion+in+Kafka+Streams
> >
> > Here's a KIP for extending the Streams DSL API to support "streaming
> > recursion". See the Motivation section for details on what I mean by
> this,
> > along with an example of recursively counting nodes in a graph.
> >
> > I haven't included changes for the PAPI, mostly because I don't use it,
> so
> > I'm not as familiar with the idioms there. If you can think of a good
> > analogue for a new PAPI method, I'm happy to include it in the KIP.
> >
> > Regards,
> >
> > Nick Telford
> >
>


Re: [DISCUSS] KIP-844: Transactional State Stores

2022-07-26 Thread Nick Telford
Hi Alex,

Excellent proposal, I'm very keen to see this land!

Would it be useful to permit configuring the type of store used for
uncommitted offsets on a store-by-store basis? This way, users could choose
whether to use, e.g. an in-memory store or RocksDB, potentially reducing
the overheads associated with RocksDb for smaller stores, but without the
memory pressure issues?

I suspect that in most cases, the number of uncommitted records will be
very small, because the default commit interval is 100ms.

Regards,

Nick

On Tue, 26 Jul 2022 at 01:36, Guozhang Wang  wrote:

> Hello Alex,
>
> Thanks for the updated KIP, I looked over it and browsed the WIP and just
> have a couple meta thoughts:
>
> 1) About the param passed into the `recover()` function: it seems to me
> that the semantics of "recover(offset)" is: recover this state to a
> transaction boundary which is at least the passed-in offset. And the only
> possibility that the returned offset is different than the passed-in offset
> is that if the previous failure happens after we've done all the commit
> procedures except writing the new checkpoint, in which case the returned
> offset would be larger than the passed-in offset. Otherwise it should
> always be equal to the passed-in offset, is that right?
>
> 2) It seems the only use for the "transactional()" function is to determine
> if we can update the checkpoint file while in EOS. But the purpose of the
> checkpoint file's offsets is just to tell "the local state's current
> snapshot's progress is at least the indicated offsets" anyways, and with
> this KIP maybe we would just do:
>
> a) when in ALOS, upon failover: we set the starting offset as
> checkpointed-offset, then restore() from changelog till the end-offset.
> This way we may restore some records twice.
> b) when in EOS, upon failover: we first call recover(checkpointed-offset),
> then set the starting offset as the returned offset (which may be larger
> than checkpointed-offset), then restore until the end-offset.
>
> So why not also:
> c) we let the `commit()` function to also return an offset, which indicates
> "checkpointable offsets".
> d) for existing non-transactional stores, we just have a default
> implementation of "commit()" which is simply a flush, and returns a
> sentinel value like -1. Then later if we get checkpointable offsets -1, we
> do not write the checkpoint. Upon clean shutting down we can just
> checkpoint regardless of the returned value from "commit".
> e) for existing non-transactional stores, we just have a default
> implementation of "recover()" which is to wipe out the local store and
> return offset 0 if the passed in offset is -1, otherwise if not -1 then it
> indicates a clean shutdown in the last run, can this function is just a
> no-op.
>
> In that case, we would not need the "transactional()" function anymore,
> since for non-transactional stores their behaviors are still wrapped in the
> `commit / recover` function pairs.
>
> I have not completed the thorough pass on your WIP PR, so maybe I could
> come up with some more feedback later, but just let me know if my
> understanding above is correct or not?
>
>
> Guozhang
>
>
>
>
> On Thu, Jul 14, 2022 at 7:01 AM Alexander Sorokoumov
>  wrote:
>
> > Hi,
> >
> > I updated the KIP with the following changes:
> > * Replaced in-memory batches with the secondary-store approach as the
> > default implementation to address the feedback about memory pressure as
> > suggested by Sagar and Bruno.
> > * Introduced StateStore#commit and StateStore#recover methods as an
> > extension of the rollback idea. @Guozhang, please see the comment below
> on
> > why I took a slightly different approach than you suggested.
> > * Removed mentions of changes to IQv1 and IQv2. Transactional state
> stores
> > enable reading committed in IQ, but it is really an independent feature
> > that deserves its own KIP. Conflating them unnecessarily increases the
> > scope for discussion, implementation, and testing in a single unit of
> work.
> >
> > I also published a prototype -
> https://github.com/apache/kafka/pull/12393
> > that implements changes described in the proposal.
> >
> > Regarding explicit rollback, I think it is a powerful idea that allows
> > other StateStore implementations to take a different path to the
> > transactional behavior rather than keep 2 state stores. Instead of
> > introducing a new commit token, I suggest using a changelog offset that
> > already 1:1 corresponds to the materialized state. This works nicely
> > because Kafka Stream first commits an AK transaction and only then
> > checkpoints the state store, so we can use the changelog offset to commit
> > the state store transaction.
> >
> > I called the method StateStore#recover rather than StateStore#rollback
> > because a state store might either roll back or forward depending on the
> > specific point of the crash failure.Consider the write algorithm in Kafka
> > Streams is:
> > 1. write stuff to the 

[DISCUSS] KIP-857: Streaming recursion in Kafka Streams

2022-07-26 Thread Nick Telford
Hi everyone,

URL:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-857%3A+Streaming+recursion+in+Kafka+Streams

Here's a KIP for extending the Streams DSL API to support "streaming
recursion". See the Motivation section for details on what I mean by this,
along with an example of recursively counting nodes in a graph.

I haven't included changes for the PAPI, mostly because I don't use it, so
I'm not as familiar with the idioms there. If you can think of a good
analogue for a new PAPI method, I'm happy to include it in the KIP.

Regards,

Nick Telford


Re: [DISCUSS] KIP-819: Merge multiple KStreams in one operation

2022-03-29 Thread Nick Telford
Yeah, the Named parameter makes it a little trickier. My suggestion would
be to add an additional overload that looks like:

KStream merged(KStream first, Named named, KStream rest);

It's not ideal having the Named parameter split the other parameters; we
could alternatively move the Named parameter to be first, but then that
wouldn't align with the rest of the API.

Nick

On Tue, 29 Mar 2022 at 05:20, Chris Egerton  wrote:

> Hi all,
>
> Java permits the overload. Simple test class to demonstrate:
>
> ```
> public class Test {
> private final String field;
>
> public Test(String field) {
> this.field = field;
> }
>
> public Test merge(Test that) {
> return new Test("Single-arg merge: " + this.field + ", " +
> that.field);
> }
>
> public Test merge(Test that, Test... those) {
> String newField = "Varargs merge: " + this.field + ", " +
> that.field;
> for (Test test : those) newField += ", " + test.field;
> return new Test(newField);
> }
>
> public static void main(String[] args) {
> Test t1 = new Test("t1"), t2 = new Test("t2"), t3 = new Test("t3");
> Test merge1 = t1.merge(t2), merge2 = t1.merge(t2, t3);
> System.out.println(merge1.field); // Single-arg merge: t1, t2
> System.out.println(merge2.field); // Varargs merge: t1, t2, t3
> }
> }
> ```
>
> There's a great StackOverflow writeup on the subject [1], which explains
> that during method resolution, priority is given to methods whose
> signatures match the argument list without taking boxing/unboxing or
> varargs into consideration:
>
> > The first phase performs overload resolution without permitting boxing or
> unboxing conversion, or the use of variable arity method invocation. If no
> applicable method is found during this phase then processing continues to
> the second phase.
> > The second phase performs overload resolution while allowing boxing and
> unboxing, but still precludes the use of variable arity method invocation.
> If no applicable method is found during this phase then processing
> continues to the third phase.
> > The third phase allows overloading to be combined with variable arity
> methods, boxing, and unboxing.
>
> I'm curious if it's worth keeping a variant that accepts a Named parameter?
> Might be tricky to accommodate since variadic arguments have to be last.
>
> [1] - https://stackoverflow.com/a/48850722
>
> Cheers,
>
> Chris
>
> On Mon, Mar 28, 2022 at 11:46 PM Matthias J. Sax  wrote:
>
> > I think Java does not allow to have both overloads, because it would
> > result in ambiguity?
> >
> > If you call `s1.merge(s2)` it's unclear which method you want to call.
> >
> >
> > -Matthias
> >
> >
> > On 3/28/22 7:20 AM, Nick Telford wrote:
> > > Hi Matthias,
> > >
> > > How about instead of changing the signature of the existing method to
> > > variadic, we simply add a new overload which takes variadic args:
> > >
> > > KStream merge(KStream first, KStream... rest);
> > >
> > > That way, we maintain both source *and* binary compatibility for the
> > > existing method, and we can enforce that there is always at least one
> > > stream (argument) being merged.
> > >
> > > I'm fine dropping the static methods. As you said, this is mostly all
> > just
> > > syntax sugar anyway, but I do think allowing multiple streams to be
> > merged
> > > together is a benefit. My motivation was that we generate diagrams for
> > our
> > > Topologies, and having several binary merges becomes quite messy when a
> > > single n-ary merge is what you're really modelling.
> > >
> > > Regards,
> > >
> > > Nick
> > >
> > > On Thu, 24 Mar 2022 at 21:24, Matthias J. Sax 
> wrote:
> > >
> > >> Thanks for proposing this KIP.
> > >>
> > >> I feel a little bit torn by the idea. In general, we try to keep the
> > >> surface area small, and only add APIs that delivery (significant)
> value.
> > >>
> > >> It seems the current proposal is more or less about syntactic sugar,
> > >> what can still be valuable, but I am not really sure about it.
> > >>
> > >> I am also wondering, if we could use a variadic argument instead of a
> > >> `Collection`:
> > >>
> > >>   KStream merge(KStream... streams);
> > >>
> > >> This wa

Re: [DISCUSS] KIP-819: Merge multiple KStreams in one operation

2022-03-28 Thread Nick Telford
Hi Matthias,

How about instead of changing the signature of the existing method to
variadic, we simply add a new overload which takes variadic args:

KStream merge(KStream first, KStream... rest);

That way, we maintain both source *and* binary compatibility for the
existing method, and we can enforce that there is always at least one
stream (argument) being merged.

I'm fine dropping the static methods. As you said, this is mostly all just
syntax sugar anyway, but I do think allowing multiple streams to be merged
together is a benefit. My motivation was that we generate diagrams for our
Topologies, and having several binary merges becomes quite messy when a
single n-ary merge is what you're really modelling.

Regards,

Nick

On Thu, 24 Mar 2022 at 21:24, Matthias J. Sax  wrote:

> Thanks for proposing this KIP.
>
> I feel a little bit torn by the idea. In general, we try to keep the
> surface area small, and only add APIs that delivery (significant) value.
>
> It seems the current proposal is more or less about syntactic sugar,
> what can still be valuable, but I am not really sure about it.
>
> I am also wondering, if we could use a variadic argument instead of a
> `Collection`:
>
>  KStream merge(KStream... streams);
>
> This way, we could just replace the existing method in a backward
> compatible way (well, source code compatible only) and thus not increase
> the surface area of the API while still achieving your goal?
>
> A `merge()` with zero argument would just be a no-op (same as for using
> `Collection` I assume?).
>
>
> For adding the static methods: It seems not to be a common pattern to
> me? I might be better not to add them and leave it to users to write a
> small helper method themselves if they have such a pattern?
>
>
> -Matthias
>
>
>
> On 1/31/22 7:35 AM, Nick Telford wrote:
> > Hi everyone,
> >
> > I'd like to discuss KIP 819:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-819%3A+Merge+multiple+KStreams+in+one+operation
> >
> > This is a simple KIP that adds/modifies the KStream#merge API to enable
> > many streams to be merged in a single graph node.
> >
> > Regards,
> >
> > Nick Telford
> >
>


  1   2   >