subject:"Re\: \[DISCUSS\] KIP\-892\: Transactional Semantics for StateStores"

Addendum:

I think we would also face the same problem with the approach John outlined
earlier (using the record cache as a transaction buffer and flushing it
straight to SST files). This is because the record cache (the ThreadCache
class) is not thread-safe, so every commit would invalidate open IQ
Iterators in the same way that RocksDB WriteBatches do.
--
Nick

On Wed, 13 Sept 2023 at 16:58, Nick Telford  wrote:

> Hi Bruno,
>
> I've updated the KIP based on our conversation. The only things I've not
> yet done are:
>
> 1. Using transactions under ALOS and EOS.
> 2. Making IsolationLevel a query-time constraint, rather than linking it
> to the processing.guarantee.
>
> There's a wrinkle that makes this a challenge: Interactive Queries that
> open an Iterator, when using transactions and READ_UNCOMMITTED.
> The problem is that under READ_UNCOMMITTED, queries need to be able to
> read records from the currently uncommitted transaction buffer
> (WriteBatch). This includes for Iterators, which should iterate both the
> transaction buffer and underlying database (using
> WriteBatch#iteratorWithBase()).
>
> The issue is that when the StreamThread commits, it writes the current
> WriteBatch to RocksDB *and then clears the WriteBatch*. Clearing the
> WriteBatch while an Interactive Query holds an open Iterator on it will
> invalidate the Iterator. Worse, it turns out that Iterators over a
> WriteBatch become invalidated not just when the WriteBatch is cleared, but
> also when the Iterators' current key receives a new write.
>
> Now that I'm writing this, I remember that this is the major reason that I
> switched the original design from having a query-time IsolationLevel to
> having the IsolationLevel linked to the transactionality of the stores
> themselves.
>
> It *might* be possible to resolve this, by having a "chain" of
> WriteBatches, with the StreamThread switching to a new WriteBatch whenever
> a new Interactive Query attempts to read from the database, but that could
> cause some performance problems/memory pressure when subjected to a high
> Interactive Query load. It would also reduce the efficiency of WriteBatches
> on-commit, as we'd have to write N WriteBatches, where N is the number of
> Interactive Queries since the last commit.
>
> I realise this is getting into the weeds of the implementation, and you'd
> rather we focus on the API for now, but I think it's important to consider
> how to implement the desired API, in case we come up with an API that
> cannot be implemented efficiently, or even at all!
>
> Thoughts?
> --
> Nick
>
> On Wed, 13 Sept 2023 at 13:03, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> 6.
>> Of course, you are right! My bad!
>> Wiping out the state in the downgrading case is fine.
>>
>>
>> 3a.
>> Focus on the public facing changes for the KIP. We will manage to get
>> the internals right. Regarding state stores that do not support
>> READ_COMMITTED, they should throw an error stating that they do not
>> support READ_COMMITTED. No need to adapt all state stores immediately.
>>
>> 3b.
>> I am in favor of using transactions also for ALOS.
>>
>>
>> Best,
>> Bruno
>>
>> On 9/13/23 11:57 AM, Nick Telford wrote:
>> > Hi Bruno,
>> >
>> > Thanks for getting back to me!
>> >
>> > 2.
>> > The fact that implementations can always track estimated memory usage in
>> > the wrapper is a good point. I can remove -1 as an option, and I'll
>> clarify
>> > the JavaDoc that 0 is not just for non-transactional stores, which is
>> > currently misleading.
>> >
>> > 6.
>> > The problem with catching the exception in the downgrade process is that
>> > would require new code in the Kafka version being downgraded to. Since
>> > users could conceivably downgrade to almost *any* older version of Kafka
>> > Streams, I'm not sure how we could add that code?
>> > The only way I can think of doing it would be to provide a dedicated
>> > downgrade tool, that goes through every local store and removes the
>> > offsets column families. But that seems like an unnecessary amount of
>> extra
>> > code to maintain just to handle a somewhat niche situation, when the
>> > alternative (automatically wipe and restore stores) should be
>> acceptable.
>> >
>> > 1, 4, 5: Agreed. I'll make the changes you've requested.
>> >
>> > 3a.
>> > I agree that IsolationLevel makes more sense at query-time, and I
>> actually
>> > initially attempted to place the IsolationLevel at query-time, but I ran
>> > into some problems:
>> > - The key issue is that, under ALOS we're not staging writes in
>> > transactions, so can't perform writes at the READ_COMMITTED isolation
>> > level. However, this may be addressed if we decide to *always* use
>> > transactions as discussed under 3b.
>> > - IQv1 and IQv2 have quite different implementations. I remember having
>> > some difficulty understanding the IQv1 internals, which made it
>> difficult
>> > to determine what needed to be changed. However, I *think* this can be
>> > addressed for both

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Hi Bruno,

I've updated the KIP based on our conversation. The only things I've not
yet done are:

1. Using transactions under ALOS and EOS.
2. Making IsolationLevel a query-time constraint, rather than linking it to
the processing.guarantee.

There's a wrinkle that makes this a challenge: Interactive Queries that
open an Iterator, when using transactions and READ_UNCOMMITTED.
The problem is that under READ_UNCOMMITTED, queries need to be able to read
records from the currently uncommitted transaction buffer (WriteBatch).
This includes for Iterators, which should iterate both the transaction
buffer and underlying database (using WriteBatch#iteratorWithBase()).

The issue is that when the StreamThread commits, it writes the current
WriteBatch to RocksDB *and then clears the WriteBatch*. Clearing the
WriteBatch while an Interactive Query holds an open Iterator on it will
invalidate the Iterator. Worse, it turns out that Iterators over a
WriteBatch become invalidated not just when the WriteBatch is cleared, but
also when the Iterators' current key receives a new write.

Now that I'm writing this, I remember that this is the major reason that I
switched the original design from having a query-time IsolationLevel to
having the IsolationLevel linked to the transactionality of the stores
themselves.

It *might* be possible to resolve this, by having a "chain" of
WriteBatches, with the StreamThread switching to a new WriteBatch whenever
a new Interactive Query attempts to read from the database, but that could
cause some performance problems/memory pressure when subjected to a high
Interactive Query load. It would also reduce the efficiency of WriteBatches
on-commit, as we'd have to write N WriteBatches, where N is the number of
Interactive Queries since the last commit.

I realise this is getting into the weeds of the implementation, and you'd
rather we focus on the API for now, but I think it's important to consider
how to implement the desired API, in case we come up with an API that
cannot be implemented efficiently, or even at all!

Thoughts?
--
Nick

On Wed, 13 Sept 2023 at 13:03, Bruno Cadonna  wrote:

> Hi Nick,
>
> 6.
> Of course, you are right! My bad!
> Wiping out the state in the downgrading case is fine.
>
>
> 3a.
> Focus on the public facing changes for the KIP. We will manage to get
> the internals right. Regarding state stores that do not support
> READ_COMMITTED, they should throw an error stating that they do not
> support READ_COMMITTED. No need to adapt all state stores immediately.
>
> 3b.
> I am in favor of using transactions also for ALOS.
>
>
> Best,
> Bruno
>
> On 9/13/23 11:57 AM, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for getting back to me!
> >
> > 2.
> > The fact that implementations can always track estimated memory usage in
> > the wrapper is a good point. I can remove -1 as an option, and I'll
> clarify
> > the JavaDoc that 0 is not just for non-transactional stores, which is
> > currently misleading.
> >
> > 6.
> > The problem with catching the exception in the downgrade process is that
> > would require new code in the Kafka version being downgraded to. Since
> > users could conceivably downgrade to almost *any* older version of Kafka
> > Streams, I'm not sure how we could add that code?
> > The only way I can think of doing it would be to provide a dedicated
> > downgrade tool, that goes through every local store and removes the
> > offsets column families. But that seems like an unnecessary amount of
> extra
> > code to maintain just to handle a somewhat niche situation, when the
> > alternative (automatically wipe and restore stores) should be acceptable.
> >
> > 1, 4, 5: Agreed. I'll make the changes you've requested.
> >
> > 3a.
> > I agree that IsolationLevel makes more sense at query-time, and I
> actually
> > initially attempted to place the IsolationLevel at query-time, but I ran
> > into some problems:
> > - The key issue is that, under ALOS we're not staging writes in
> > transactions, so can't perform writes at the READ_COMMITTED isolation
> > level. However, this may be addressed if we decide to *always* use
> > transactions as discussed under 3b.
> > - IQv1 and IQv2 have quite different implementations. I remember having
> > some difficulty understanding the IQv1 internals, which made it difficult
> > to determine what needed to be changed. However, I *think* this can be
> > addressed for both implementations by wrapping the RocksDBStore in an
> > IsolationLevel-dependent wrapper, that overrides read methods (get, etc.)
> > to either read directly from the database or from the ongoing
> transaction.
> > But IQv1 might still be difficult.
> > - If IsolationLevel becomes a query constraint, then all other
> StateStores
> > will need to respect it, including the in-memory stores. This would
> require
> > us to adapt in-memory stores to stage their writes so they can be
> isolated
> > from READ_COMMITTTED queries. It would also become an important
> >

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-13 Thread Bruno Cadonna


Hi Nick,

6.
Of course, you are right! My bad!
Wiping out the state in the downgrading case is fine.


3a.
Focus on the public facing changes for the KIP. We will manage to get 
the internals right. Regarding state stores that do not support 
READ_COMMITTED, they should throw an error stating that they do not 
support READ_COMMITTED. No need to adapt all state stores immediately.


3b.
I am in favor of using transactions also for ALOS.


Best,
Bruno

On 9/13/23 11:57 AM, Nick Telford wrote:

Hi Bruno,

Thanks for getting back to me!

2.
The fact that implementations can always track estimated memory usage in
the wrapper is a good point. I can remove -1 as an option, and I'll clarify
the JavaDoc that 0 is not just for non-transactional stores, which is
currently misleading.

6.
The problem with catching the exception in the downgrade process is that
would require new code in the Kafka version being downgraded to. Since
users could conceivably downgrade to almost *any* older version of Kafka
Streams, I'm not sure how we could add that code?
The only way I can think of doing it would be to provide a dedicated
downgrade tool, that goes through every local store and removes the
offsets column families. But that seems like an unnecessary amount of extra
code to maintain just to handle a somewhat niche situation, when the
alternative (automatically wipe and restore stores) should be acceptable.

1, 4, 5: Agreed. I'll make the changes you've requested.

3a.
I agree that IsolationLevel makes more sense at query-time, and I actually
initially attempted to place the IsolationLevel at query-time, but I ran
into some problems:
- The key issue is that, under ALOS we're not staging writes in
transactions, so can't perform writes at the READ_COMMITTED isolation
level. However, this may be addressed if we decide to *always* use
transactions as discussed under 3b.
- IQv1 and IQv2 have quite different implementations. I remember having
some difficulty understanding the IQv1 internals, which made it difficult
to determine what needed to be changed. However, I *think* this can be
addressed for both implementations by wrapping the RocksDBStore in an
IsolationLevel-dependent wrapper, that overrides read methods (get, etc.)
to either read directly from the database or from the ongoing transaction.
But IQv1 might still be difficult.
- If IsolationLevel becomes a query constraint, then all other StateStores
will need to respect it, including the in-memory stores. This would require
us to adapt in-memory stores to stage their writes so they can be isolated
from READ_COMMITTTED queries. It would also become an important
consideration for third-party stores on upgrade, as without changes, they
would not support READ_COMMITTED queries correctly.

Ultimately, I may need some help making the necessary change to IQv1 to
support this, but I don't think it's fundamentally impossible, if we want
to pursue this route.

3b.
The main reason I chose to keep ALOS un-transactional was to minimize
behavioural change for most users (I believe most Streams users use the
default configuration, which is ALOS). That said, it's clear that if ALOS
also used transactional stores, the only change in behaviour would be that
it would become *more correct*, which could be considered a "bug fix" by
users, rather than a change they need to handle.

I believe that performance using transactions (aka. RocksDB WriteBatches)
should actually be *better* than the un-batched write-path that is
currently used[1]. The only "performance" consideration will be the
increased memory usage that transactions require. Given the mitigations for
this memory that we have in place, I would expect that this is not a
problem for most users.

If we're happy to do so, we can make ALOS also use transactions.

Regards,
Nick

Link 1:
https://github.com/adamretter/rocksjava-write-methods-benchmark#results

On Wed, 13 Sept 2023 at 09:41, Bruno Cadonna  wrote:


Hi Nick,

Thanks for the updates and sorry for the delay on my side!


1.
Making the default implementation for flush() a no-op sounds good to me.


2.
I think what was bugging me here is that a third-party state store needs
to implement the state store interface. That means they need to
implement a wrapper around the actual state store as we do for RocksDB
with RocksDBStore. So, a third-party state store can always estimate the
uncommitted bytes, if it wants, because the wrapper can record the added
bytes.
One case I can think of where returning -1 makes sense is when Streams
does not need to estimate the size of the write batch and trigger
extraordinary commits, because the third-party state store takes care of
memory. But in that case the method could also just return 0. Even that
case would be better solved with a method that returns whether the state
store manages itself the memory used for uncommitted bytes or not.
Said that, I am fine with keeping the -1 return value, I was just
wondering when and if it will be used.

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Bruno,

Thinking about 3a. in addition to adding the IsolationLevel to
QueryStoreParameters and Query, what about also adding a method like
"ReadOnlyKeyValueStore view(IsolationLevel isolationLevel)" to ReadOnlyKeyValueStore?

This would enable us to easily select/switch between IsolationLevels, even
if the StateStore has many layers of wrappers (as is the case at the point
IQv1 deals with the store). Would this be acceptable, or do you have
another approach in mind?

Regards,
Nick

On Wed, 13 Sept 2023 at 10:57, Nick Telford  wrote:

> Hi Bruno,
>
> Thanks for getting back to me!
>
> 2.
> The fact that implementations can always track estimated memory usage in
> the wrapper is a good point. I can remove -1 as an option, and I'll clarify
> the JavaDoc that 0 is not just for non-transactional stores, which is
> currently misleading.
>
> 6.
> The problem with catching the exception in the downgrade process is that
> would require new code in the Kafka version being downgraded to. Since
> users could conceivably downgrade to almost *any* older version of Kafka
> Streams, I'm not sure how we could add that code?
> The only way I can think of doing it would be to provide a dedicated
> downgrade tool, that goes through every local store and removes the
> offsets column families. But that seems like an unnecessary amount of extra
> code to maintain just to handle a somewhat niche situation, when the
> alternative (automatically wipe and restore stores) should be acceptable.
>
> 1, 4, 5: Agreed. I'll make the changes you've requested.
>
> 3a.
> I agree that IsolationLevel makes more sense at query-time, and I actually
> initially attempted to place the IsolationLevel at query-time, but I ran
> into some problems:
> - The key issue is that, under ALOS we're not staging writes in
> transactions, so can't perform writes at the READ_COMMITTED isolation
> level. However, this may be addressed if we decide to *always* use
> transactions as discussed under 3b.
> - IQv1 and IQv2 have quite different implementations. I remember having
> some difficulty understanding the IQv1 internals, which made it difficult
> to determine what needed to be changed. However, I *think* this can be
> addressed for both implementations by wrapping the RocksDBStore in an
> IsolationLevel-dependent wrapper, that overrides read methods (get, etc.)
> to either read directly from the database or from the ongoing transaction.
> But IQv1 might still be difficult.
> - If IsolationLevel becomes a query constraint, then all other StateStores
> will need to respect it, including the in-memory stores. This would require
> us to adapt in-memory stores to stage their writes so they can be isolated
> from READ_COMMITTTED queries. It would also become an important
> consideration for third-party stores on upgrade, as without changes, they
> would not support READ_COMMITTED queries correctly.
>
> Ultimately, I may need some help making the necessary change to IQv1 to
> support this, but I don't think it's fundamentally impossible, if we want
> to pursue this route.
>
> 3b.
> The main reason I chose to keep ALOS un-transactional was to minimize
> behavioural change for most users (I believe most Streams users use the
> default configuration, which is ALOS). That said, it's clear that if ALOS
> also used transactional stores, the only change in behaviour would be that
> it would become *more correct*, which could be considered a "bug fix" by
> users, rather than a change they need to handle.
>
> I believe that performance using transactions (aka. RocksDB WriteBatches)
> should actually be *better* than the un-batched write-path that is
> currently used[1]. The only "performance" consideration will be the
> increased memory usage that transactions require. Given the mitigations for
> this memory that we have in place, I would expect that this is not a
> problem for most users.
>
> If we're happy to do so, we can make ALOS also use transactions.
>
> Regards,
> Nick
>
> Link 1:
> https://github.com/adamretter/rocksjava-write-methods-benchmark#results
>
> On Wed, 13 Sept 2023 at 09:41, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> Thanks for the updates and sorry for the delay on my side!
>>
>>
>> 1.
>> Making the default implementation for flush() a no-op sounds good to me.
>>
>>
>> 2.
>> I think what was bugging me here is that a third-party state store needs
>> to implement the state store interface. That means they need to
>> implement a wrapper around the actual state store as we do for RocksDB
>> with RocksDBStore. So, a third-party state store can always estimate the
>> uncommitted bytes, if it wants, because the wrapper can record the added
>> bytes.
>> One case I can think of where returning -1 makes sense is when Streams
>> does not need to estimate the size of the write batch and trigger
>> extraordinary commits, because the third-party state store takes care of
>> memory. But in that case the method could also just return 0. Even that
>> case would be

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Hi Bruno,

Thanks for getting back to me!

2.
The fact that implementations can always track estimated memory usage in
the wrapper is a good point. I can remove -1 as an option, and I'll clarify
the JavaDoc that 0 is not just for non-transactional stores, which is
currently misleading.

6.
The problem with catching the exception in the downgrade process is that
would require new code in the Kafka version being downgraded to. Since
users could conceivably downgrade to almost *any* older version of Kafka
Streams, I'm not sure how we could add that code?
The only way I can think of doing it would be to provide a dedicated
downgrade tool, that goes through every local store and removes the
offsets column families. But that seems like an unnecessary amount of extra
code to maintain just to handle a somewhat niche situation, when the
alternative (automatically wipe and restore stores) should be acceptable.

1, 4, 5: Agreed. I'll make the changes you've requested.

3a.
I agree that IsolationLevel makes more sense at query-time, and I actually
initially attempted to place the IsolationLevel at query-time, but I ran
into some problems:
- The key issue is that, under ALOS we're not staging writes in
transactions, so can't perform writes at the READ_COMMITTED isolation
level. However, this may be addressed if we decide to *always* use
transactions as discussed under 3b.
- IQv1 and IQv2 have quite different implementations. I remember having
some difficulty understanding the IQv1 internals, which made it difficult
to determine what needed to be changed. However, I *think* this can be
addressed for both implementations by wrapping the RocksDBStore in an
IsolationLevel-dependent wrapper, that overrides read methods (get, etc.)
to either read directly from the database or from the ongoing transaction.
But IQv1 might still be difficult.
- If IsolationLevel becomes a query constraint, then all other StateStores
will need to respect it, including the in-memory stores. This would require
us to adapt in-memory stores to stage their writes so they can be isolated
from READ_COMMITTTED queries. It would also become an important
consideration for third-party stores on upgrade, as without changes, they
would not support READ_COMMITTED queries correctly.

Ultimately, I may need some help making the necessary change to IQv1 to
support this, but I don't think it's fundamentally impossible, if we want
to pursue this route.

3b.
The main reason I chose to keep ALOS un-transactional was to minimize
behavioural change for most users (I believe most Streams users use the
default configuration, which is ALOS). That said, it's clear that if ALOS
also used transactional stores, the only change in behaviour would be that
it would become *more correct*, which could be considered a "bug fix" by
users, rather than a change they need to handle.

I believe that performance using transactions (aka. RocksDB WriteBatches)
should actually be *better* than the un-batched write-path that is
currently used[1]. The only "performance" consideration will be the
increased memory usage that transactions require. Given the mitigations for
this memory that we have in place, I would expect that this is not a
problem for most users.

If we're happy to do so, we can make ALOS also use transactions.

Regards,
Nick

Link 1:
https://github.com/adamretter/rocksjava-write-methods-benchmark#results

On Wed, 13 Sept 2023 at 09:41, Bruno Cadonna  wrote:

> Hi Nick,
>
> Thanks for the updates and sorry for the delay on my side!
>
>
> 1.
> Making the default implementation for flush() a no-op sounds good to me.
>
>
> 2.
> I think what was bugging me here is that a third-party state store needs
> to implement the state store interface. That means they need to
> implement a wrapper around the actual state store as we do for RocksDB
> with RocksDBStore. So, a third-party state store can always estimate the
> uncommitted bytes, if it wants, because the wrapper can record the added
> bytes.
> One case I can think of where returning -1 makes sense is when Streams
> does not need to estimate the size of the write batch and trigger
> extraordinary commits, because the third-party state store takes care of
> memory. But in that case the method could also just return 0. Even that
> case would be better solved with a method that returns whether the state
> store manages itself the memory used for uncommitted bytes or not.
> Said that, I am fine with keeping the -1 return value, I was just
> wondering when and if it will be used.
>
> Regarding returning 0 for transactional state stores when the batch is
> empty, I was just wondering because you explicitly stated
>
> "or {@code 0} if this StateStore does not support transactions."
>
> So it seemed to me returning 0 could only happen for non-transactional
> state stores.
>
>
> 3.
>
> a) What do you think if we move the isolation level to IQ (v1 and v2)?
> In the end this is the only component that really needs to specify the
> isolation

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-13 Thread Bruno Cadonna

Hi Nick,

Thanks for the updates and sorry for the delay on my side!

1.
Making the default implementation for flush() a no-op sounds good to me.

2.
I think what was bugging me here is that a third-party state store needs
to implement the state store interface. That means they need to
implement a wrapper around the actual state store as we do for RocksDB
with RocksDBStore. So, a third-party state store can always estimate the
uncommitted bytes, if it wants, because the wrapper can record the added
bytes.
One case I can think of where returning -1 makes sense is when Streams
does not need to estimate the size of the write batch and trigger
extraordinary commits, because the third-party state store takes care of
memory. But in that case the method could also just return 0. Even that
case would be better solved with a method that returns whether the state
store manages itself the memory used for uncommitted bytes or not.
Said that, I am fine with keeping the -1 return value, I was just
wondering when and if it will be used.

Regarding returning 0 for transactional state stores when the batch is
empty, I was just wondering because you explicitly stated

"or {@code 0} if this StateStore does not support transactions."

So it seemed to me returning 0 could only happen for non-transactional
state stores.

a) What do you think if we move the isolation level to IQ (v1 and v2)?
In the end this is the only component that really needs to specify the
isolation level. It is similar to the Kafka consumer that can choose
with what isolation level to read the input topic.
For IQv1 the isolation level should go into StoreQueryParameters. For
IQv2, I would add it to the Query interface.

b) Point a) raises the question what should happen during at-least-once
processing when the state store does not use transactions? John in the
past proposed to also use transactions on state stores for
at-least-once. I like that idea, because it avoids aggregating the same
records over and over again in the case of a failure. We had a case in
the past where a Streams applications in at-least-once mode was failing
continuously for some reasons I do not remember before committing the
offsets. After each failover, the app aggregated again and again the
same records. Of course the aggregate increased to very wrong values
just because of the failover. With transactions on the state stores we
could have avoided this. The app would have output the same aggregate
multiple times (i.e., after each failover) but at least the value of the
aggregate would not depend on the number of failovers. Outputting the
same aggregate multiple times would be incorrect under exactly-once but
it is OK for at-least-once.
If it makes sense to add a config to turn on and off transactions on
state stores under at-least-once or just use transactions in any case is
a question we should also discuss in this KIP. It depends a bit on the
performance trade-off. Maybe to be safe, I would add a config.

4.
Your points are all valid. I tend to say to keep the metrics around
flush() until we remove flush() completely from the interface. Calls to
flush() might still exist since existing processors might still call
flush() explicitly as you mentioned in 1). For sure, we need to document
how the metrics change due to the transactions in the upgrade notes.

5.
I see. Then you should describe how the .position files are handled in
a dedicated section of the KIP or incorporate the description in the
"Atomic Checkpointing" section instead of only mentioning it in the
"Compatibility, Deprecation, and Migration Plan".

6.
Describing upgrading and downgrading in the KIP is a good idea.
Regarding downgrading, I think you could also catch the exception and do
what is needed to downgrade, e.g., drop the column family. See here for
an example:

https://github.com/apache/kafka/blob/63fee01366e6ce98b9dfafd279a45d40b80e282d/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBTimestampedStore.java#L75

It is a bit brittle, but it works.

Best,
Bruno

On 8/24/23 12:18 PM, Nick Telford wrote:

Hi Bruno,

Thanks for taking the time to review the KIP. I'm back from leave now and
intend to move this forwards as quickly as I can.

Addressing your points:

1.
Because flush() is part of the StateStore API, it's exposed to custom
Processors, which might be making calls to flush(). This was actually the
case in a few integration tests.
To maintain as much compatibility as possible, I'd prefer not to make this
an UnsupportedOperationException, as it will cause previously working
Processors to start throwing exceptions at runtime.
I agree that it doesn't make sense for it to proxy commit(), though, as
that would cause it to violate the "StateStores commit only when the Task
commits" rule.
Instead, I think we should make this a no-op. That way, existing user
Processors will continue to work as-before, without violation of store

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-11 Thread Colt McNealy

Nick,

Thanks for the response.

>Can you clarify how much state was restored in those 11 seconds?
That was a full restoration of ~650MB of state after I wiped the state
directory. The restoration after a crash with your branch is nearly
instantaneous, whereas with plain Kafka 3.5.0 a crash triggers a full
restoration (8 seconds).

Additionally, I pulled, rebuilt, and re-tested your changes and now the
restoration time with your branch is the same as with vanilla Streams!
Fantastic work!

I plan to do some more testing with larger state stores over the next
couple weeks, both with RocksDB and Speedb OSS. And perhaps I might even
try enabling some of the experimental Speedb OSS features, such as the
[Improved Write Flow](https://docs.speedb.io/speedb-features/write-flow).
As far as I understand, this isn't possible to do through the standard
RocksDBConfigSetter since some of the config options are Speedb-specific.

Cheers,
Colt McNealy

*Founder, LittleHorse.dev*


On Mon, Sep 11, 2023 at 4:29 AM Nick Telford  wrote:

> Hi Colt,
>
> Thanks for taking the time to run your benchmarks on this, that's
> incredibly helpful.
>
> > With KIP 892, I verified that unclean shutdown does not cause a fresh
> > restore (). I got the following benchmark results:
> > - Benchmark took 216 seconds
> > - 1,401 tasks per second on one partition
> > - 11 seconds to restore the state
>
> Can you clarify how much state was restored in those 11 seconds? Was this
> the time to do the full restore regardless, or was it the time to only
> restore a small fraction of the state (e.g. the last aborted transaction)?
>
> > -- QUESTION: Because we observed a significant (30% or so) and
> reproducible
> > slowdown during restoration, it seems like KIP-892 uses the checkpointing
> > behavior during restoration as well? If so, I would argue that this might
> > not be necessary, because everything we write is already committed, so we
> > don't need to change the behavior during restoration or standby tasks.
> > Perhaps we could write the offsets to RocksDB on every batch (or even
> every
> > 5 seconds or so).
>
> Restore has always used a completely separate code-path to regular writes,
> and continues to do so. I had a quick pass over the code and I suspect I
> know what's causing the performance degradation: for every restored record,
> I was adding the changelog offset of that record to the batch along with
> the record. This is different to the regular write-path, which only adds
> the current offsets once, on-commit. This writeOffset method is fairly
> expensive, since it has to serialize the TopicPartition and offset that's
> being written to the database.
>
> Assuming this is the cause, I've already pushed a fix to my branch that
> will only call writeOffset once per-batch, and also adds some caching to
> the serialization in writeOffset, that should also enhance performance of
> state commit in the normal write-path.
>
> Please let me know if this addresses the issue!
>
> Regards,
> Nick
>
>
> On Mon, 11 Sept 2023 at 05:38, Colt McNealy  wrote:
>
> > Howdy folks,
> >
> > First I wanted to say fantastic work and thank you to Nick. I built your
> > branch (https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0) and did
> > some testing on our Streams app with Kafka 3.5.0, your `kip-892-3.5.0`
> > branch, and your `kip-892-3.5.0` branch built with Speedb OSS 2.3.0.1.
> And
> > it worked! Including the global store (we don't have any segmented
> stores,
> > unfortunately).
> >
> > The test I ran involved running 3,000 workflows with 100 tasks each, and
> > roughly 650MB state total.
> >
> > With Streams 3.5.0, I indeed verified that unclean shutdown caused a
> fresh
> > restore from scratch. I also benchmarked my application at:
> > - Running the benchmark took 211 seconds
> > - 1,421 tasks per second on one partition
> > - 8 seconds to restore the state (650MB or so)
> >
> > With KIP 892, I verified that unclean shutdown does not cause a fresh
> > restore (). I got the following benchmark results:
> > - Benchmark took 216 seconds
> > - 1,401 tasks per second on one partition
> > - 11 seconds to restore the state
> >
> > I ran the restorations many times to ensure that there was no rounding
> > error or noise; the results were remarkably consistent. Additionally, I
> ran
> > the restorations with KIP-892 built with Speedb OSS. The restoration time
> > consistently came out as 10 seconds, which was an improvement from the 11
> > seconds observed with RocksDB + KIP-892.
> >
> > My application is bottlenecked mostly by serialization and
> deserialization,
> > so improving the performance of the state store doesn't really impact our
> > throughput that much. And the processing performance (benchmark time,
> > tasks/second) are pretty close in KIP-892 vs Streams 3.5.0. However, at
> > larger state store sizes, RocksDB performance begins to degrade, so that
> > might not be true once we pass 20GB per partition.
> >
> > -- QUESTION:

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-11 Thread Nick Telford

Hi Colt,

Thanks for taking the time to run your benchmarks on this, that's
incredibly helpful.

> With KIP 892, I verified that unclean shutdown does not cause a fresh
> restore (). I got the following benchmark results:
> - Benchmark took 216 seconds
> - 1,401 tasks per second on one partition
> - 11 seconds to restore the state

Can you clarify how much state was restored in those 11 seconds? Was this
the time to do the full restore regardless, or was it the time to only
restore a small fraction of the state (e.g. the last aborted transaction)?

> -- QUESTION: Because we observed a significant (30% or so) and
reproducible
> slowdown during restoration, it seems like KIP-892 uses the checkpointing
> behavior during restoration as well? If so, I would argue that this might
> not be necessary, because everything we write is already committed, so we
> don't need to change the behavior during restoration or standby tasks.
> Perhaps we could write the offsets to RocksDB on every batch (or even
every
> 5 seconds or so).

Restore has always used a completely separate code-path to regular writes,
and continues to do so. I had a quick pass over the code and I suspect I
know what's causing the performance degradation: for every restored record,
I was adding the changelog offset of that record to the batch along with
the record. This is different to the regular write-path, which only adds
the current offsets once, on-commit. This writeOffset method is fairly
expensive, since it has to serialize the TopicPartition and offset that's
being written to the database.

Assuming this is the cause, I've already pushed a fix to my branch that
will only call writeOffset once per-batch, and also adds some caching to
the serialization in writeOffset, that should also enhance performance of
state commit in the normal write-path.

Please let me know if this addresses the issue!

Regards,
Nick


On Mon, 11 Sept 2023 at 05:38, Colt McNealy  wrote:

> Howdy folks,
>
> First I wanted to say fantastic work and thank you to Nick. I built your
> branch (https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0) and did
> some testing on our Streams app with Kafka 3.5.0, your `kip-892-3.5.0`
> branch, and your `kip-892-3.5.0` branch built with Speedb OSS 2.3.0.1. And
> it worked! Including the global store (we don't have any segmented stores,
> unfortunately).
>
> The test I ran involved running 3,000 workflows with 100 tasks each, and
> roughly 650MB state total.
>
> With Streams 3.5.0, I indeed verified that unclean shutdown caused a fresh
> restore from scratch. I also benchmarked my application at:
> - Running the benchmark took 211 seconds
> - 1,421 tasks per second on one partition
> - 8 seconds to restore the state (650MB or so)
>
> With KIP 892, I verified that unclean shutdown does not cause a fresh
> restore (). I got the following benchmark results:
> - Benchmark took 216 seconds
> - 1,401 tasks per second on one partition
> - 11 seconds to restore the state
>
> I ran the restorations many times to ensure that there was no rounding
> error or noise; the results were remarkably consistent. Additionally, I ran
> the restorations with KIP-892 built with Speedb OSS. The restoration time
> consistently came out as 10 seconds, which was an improvement from the 11
> seconds observed with RocksDB + KIP-892.
>
> My application is bottlenecked mostly by serialization and deserialization,
> so improving the performance of the state store doesn't really impact our
> throughput that much. And the processing performance (benchmark time,
> tasks/second) are pretty close in KIP-892 vs Streams 3.5.0. However, at
> larger state store sizes, RocksDB performance begins to degrade, so that
> might not be true once we pass 20GB per partition.
>
> -- QUESTION: Because we observed a significant (30% or so) and reproducible
> slowdown during restoration, it seems like KIP-892 uses the checkpointing
> behavior during restoration as well? If so, I would argue that this might
> not be necessary, because everything we write is already committed, so we
> don't need to change the behavior during restoration or standby tasks.
> Perhaps we could write the offsets to RocksDB on every batch (or even every
> 5 seconds or so).
>
> -- Note: This was a very small-scale test, with <1GB of state (as I didn't
> have time to spend hours building up state). In the past I have noted that
> RocksDB performance degrades significantly after 25GB of state in one
> store. Future work involves determining the performance impact of KIP-892
> relative to trunk at larger scale, since it's possible that the relative
> behaviors are far different (i.e. relative to trunk, 892's processing and
> restoration throughput might be much better or much worse).
>
> -- Note: For those who want to replicate the tests, you can find the branch
> of our streams app here:
>
> https://github.com/littlehorse-enterprises/littlehorse/tree/minor/testing-streams-forks
> . The example I ran was

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-09-10 Thread Colt McNealy

Howdy folks,

First I wanted to say fantastic work and thank you to Nick. I built your
branch (https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0) and did
some testing on our Streams app with Kafka 3.5.0, your `kip-892-3.5.0`
branch, and your `kip-892-3.5.0` branch built with Speedb OSS 2.3.0.1. And
it worked! Including the global store (we don't have any segmented stores,
unfortunately).

The test I ran involved running 3,000 workflows with 100 tasks each, and
roughly 650MB state total.

With Streams 3.5.0, I indeed verified that unclean shutdown caused a fresh
restore from scratch. I also benchmarked my application at:
- Running the benchmark took 211 seconds
- 1,421 tasks per second on one partition
- 8 seconds to restore the state (650MB or so)

With KIP 892, I verified that unclean shutdown does not cause a fresh
restore (). I got the following benchmark results:
- Benchmark took 216 seconds
- 1,401 tasks per second on one partition
- 11 seconds to restore the state

I ran the restorations many times to ensure that there was no rounding
error or noise; the results were remarkably consistent. Additionally, I ran
the restorations with KIP-892 built with Speedb OSS. The restoration time
consistently came out as 10 seconds, which was an improvement from the 11
seconds observed with RocksDB + KIP-892.

My application is bottlenecked mostly by serialization and deserialization,
so improving the performance of the state store doesn't really impact our
throughput that much. And the processing performance (benchmark time,
tasks/second) are pretty close in KIP-892 vs Streams 3.5.0. However, at
larger state store sizes, RocksDB performance begins to degrade, so that
might not be true once we pass 20GB per partition.

-- QUESTION: Because we observed a significant (30% or so) and reproducible
slowdown during restoration, it seems like KIP-892 uses the checkpointing
behavior during restoration as well? If so, I would argue that this might
not be necessary, because everything we write is already committed, so we
don't need to change the behavior during restoration or standby tasks.
Perhaps we could write the offsets to RocksDB on every batch (or even every
5 seconds or so).

-- Note: This was a very small-scale test, with <1GB of state (as I didn't
have time to spend hours building up state). In the past I have noted that
RocksDB performance degrades significantly after 25GB of state in one
store. Future work involves determining the performance impact of KIP-892
relative to trunk at larger scale, since it's possible that the relative
behaviors are far different (i.e. relative to trunk, 892's processing and
restoration throughput might be much better or much worse).

-- Note: For those who want to replicate the tests, you can find the branch
of our streams app here:
https://github.com/littlehorse-enterprises/littlehorse/tree/minor/testing-streams-forks
. The example I ran was `examples/hundred-tasks`, and I ran the server with
`./local-dev/do-server.sh one-partition`. The `STREAMS_TESTS.md` file has a
detailed breakdown of the testing.

Anyways, I'm super excited about this KIP and if a bit more future testing
goes well, we plan to ship our product with a build of KIP-892, Speedb OSS,
and potentially a few other minor tweaks that we are thinking about.

Thanks Nick!

Ride well,
Colt McNealy

*Founder, LittleHorse.dev*

On Thu, Aug 24, 2023 at 3:19 AM Nick Telford  wrote:

> Hi Bruno,
>
> Thanks for taking the time to review the KIP. I'm back from leave now and
> intend to move this forwards as quickly as I can.
>
> Addressing your points:
>
> 1.
> Because flush() is part of the StateStore API, it's exposed to custom
> Processors, which might be making calls to flush(). This was actually the
> case in a few integration tests.
> To maintain as much compatibility as possible, I'd prefer not to make this
> an UnsupportedOperationException, as it will cause previously working
> Processors to start throwing exceptions at runtime.
> I agree that it doesn't make sense for it to proxy commit(), though, as
> that would cause it to violate the "StateStores commit only when the Task
> commits" rule.
> Instead, I think we should make this a no-op. That way, existing user
> Processors will continue to work as-before, without violation of store
> consistency that would be caused by premature flush/commit of StateStore
> data to disk.
> What do you think?
>
> 2.
> As stated in the JavaDoc, when a StateStore implementation is
> transactional, but is unable to estimate the uncommitted memory usage, the
> method will return -1.
> The intention here is to permit third-party implementations that may not be
> able to estimate memory usage.
>
> Yes, it will be 0 when nothing has been written to the store yet. I thought
> that was implied by "This method will return an approximation of the memory
> would be freed by the next call to {@link #commit(Map)}" and "@return The
> approximate size of all records awaiting {@link

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-08-24 Thread Nick Telford

Hi Bruno,

Thanks for taking the time to review the KIP. I'm back from leave now and
intend to move this forwards as quickly as I can.

Addressing your points:

1.
Because flush() is part of the StateStore API, it's exposed to custom
Processors, which might be making calls to flush(). This was actually the
case in a few integration tests.
To maintain as much compatibility as possible, I'd prefer not to make this
an UnsupportedOperationException, as it will cause previously working
Processors to start throwing exceptions at runtime.
I agree that it doesn't make sense for it to proxy commit(), though, as
that would cause it to violate the "StateStores commit only when the Task
commits" rule.
Instead, I think we should make this a no-op. That way, existing user
Processors will continue to work as-before, without violation of store
consistency that would be caused by premature flush/commit of StateStore
data to disk.
What do you think?

2.
As stated in the JavaDoc, when a StateStore implementation is
transactional, but is unable to estimate the uncommitted memory usage, the
method will return -1.
The intention here is to permit third-party implementations that may not be
able to estimate memory usage.

Yes, it will be 0 when nothing has been written to the store yet. I thought
that was implied by "This method will return an approximation of the memory
would be freed by the next call to {@link #commit(Map)}" and "@return The
approximate size of all records awaiting {@link #commit(Map)}", however, I
can add it explicitly to the JavaDoc if you think this is unclear?

3.
I realise this is probably the most contentious point in my design, and I'm
open to changing it if I'm unable to convince you of the benefits.
Nevertheless, here's my argument:
The Interactive Query (IQ) API(s) are directly provided StateStores to
query, and it may be important for users to programmatically know which
mode the StateStore is operating under. If we simply provide an
"eosEnabled" boolean (as used throughout the internal streams engine), or
similar, then users will need to understand the operation and consequences
of each available processing mode and how it pertains to their StateStore.

Interactive Query users aren't the only people that care about the
processing.mode/IsolationLevel of a StateStore: implementers of custom
StateStores also need to understand the behaviour expected of their
implementation. KIP-892 introduces some assumptions into the Streams Engine
about how StateStores operate under each processing mode, and it's
important that custom implementations adhere to those assumptions in order
to maintain the consistency guarantees.

IsolationLevels provide a high-level contract on the behaviour of the
StateStore: a user knows that under READ_COMMITTED, they will see writes
only after the Task has committed, and under READ_UNCOMMITTED they will see
writes immediately. No understanding of the details of each processing.mode
is required, either for IQ users or StateStore implementers.

An argument can be made that these contractual guarantees can simply be
documented for the processing.mode (i.e. that exactly-once and
exactly-once-v2 behave like READ_COMMITTED and at-least-once behaves like
READ_UNCOMMITTED), but there are several small issues with this I'd prefer
to avoid:

   - Where would we document these contracts, in a way that is difficult
   for users/implementers to miss/ignore?
   - It's not clear to users that the processing mode is communicating
   an expectation of read isolation, unless they read the documentation. Users
   rarely consult documentation unless they feel they need to, so it's likely
   this detail would get missed by many users.
   - It tightly couples processing modes to read isolation. Adding new
   processing modes, or changing the read isolation of existing processing
   modes would be difficult/impossible.

Ultimately, the cost of introducing IsolationLevels is just a single
method, since we re-use the existing IsolationLevel enum from Kafka. This
gives us a clear place to document the contractual guarantees expected
of/provided by StateStores, that is accessible both by the StateStore
itself, and by IQ users.

(Writing this I've just realised that the StateStore and IQ APIs actually
don't provide access to StateStoreContext that IQ users would have direct
access to... Perhaps StateStore should expose isolationLevel() itself too?)

4.
Yeah, I'm not comfortable renaming the metrics in-place either, as it's a
backwards incompatible change. My concern is that, if we leave the existing
"flush" metrics in place, they will be confusing to users. Right now,
"flush" metrics record explicit flushes to disk, but under KIP-892, even a
commit() will not explicitly flush data to disk - RocksDB will decide on
when to flush memtables to disk itself.

If we keep the existing "flush" metrics, we'd have two options, which both
seem pretty bad to me:

   1. Have them record calls to commit(), which would be

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-08-14 Thread Bruno Cadonna

Hi Nick!

Thanks for the updates!

1.
Why does StateStore#flush() default to
StateStore#commit(Collections.emptyMap())?
Since calls to flush() will not exist anymore after this KIP is
released, I would rather throw an unsupported operation exception by
default.

2.
When would a state store return -1 from
StateStore#approximateNumUncommittedBytes() while being transactional?

Wouldn't StateStore#approximateNumUncommittedBytes() also return 0 if
the state store is transactional but nothing has been written to the
state store yet?

3.
Sorry for bringing this up again. Does this KIP really need to introduce
StateStoreContext#isolationLevel()? StateStoreContext has already
appConfigs() which basically exposes the same information, i.e., if EOS
is enabled or not.

In one of your previous e-mails you wrote:

"My idea was to try to keep the StateStore interface as loosely coupled
from the Streams engine as possible, to give implementers more freedom,
and reduce the amount of internal knowledge required."

While I understand the intent, I doubt that it decreases the coupling of
a StateStore interface and the Streams engine. READ_COMMITTED only
applies to IQ but not to reads by processors. Thus, implementers need to
understand how Streams accesses the state stores.

I would like to hear what others think about this.

4.
Great exposing new metrics for transactional state stores! However, I
would prefer to add new metrics and deprecate (in the docs) the old
ones. You can find examples of deprecated metrics here:
https://kafka.apache.org/documentation/#selector_monitoring

5.
Why does the KIP mention position files? I do not think they are related
to transactions or flushes.

6.
I think we will also need to adapt/add integration tests besides unit
tests. Additionally, we probably need integration or system tests to
verify that upgrades and downgrades between transactional and
non-transactional state stores work as expected.

Best,
Bruno

On 7/21/23 10:34 PM, Nick Telford wrote:

One more thing: I noted John's suggestion in the KIP, under "Rejected
Alternatives". I still think it's an idea worth pursuing, but I believe
that it's out of the scope of this KIP, because it solves a different set
of problems to this KIP, and the scope of this one has already grown quite
large!

On Fri, 21 Jul 2023 at 21:33, Nick Telford wrote:

Hi everyone,

I've updated the KIP (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores)
with the latest changes; mostly bringing back "Atomic Checkpointing" (for
what feels like the 10th time!). I think the one thing missing is some
changes to metrics (notably the store "flush" metrics will need to be
renamed to "commit").

The reason I brought back Atomic Checkpointing was to decouple store flush
from store commit. This is important, because with Transactional
StateStores, we now need to call "flush" on *every* Task commit, and not
just when the StateStore is closing, otherwise our transaction buffer will
never be written and persisted, instead growing unbounded! I experimented
with some simple solutions, like forcing a store flush whenever the
transaction buffer was likely to exceed its configured size, but this was
brittle: it prevented the transaction buffer from being configured to be
unbounded, and it still would have required explicit flushes of RocksDB,
yielding sub-optimal performance and memory utilization.

I deemed Atomic Checkpointing to be the "right" way to resolve this
problem. By ensuring that the changelog offsets that correspond to the most
recently written records are always atomically written to the StateStore
(by writing them to the same transaction buffer), we can avoid forcibly
flushing the RocksDB memtables to disk, letting RocksDB flush them only
when necessary, without losing any of our consistency guarantees. See the
updated KIP for more info.

I have fully implemented these changes, although I'm still not entirely
happy with the implementation for segmented StateStores, so I plan to
refactor that. Despite that, all tests pass. If you'd like to try out or
review this highly experimental and incomplete branch, it's available here:
https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0. Note: it's built
against Kafka 3.5.0 so that I had a stable base to build and test it on,
and to enable easy apples-to-apples comparisons in a live environment. I
plan to rebase it against trunk once it's nearer completion and has been
proven on our main application.

I would really appreciate help in reviewing and testing:
- Segmented (Versioned, Session and Window) stores
- Global stores

As I do not currently use either of these, so my primary test environment
doesn't test these areas.

I'm going on Parental Leave starting next week for a few weeks, so will
not have time to move this forward until late August. That said, your
feedback is welcome and appreciated, I just won't be able to respond as

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-21 Thread Nick Telford

One more thing: I noted John's suggestion in the KIP, under "Rejected
Alternatives". I still think it's an idea worth pursuing, but I believe
that it's out of the scope of this KIP, because it solves a different set
of problems to this KIP, and the scope of this one has already grown quite
large!

On Fri, 21 Jul 2023 at 21:33, Nick Telford  wrote:

> Hi everyone,
>
> I've updated the KIP (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores)
> with the latest changes; mostly bringing back "Atomic Checkpointing" (for
> what feels like the 10th time!). I think the one thing missing is some
> changes to metrics (notably the store "flush" metrics will need to be
> renamed to "commit").
>
> The reason I brought back Atomic Checkpointing was to decouple store flush
> from store commit. This is important, because with Transactional
> StateStores, we now need to call "flush" on *every* Task commit, and not
> just when the StateStore is closing, otherwise our transaction buffer will
> never be written and persisted, instead growing unbounded! I experimented
> with some simple solutions, like forcing a store flush whenever the
> transaction buffer was likely to exceed its configured size, but this was
> brittle: it prevented the transaction buffer from being configured to be
> unbounded, and it still would have required explicit flushes of RocksDB,
> yielding sub-optimal performance and memory utilization.
>
> I deemed Atomic Checkpointing to be the "right" way to resolve this
> problem. By ensuring that the changelog offsets that correspond to the most
> recently written records are always atomically written to the StateStore
> (by writing them to the same transaction buffer), we can avoid forcibly
> flushing the RocksDB memtables to disk, letting RocksDB flush them only
> when necessary, without losing any of our consistency guarantees. See the
> updated KIP for more info.
>
> I have fully implemented these changes, although I'm still not entirely
> happy with the implementation for segmented StateStores, so I plan to
> refactor that. Despite that, all tests pass. If you'd like to try out or
> review this highly experimental and incomplete branch, it's available here:
> https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0. Note: it's built
> against Kafka 3.5.0 so that I had a stable base to build and test it on,
> and to enable easy apples-to-apples comparisons in a live environment. I
> plan to rebase it against trunk once it's nearer completion and has been
> proven on our main application.
>
> I would really appreciate help in reviewing and testing:
> - Segmented (Versioned, Session and Window) stores
> - Global stores
>
> As I do not currently use either of these, so my primary test environment
> doesn't test these areas.
>
> I'm going on Parental Leave starting next week for a few weeks, so will
> not have time to move this forward until late August. That said, your
> feedback is welcome and appreciated, I just won't be able to respond as
> quickly as usual.
>
> Regards,
> Nick
>
> On Mon, 3 Jul 2023 at 16:23, Nick Telford  wrote:
>
>> Hi Bruno
>>
>> Yes, that's correct, although the impact on IQ is not something I had
>> considered.
>>
>> What about atomically updating the state store from the transaction
>>> buffer every commit interval and writing the checkpoint (thus, flushing
>>> the memtable) every configured amount of data and/or number of commit
>>> intervals?
>>>
>>
>> I'm not quite sure I follow. Are you suggesting that we add an additional
>> config for the max number of commit intervals between checkpoints? That
>> way, we would checkpoint *either* when the transaction buffers are nearly
>> full, *OR* whenever a certain number of commit intervals have elapsed,
>> whichever comes first?
>>
>> That certainly seems reasonable, although this re-ignites an earlier
>> debate about whether a config should be measured in "number of commit
>> intervals", instead of just an absolute time.
>>
>> FWIW, I realised that this issue is the reason I was pursuing the Atomic
>> Checkpoints, as it de-couples memtable flush from checkpointing, which
>> enables us to just checkpoint on every commit without any performance
>> impact. Atomic Checkpointing is definitely the "best" solution, but I'm not
>> sure if this is enough to bring it back into this KIP.
>>
>> I'm currently working on moving all the transactional logic directly into
>> RocksDBStore itself, which does away with the StateStore#newTransaction
>> method, and reduces the number of new classes introduced, significantly
>> reducing the complexity. If it works, and the complexity is drastically
>> reduced, I may try bringing back Atomic Checkpoints into this KIP.
>>
>> Regards,
>> Nick
>>
>> On Mon, 3 Jul 2023 at 15:27, Bruno Cadonna  wrote:
>>
>>> Hi Nick,
>>>
>>> Thanks for the insights! Very interesting!
>>>
>>> As far as I understand, you want to atomically update the state store
>>> from the

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-21 Thread Nick Telford

Hi everyone,

I've updated the KIP (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores)
with the latest changes; mostly bringing back "Atomic Checkpointing" (for
what feels like the 10th time!). I think the one thing missing is some
changes to metrics (notably the store "flush" metrics will need to be
renamed to "commit").

The reason I brought back Atomic Checkpointing was to decouple store flush
from store commit. This is important, because with Transactional
StateStores, we now need to call "flush" on *every* Task commit, and not
just when the StateStore is closing, otherwise our transaction buffer will
never be written and persisted, instead growing unbounded! I experimented
with some simple solutions, like forcing a store flush whenever the
transaction buffer was likely to exceed its configured size, but this was
brittle: it prevented the transaction buffer from being configured to be
unbounded, and it still would have required explicit flushes of RocksDB,
yielding sub-optimal performance and memory utilization.

I deemed Atomic Checkpointing to be the "right" way to resolve this
problem. By ensuring that the changelog offsets that correspond to the most
recently written records are always atomically written to the StateStore
(by writing them to the same transaction buffer), we can avoid forcibly
flushing the RocksDB memtables to disk, letting RocksDB flush them only
when necessary, without losing any of our consistency guarantees. See the
updated KIP for more info.

I have fully implemented these changes, although I'm still not entirely
happy with the implementation for segmented StateStores, so I plan to
refactor that. Despite that, all tests pass. If you'd like to try out or
review this highly experimental and incomplete branch, it's available here:
https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0. Note: it's built
against Kafka 3.5.0 so that I had a stable base to build and test it on,
and to enable easy apples-to-apples comparisons in a live environment. I
plan to rebase it against trunk once it's nearer completion and has been
proven on our main application.

I would really appreciate help in reviewing and testing:
- Segmented (Versioned, Session and Window) stores
- Global stores

As I do not currently use either of these, so my primary test environment
doesn't test these areas.

I'm going on Parental Leave starting next week for a few weeks, so will not
have time to move this forward until late August. That said, your feedback
is welcome and appreciated, I just won't be able to respond as quickly as
usual.

Regards,
Nick

On Mon, 3 Jul 2023 at 16:23, Nick Telford  wrote:

> Hi Bruno
>
> Yes, that's correct, although the impact on IQ is not something I had
> considered.
>
> What about atomically updating the state store from the transaction
>> buffer every commit interval and writing the checkpoint (thus, flushing
>> the memtable) every configured amount of data and/or number of commit
>> intervals?
>>
>
> I'm not quite sure I follow. Are you suggesting that we add an additional
> config for the max number of commit intervals between checkpoints? That
> way, we would checkpoint *either* when the transaction buffers are nearly
> full, *OR* whenever a certain number of commit intervals have elapsed,
> whichever comes first?
>
> That certainly seems reasonable, although this re-ignites an earlier
> debate about whether a config should be measured in "number of commit
> intervals", instead of just an absolute time.
>
> FWIW, I realised that this issue is the reason I was pursuing the Atomic
> Checkpoints, as it de-couples memtable flush from checkpointing, which
> enables us to just checkpoint on every commit without any performance
> impact. Atomic Checkpointing is definitely the "best" solution, but I'm not
> sure if this is enough to bring it back into this KIP.
>
> I'm currently working on moving all the transactional logic directly into
> RocksDBStore itself, which does away with the StateStore#newTransaction
> method, and reduces the number of new classes introduced, significantly
> reducing the complexity. If it works, and the complexity is drastically
> reduced, I may try bringing back Atomic Checkpoints into this KIP.
>
> Regards,
> Nick
>
> On Mon, 3 Jul 2023 at 15:27, Bruno Cadonna  wrote:
>
>> Hi Nick,
>>
>> Thanks for the insights! Very interesting!
>>
>> As far as I understand, you want to atomically update the state store
>> from the transaction buffer, flush the memtable of a state store and
>> write the checkpoint not after the commit time elapsed but after the
>> transaction buffer reached a size that would lead to exceeding
>> statestore.transaction.buffer.max.bytes before the next commit interval
>> ends.
>> That means, the Kafka transaction would commit every commit interval but
>> the state store will only be atomically updated roughly every
>> statestore.transaction.buffer.max.bytes of data. Also IQ would then only
>>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-03 Thread Nick Telford

Hi Bruno

Yes, that's correct, although the impact on IQ is not something I had
considered.

What about atomically updating the state store from the transaction
> buffer every commit interval and writing the checkpoint (thus, flushing
> the memtable) every configured amount of data and/or number of commit
> intervals?
>

I'm not quite sure I follow. Are you suggesting that we add an additional
config for the max number of commit intervals between checkpoints? That
way, we would checkpoint *either* when the transaction buffers are nearly
full, *OR* whenever a certain number of commit intervals have elapsed,
whichever comes first?

That certainly seems reasonable, although this re-ignites an earlier debate
about whether a config should be measured in "number of commit intervals",
instead of just an absolute time.

FWIW, I realised that this issue is the reason I was pursuing the Atomic
Checkpoints, as it de-couples memtable flush from checkpointing, which
enables us to just checkpoint on every commit without any performance
impact. Atomic Checkpointing is definitely the "best" solution, but I'm not
sure if this is enough to bring it back into this KIP.

I'm currently working on moving all the transactional logic directly into
RocksDBStore itself, which does away with the StateStore#newTransaction
method, and reduces the number of new classes introduced, significantly
reducing the complexity. If it works, and the complexity is drastically
reduced, I may try bringing back Atomic Checkpoints into this KIP.

Regards,
Nick

On Mon, 3 Jul 2023 at 15:27, Bruno Cadonna  wrote:

> Hi Nick,
>
> Thanks for the insights! Very interesting!
>
> As far as I understand, you want to atomically update the state store
> from the transaction buffer, flush the memtable of a state store and
> write the checkpoint not after the commit time elapsed but after the
> transaction buffer reached a size that would lead to exceeding
> statestore.transaction.buffer.max.bytes before the next commit interval
> ends.
> That means, the Kafka transaction would commit every commit interval but
> the state store will only be atomically updated roughly every
> statestore.transaction.buffer.max.bytes of data. Also IQ would then only
> see new data roughly every statestore.transaction.buffer.max.bytes.
> After a failure the state store needs to restore up to
> statestore.transaction.buffer.max.bytes.
>
> Is this correct?
>
> What about atomically updating the state store from the transaction
> buffer every commit interval and writing the checkpoint (thus, flushing
> the memtable) every configured amount of data and/or number of commit
> intervals? In such a way, we would have the same delay for records
> appearing in output topics and IQ because both would appear when the
> Kafka transaction is committed. However, after a failure the state store
> still needs to restore up to statestore.transaction.buffer.max.bytes and
> it might restore data that is already in the state store because the
> checkpoint lags behind the last stable offset (i.e. the last committed
> offset) of the changelog topics. Restoring data that is already in the
> state store is idempotent, so eos should not violated.
> This solution needs at least one new config to specify when a checkpoint
> should be written.
>
>
>
> A small correction to your previous e-mail that does not change anything
> you said: Under alos the default commit interval is 30 seconds, not five
> seconds.
>
>
> Best,
> Bruno
>
>
> On 01.07.23 12:37, Nick Telford wrote:
> > Hi everyone,
> >
> > I've begun performance testing my branch on our staging environment,
> > putting it through its paces in our non-trivial application. I'm already
> > observing the same increased flush rate that we saw the last time we
> > attempted to use a version of this KIP, but this time, I think I know
> why.
> >
> > Pre-KIP-892, StreamTask#postCommit, which is called at the end of the
> Task
> > commit process, has the following behaviour:
> >
> > - Under ALOS: checkpoint the state stores. This includes
> > flushing memtables in RocksDB. This is acceptable because the default
> > commit.interval.ms is 5 seconds, so forcibly flushing memtables
> every 5
> > seconds is acceptable for most applications.
> > - Under EOS: checkpointing is not done, *unless* it's being forced,
> due
> > to e.g. the Task closing or being revoked. This means that under
> normal
> > processing conditions, the state stores will not be checkpointed,
> and will
> > not have memtables flushed at all , unless RocksDB decides to flush
> them on
> > its own. Checkpointing stores and force-flushing their memtables is
> only
> > done when a Task is being closed.
> >
> > Under EOS, KIP-892 needs to checkpoint stores on at least *some* normal
> > Task commits, in order to write the RocksDB transaction buffers to the
> > state stores, and to ensure the offsets are synced to disk to prevent
> > restores from getting out of

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-03 Thread Bruno Cadonna


Hi Nick,

Thanks for the insights! Very interesting!

As far as I understand, you want to atomically update the state store 
from the transaction buffer, flush the memtable of a state store and 
write the checkpoint not after the commit time elapsed but after the 
transaction buffer reached a size that would lead to exceeding 
statestore.transaction.buffer.max.bytes before the next commit interval 
ends.
That means, the Kafka transaction would commit every commit interval but 
the state store will only be atomically updated roughly every 
statestore.transaction.buffer.max.bytes of data. Also IQ would then only 
see new data roughly every statestore.transaction.buffer.max.bytes.
After a failure the state store needs to restore up to 
statestore.transaction.buffer.max.bytes.


Is this correct?

What about atomically updating the state store from the transaction 
buffer every commit interval and writing the checkpoint (thus, flushing 
the memtable) every configured amount of data and/or number of commit 
intervals? In such a way, we would have the same delay for records 
appearing in output topics and IQ because both would appear when the 
Kafka transaction is committed. However, after a failure the state store 
still needs to restore up to statestore.transaction.buffer.max.bytes and 
it might restore data that is already in the state store because the 
checkpoint lags behind the last stable offset (i.e. the last committed 
offset) of the changelog topics. Restoring data that is already in the 
state store is idempotent, so eos should not violated.
This solution needs at least one new config to specify when a checkpoint 
should be written.




A small correction to your previous e-mail that does not change anything 
you said: Under alos the default commit interval is 30 seconds, not five 
seconds.



Best,
Bruno


On 01.07.23 12:37, Nick Telford wrote:

Hi everyone,

I've begun performance testing my branch on our staging environment,
putting it through its paces in our non-trivial application. I'm already
observing the same increased flush rate that we saw the last time we
attempted to use a version of this KIP, but this time, I think I know why.

Pre-KIP-892, StreamTask#postCommit, which is called at the end of the Task
commit process, has the following behaviour:

- Under ALOS: checkpoint the state stores. This includes
flushing memtables in RocksDB. This is acceptable because the default
commit.interval.ms is 5 seconds, so forcibly flushing memtables every 5
seconds is acceptable for most applications.
- Under EOS: checkpointing is not done, *unless* it's being forced, due
to e.g. the Task closing or being revoked. This means that under normal
processing conditions, the state stores will not be checkpointed, and will
not have memtables flushed at all , unless RocksDB decides to flush them on
its own. Checkpointing stores and force-flushing their memtables is only
done when a Task is being closed.

Under EOS, KIP-892 needs to checkpoint stores on at least *some* normal
Task commits, in order to write the RocksDB transaction buffers to the
state stores, and to ensure the offsets are synced to disk to prevent
restores from getting out of hand. Consequently, my current implementation
calls maybeCheckpoint on *every* Task commit, which is far too frequent.
This causes checkpoints every 10,000 records, which is a change in flush
behaviour, potentially causing performance problems for some applications.

I'm looking into possible solutions, and I'm currently leaning towards
using the statestore.transaction.buffer.max.bytes configuration to
checkpoint Tasks once we are likely to exceed it. This would complement the
existing "early Task commit" functionality that this configuration
provides, in the following way:

- Currently, we use statestore.transaction.buffer.max.bytes to force an
early Task commit if processing more records would cause our state store
transactions to exceed the memory assigned to them.
- New functionality: when a Task *does* commit, we will not checkpoint
the stores (and hence flush the transaction buffers) unless we expect to
cross the statestore.transaction.buffer.max.bytes threshold before the next
commit

I'm also open to suggestions.

Regards,
Nick

On Thu, 22 Jun 2023 at 14:06, Nick Telford  wrote:


Hi Bruno!

3.
By "less predictable for users", I meant in terms of understanding the
performance profile under various circumstances. The more complex the
solution, the more difficult it would be for users to understand the
performance they see. For example, spilling records to disk when the
transaction buffer reaches a threshold would, I expect, reduce write
throughput. This reduction in write throughput could be unexpected, and
potentially difficult to diagnose/understand for users.
At the moment, I think the "early commit" concept is relatively
straightforward; it's easy to document, and conceptually fairly obvious to

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-07-01 Thread Nick Telford

Hi everyone,

I've begun performance testing my branch on our staging environment,
putting it through its paces in our non-trivial application. I'm already
observing the same increased flush rate that we saw the last time we
attempted to use a version of this KIP, but this time, I think I know why.

Pre-KIP-892, StreamTask#postCommit, which is called at the end of the Task
commit process, has the following behaviour:

   - Under ALOS: checkpoint the state stores. This includes
   flushing memtables in RocksDB. This is acceptable because the default
   commit.interval.ms is 5 seconds, so forcibly flushing memtables every 5
   seconds is acceptable for most applications.
   - Under EOS: checkpointing is not done, *unless* it's being forced, due
   to e.g. the Task closing or being revoked. This means that under normal
   processing conditions, the state stores will not be checkpointed, and will
   not have memtables flushed at all , unless RocksDB decides to flush them on
   its own. Checkpointing stores and force-flushing their memtables is only
   done when a Task is being closed.

Under EOS, KIP-892 needs to checkpoint stores on at least *some* normal
Task commits, in order to write the RocksDB transaction buffers to the
state stores, and to ensure the offsets are synced to disk to prevent
restores from getting out of hand. Consequently, my current implementation
calls maybeCheckpoint on *every* Task commit, which is far too frequent.
This causes checkpoints every 10,000 records, which is a change in flush
behaviour, potentially causing performance problems for some applications.

I'm looking into possible solutions, and I'm currently leaning towards
using the statestore.transaction.buffer.max.bytes configuration to
checkpoint Tasks once we are likely to exceed it. This would complement the
existing "early Task commit" functionality that this configuration
provides, in the following way:

   - Currently, we use statestore.transaction.buffer.max.bytes to force an
   early Task commit if processing more records would cause our state store
   transactions to exceed the memory assigned to them.
   - New functionality: when a Task *does* commit, we will not checkpoint
   the stores (and hence flush the transaction buffers) unless we expect to
   cross the statestore.transaction.buffer.max.bytes threshold before the next
   commit

I'm also open to suggestions.

Regards,
Nick

On Thu, 22 Jun 2023 at 14:06, Nick Telford  wrote:

> Hi Bruno!
>
> 3.
> By "less predictable for users", I meant in terms of understanding the
> performance profile under various circumstances. The more complex the
> solution, the more difficult it would be for users to understand the
> performance they see. For example, spilling records to disk when the
> transaction buffer reaches a threshold would, I expect, reduce write
> throughput. This reduction in write throughput could be unexpected, and
> potentially difficult to diagnose/understand for users.
> At the moment, I think the "early commit" concept is relatively
> straightforward; it's easy to document, and conceptually fairly obvious to
> users. We could probably add a metric to make it easier to understand when
> it happens though.
>
> 3. (the second one)
> The IsolationLevel is *essentially* an indirect way of telling StateStores
> whether they should be transactional. READ_COMMITTED essentially requires
> transactions, because it dictates that two threads calling
> `newTransaction()` should not see writes from the other transaction until
> they have been committed. With READ_UNCOMMITTED, all bets are off, and
> stores can allow threads to observe written records at any time, which is
> essentially "no transactions". That said, StateStores are free to implement
> these guarantees however they can, which is a bit more relaxed than
> dictating "you must use transactions". For example, with RocksDB we would
> implement these as READ_COMMITTED == WBWI-based "transactions",
> READ_UNCOMMITTED == direct writes to the database. But with other storage
> engines, it might be preferable to *always* use transactions, even when
> unnecessary; or there may be storage engines that don't provide
> transactions, but the isolation guarantees can be met using a different
> technique.
> My idea was to try to keep the StateStore interface as loosely coupled
> from the Streams engine as possible, to give implementers more freedom, and
> reduce the amount of internal knowledge required.
> That said, I understand that "IsolationLevel" might not be the right
> abstraction, and we can always make it much more explicit if required, e.g.
> boolean transactional()
>
> 7-8.
> I can make these changes either later today or tomorrow.
>
> Small update:
> I've rebased my branch on trunk and fixed a bunch of issues that needed
> addressing. Currently, all the tests pass, which is promising, but it will
> need to undergo some performance testing. I haven't (yet) worked on
> removing the `newTransaction()` stuff,

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-22 Thread Nick Telford

Hi Bruno!

3.
By "less predictable for users", I meant in terms of understanding the
performance profile under various circumstances. The more complex the
solution, the more difficult it would be for users to understand the
performance they see. For example, spilling records to disk when the
transaction buffer reaches a threshold would, I expect, reduce write
throughput. This reduction in write throughput could be unexpected, and
potentially difficult to diagnose/understand for users.
At the moment, I think the "early commit" concept is relatively
straightforward; it's easy to document, and conceptually fairly obvious to
users. We could probably add a metric to make it easier to understand when
it happens though.

3. (the second one)
The IsolationLevel is *essentially* an indirect way of telling StateStores
whether they should be transactional. READ_COMMITTED essentially requires
transactions, because it dictates that two threads calling
`newTransaction()` should not see writes from the other transaction until
they have been committed. With READ_UNCOMMITTED, all bets are off, and
stores can allow threads to observe written records at any time, which is
essentially "no transactions". That said, StateStores are free to implement
these guarantees however they can, which is a bit more relaxed than
dictating "you must use transactions". For example, with RocksDB we would
implement these as READ_COMMITTED == WBWI-based "transactions",
READ_UNCOMMITTED == direct writes to the database. But with other storage
engines, it might be preferable to *always* use transactions, even when
unnecessary; or there may be storage engines that don't provide
transactions, but the isolation guarantees can be met using a different
technique.
My idea was to try to keep the StateStore interface as loosely coupled from
the Streams engine as possible, to give implementers more freedom, and
reduce the amount of internal knowledge required.
That said, I understand that "IsolationLevel" might not be the right
abstraction, and we can always make it much more explicit if required, e.g.
boolean transactional()

7-8.
I can make these changes either later today or tomorrow.

Small update:
I've rebased my branch on trunk and fixed a bunch of issues that needed
addressing. Currently, all the tests pass, which is promising, but it will
need to undergo some performance testing. I haven't (yet) worked on
removing the `newTransaction()` stuff, but I would expect that,
behaviourally, it should make no difference. The branch is available at
https://github.com/nicktelford/kafka/tree/KIP-892-c if anyone is interested
in taking an early look.

Regards,
Nick

On Thu, 22 Jun 2023 at 11:59, Bruno Cadonna  wrote:

> Hi Nick,
>
> 1.
> Yeah, I agree with you. That was actually also my point. I understood
> that John was proposing the ingestion path as a way to avoid the early
> commits. Probably, I misinterpreted the intent.
>
> 2.
> I agree with John here, that actually it is public API. My question is
> how this usage pattern affects normal processing.
>
> 3.
> My concern is that checking for the size of the transaction buffer and
> maybe triggering an early commit affects the whole processing of Kafka
> Streams. The transactionality of a state store is not confined to the
> state store itself, but spills over and changes the behavior of other
> parts of the system. I agree with you that it is a decent compromise. I
> just wanted to analyse the downsides and list the options to overcome
> them. I also agree with you that all options seem quite heavy compared
> with your KIP. I do not understand what you mean with "less predictable
> for users", though.
>
>
> I found the discussions about the alternatives really interesting. But I
> also think that your plan sounds good and we should continue with it!
>
>
> Some comments on your reply to my e-mail on June 20th:
>
> 3.
> Ah, now, I understand the reasoning behind putting isolation level in
> the state store context. Thanks! Should that also be a way to give the
> the state store the opportunity to decide whether to turn on
> transactions or not?
> With my comment, I was more concerned about how do you know if a
> checkpoint file needs to be written under EOS, if you do not have a way
> to know if the state store is transactional or not. If a state store is
> transactional, the checkpoint file can be written during normal
> processing under EOS. If the state store is not transactional, the
> checkpoint file must not be written under EOS.
>
> 7.
> My point was about not only considering the bytes in memory in config
> statestore.uncommitted.max.bytes, but also bytes that might be spilled
> on disk. Basically, I was wondering whether you should remove the
> "memory" in "Maximum number of memory bytes to be used to
> buffer uncommitted state-store records." My thinking was that even if a
> state store spills uncommitted bytes to disk, limiting the overall bytes
> might make sense. Thinking about it again and

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-22 Thread Bruno Cadonna


Hi Nick,

1.
Yeah, I agree with you. That was actually also my point. I understood 
that John was proposing the ingestion path as a way to avoid the early 
commits. Probably, I misinterpreted the intent.


2.
I agree with John here, that actually it is public API. My question is 
how this usage pattern affects normal processing.


3.
My concern is that checking for the size of the transaction buffer and 
maybe triggering an early commit affects the whole processing of Kafka 
Streams. The transactionality of a state store is not confined to the 
state store itself, but spills over and changes the behavior of other 
parts of the system. I agree with you that it is a decent compromise. I 
just wanted to analyse the downsides and list the options to overcome 
them. I also agree with you that all options seem quite heavy compared 
with your KIP. I do not understand what you mean with "less predictable 
for users", though.



I found the discussions about the alternatives really interesting. But I 
also think that your plan sounds good and we should continue with it!



Some comments on your reply to my e-mail on June 20th:

3.
Ah, now, I understand the reasoning behind putting isolation level in 
the state store context. Thanks! Should that also be a way to give the 
the state store the opportunity to decide whether to turn on 
transactions or not?
With my comment, I was more concerned about how do you know if a 
checkpoint file needs to be written under EOS, if you do not have a way 
to know if the state store is transactional or not. If a state store is 
transactional, the checkpoint file can be written during normal 
processing under EOS. If the state store is not transactional, the 
checkpoint file must not be written under EOS.


7.
My point was about not only considering the bytes in memory in config 
statestore.uncommitted.max.bytes, but also bytes that might be spilled 
on disk. Basically, I was wondering whether you should remove the 
"memory" in "Maximum number of memory bytes to be used to
buffer uncommitted state-store records." My thinking was that even if a 
state store spills uncommitted bytes to disk, limiting the overall bytes 
might make sense. Thinking about it again and considering the recent 
discussions, it does not make too much sense anymore.

I like the name statestore.transaction.buffer.max.bytes that you proposed.

8.
A high-level description (without implementation details) of how Kafka 
Streams will manage the commit of changelog transactions, state store 
transactions and checkpointing would be great. Would be great if you 
could also add some sentences about the behavior in case of a failure. 
For instance how does a transactional state store recover after a 
failure or what happens with the transaction buffer, etc. (that is what 
I meant by "fail-over" in point 9.)


Best,
Bruno

On 21.06.23 18:50, Nick Telford wrote:

Hi Bruno,

1.
Isn't this exactly the same issue that WriteBatchWithIndex transactions
have, whereby exceeding (or likely to exceed) configured memory needs to
trigger an early commit?

2.
This is one of my big concerns. Ultimately, any approach based on cracking
open RocksDB internals and using it in ways it's not really designed for is
likely to have some unforseen performance or consistency issues.

3.
What's your motivation for removing these early commits? While not ideal, I
think they're a decent compromise to ensure consistency whilst maintaining
good and predictable performance.
All 3 of your suggested ideas seem *very* complicated, and might actually
make behaviour less predictable for users as a consequence.

I'm a bit concerned that the scope of this KIP is growing a bit out of
control. While it's good to discuss ideas for future improvements, I think
it's important to narrow the scope down to a design that achieves the most
pressing objectives (constant sized restorations during dirty
close/unexpected errors). Any design that this KIP produces can ultimately
be changed in the future, especially if the bulk of it is internal
behaviour.

I'm going to spend some time next week trying to re-work the original
WriteBatchWithIndex design to remove the newTransaction() method, such that
it's just an implementation detail of RocksDBStore. That way, if we want to
replace WBWI with something in the future, like the SST file management
outlined by John, then we can do so with little/no API changes.

Regards,

Nick

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-21 Thread John Roesler

No worries, I should have included a ";)" to let you know it was mostly 
tongue-in-cheek.


Thanks,
-John

On 6/21/23 12:34, Nick Telford wrote:

Sorry John, I didn't mean to mis-characterize it like that. I was mostly
referring to disabling memtables. AFAIK the SstFileWriter API is primarily
designed for bulk ingest, e.g. for bootstrapping a database from a backup,
rather than during normal operation of an online database. That said, I was
overly alarmist in my phrasing.

My concern is only that, while the concept seems quite reasonable, there
are no doubt hidden issues lurking.

On Wed, 21 Jun 2023 at 18:25, John Roesler  wrote:


Thanks Nick,

That sounds good to me.

I can't let (2) slide, though.. Writing and ingesting SST files is not a
RocksDB internal, but rather a supported usage pattern on public APIs.
Regardless, I think your overall preference is fine with me, especially
if we can internalize this change within the store implementation itself.

Thanks,
-John

On 6/21/23 11:50, Nick Telford wrote:

Hi Bruno,

1.
Isn't this exactly the same issue that WriteBatchWithIndex transactions
have, whereby exceeding (or likely to exceed) configured memory needs to
trigger an early commit?

2.
This is one of my big concerns. Ultimately, any approach based on

cracking

open RocksDB internals and using it in ways it's not really designed for

is

likely to have some unforseen performance or consistency issues.

3.
What's your motivation for removing these early commits? While not

ideal, I

think they're a decent compromise to ensure consistency whilst

maintaining

good and predictable performance.
All 3 of your suggested ideas seem *very* complicated, and might actually
make behaviour less predictable for users as a consequence.

I'm a bit concerned that the scope of this KIP is growing a bit out of
control. While it's good to discuss ideas for future improvements, I

think

it's important to narrow the scope down to a design that achieves the

most

pressing objectives (constant sized restorations during dirty
close/unexpected errors). Any design that this KIP produces can

ultimately

be changed in the future, especially if the bulk of it is internal
behaviour.

I'm going to spend some time next week trying to re-work the original
WriteBatchWithIndex design to remove the newTransaction() method, such

that

it's just an implementation detail of RocksDBStore. That way, if we want

to

replace WBWI with something in the future, like the SST file management
outlined by John, then we can do so with little/no API changes.

Regards,

Nick

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-21 Thread Nick Telford

Sorry John, I didn't mean to mis-characterize it like that. I was mostly
referring to disabling memtables. AFAIK the SstFileWriter API is primarily
designed for bulk ingest, e.g. for bootstrapping a database from a backup,
rather than during normal operation of an online database. That said, I was
overly alarmist in my phrasing.

My concern is only that, while the concept seems quite reasonable, there
are no doubt hidden issues lurking.

On Wed, 21 Jun 2023 at 18:25, John Roesler  wrote:

> Thanks Nick,
>
> That sounds good to me.
>
> I can't let (2) slide, though.. Writing and ingesting SST files is not a
> RocksDB internal, but rather a supported usage pattern on public APIs.
> Regardless, I think your overall preference is fine with me, especially
> if we can internalize this change within the store implementation itself.
>
> Thanks,
> -John
>
> On 6/21/23 11:50, Nick Telford wrote:
> > Hi Bruno,
> >
> > 1.
> > Isn't this exactly the same issue that WriteBatchWithIndex transactions
> > have, whereby exceeding (or likely to exceed) configured memory needs to
> > trigger an early commit?
> >
> > 2.
> > This is one of my big concerns. Ultimately, any approach based on
> cracking
> > open RocksDB internals and using it in ways it's not really designed for
> is
> > likely to have some unforseen performance or consistency issues.
> >
> > 3.
> > What's your motivation for removing these early commits? While not
> ideal, I
> > think they're a decent compromise to ensure consistency whilst
> maintaining
> > good and predictable performance.
> > All 3 of your suggested ideas seem *very* complicated, and might actually
> > make behaviour less predictable for users as a consequence.
> >
> > I'm a bit concerned that the scope of this KIP is growing a bit out of
> > control. While it's good to discuss ideas for future improvements, I
> think
> > it's important to narrow the scope down to a design that achieves the
> most
> > pressing objectives (constant sized restorations during dirty
> > close/unexpected errors). Any design that this KIP produces can
> ultimately
> > be changed in the future, especially if the bulk of it is internal
> > behaviour.
> >
> > I'm going to spend some time next week trying to re-work the original
> > WriteBatchWithIndex design to remove the newTransaction() method, such
> that
> > it's just an implementation detail of RocksDBStore. That way, if we want
> to
> > replace WBWI with something in the future, like the SST file management
> > outlined by John, then we can do so with little/no API changes.
> >
> > Regards,
> >
> > Nick
> >
>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-21 Thread John Roesler


Thanks Nick,

That sounds good to me.

I can't let (2) slide, though.. Writing and ingesting SST files is not a 
RocksDB internal, but rather a supported usage pattern on public APIs. 
Regardless, I think your overall preference is fine with me, especially 
if we can internalize this change within the store implementation itself.


Thanks,
-John

On 6/21/23 11:50, Nick Telford wrote:

Hi Bruno,

1.
Isn't this exactly the same issue that WriteBatchWithIndex transactions
have, whereby exceeding (or likely to exceed) configured memory needs to
trigger an early commit?

2.
This is one of my big concerns. Ultimately, any approach based on cracking
open RocksDB internals and using it in ways it's not really designed for is
likely to have some unforseen performance or consistency issues.

3.
What's your motivation for removing these early commits? While not ideal, I
think they're a decent compromise to ensure consistency whilst maintaining
good and predictable performance.
All 3 of your suggested ideas seem *very* complicated, and might actually
make behaviour less predictable for users as a consequence.

I'm a bit concerned that the scope of this KIP is growing a bit out of
control. While it's good to discuss ideas for future improvements, I think
it's important to narrow the scope down to a design that achieves the most
pressing objectives (constant sized restorations during dirty
close/unexpected errors). Any design that this KIP produces can ultimately
be changed in the future, especially if the bulk of it is internal
behaviour.

I'm going to spend some time next week trying to re-work the original
WriteBatchWithIndex design to remove the newTransaction() method, such that
it's just an implementation detail of RocksDBStore. That way, if we want to
replace WBWI with something in the future, like the SST file management
outlined by John, then we can do so with little/no API changes.

Regards,

Nick

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-21 Thread Nick Telford

Hi Bruno,

1.
Isn't this exactly the same issue that WriteBatchWithIndex transactions
have, whereby exceeding (or likely to exceed) configured memory needs to
trigger an early commit?

2.
This is one of my big concerns. Ultimately, any approach based on cracking
open RocksDB internals and using it in ways it's not really designed for is
likely to have some unforseen performance or consistency issues.

3.
What's your motivation for removing these early commits? While not ideal, I
think they're a decent compromise to ensure consistency whilst maintaining
good and predictable performance.
All 3 of your suggested ideas seem *very* complicated, and might actually
make behaviour less predictable for users as a consequence.

I'm a bit concerned that the scope of this KIP is growing a bit out of
control. While it's good to discuss ideas for future improvements, I think
it's important to narrow the scope down to a design that achieves the most
pressing objectives (constant sized restorations during dirty
close/unexpected errors). Any design that this KIP produces can ultimately
be changed in the future, especially if the bulk of it is internal
behaviour.

I'm going to spend some time next week trying to re-work the original
WriteBatchWithIndex design to remove the newTransaction() method, such that
it's just an implementation detail of RocksDBStore. That way, if we want to
replace WBWI with something in the future, like the SST file management
outlined by John, then we can do so with little/no API changes.

Regards,

Nick

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

2023-06-21 Thread Bruno Cadonna


Hi John,
Hi Nick,

Thanks for the interesting ideas!

Here my comments.

1.
It is not clear to me what happens if the cache exceeds its configured 
size between two commits? Currently, the cache evicts its entries and 
writes dirty entries to the state store. Should the cache write those 
dirty entries to the SST files directly? But then, how would the task 
that owns the state store or an interactive query query those SST files? 
We cannot ingest the SST files into the state store yet, because we 
first need to commit the Kafka transaction containing those entries.

Nick, I think you had a similar question a couple of messages ago.

2.
I am wondering how ingesting external SST files affects compaction and 
whether Streams will experience more write stalls.


3.
I would really like to get rid of the extra commits triggered by 
exceeding the configured size of the transaction as proposed in Nick's 
KIP. Avoiding this extra commits would allow us to put all the 
transactional logic behind the state store interface without the need to 
check for memory sizes before starting to process a record.
However, at the moment, I do not see how to achieve this without the 
possibility to spill records to disk. John's (SST ingest) and Alex's 
(KIP-844) ideas go in that direction. Alex's idea has a performance 
issue and with John's idea it is not clear to me how to query records in 
SST files.


I see three options to improve on those two ideas:

a. After the Kafka transaction is committed ingest SST files produced by 
the temporary state store proposed by Alex.
That might improve performance since in this way we update the state 
store in batches. However, I am not sure how performant clearing the 
temporary state store is.


b. Implementing querying of SST files in Java.
This seems quite some work. However, the implementation does not need to 
be super performant because the data to query might not be too much and 
querying the SST files might also not happen too often.


c. Make WriteBatchWithIndex spill to disk when configured size is exceeded.
That would require a change in RocksDB. For that, we need to either fork 
RocksDB or to get this change in one of the next releases. The latter 
would also imply to upgrade the RocksDB version in Kafka Streams which 
might get us backwards compatibility issues.



Best,
Bruno


On 20.06.23 23:43, Nick Telford wrote:

Here's what I'm thinking: based on Bruno's earlier feedback, I'm going to
try to simplify my original design down such that it needs no/minimal
changes to the public interface.

If that succeeds, then it should also be possible to transparently
implement the "no memtables" solution as a performance optimization when
the record cache is enabled. I consider this approach only an optimisation,
because of the need to still support stores with the cache disabled.

For that reason, I think the "no memtables" approach would probably best be
suited as a follow-up KIP, but that we keep it in mind during the design of
this one.

What do you think?

Regards,
Nick


On Tue, 20 Jun 2023, 22:26 John Roesler,  wrote:


Oh, that's a good point.

On the topic of a behavioral switch for disabled caches, the typical use
case for disabling the cache is to cause each individual update to
propagate down the topology, so another thought might be to just go
ahead and add the memory we would have used for the memtables to the
cache size, but if people did disable the cache entirely, then we could
still go ahead and forward the records on each write?

I know that Guozhang was also proposing for a while to actually decouple
caching and forwarding, which might provide a way to side-step this
dilemma (i.e., we just always forward and only apply the cache to state
and changelog writes).

By the way, I'm basing my statement on why you'd disable caches on
memory, but also on the guidance here:

https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html
. That doc also contains a section on how to bound the total memory
usage across RocksDB memtables, which points to another benefit of
disabling memtables and managing the write buffer ourselves (simplified
memory configuration).

Thanks,
-John

On 6/20/23 16:05, Nick Telford wrote:

Potentially we could just go the memorable with Rocks WriteBatches route

if

the cache is disabled?

On Tue, 20 Jun 2023, 22:00 John Roesler,  wrote:


Touché!

Ok, I agree that figuring out the case of a disabled cache would be
non-trivial. Ingesting single-record SST files will probably not be
performant, but benchmarking may prove different. Or maybe we can have
some reserved cache space on top of the user-configured cache, which we
would have reclaimed from the memtable space. Or some other, more
creative solution.

Thanks,
-John

On 6/20/23 15:30, Nick Telford wrote:

Note that users can disable the cache, which would still be

ok, I think. We wouldn't ingest the SST files on every record, but just
append to them and only ingest them

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Here's what I'm thinking: based on Bruno's earlier feedback, I'm going to
try to simplify my original design down such that it needs no/minimal
changes to the public interface.

If that succeeds, then it should also be possible to transparently
implement the "no memtables" solution as a performance optimization when
the record cache is enabled. I consider this approach only an optimisation,
because of the need to still support stores with the cache disabled.

For that reason, I think the "no memtables" approach would probably best be
suited as a follow-up KIP, but that we keep it in mind during the design of
this one.

What do you think?

Regards,
Nick


On Tue, 20 Jun 2023, 22:26 John Roesler,  wrote:

> Oh, that's a good point.
>
> On the topic of a behavioral switch for disabled caches, the typical use
> case for disabling the cache is to cause each individual update to
> propagate down the topology, so another thought might be to just go
> ahead and add the memory we would have used for the memtables to the
> cache size, but if people did disable the cache entirely, then we could
> still go ahead and forward the records on each write?
>
> I know that Guozhang was also proposing for a while to actually decouple
> caching and forwarding, which might provide a way to side-step this
> dilemma (i.e., we just always forward and only apply the cache to state
> and changelog writes).
>
> By the way, I'm basing my statement on why you'd disable caches on
> memory, but also on the guidance here:
>
> https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html
> . That doc also contains a section on how to bound the total memory
> usage across RocksDB memtables, which points to another benefit of
> disabling memtables and managing the write buffer ourselves (simplified
> memory configuration).
>
> Thanks,
> -John
>
> On 6/20/23 16:05, Nick Telford wrote:
> > Potentially we could just go the memorable with Rocks WriteBatches route
> if
> > the cache is disabled?
> >
> > On Tue, 20 Jun 2023, 22:00 John Roesler,  wrote:
> >
> >> Touché!
> >>
> >> Ok, I agree that figuring out the case of a disabled cache would be
> >> non-trivial. Ingesting single-record SST files will probably not be
> >> performant, but benchmarking may prove different. Or maybe we can have
> >> some reserved cache space on top of the user-configured cache, which we
> >> would have reclaimed from the memtable space. Or some other, more
> >> creative solution.
> >>
> >> Thanks,
> >> -John
> >>
> >> On 6/20/23 15:30, Nick Telford wrote:
>  Note that users can disable the cache, which would still be
> >>> ok, I think. We wouldn't ingest the SST files on every record, but just
> >>> append to them and only ingest them on commit, when we're already
> >>> waiting for acks and a RocksDB commit.
> >>>
> >>> In this case, how would uncommitted records be read by joins?
> >>>
> >>> On Tue, 20 Jun 2023, 20:51 John Roesler,  wrote:
> >>>
>  Ah, sorry Nick,
> 
>  I just meant the regular heap based cache that we maintain in
> Streams. I
>  see that it's not called "RecordCache" (my mistake).
> 
>  The actual cache is ThreadCache:
> 
> 
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java
> 
>  Here's the example of how we use the cache in KeyValueStore:
> 
> 
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
> 
>  It's basically just an on-heap Map of records that have not yet been
>  written to the changelog or flushed into the underlying store. It gets
>  flushed when the total cache size exceeds `cache.max.bytes.buffering`
> or
>  the `commit.interval.ms` elapses.
> 
>  Speaking of those configs, another benefit to this idea is that we
> would
>  no longer need to trigger extra commits based on the size of the
> ongoing
>  transaction. Instead, we'd just preserve the existing cache-flush
>  behavior. Note that users can disable the cache, which would still be
>  ok, I think. We wouldn't ingest the SST files on every record, but
> just
>  append to them and only ingest them on commit, when we're already
>  waiting for acks and a RocksDB commit.
> 
>  Thanks,
>  -John
> 
>  On 6/20/23 14:09, Nick Telford wrote:
> > Hi John,
> >
> > By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find
> >> any
> > class called "RecordCache"...
> >
> > Cheers,
> >
> > Nick
> >
> > On Tue, 20 Jun 2023 at 19:42, John Roesler 
> >> wrote:
> >
> >> Hi Nick,
> >>
> >> Thanks for picking this up again!
> >>
> >> I did have one new thought over the intervening months, which I'd
> like
> >> your take on.
> >>
> >> What if, instead of using the RocksDB atomic write primitive at all,
> >>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Oh, that's a good point.

On the topic of a behavioral switch for disabled caches, the typical use
case for disabling the cache is to cause each individual update to
propagate down the topology, so another thought might be to just go
ahead and add the memory we would have used for the memtables to the
cache size, but if people did disable the cache entirely, then we could
still go ahead and forward the records on each write?

I know that Guozhang was also proposing for a while to actually decouple
caching and forwarding, which might provide a way to side-step this
dilemma (i.e., we just always forward and only apply the cache to state
and changelog writes).

By the way, I'm basing my statement on why you'd disable caches on
memory, but also on the guidance here:
https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html
. That doc also contains a section on how to bound the total memory
usage across RocksDB memtables, which points to another benefit of
disabling memtables and managing the write buffer ourselves (simplified
memory configuration).

Thanks,
-John

On 6/20/23 16:05, Nick Telford wrote:

Potentially we could just go the memorable with Rocks WriteBatches route if
the cache is disabled?

On Tue, 20 Jun 2023, 22:00 John Roesler, wrote:

Touché!

Ok, I agree that figuring out the case of a disabled cache would be
non-trivial. Ingesting single-record SST files will probably not be
performant, but benchmarking may prove different. Or maybe we can have
some reserved cache space on top of the user-configured cache, which we
would have reclaimed from the memtable space. Or some other, more
creative solution.

Thanks,
-John

On 6/20/23 15:30, Nick Telford wrote:

Note that users can disable the cache, which would still be

ok, I think. We wouldn't ingest the SST files on every record, but just
append to them and only ingest them on commit, when we're already
waiting for acks and a RocksDB commit.

In this case, how would uncommitted records be read by joins?

On Tue, 20 Jun 2023, 20:51 John Roesler, wrote:

Ah, sorry Nick,

I just meant the regular heap based cache that we maintain in Streams. I
see that it's not called "RecordCache" (my mistake).

The actual cache is ThreadCache:

https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java

Here's the example of how we use the cache in KeyValueStore:

https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java

It's basically just an on-heap Map of records that have not yet been
written to the changelog or flushed into the underlying store. It gets
flushed when the total cache size exceeds `cache.max.bytes.buffering` or
the `commit.interval.ms` elapses.

Speaking of those configs, another benefit to this idea is that we would
no longer need to trigger extra commits based on the size of the ongoing
transaction. Instead, we'd just preserve the existing cache-flush
behavior. Note that users can disable the cache, which would still be
ok, I think. We wouldn't ingest the SST files on every record, but just
append to them and only ingest them on commit, when we're already
waiting for acks and a RocksDB commit.

Thanks,
-John

On 6/20/23 14:09, Nick Telford wrote:

Hi John,

By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find

any

class called "RecordCache"...

Cheers,

Nick

On Tue, 20 Jun 2023 at 19:42, John Roesler

wrote:

Hi Nick,

Thanks for picking this up again!

I did have one new thought over the intervening months, which I'd like
your take on.

What if, instead of using the RocksDB atomic write primitive at all,

instead just:
1. disable memtables entirely
2. directly write the RecordCache into SST files when we flush
3. atomically ingest the SST file(s) into RocksDB when we get the ACK
from the changelog (see

https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md

and

https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java

and

https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429

)
4. track the changelog offsets either in another CF or the same CF

with

a reserved key, either of which will make the changelog offset update
atomic with the file ingestions

I suspect this'll have a number of benefits:
* writes to RocksDB will always be atomic
* we don't fragment memory between the RecordCache and the memtables
* RecordCache gives far higher performance than memtable for reads and
writes
* we don't need any new "transaction" concepts or memory bound configs

What do you think?

Thanks,
-John

On 6/20/23 10:51, Nick Telford wrote:

Hi Bruno,

Thanks for reviewing the KIP. It's been a long road, I started

working

this more than a year ago, and most of the time in the last 6 months

has

been spent on the

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Potentially we could just go the memorable with Rocks WriteBatches route if
the cache is disabled?

On Tue, 20 Jun 2023, 22:00 John Roesler,  wrote:

> Touché!
>
> Ok, I agree that figuring out the case of a disabled cache would be
> non-trivial. Ingesting single-record SST files will probably not be
> performant, but benchmarking may prove different. Or maybe we can have
> some reserved cache space on top of the user-configured cache, which we
> would have reclaimed from the memtable space. Or some other, more
> creative solution.
>
> Thanks,
> -John
>
> On 6/20/23 15:30, Nick Telford wrote:
> >> Note that users can disable the cache, which would still be
> > ok, I think. We wouldn't ingest the SST files on every record, but just
> > append to them and only ingest them on commit, when we're already
> > waiting for acks and a RocksDB commit.
> >
> > In this case, how would uncommitted records be read by joins?
> >
> > On Tue, 20 Jun 2023, 20:51 John Roesler,  wrote:
> >
> >> Ah, sorry Nick,
> >>
> >> I just meant the regular heap based cache that we maintain in Streams. I
> >> see that it's not called "RecordCache" (my mistake).
> >>
> >> The actual cache is ThreadCache:
> >>
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java
> >>
> >> Here's the example of how we use the cache in KeyValueStore:
> >>
> >>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
> >>
> >> It's basically just an on-heap Map of records that have not yet been
> >> written to the changelog or flushed into the underlying store. It gets
> >> flushed when the total cache size exceeds `cache.max.bytes.buffering` or
> >> the `commit.interval.ms` elapses.
> >>
> >> Speaking of those configs, another benefit to this idea is that we would
> >> no longer need to trigger extra commits based on the size of the ongoing
> >> transaction. Instead, we'd just preserve the existing cache-flush
> >> behavior. Note that users can disable the cache, which would still be
> >> ok, I think. We wouldn't ingest the SST files on every record, but just
> >> append to them and only ingest them on commit, when we're already
> >> waiting for acks and a RocksDB commit.
> >>
> >> Thanks,
> >> -John
> >>
> >> On 6/20/23 14:09, Nick Telford wrote:
> >>> Hi John,
> >>>
> >>> By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find
> any
> >>> class called "RecordCache"...
> >>>
> >>> Cheers,
> >>>
> >>> Nick
> >>>
> >>> On Tue, 20 Jun 2023 at 19:42, John Roesler 
> wrote:
> >>>
>  Hi Nick,
> 
>  Thanks for picking this up again!
> 
>  I did have one new thought over the intervening months, which I'd like
>  your take on.
> 
>  What if, instead of using the RocksDB atomic write primitive at all,
> we
>  instead just:
>  1. disable memtables entirely
>  2. directly write the RecordCache into SST files when we flush
>  3. atomically ingest the SST file(s) into RocksDB when we get the ACK
>  from the changelog (see
> 
> 
> >>
> https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
>  and
> 
> 
> >>
> https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
>  and
> 
> 
> >>
> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429
>  )
>  4. track the changelog offsets either in another CF or the same CF
> with
>  a reserved key, either of which will make the changelog offset update
>  atomic with the file ingestions
> 
>  I suspect this'll have a number of benefits:
>  * writes to RocksDB will always be atomic
>  * we don't fragment memory between the RecordCache and the memtables
>  * RecordCache gives far higher performance than memtable for reads and
>  writes
>  * we don't need any new "transaction" concepts or memory bound configs
> 
>  What do you think?
> 
>  Thanks,
>  -John
> 
>  On 6/20/23 10:51, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for reviewing the KIP. It's been a long road, I started
> working
> >> on
> > this more than a year ago, and most of the time in the last 6 months
> >> has
> > been spent on the "Atomic Checkpointing" stuff that's been benched,
> so
>  some
> > of the reasoning behind some of my decisions have been lost, but I'll
> >> do
>  my
> > best to reconstruct them.
> >
> > 1.
> > IIRC, this was the initial approach I tried. I don't remember the
> exact
> > reasons I changed it to use a separate "view" of the StateStore that
> > encapsulates the transaction, but I believe it had something to do
> with
> > concurrent access to the StateStore from Interactive Query threads.
> >> Reads
> > from interactive queries need to be isolated

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Touché!

Thanks,
-John

On 6/20/23 15:30, Nick Telford wrote:

Note that users can disable the cache, which would still be

ok, I think. We wouldn't ingest the SST files on every record, but just
append to them and only ingest them on commit, when we're already
waiting for acks and a RocksDB commit.

In this case, how would uncommitted records be read by joins?

On Tue, 20 Jun 2023, 20:51 John Roesler, wrote:

Ah, sorry Nick,

I just meant the regular heap based cache that we maintain in Streams. I
see that it's not called "RecordCache" (my mistake).

The actual cache is ThreadCache:

https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java

Here's the example of how we use the cache in KeyValueStore:

https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java

Thanks,
-John

On 6/20/23 14:09, Nick Telford wrote:

Hi John,

By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find any
class called "RecordCache"...

Cheers,

Nick

On Tue, 20 Jun 2023 at 19:42, John Roesler wrote:

Hi Nick,

Thanks for picking this up again!

I did have one new thought over the intervening months, which I'd like
your take on.

What if, instead of using the RocksDB atomic write primitive at all, we
instead just:
1. disable memtables entirely
2. directly write the RecordCache into SST files when we flush
3. atomically ingest the SST file(s) into RocksDB when we get the ACK
from the changelog (see

https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md

and

https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java

and

https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429

)
4. track the changelog offsets either in another CF or the same CF with
a reserved key, either of which will make the changelog offset update
atomic with the file ingestions

What do you think?

Thanks,
-John

On 6/20/23 10:51, Nick Telford wrote:

Hi Bruno,

Thanks for reviewing the KIP. It's been a long road, I started working

this more than a year ago, and most of the time in the last 6 months

has

been spent on the "Atomic Checkpointing" stuff that's been benched, so

some

of the reasoning behind some of my decisions have been lost, but I'll

best to reconstruct them.

Reads

from interactive queries need to be isolated from the currently ongoing
transaction, both for consistency (so interactive queries don't observe
changes that are subsequently rolled-back), but also to prevent

Iterators

opened by an interactive query from being closed and invalidated by the
StreamThread when it commits the transaction, which causes your

interactive

queries to crash.

Another reason I believe I implemented it this way was a separation of
concerns. Recall that newTransaction() originally created an object of

type

Transaction, not StateStore. My intent was to improve the type-safety

the API, in an effort to ensure Transactions weren't used incorrectly.
Unfortunately, this didn't pan out, but newTransaction() remained.

Finally, this had the added benefit that implementations could easily

add

support for transactions *without*

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

> Note that users can disable the cache, which would still be
ok, I think. We wouldn't ingest the SST files on every record, but just
append to them and only ingest them on commit, when we're already
waiting for acks and a RocksDB commit.

In this case, how would uncommitted records be read by joins?

On Tue, 20 Jun 2023, 20:51 John Roesler,  wrote:

> Ah, sorry Nick,
>
> I just meant the regular heap based cache that we maintain in Streams. I
> see that it's not called "RecordCache" (my mistake).
>
> The actual cache is ThreadCache:
>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java
>
> Here's the example of how we use the cache in KeyValueStore:
>
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
>
> It's basically just an on-heap Map of records that have not yet been
> written to the changelog or flushed into the underlying store. It gets
> flushed when the total cache size exceeds `cache.max.bytes.buffering` or
> the `commit.interval.ms` elapses.
>
> Speaking of those configs, another benefit to this idea is that we would
> no longer need to trigger extra commits based on the size of the ongoing
> transaction. Instead, we'd just preserve the existing cache-flush
> behavior. Note that users can disable the cache, which would still be
> ok, I think. We wouldn't ingest the SST files on every record, but just
> append to them and only ingest them on commit, when we're already
> waiting for acks and a RocksDB commit.
>
> Thanks,
> -John
>
> On 6/20/23 14:09, Nick Telford wrote:
> > Hi John,
> >
> > By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find any
> > class called "RecordCache"...
> >
> > Cheers,
> >
> > Nick
> >
> > On Tue, 20 Jun 2023 at 19:42, John Roesler  wrote:
> >
> >> Hi Nick,
> >>
> >> Thanks for picking this up again!
> >>
> >> I did have one new thought over the intervening months, which I'd like
> >> your take on.
> >>
> >> What if, instead of using the RocksDB atomic write primitive at all, we
> >> instead just:
> >> 1. disable memtables entirely
> >> 2. directly write the RecordCache into SST files when we flush
> >> 3. atomically ingest the SST file(s) into RocksDB when we get the ACK
> >> from the changelog (see
> >>
> >>
> https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
> >> and
> >>
> >>
> https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
> >> and
> >>
> >>
> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429
> >> )
> >> 4. track the changelog offsets either in another CF or the same CF with
> >> a reserved key, either of which will make the changelog offset update
> >> atomic with the file ingestions
> >>
> >> I suspect this'll have a number of benefits:
> >> * writes to RocksDB will always be atomic
> >> * we don't fragment memory between the RecordCache and the memtables
> >> * RecordCache gives far higher performance than memtable for reads and
> >> writes
> >> * we don't need any new "transaction" concepts or memory bound configs
> >>
> >> What do you think?
> >>
> >> Thanks,
> >> -John
> >>
> >> On 6/20/23 10:51, Nick Telford wrote:
> >>> Hi Bruno,
> >>>
> >>> Thanks for reviewing the KIP. It's been a long road, I started working
> on
> >>> this more than a year ago, and most of the time in the last 6 months
> has
> >>> been spent on the "Atomic Checkpointing" stuff that's been benched, so
> >> some
> >>> of the reasoning behind some of my decisions have been lost, but I'll
> do
> >> my
> >>> best to reconstruct them.
> >>>
> >>> 1.
> >>> IIRC, this was the initial approach I tried. I don't remember the exact
> >>> reasons I changed it to use a separate "view" of the StateStore that
> >>> encapsulates the transaction, but I believe it had something to do with
> >>> concurrent access to the StateStore from Interactive Query threads.
> Reads
> >>> from interactive queries need to be isolated from the currently ongoing
> >>> transaction, both for consistency (so interactive queries don't observe
> >>> changes that are subsequently rolled-back), but also to prevent
> Iterators
> >>> opened by an interactive query from being closed and invalidated by the
> >>> StreamThread when it commits the transaction, which causes your
> >> interactive
> >>> queries to crash.
> >>>
> >>> Another reason I believe I implemented it this way was a separation of
> >>> concerns. Recall that newTransaction() originally created an object of
> >> type
> >>> Transaction, not StateStore. My intent was to improve the type-safety
> of
> >>> the API, in an effort to ensure Transactions weren't used incorrectly.
> >>> Unfortunately, this didn't pan out, but newTransaction() remained.
> >>>
> >>> Finally, this had the added benefit that implementations could easily
> add
> >>> support for transactions

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Ah, sorry Nick,

I just meant the regular heap based cache that we maintain in Streams. I
see that it's not called "RecordCache" (my mistake).

The actual cache is ThreadCache:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/ThreadCache.java

Here's the example of how we use the cache in KeyValueStore:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java

Thanks,
-John

On 6/20/23 14:09, Nick Telford wrote:

Hi John,

By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find any
class called "RecordCache"...

Cheers,

Nick

On Tue, 20 Jun 2023 at 19:42, John Roesler wrote:

Hi Nick,

Thanks for picking this up again!

I did have one new thought over the intervening months, which I'd like
your take on.

https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
and

https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
and

https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429
)
4. track the changelog offsets either in another CF or the same CF with
a reserved key, either of which will make the changelog offset update
atomic with the file ingestions

What do you think?

Thanks,
-John

On 6/20/23 10:51, Nick Telford wrote:

Hi Bruno,

some

of the reasoning behind some of my decisions have been lost, but I'll do

best to reconstruct them.

1.
IIRC, this was the initial approach I tried. I don't remember the exact
reasons I changed it to use a separate "view" of the StateStore that
encapsulates the transaction, but I believe it had something to do with
concurrent access to the StateStore from Interactive Query threads. Reads
from interactive queries need to be isolated from the currently ongoing
transaction, both for consistency (so interactive queries don't observe
changes that are subsequently rolled-back), but also to prevent Iterators
opened by an interactive query from being closed and invalidated by the
StreamThread when it commits the transaction, which causes your

interactive

queries to crash.

Another reason I believe I implemented it this way was a separation of
concerns. Recall that newTransaction() originally created an object of

type

Transaction, not StateStore. My intent was to improve the type-safety of
the API, in an effort to ensure Transactions weren't used incorrectly.
Unfortunately, this didn't pan out, but newTransaction() remained.

Finally, this had the added benefit that implementations could easily add
support for transactions *without* re-writing their existing,
non-transactional implementation. I think this can be a benefit both for
implementers of custom StateStores, but also for anyone extending
RocksDbStore, as they can rely on the existing access methods working how
they expect them to.

from

the transaction, *without* the performance overhead of checking which
thread the method is being called from on every access.

As for replacing flush() with commit() - I saw no reason to add this
complexity to the KIP, unless there was a need to add arguments to the
flush/commit method. This need arises with Atomic

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Hi John,

I think you're referring to the "record cache" that's provided by the
ThreadCache class?

1-3.
I was hoping to (eventually) remove the "flush-on-commit" behaviour from
RocksDbStore, so that RocksDB can choose when to flush memtables, enabling
users to tailor RocksDB performance to their workload. Explicitly flushing
the Record Cache to files instead would entail either flushing on every
commit, or the current behaviour, of flushing on every commit provided at
least 10K records have been processed. Compared with RocksDB-managed
memtable flushing, this is very inflexible. If we pursue this design, I
highly recommend replacing the hard-coded 10K limit with something
configurable so that users can tune flush behaviour for their workloads.

4.
Tracking the changelog offsets in another CF and atomically updating it
with the main CFs is orthogonal, I think, as it can be done when using
memtables provided the "Atomic Flush" feature of RocksDB is enabled. This
is something I'd originally planned for this KIP, but we're trying to pull
out into a later KIP to make things more manageable.

> * we don't fragment memory between the RecordCache and the memtables
I think by memory fragmentation, you mean duplication, because we're
caching the records both in the (on-heap) Record Cache and the RocksDB
memtables? This is a good point that I hadn't considered before. Wouldn't a
simpler solution be to just disable the record cache for RocksDB stores (by
default), and let the memtables do the caching? Although I guess that would
reduce read performance, which could be especially important for joins.

> * RecordCache gives far higher performance than memtable for reads and
writes
I'll concede this point. The JNI boundary plus RocksDB record encoding will
likely make it impossible to ever match the Record Cache on throughput.

> * we don't need any new "transaction" concepts or memory bound configs
Maybe. Unless I'm mistaken, the Record Cache only retains the most recently
written value for a key, which would mean that Interactive Queries would
always observe new record values *before* they're committed to the
changelog. While this is the current behaviour, it's also a violation of
consistency, because successive IQ could observe a regression of a value,
due to an error writing to the changelog (e.g. a changelog transaction
rollback or a timeout). This is something that KIP-892 aims to improve on,
as the current design would ensure that records are only observed by IQ
*after* they have been committed to the Kafka changelog.

That said, it definitely sounds *feasible*.

Regards,

Nick

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Hi John,

By "RecordCache", do you mean the RocksDB "WriteBatch"? I can't find any
class called "RecordCache"...

Cheers,

Nick

On Tue, 20 Jun 2023 at 19:42, John Roesler  wrote:

> Hi Nick,
>
> Thanks for picking this up again!
>
> I did have one new thought over the intervening months, which I'd like
> your take on.
>
> What if, instead of using the RocksDB atomic write primitive at all, we
> instead just:
> 1. disable memtables entirely
> 2. directly write the RecordCache into SST files when we flush
> 3. atomically ingest the SST file(s) into RocksDB when we get the ACK
> from the changelog (see
>
> https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
> and
>
> https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
> and
>
> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429
> )
> 4. track the changelog offsets either in another CF or the same CF with
> a reserved key, either of which will make the changelog offset update
> atomic with the file ingestions
>
> I suspect this'll have a number of benefits:
> * writes to RocksDB will always be atomic
> * we don't fragment memory between the RecordCache and the memtables
> * RecordCache gives far higher performance than memtable for reads and
> writes
> * we don't need any new "transaction" concepts or memory bound configs
>
> What do you think?
>
> Thanks,
> -John
>
> On 6/20/23 10:51, Nick Telford wrote:
> > Hi Bruno,
> >
> > Thanks for reviewing the KIP. It's been a long road, I started working on
> > this more than a year ago, and most of the time in the last 6 months has
> > been spent on the "Atomic Checkpointing" stuff that's been benched, so
> some
> > of the reasoning behind some of my decisions have been lost, but I'll do
> my
> > best to reconstruct them.
> >
> > 1.
> > IIRC, this was the initial approach I tried. I don't remember the exact
> > reasons I changed it to use a separate "view" of the StateStore that
> > encapsulates the transaction, but I believe it had something to do with
> > concurrent access to the StateStore from Interactive Query threads. Reads
> > from interactive queries need to be isolated from the currently ongoing
> > transaction, both for consistency (so interactive queries don't observe
> > changes that are subsequently rolled-back), but also to prevent Iterators
> > opened by an interactive query from being closed and invalidated by the
> > StreamThread when it commits the transaction, which causes your
> interactive
> > queries to crash.
> >
> > Another reason I believe I implemented it this way was a separation of
> > concerns. Recall that newTransaction() originally created an object of
> type
> > Transaction, not StateStore. My intent was to improve the type-safety of
> > the API, in an effort to ensure Transactions weren't used incorrectly.
> > Unfortunately, this didn't pan out, but newTransaction() remained.
> >
> > Finally, this had the added benefit that implementations could easily add
> > support for transactions *without* re-writing their existing,
> > non-transactional implementation. I think this can be a benefit both for
> > implementers of custom StateStores, but also for anyone extending
> > RocksDbStore, as they can rely on the existing access methods working how
> > they expect them to.
> >
> > I'm not too happy with the way the current design has panned out, so I'm
> > open to ideas on how to improve it. Key to this is finding some way to
> > ensure that reads from Interactive Query threads are properly isolated
> from
> > the transaction, *without* the performance overhead of checking which
> > thread the method is being called from on every access.
> >
> > As for replacing flush() with commit() - I saw no reason to add this
> > complexity to the KIP, unless there was a need to add arguments to the
> > flush/commit method. This need arises with Atomic Checkpointing, but that
> > will be implemented separately, in a future KIP. Do you see a need for
> some
> > arguments to the flush/commit method that I've missed? Or were you simply
> > suggesting a rename?
> >
> > 2.
> > This is simply due to the practical reason that isolationLevel() is
> really
> > a proxy for checking if the app is under EOS. The application
> configuration
> > is not provided to the constructor of StateStores, but it *is* provided
> to
> > init(), via StateStoreContext. For this reason, it seemed somewhat
> natural
> > to add it to StateStoreContext. I think this makes sense, since the
> > IsolationLevel of all StateStores in an application *must* be the same,
> and
> > since those stores are all initialized with the same StateStoreContext,
> it
> > seems natural for that context to carry the desired IsolationLevel to
> use.
> >
> > 3.
> > Using IsolationLevel instead of just passing `boolean eosEnabled`, like
> > much of the internals was an attempt to logically de-couple the
> StateStore
> > API from the

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Hi Nick,

Thanks for picking this up again!

I did have one new thought over the intervening months, which I'd like
your take on.

What if, instead of using the RocksDB atomic write primitive at all, we
instead just:

1. disable memtables entirely
2. directly write the RecordCache into SST files when we flush
3. atomically ingest the SST file(s) into RocksDB when we get the ACK
from the changelog (see
https://github.com/EighteenZi/rocksdb_wiki/blob/master/Creating-and-Ingesting-SST-files.md
and
https://github.com/facebook/rocksdb/blob/master/java/src/main/java/org/rocksdb/IngestExternalFileOptions.java
and
https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L1413-L1429)
4. track the changelog offsets either in another CF or the same CF with
a reserved key, either of which will make the changelog offset update
atomic with the file ingestions

* we don't need any new "transaction" concepts or memory bound configs

What do you think?

Thanks,
-John

On 6/20/23 10:51, Nick Telford wrote:

Hi Bruno,

Thanks for reviewing the KIP. It's been a long road, I started working on
this more than a year ago, and most of the time in the last 6 months has
been spent on the "Atomic Checkpointing" stuff that's been benched, so some
of the reasoning behind some of my decisions have been lost, but I'll do my
best to reconstruct them.

Another reason I believe I implemented it this way was a separation of
concerns. Recall that newTransaction() originally created an object of type
Transaction, not StateStore. My intent was to improve the type-safety of
the API, in an effort to ensure Transactions weren't used incorrectly.
Unfortunately, this didn't pan out, but newTransaction() remained.

I'm not too happy with the way the current design has panned out, so I'm
open to ideas on how to improve it. Key to this is finding some way to
ensure that reads from Interactive Query threads are properly isolated from
the transaction, *without* the performance overhead of checking which
thread the method is being called from on every access.

As for replacing flush() with commit() - I saw no reason to add this
complexity to the KIP, unless there was a need to add arguments to the
flush/commit method. This need arises with Atomic Checkpointing, but that
will be implemented separately, in a future KIP. Do you see a need for some
arguments to the flush/commit method that I've missed? Or were you simply
suggesting a rename?

2.
This is simply due to the practical reason that isolationLevel() is really
a proxy for checking if the app is under EOS. The application configuration
is not provided to the constructor of StateStores, but it *is* provided to
init(), via StateStoreContext. For this reason, it seemed somewhat natural
to add it to StateStoreContext. I think this makes sense, since the
IsolationLevel of all StateStores in an application *must* be the same, and
since those stores are all initialized with the same StateStoreContext, it
seems natural for that context to carry the desired IsolationLevel to use.

3.
Using IsolationLevel instead of just passing `boolean eosEnabled`, like
much of the internals was an attempt to logically de-couple the StateStore
API from the internals of Kafka Streams. Technically, StateStores don't
need to know/care what processing mode the KS app is using, all they need
to know is the isolation level expected of them.

Having formal definitions for the expectations of the two required
IsolationLevels allow implementers to implement transactional stores
without having to dig through the internals of Kafka Streams and understand
exactly how they are used. The tight coupling between state stores and
internal behaviour has

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores