[
https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-17571:
-----------------------------------
Description:
h3. Basics
Current implementation can only work with infinite storage space. This is
because the storage works in appen-only mode (except for transactions
rollbacks).
There's a value in the system called "low watermark". It's guaranteed, that no
new RO transactions will be started at the time earlier then LW. Existence of
such value allows us to safely delete all versions before it. We will not
discuss the mechanics of acquiring such value, assuming that it is simply given.
It's expected that deletion will be performed in background workers, without
holding any transactions locks. Process is independent on each RAFT group
member. GC shouldn't affect transactions in any way, if possible.
h3. API
Original intended design looks like following:
{code:java}
cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive,
@Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code}
Now, I don't think that this is necessarily a good design. Main problem with it
is the existence of bounds:
* First of all, why not just have inclusive lower bound and exclusive upper
bound, like almost all methods with bounds in existence.
* Second, I believe that this API has been proposed in assumption that row ids
will have a uniform distribution and every "range" cleanup will result in
somewhat equal amount of job executed. This is simply not true in current
implementation.
RowId is a timestamp based value that exist in a very narrow time slice, making
most of ranges empty and meaningless.
Then, the way they're stored is actually up to the storage. There's no
restrictions on byte order when physically storing RowId objects.
h3. Problems
There's one thing that worries me: indexes.
Because storages are unaware of table schemas, it's impossible to cleanup
indexes. This gets me thinking. API should be flexible enough so that indexes
cleanup can be performed as an external operation over the storage. This will
also reduce the amount of job that we need to do for the implementation.
To be specific, it feels like the method would look like this:
{code:java}
RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows,
BiConsumer<RowId, BinaryRow> indexCleaner);{code}
Explanation is required.
* timestamp represents the same thing as before - low watermark.
* startId - the row that should be first to iterate in current batch.
* numRows - number of rows that should be cleaned in current batch. By this I
mean individual versions, not chains.
* cleaner closure must be used to clean indexes after every individual version
removal. Right now it doesn't look optimal to me, but I struggle to find a good
solution on efficient indexes cleanup.
* next rowId is returned, that should be used a startId in next call. "null"
if cleanup is complete. In this case it can be started from the beginning or
simply postponed until new low watermark value is available.
How to operate it:
* numRows has two strategic purposes:
** to control the size of batches in "runConsistently" closures.
** to be able to run many cleanups in parallel avoiding pools starvation.
Every task is split into small chunks.
* cleanup should start from the smallest possible row id.
* low watermark value can be changed in-between calls. This is fine and even
preferable.
BUT, such design is a "backwards" design and it gives the storage too much
responsibility.
h3. Alternative
Some might find given API too complicated. There's another way that involves
less actions from the "client".
{code:java}
void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId,
BinaryRow> indexCleaner);{code}
In a way, both of these methods are equivalent, the difference is minuscule.
One can be transformed into the other with relative ease. If anything, this one
is even preferable as a first implementation, because it's smaller and there's
less room for errors.
Another alternative it to have an internal queue, like in case of TTL in Ignite
2. This would be the simplest API to implement:
{code:java}
@Nullable Pair<RowId, BinaryRow> enqueue(HybridTimestamp threshold);{code}
h3. Different engines
This is the part where I express my thoughts of what can (and will) go wrong in
different storages.
Generally speaking, right now there are three implementations that must be
compatible in behavior:
# test storage that uses a ConcurrentSkipListMap and actual chains,
represented with singly-linked lists
# B+Tree-based storages, two of them. They use trees with, also, chains, also
represented with singly-linked lists, but off-heap this time
# RocksDB-based storage. It uses a flat structure, so it's just a big mess of
(RowId, Timestamp?) pairs in keyset
Let's discuss it one by one:
# This is the easiest one. Currently links to next chain elements are final.
They must become volatile. There's no manual "freeing" of the memory used by
chains, so it'll just work.
# This one is hard (in my opinion). Imagine having a "RowVersion" instance and
trying to read the "next" row versions, while either looking for a specific
version or performing index build/cleanup, doesn't matter. How do you guarantee
that the link is valid and has not been deleted while you were doing your
things?
Out of the box, there's no way to tell that item in a data page have been
deleted (to be completely honest, there _might_ be a loop-hole that exploits
the fact that there's only a single _write_ thread that inserts anything and
maybe a bunch of GC threads that delete data, but I wouldn't go that route).
This is because _links_ have no rotation id, it's just not there. And we will
not add them for this reason alone, this would complicate the design too much
and possibly even compromise a lot of things, like data density or something.
For this reason, some external synchronization is required. Links must never be
de-referenced without 100% certainty that they actually hold data that
developers expect them to have. There are no exceptions to this rule.
Exact way to synchronize things is up to the one who will implement GC in the
engine.
(There's also a problem with completely deleting rows from the tree. It may
clash with iterators that cache stuff from the pages they have read. This may
also result in obsolete links)
Two directions for searching the answer. Truth lies somewhere in-between:
- update closures guarantee isolation, but they affect throughput
- explicit latches for row ids also guarantee isolation, but they also affect
throughput
Ignite 2 goes the second route, but the reasoning is different, in that
particular case we already have latches because of the atomic protocol, they're
simply reused.
# RocksDB has an interesting thing about it - it effectively creates a
snapshot for each cursor that's created. I don't think that we really ever
explored the realm of bugs that can occur because of that. Things that may
broke include:
- transactional scans
- index cleanup
- GC
I believe that occasional call of {{RocksIteratorInterface#refresh}} will help
here. But before doing that, we should clearly identify problematic places and
write tests that actually show a problem. Doing such things blindly is
dangerous.
was:
h3. Basics
Current implementation can only work with infinite storage space. This is
because the storage works in appen-only mode (except for transactions
rollbacks).
There's a value in the system called "low watermark". It's guaranteed, that no
new RO transactions will be started at the time earlier then LW. Existence of
such value allows us to safely delete all versions before it. We will not
discuss the mechanics of acquiring such value, assuming that it is simply given.
It's expected that deletion will be performed in background workers, without
holding any transactions locks. Process is independent on each RAFT group
member. GC shouldn't affect transactions in any way, if possible.
h3. API
Original intended design looks like following:
{code:java}
cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive,
@Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code}
Now, I don't think that this is necessarily a good design. Main problem with it
is the existence of bounds:
* First of all, why not just have inclusive lower bound and exclusive upper
bound, like almost all methods with bounds in existence.
* Second, I believe that this API has been proposed in assumption that row ids
will have a uniform distribution and every "range" cleanup will result in
somewhat equal amount of job executed. This is simply not true in current
implementation.
RowId is a timestamp based value that exist in a very narrow time slice, making
most of ranges empty and meaningless.
Then, the way they're stored is actually up to the storage. There's no
restrictions on byte order when physically storing RowId objects.
h3. Problems
There's one thing that worries me: indexes.
Because storages are unaware of table schemas, it's impossible to cleanup
indexes. This gets me thinking. API should be flexible enough so that indexes
cleanup can be performed as an external operation over the storage. This will
also reduce the amount of job that we need to do for the implementation.
To be specific, it feels like the method would look like this:
{code:java}
RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows,
BiConsumer<RowId, BinaryRow> indexCleaner);{code}
Explanation is required.
* timestamp represents the same thing as before - low watermark.
* startId - the row that should be first to iterate in current batch.
* numRows - number of rows that should be cleaned in current batch. By this I
mean individual versions, not chains.
* cleaner closure must be used to clean indexes after every individual version
removal. Right now it doesn't look optimal to me, but I struggle to find a good
solution on efficient indexes cleanup.
* next rowId is returned, that should be used a startId in next call. "null"
if cleanup is complete. In this case it can be started from the beginning or
simply postponed until new low watermark value is available.
How to operate it:
* numRows has two strategic purposes:
** to control the size of batches in "runConsistently" closures.
** to be able to run many cleanups in parallel avoiding pools starvation.
Every task is split into small chunks.
* cleanup should start from the smallest possible row id.
* low watermark value can be changed in-between calls. This is fine and even
preferable.
h3. Alternative
Some might find given API too complicated. There's another way that involves
less actions from the "client".
{code:java}
void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId,
BinaryRow> indexCleaner);{code}
In a way, both of these methods are equivalent, the difference is minuscule.
One can be transformed into the other with relative ease. If anything, this one
is even preferable as a first implementation, because it's smaller and there's
less room for errors.
h3. Different engines
This is the part where I express my thoughts of what can (and will) go wrong in
different storages.
Generally speaking, right now there are three implementations that must be
compatible in behavior:
# test storage that uses a ConcurrentSkipListMap and actual chains,
represented with singly-linked lists
# B+Tree-based storages, two of them. They use trees with, also, chains, also
represented with singly-linked lists, but off-heap this time
# RocksDB-based storage. It uses a flat structure, so it's just a big mess of
(RowId, Timestamp?) pairs in keyset
Let's discuss it one by one:
# This is the easiest one. Currently links to next chain elements are final.
They must become volatile. There's no manual "freeing" of the memory used by
chains, so it'll just work.
# This one is hard (in my opinion). Imagine having a "RowVersion" instance and
trying to read the "next" row versions, while either looking for a specific
version or performing index build/cleanup, doesn't matter. How do you guarantee
that the link is valid and has not been deleted while you were doing your
things?
Out of the box, there's no way to tell that item in a data page have been
deleted (to be completely honest, there _might_ be a loop-hole that exploits
the fact that there's only a single _write_ thread that inserts anything and
maybe a bunch of GC threads that delete data, but I wouldn't go that route).
This is because _links_ have no rotation id, it's just not there. And we will
not add them for this reason alone, this would complicate the design too much
and possibly even compromise a lot of things, like data density or something.
For this reason, some external synchronization is required. Links must never be
de-referenced without 100% certainty that they actually hold data that
developers expect them to have. There are no exceptions to this rule.
Exact way to synchronize things is up to the one who will implement GC in the
engine.
(There's also a problem with completely deleting rows from the tree. It may
clash with iterators that cache stuff from the pages they have read. This may
also result in obsolete links)
Two directions for searching the answer. Truth lies somewhere in-between:
- update closures guarantee isolation, but they affect throughput
- explicit latches for row ids also guarantee isolation, but they also affect
throughput
Ignite 2 goes the second route, but the reasoning is different, in that
particular case we already have latches because of the atomic protocol, they're
simply reused.
# RocksDB has an interesting thing about it - it effectively creates a
snapshot for each cursor that's created. I don't think that we really ever
explored the realm of bugs that can occur because of that. Things that may
broke include:
- transactional scans
- index cleanup
- GC
I believe that occasional call of {{RocksIteratorInterface#refresh}} will help
here. But before doing that, we should clearly identify problematic places and
write tests that actually show a problem. Doing such things blindly is
dangerous.
h3. Random notes
Passed closure should only be executed when specific version is already
deleted. Otherwise the data in row store will match indexes and indexes would
not be cleaned, which is wrong. There must be a test for it, of course.
Unsurprisingly, integration of this method into a raft machine is out of scope.
I don't think that LW propagation is implemented at this point. I may be wrong.
Anyways, we better discuss it with guys who implement transactions right now.
> Implement GC for MV storages
> ----------------------------
>
> Key: IGNITE-17571
> URL: https://issues.apache.org/jira/browse/IGNITE-17571
> Project: Ignite
> Issue Type: Epic
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> h3. Basics
> Current implementation can only work with infinite storage space. This is
> because the storage works in appen-only mode (except for transactions
> rollbacks).
> There's a value in the system called "low watermark". It's guaranteed, that
> no new RO transactions will be started at the time earlier then LW. Existence
> of such value allows us to safely delete all versions before it. We will not
> discuss the mechanics of acquiring such value, assuming that it is simply
> given.
> It's expected that deletion will be performed in background workers, without
> holding any transactions locks. Process is independent on each RAFT group
> member. GC shouldn't affect transactions in any way, if possible.
> h3. API
> Original intended design looks like following:
> {code:java}
> cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive,
> @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code}
> Now, I don't think that this is necessarily a good design. Main problem with
> it is the existence of bounds:
> * First of all, why not just have inclusive lower bound and exclusive upper
> bound, like almost all methods with bounds in existence.
> * Second, I believe that this API has been proposed in assumption that row
> ids will have a uniform distribution and every "range" cleanup will result in
> somewhat equal amount of job executed. This is simply not true in current
> implementation.
> RowId is a timestamp based value that exist in a very narrow time slice,
> making most of ranges empty and meaningless.
> Then, the way they're stored is actually up to the storage. There's no
> restrictions on byte order when physically storing RowId objects.
> h3. Problems
> There's one thing that worries me: indexes.
> Because storages are unaware of table schemas, it's impossible to cleanup
> indexes. This gets me thinking. API should be flexible enough so that indexes
> cleanup can be performed as an external operation over the storage. This will
> also reduce the amount of job that we need to do for the implementation.
> To be specific, it feels like the method would look like this:
> {code:java}
> RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows,
> BiConsumer<RowId, BinaryRow> indexCleaner);{code}
> Explanation is required.
> * timestamp represents the same thing as before - low watermark.
> * startId - the row that should be first to iterate in current batch.
> * numRows - number of rows that should be cleaned in current batch. By this
> I mean individual versions, not chains.
> * cleaner closure must be used to clean indexes after every individual
> version removal. Right now it doesn't look optimal to me, but I struggle to
> find a good solution on efficient indexes cleanup.
> * next rowId is returned, that should be used a startId in next call. "null"
> if cleanup is complete. In this case it can be started from the beginning or
> simply postponed until new low watermark value is available.
> How to operate it:
> * numRows has two strategic purposes:
> ** to control the size of batches in "runConsistently" closures.
> ** to be able to run many cleanups in parallel avoiding pools starvation.
> Every task is split into small chunks.
> * cleanup should start from the smallest possible row id.
> * low watermark value can be changed in-between calls. This is fine and even
> preferable.
> BUT, such design is a "backwards" design and it gives the storage too much
> responsibility.
> h3. Alternative
> Some might find given API too complicated. There's another way that involves
> less actions from the "client".
> {code:java}
> void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId,
> BinaryRow> indexCleaner);{code}
> In a way, both of these methods are equivalent, the difference is minuscule.
> One can be transformed into the other with relative ease. If anything, this
> one is even preferable as a first implementation, because it's smaller and
> there's less room for errors.
> Another alternative it to have an internal queue, like in case of TTL in
> Ignite 2. This would be the simplest API to implement:
> {code:java}
> @Nullable Pair<RowId, BinaryRow> enqueue(HybridTimestamp threshold);{code}
> h3. Different engines
> This is the part where I express my thoughts of what can (and will) go wrong
> in different storages.
> Generally speaking, right now there are three implementations that must be
> compatible in behavior:
> # test storage that uses a ConcurrentSkipListMap and actual chains,
> represented with singly-linked lists
> # B+Tree-based storages, two of them. They use trees with, also, chains,
> also represented with singly-linked lists, but off-heap this time
> # RocksDB-based storage. It uses a flat structure, so it's just a big mess
> of (RowId, Timestamp?) pairs in keyset
> Let's discuss it one by one:
> # This is the easiest one. Currently links to next chain elements are final.
> They must become volatile. There's no manual "freeing" of the memory used by
> chains, so it'll just work.
> # This one is hard (in my opinion). Imagine having a "RowVersion" instance
> and trying to read the "next" row versions, while either looking for a
> specific version or performing index build/cleanup, doesn't matter. How do
> you guarantee that the link is valid and has not been deleted while you were
> doing your things?
> Out of the box, there's no way to tell that item in a data page have been
> deleted (to be completely honest, there _might_ be a loop-hole that exploits
> the fact that there's only a single _write_ thread that inserts anything and
> maybe a bunch of GC threads that delete data, but I wouldn't go that route).
> This is because _links_ have no rotation id, it's just not there. And we will
> not add them for this reason alone, this would complicate the design too much
> and possibly even compromise a lot of things, like data density or something.
> For this reason, some external synchronization is required. Links must never
> be de-referenced without 100% certainty that they actually hold data that
> developers expect them to have. There are no exceptions to this rule.
> Exact way to synchronize things is up to the one who will implement GC in the
> engine.
> (There's also a problem with completely deleting rows from the tree. It may
> clash with iterators that cache stuff from the pages they have read. This may
> also result in obsolete links)
> Two directions for searching the answer. Truth lies somewhere in-between:
> - update closures guarantee isolation, but they affect throughput
> - explicit latches for row ids also guarantee isolation, but they also
> affect throughput
> Ignite 2 goes the second route, but the reasoning is different, in that
> particular case we already have latches because of the atomic protocol,
> they're simply reused.
> # RocksDB has an interesting thing about it - it effectively creates a
> snapshot for each cursor that's created. I don't think that we really ever
> explored the realm of bugs that can occur because of that. Things that may
> broke include:
> - transactional scans
> - index cleanup
> - GC
> I believe that occasional call of {{RocksIteratorInterface#refresh}} will
> help here. But before doing that, we should clearly identify problematic
> places and write tests that actually show a problem. Doing such things
> blindly is dangerous.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)