[ 
https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17571:
-----------------------------------
    Description: 
h3. Basics

Current implementation can only work with infinite storage space. This is 
because the storage works in appen-only mode (except for transactions 
rollbacks).

There's a value in the system called "low watermark". It's guaranteed, that no 
new RO transactions will be started at the time earlier then LW. Existence of 
such value allows us to safely delete all versions before it. We will not 
discuss the mechanics of acquiring such value, assuming that it is simply given.

It's expected that deletion will be performed in background workers, without 
holding any transactions locks. Process is independent on each RAFT group 
member. GC shouldn't affect transactions in any way, if possible.
h3. API

Original intended design looks like following:
{code:java}
cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, 
@Nullable byte[] upper,  boolean toInclusive, Timestamp timestamp){code}
Now, I don't think that this is necessarily a good design. Main problem with it 
is the existence of bounds:
 * First of all, why not just have inclusive lower bound and exclusive upper 
bound, like almost all methods with bounds in existence.
 * Second, I believe that this API has been proposed in assumption that row ids 
will have a uniform distribution and every "range" cleanup will result in 
somewhat equal amount of job executed. This is simply not true in current 
implementation.
RowId is a timestamp based value that exist in a very narrow time slice, making 
most of ranges empty and meaningless.
Then, the way they're stored is actually up to the storage. There's no 
restrictions on byte order when physically storing RowId objects.

h3. Problems

There's one thing that worries me: indexes.

Because storages are unaware of table schemas, it's impossible to cleanup 
indexes. This gets me thinking. API should be flexible enough so that indexes 
cleanup can be performed as an external operation over the storage. This will 
also reduce the amount of job that we need to do for the implementation.

To be specific, it feels like the method would look like this:
{code:java}
RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, 
BiConsumer<RowId, BinaryRow> indexCleaner);{code}
Explanation is required.
 * timestamp represents the same thing as before - low watermark.
 * startId - the row that should be first to iterate in current batch.
 * numRows - number of rows that should be cleaned in current batch. By this I 
mean individual versions, not chains.
 * cleaner closure must be used to clean indexes after every individual version 
removal. Right now it doesn't look optimal to me, but I struggle to find a good 
solution on efficient indexes cleanup.
 * next rowId is returned, that should be used a startId in next call. "null" 
if cleanup is complete. In this case it can be started from the beginning or 
simply postponed until new low watermark value is available.

How to operate it:
 * numRows has two strategic purposes:
 ** to control the size of batches in "runConsistently" closures.
 ** to be able to run many cleanups in parallel avoiding pools starvation. 
Every task is split into small chunks.
 * cleanup should start from the smallest possible row id.
 * low watermark value can be changed in-between calls. This is fine and even 
preferable.

BUT, such design is a "backwards" design and it gives the storage too much 
responsibility.
h3. Alternative

Some might find given API too complicated. There's another way that involves 
less actions from the "client".
{code:java}
void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, 
BinaryRow> indexCleaner);{code}
In a way, both of these methods are equivalent, the difference is minuscule. 
One can be transformed into the other with relative ease. If anything, this one 
is even preferable as a first implementation, because it's smaller and there's 
less room for errors.

Another alternative it to have an internal queue, like in case of TTL in Ignite 
2. This would be the simplest API to implement:
{code:java}
@Nullable Pair<RowId, BinaryRow> enqueue(HybridTimestamp threshold);{code}
h3. Different engines

This is the part where I express my thoughts of what can (and will) go wrong in 
different storages. 

Generally speaking, right now there are three implementations that must be 
compatible in behavior:
 # test storage that uses a ConcurrentSkipListMap and actual chains, 
represented with singly-linked lists
 # B+Tree-based storages, two of them. They use trees with, also, chains, also 
represented with singly-linked lists, but off-heap this time
 # RocksDB-based storage. It uses a flat structure, so it's just a big mess of 
(RowId, Timestamp?) pairs in keyset

Let's discuss it one by one:
 # This is the easiest one. Currently links to next chain elements are final. 
They must become volatile. There's no manual "freeing" of the memory used by 
chains, so it'll just work.
 # This one is hard (in my opinion). Imagine having a "RowVersion" instance and 
trying to read the "next" row versions, while either looking for a specific 
version or performing index build/cleanup, doesn't matter. How do you guarantee 
that the link is valid and has not been deleted while you were doing your 
things?
Out of the box, there's no way to tell that item in a data page have been 
deleted (to be completely honest, there _might_ be a loop-hole that exploits 
the fact that there's only a single _write_ thread that inserts anything and 
maybe a bunch of GC threads that delete data, but I wouldn't go that route).
This is because _links_ have no rotation id, it's just not there. And we will 
not add them for this reason alone, this would complicate the design too much 
and possibly even compromise a lot of things, like data density or something.
For this reason, some external synchronization is required. Links must never be 
de-referenced without 100% certainty that they actually hold data that 
developers expect them to have. There are no exceptions to this rule.
Exact way to synchronize things is up to the one who will implement GC in the 
engine.
(There's also a problem with completely deleting rows from the tree. It may 
clash with iterators that cache stuff from the pages they have read. This may 
also result in obsolete links)
Two directions for searching the answer. Truth lies somewhere in-between:
 - update closures guarantee isolation, but they affect throughput
 - explicit latches for row ids also guarantee isolation, but they also affect 
throughput
Ignite 2 goes the second route, but the reasoning is different, in that 
particular case we already have latches because of the atomic protocol, they're 
simply reused.
 # RocksDB has an interesting thing about it - it effectively creates a 
snapshot for each cursor that's created. I don't think that we really ever 
explored the realm of bugs that can occur because of that. Things that may 
broke include:
 - transactional scans
 - index cleanup
 - GC
I believe that occasional call of {{RocksIteratorInterface#refresh}} will help 
here. But before doing that, we should clearly identify problematic places and 
write tests that actually show a problem. Doing such things blindly is 
dangerous.

 

 

 

  was:
h3. Basics

Current implementation can only work with infinite storage space. This is 
because the storage works in appen-only mode (except for transactions 
rollbacks).

There's a value in the system called "low watermark". It's guaranteed, that no 
new RO transactions will be started at the time earlier then LW. Existence of 
such value allows us to safely delete all versions before it. We will not 
discuss the mechanics of acquiring such value, assuming that it is simply given.

It's expected that deletion will be performed in background workers, without 
holding any transactions locks. Process is independent on each RAFT group 
member. GC shouldn't affect transactions in any way, if possible.
h3. API

Original intended design looks like following:
{code:java}
cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, 
@Nullable byte[] upper,  boolean toInclusive, Timestamp timestamp){code}
Now, I don't think that this is necessarily a good design. Main problem with it 
is the existence of bounds:
 * First of all, why not just have inclusive lower bound and exclusive upper 
bound, like almost all methods with bounds in existence.
 * Second, I believe that this API has been proposed in assumption that row ids 
will have a uniform distribution and every "range" cleanup will result in 
somewhat equal amount of job executed. This is simply not true in current 
implementation.
RowId is a timestamp based value that exist in a very narrow time slice, making 
most of ranges empty and meaningless.
Then, the way they're stored is actually up to the storage. There's no 
restrictions on byte order when physically storing RowId objects.

h3. Problems

There's one thing that worries me: indexes.

Because storages are unaware of table schemas, it's impossible to cleanup 
indexes. This gets me thinking. API should be flexible enough so that indexes 
cleanup can be performed as an external operation over the storage. This will 
also reduce the amount of job that we need to do for the implementation.

To be specific, it feels like the method would look like this:
{code:java}
RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, 
BiConsumer<RowId, BinaryRow> indexCleaner);{code}
Explanation is required.
 * timestamp represents the same thing as before - low watermark.
 * startId - the row that should be first to iterate in current batch.
 * numRows - number of rows that should be cleaned in current batch. By this I 
mean individual versions, not chains.
 * cleaner closure must be used to clean indexes after every individual version 
removal. Right now it doesn't look optimal to me, but I struggle to find a good 
solution on efficient indexes cleanup.
 * next rowId is returned, that should be used a startId in next call. "null" 
if cleanup is complete. In this case it can be started from the beginning or 
simply postponed until new low watermark value is available.

How to operate it:
 * numRows has two strategic purposes:
 ** to control the size of batches in "runConsistently" closures.
 ** to be able to run many cleanups in parallel avoiding pools starvation. 
Every task is split into small chunks.
 * cleanup should start from the smallest possible row id.
 * low watermark value can be changed in-between calls. This is fine and even 
preferable.

h3. Alternative

Some might find given API too complicated. There's another way that involves 
less actions from the "client".
{code:java}
void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, 
BinaryRow> indexCleaner);{code}
In a way, both of these methods are equivalent, the difference is minuscule. 
One can be transformed into the other with relative ease. If anything, this one 
is even preferable as a first implementation, because it's smaller and there's 
less room for errors.
h3. Different engines

This is the part where I express my thoughts of what can (and will) go wrong in 
different storages. 

Generally speaking, right now there are three implementations that must be 
compatible in behavior:
 # test storage that uses a ConcurrentSkipListMap and actual chains, 
represented with singly-linked lists
 # B+Tree-based storages, two of them. They use trees with, also, chains, also 
represented with singly-linked lists, but off-heap this time
 # RocksDB-based storage. It uses a flat structure, so it's just a big mess of 
(RowId, Timestamp?) pairs in keyset

Let's discuss it one by one:
 # This is the easiest one. Currently links to next chain elements are final. 
They must become volatile. There's no manual "freeing" of the memory used by 
chains, so it'll just work.
 # This one is hard (in my opinion). Imagine having a "RowVersion" instance and 
trying to read the "next" row versions, while either looking for a specific 
version or performing index build/cleanup, doesn't matter. How do you guarantee 
that the link is valid and has not been deleted while you were doing your 
things?
Out of the box, there's no way to tell that item in a data page have been 
deleted (to be completely honest, there _might_ be a loop-hole that exploits 
the fact that there's only a single _write_ thread that inserts anything and 
maybe a bunch of GC threads that delete data, but I wouldn't go that route).
This is because _links_ have no rotation id, it's just not there. And we will 
not add them for this reason alone, this would complicate the design too much 
and possibly even compromise a lot of things, like data density or something.
For this reason, some external synchronization is required. Links must never be 
de-referenced without 100% certainty that they actually hold data that 
developers expect them to have. There are no exceptions to this rule.
Exact way to synchronize things is up to the one who will implement GC in the 
engine.
(There's also a problem with completely deleting rows from the tree. It may 
clash with iterators that cache stuff from the pages they have read. This may 
also result in obsolete links)
Two directions for searching the answer. Truth lies somewhere in-between:
 - update closures guarantee isolation, but they affect throughput
 - explicit latches for row ids also guarantee isolation, but they also affect 
throughput
Ignite 2 goes the second route, but the reasoning is different, in that 
particular case we already have latches because of the atomic protocol, they're 
simply reused.
 # RocksDB has an interesting thing about it - it effectively creates a 
snapshot for each cursor that's created. I don't think that we really ever 
explored the realm of bugs that can occur because of that. Things that may 
broke include:
 - transactional scans
 - index cleanup
 - GC
I believe that occasional call of {{RocksIteratorInterface#refresh}} will help 
here. But before doing that, we should clearly identify problematic places and 
write tests that actually show a problem. Doing such things blindly is 
dangerous.

h3. Random notes

Passed closure should only be executed when specific version is already 
deleted. Otherwise the data in row store will match indexes and indexes would 
not be cleaned, which is wrong. There must be a test for it, of course.

Unsurprisingly, integration of this method into a raft machine is out of scope. 
I don't think that LW propagation is implemented at this point. I may be wrong. 
Anyways, we better discuss it with guys who implement transactions right now.

 

 

 


> Implement GC for MV storages
> ----------------------------
>
>                 Key: IGNITE-17571
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17571
>             Project: Ignite
>          Issue Type: Epic
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Basics
> Current implementation can only work with infinite storage space. This is 
> because the storage works in appen-only mode (except for transactions 
> rollbacks).
> There's a value in the system called "low watermark". It's guaranteed, that 
> no new RO transactions will be started at the time earlier then LW. Existence 
> of such value allows us to safely delete all versions before it. We will not 
> discuss the mechanics of acquiring such value, assuming that it is simply 
> given.
> It's expected that deletion will be performed in background workers, without 
> holding any transactions locks. Process is independent on each RAFT group 
> member. GC shouldn't affect transactions in any way, if possible.
> h3. API
> Original intended design looks like following:
> {code:java}
> cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, 
> @Nullable byte[] upper,  boolean toInclusive, Timestamp timestamp){code}
> Now, I don't think that this is necessarily a good design. Main problem with 
> it is the existence of bounds:
>  * First of all, why not just have inclusive lower bound and exclusive upper 
> bound, like almost all methods with bounds in existence.
>  * Second, I believe that this API has been proposed in assumption that row 
> ids will have a uniform distribution and every "range" cleanup will result in 
> somewhat equal amount of job executed. This is simply not true in current 
> implementation.
> RowId is a timestamp based value that exist in a very narrow time slice, 
> making most of ranges empty and meaningless.
> Then, the way they're stored is actually up to the storage. There's no 
> restrictions on byte order when physically storing RowId objects.
> h3. Problems
> There's one thing that worries me: indexes.
> Because storages are unaware of table schemas, it's impossible to cleanup 
> indexes. This gets me thinking. API should be flexible enough so that indexes 
> cleanup can be performed as an external operation over the storage. This will 
> also reduce the amount of job that we need to do for the implementation.
> To be specific, it feels like the method would look like this:
> {code:java}
> RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, 
> BiConsumer<RowId, BinaryRow> indexCleaner);{code}
> Explanation is required.
>  * timestamp represents the same thing as before - low watermark.
>  * startId - the row that should be first to iterate in current batch.
>  * numRows - number of rows that should be cleaned in current batch. By this 
> I mean individual versions, not chains.
>  * cleaner closure must be used to clean indexes after every individual 
> version removal. Right now it doesn't look optimal to me, but I struggle to 
> find a good solution on efficient indexes cleanup.
>  * next rowId is returned, that should be used a startId in next call. "null" 
> if cleanup is complete. In this case it can be started from the beginning or 
> simply postponed until new low watermark value is available.
> How to operate it:
>  * numRows has two strategic purposes:
>  ** to control the size of batches in "runConsistently" closures.
>  ** to be able to run many cleanups in parallel avoiding pools starvation. 
> Every task is split into small chunks.
>  * cleanup should start from the smallest possible row id.
>  * low watermark value can be changed in-between calls. This is fine and even 
> preferable.
> BUT, such design is a "backwards" design and it gives the storage too much 
> responsibility.
> h3. Alternative
> Some might find given API too complicated. There's another way that involves 
> less actions from the "client".
> {code:java}
> void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, 
> BinaryRow> indexCleaner);{code}
> In a way, both of these methods are equivalent, the difference is minuscule. 
> One can be transformed into the other with relative ease. If anything, this 
> one is even preferable as a first implementation, because it's smaller and 
> there's less room for errors.
> Another alternative it to have an internal queue, like in case of TTL in 
> Ignite 2. This would be the simplest API to implement:
> {code:java}
> @Nullable Pair<RowId, BinaryRow> enqueue(HybridTimestamp threshold);{code}
> h3. Different engines
> This is the part where I express my thoughts of what can (and will) go wrong 
> in different storages. 
> Generally speaking, right now there are three implementations that must be 
> compatible in behavior:
>  # test storage that uses a ConcurrentSkipListMap and actual chains, 
> represented with singly-linked lists
>  # B+Tree-based storages, two of them. They use trees with, also, chains, 
> also represented with singly-linked lists, but off-heap this time
>  # RocksDB-based storage. It uses a flat structure, so it's just a big mess 
> of (RowId, Timestamp?) pairs in keyset
> Let's discuss it one by one:
>  # This is the easiest one. Currently links to next chain elements are final. 
> They must become volatile. There's no manual "freeing" of the memory used by 
> chains, so it'll just work.
>  # This one is hard (in my opinion). Imagine having a "RowVersion" instance 
> and trying to read the "next" row versions, while either looking for a 
> specific version or performing index build/cleanup, doesn't matter. How do 
> you guarantee that the link is valid and has not been deleted while you were 
> doing your things?
> Out of the box, there's no way to tell that item in a data page have been 
> deleted (to be completely honest, there _might_ be a loop-hole that exploits 
> the fact that there's only a single _write_ thread that inserts anything and 
> maybe a bunch of GC threads that delete data, but I wouldn't go that route).
> This is because _links_ have no rotation id, it's just not there. And we will 
> not add them for this reason alone, this would complicate the design too much 
> and possibly even compromise a lot of things, like data density or something.
> For this reason, some external synchronization is required. Links must never 
> be de-referenced without 100% certainty that they actually hold data that 
> developers expect them to have. There are no exceptions to this rule.
> Exact way to synchronize things is up to the one who will implement GC in the 
> engine.
> (There's also a problem with completely deleting rows from the tree. It may 
> clash with iterators that cache stuff from the pages they have read. This may 
> also result in obsolete links)
> Two directions for searching the answer. Truth lies somewhere in-between:
>  - update closures guarantee isolation, but they affect throughput
>  - explicit latches for row ids also guarantee isolation, but they also 
> affect throughput
> Ignite 2 goes the second route, but the reasoning is different, in that 
> particular case we already have latches because of the atomic protocol, 
> they're simply reused.
>  # RocksDB has an interesting thing about it - it effectively creates a 
> snapshot for each cursor that's created. I don't think that we really ever 
> explored the realm of bugs that can occur because of that. Things that may 
> broke include:
>  - transactional scans
>  - index cleanup
>  - GC
> I believe that occasional call of {{RocksIteratorInterface#refresh}} will 
> help here. But before doing that, we should clearly identify problematic 
> places and write tests that actually show a problem. Doing such things 
> blindly is dangerous.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to