[ 
https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17571:
-----------------------------------
    Issue Type: Epic  (was: Improvement)

> Implement GC for MV storages
> ----------------------------
>
>                 Key: IGNITE-17571
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17571
>             Project: Ignite
>          Issue Type: Epic
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Basics
> Current implementation can only work with infinite storage space. This is 
> because the storage works in appen-only mode (except for transactions 
> rollbacks).
> There's a value in the system called "low watermark". It's guaranteed, that 
> no new RO transactions will be started at the time earlier then LW. Existence 
> of such value allows us to safely delete all versions before it. We will not 
> discuss the mechanics of acquiring such value, assuming that it is simply 
> given.
> It's expected that deletion will be performed in background workers, without 
> holding any transactions locks. Process is independent on each RAFT group 
> member. GC shouldn't affect transactions in any way, if possible.
> h3. API
> Original intended design looks like following:
> {code:java}
> cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, 
> @Nullable byte[] upper,  boolean toInclusive, Timestamp timestamp){code}
> Now, I don't think that this is necessarily a good design. Main problem with 
> it is the existence of bounds:
>  * First of all, why not just have inclusive lower bound and exclusive upper 
> bound, like almost all methods with bounds in existence.
>  * Second, I believe that this API has been proposed in assumption that row 
> ids will have a uniform distribution and every "range" cleanup will result in 
> somewhat equal amount of job executed. This is simply not true in current 
> implementation.
> RowId is a timestamp based value that exist in a very narrow time slice, 
> making most of ranges empty and meaningless.
> Then, the way they're stored is actually up to the storage. There's no 
> restrictions on byte order when physically storing RowId objects.
> h3. Problems
> There's one thing that worries me: indexes.
> Because storages are unaware of table schemas, it's impossible to cleanup 
> indexes. This gets me thinking. API should be flexible enough so that indexes 
> cleanup can be performed as an external operation over the storage. This will 
> also reduce the amount of job that we need to do for the implementation.
> To be specific, it feels like the method would look like this:
> {code:java}
> RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, 
> BiConsumer<RowId, BinaryRow> indexCleaner);{code}
> Explanation is required.
>  * timestamp represents the same thing as before - low watermark.
>  * startId - the row that should be first to iterate in current batch.
>  * numRows - number of rows that should be cleaned in current batch. By this 
> I mean individual versions, not chains.
>  * cleaner closure must be used to clean indexes after every individual 
> version removal. Right now it doesn't look optimal to me, but I struggle to 
> find a good solution on efficient indexes cleanup.
>  * next rowId is returned, that should be used a startId in next call. "null" 
> if cleanup is complete. In this case it can be started from the beginning or 
> simply postponed until new low watermark value is available.
> How to operate it:
>  * numRows has two strategic purposes:
>  ** to control the size of batches in "runConsistently" closures.
>  ** to be able to run many cleanups in parallel avoiding pools starvation. 
> Every task is split into small chunks.
>  * cleanup should start from the smallest possible row id.
>  * low watermark value can be changed in-between calls. This is fine and even 
> preferable.
> h3. Alternative
> Some might find given API too complicated. There's another way that involves 
> less actions from the "client".
> {code:java}
> void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, 
> BinaryRow> indexCleaner);{code}
> In a way, both of these methods are equivalent, the difference is minuscule. 
> One can be transformed into the other with relative ease. If anything, this 
> one is even preferable as a first implementation, because it's smaller and 
> there's less room for errors.
> h3. Different engines
> This is the part where I express my thoughts of what can (and will) go wrong 
> in different storages. 
> Generally speaking, right now there are three implementations that must be 
> compatible in behavior:
>  # test storage that uses a ConcurrentSkipListMap and actual chains, 
> represented with singly-linked lists
>  # B+Tree-based storages, two of them. They use trees with, also, chains, 
> also represented with singly-linked lists, but off-heap this time
>  # RocksDB-based storage. It uses a flat structure, so it's just a big mess 
> of (RowId, Timestamp?) pairs in keyset
> Let's discuss it one by one:
>  # This is the easiest one. Currently links to next chain elements are final. 
> They must become volatile. There's no manual "freeing" of the memory used by 
> chains, so it'll just work.
>  # This one is hard (in my opinion). Imagine having a "RowVersion" instance 
> and trying to read the "next" row versions, while either looking for a 
> specific version or performing index build/cleanup, doesn't matter. How do 
> you guarantee that the link is valid and has not been deleted while you were 
> doing your things?
> Out of the box, there's no way to tell that item in a data page have been 
> deleted (to be completely honest, there _might_ be a loop-hole that exploits 
> the fact that there's only a single _write_ thread that inserts anything and 
> maybe a bunch of GC threads that delete data, but I wouldn't go that route).
> This is because _links_ have to rotation id, it's just not there. And we will 
> not add them for this reason only, this would complicate the design too much 
> and possibly even compromise a lot of things, like data density or something.
> For this reason, some external synchronization is required. Links must never 
> be de-referenced without 100% certainty that they actually hold data that 
> developers expects them to have. There are no exceptions to this rule.
> Exact way to synchronize things is up to the one who will implement GC in the 
> engine.
> (There's also a problem with completely deleting rows from the tree. It may 
> clash with iterators that cache stuff from the pages they have read. This may 
> also result in obsolete links)
>  # RocksDB has an interesting thing about it - it effectively creates a 
> snapshot for each cursor that's created. I don't think that we really ever 
> explored the realm of bugs that can occur because of that. Things that may 
> broke include:
>  - transactional scans
>  - index cleanup
>  - GC
> I believe that occasional call of {{RocksIteratorInterface#refresh}} will 
> help here. But before doing that, we should clearly identify problematic 
> places and write tests that actually show a problem. Doing such things 
> blindly is dangerous.
> h3. Random notes
> Passed closure should only be executed when specific version is already 
> deleted. Otherwise the data in row store will match indexes and indexes would 
> not be cleaned, which is wrong. There must be a test for it, of course.
> Unsurprisingly, integration of this method into a raft machine is out of 
> scope. I don't think that LW propagation is implemented at this point. I may 
> be wrong. Anyways, we better discuss it with guys who implement transactions 
> right now.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to