[
https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov resolved IGNITE-17571.
------------------------------------
Fix Version/s: 3.0
Resolution: Fixed
> Implement GC for MV storages
> ----------------------------
>
> Key: IGNITE-17571
> URL: https://issues.apache.org/jira/browse/IGNITE-17571
> Project: Ignite
> Issue Type: Epic
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0
>
>
> h3. Basics
> Current implementation can only work with infinite storage space. This is
> because the storage works in append-only mode (except for transactions
> rollbacks).
> There's a value in the system called "low watermark". It's guaranteed, that
> no new RO transactions will be started at the time earlier than LW. Existence
> of such value allows us to safely delete all versions before it. We will not
> discuss the mechanics of acquiring such value, assuming that it is simply
> given.
> It's expected that deletion will be performed in background workers, without
> holding any transactions locks. Process is independent on each RAFT group
> member. GC shouldn't affect transactions in any way, if possible.
> h3. API
> Original intended design looks like following:
> {code:java}
> cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive,
> @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code}
> Now, I don't think that this is necessarily a good design. Main problem with
> it is the existence of bounds:
> * First of all, why not just have inclusive lower bound and exclusive upper
> bound, like almost all methods with bounds in existence.
> * Second, I believe that this API has been proposed in assumption that row
> ids will have a uniform distribution and every "range" cleanup will result in
> somewhat equal amount of job executed. This is simply not true in current
> implementation.
> RowId is a timestamp based value that exist in a very narrow time slice,
> making most of ranges empty and meaningless.
> Then, the way they're stored is actually up to the storage. There's no
> restrictions on byte order when physically storing RowId objects.
> h3. Problems
> There's one thing that worries me: indexes.
> Because storages are unaware of table schemas, it's impossible to cleanup
> indexes. This gets me thinking. API should be flexible enough so that indexes
> cleanup can be performed as an external operation over the storage. This will
> also reduce the amount of job that we need to do for the implementation.
> To be specific, it feels like the method would look like this:
> {code:java}
> RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows,
> BiConsumer<RowId, BinaryRow> indexCleaner);{code}
> Explanation is required.
> * timestamp represents the same thing as before - low watermark.
> * startId - the row that should be first to iterate in current batch.
> * numRows - number of rows that should be cleaned in current batch. By this
> I mean individual versions, not chains.
> * cleaner closure must be used to clean indexes after every individual
> version removal. Right now it doesn't look optimal to me, but I struggle to
> find a good solution on efficient indexes cleanup.
> * next rowId is returned, that should be used a startId in next call. "null"
> if cleanup is complete. In this case it can be started from the beginning or
> simply postponed until new low watermark value is available.
> How to operate it:
> * numRows has two strategic purposes:
> ** to control the size of batches in "runConsistently" closures.
> ** to be able to run many cleanups in parallel avoiding pools starvation.
> Every task is split into small chunks.
> * cleanup should start from the smallest possible row id.
> * low watermark value can be changed in-between calls. This is fine and even
> preferable.
> BUT, such design is a "backwards" design and it gives the storage too much
> responsibility.
> h3. Alternative
> Some might find given API too complicated. There's another way that involves
> less actions from the "client".
> {code:java}
> void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId,
> BinaryRow> indexCleaner);{code}
> In a way, both of these methods are equivalent, the difference is minuscule.
> One can be transformed into the other with relative ease. If anything, this
> one is even preferable as a first implementation, because it's smaller and
> there's less room for errors.
> Another alternative it to have an internal queue, like in case of TTL in
> Ignite 2. This would be the simplest API to implement:
> {code:java}
> @Nullable Pair<RowId, BinaryRow> enqueue(HybridTimestamp threshold);{code}
> h3. Different engines
> This is the part where I express my thoughts of what can (and will) go wrong
> in different storages.
> Generally speaking, right now there are three implementations that must be
> compatible in behavior:
> # test storage that uses a ConcurrentSkipListMap and actual chains,
> represented with singly-linked lists
> # B+Tree-based storages, two of them. They use trees with, also, chains,
> also represented with singly-linked lists, but off-heap this time
> # RocksDB-based storage. It uses a flat structure, so it's just a big mess
> of (RowId, Timestamp?) pairs in keyset
> Let's discuss it one by one:
> # This is the easiest one. Currently links to next chain elements are final.
> They must become volatile. There's no manual "freeing" of the memory used by
> chains, so it'll just work.
> # This one is hard (in my opinion). Imagine having a "RowVersion" instance
> and trying to read the "next" row versions, while either looking for a
> specific version or performing index build/cleanup, doesn't matter. How do
> you guarantee that the link is valid and has not been deleted while you were
> doing your things?
> Out of the box, there's no way to tell that item in a data page have been
> deleted (to be completely honest, there _might_ be a loop-hole that exploits
> the fact that there's only a single _write_ thread that inserts anything and
> maybe a bunch of GC threads that delete data, but I wouldn't go that route).
> This is because _links_ have no rotation id, it's just not there. And we will
> not add them for this reason alone, this would complicate the design too much
> and possibly even compromise a lot of things, like data density or something.
> For this reason, some external synchronization is required. Links must never
> be de-referenced without 100% certainty that they actually hold data that
> developers expect them to have. There are no exceptions to this rule.
> Exact way to synchronize things is up to the one who will implement GC in the
> engine.
> (There's also a problem with completely deleting rows from the tree. It may
> clash with iterators that cache stuff from the pages they have read. This may
> also result in obsolete links)
> Two directions for searching the answer. Truth lies somewhere in-between:
> - update closures guarantee isolation, but they affect throughput
> - explicit latches for row ids also guarantee isolation, but they also
> affect throughput
> Ignite 2 goes the second route, but the reasoning is different, in that
> particular case we already have latches because of the atomic protocol,
> they're simply reused.
> # RocksDB has an interesting thing about it - it effectively creates a
> snapshot for each cursor that's created. I don't think that we really ever
> explored the realm of bugs that can occur because of that. Things that may
> broke include:
> - transactional scans
> - index cleanup
> - GC
> I believe that occasional call of {{RocksIteratorInterface#refresh}} will
> help here. But before doing that, we should clearly identify problematic
> places and write tests that actually show a problem. Doing such things
> blindly is dangerous.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)