[ https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Bessonov updated IGNITE-17571: ----------------------------------- Issue Type: Epic (was: Improvement) > Implement GC for MV storages > ---------------------------- > > Key: IGNITE-17571 > URL: https://issues.apache.org/jira/browse/IGNITE-17571 > Project: Ignite > Issue Type: Epic > Reporter: Ivan Bessonov > Priority: Major > Labels: ignite-3 > > h3. Basics > Current implementation can only work with infinite storage space. This is > because the storage works in appen-only mode (except for transactions > rollbacks). > There's a value in the system called "low watermark". It's guaranteed, that > no new RO transactions will be started at the time earlier then LW. Existence > of such value allows us to safely delete all versions before it. We will not > discuss the mechanics of acquiring such value, assuming that it is simply > given. > It's expected that deletion will be performed in background workers, without > holding any transactions locks. Process is independent on each RAFT group > member. GC shouldn't affect transactions in any way, if possible. > h3. API > Original intended design looks like following: > {code:java} > cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, > @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code} > Now, I don't think that this is necessarily a good design. Main problem with > it is the existence of bounds: > * First of all, why not just have inclusive lower bound and exclusive upper > bound, like almost all methods with bounds in existence. > * Second, I believe that this API has been proposed in assumption that row > ids will have a uniform distribution and every "range" cleanup will result in > somewhat equal amount of job executed. This is simply not true in current > implementation. > RowId is a timestamp based value that exist in a very narrow time slice, > making most of ranges empty and meaningless. > Then, the way they're stored is actually up to the storage. There's no > restrictions on byte order when physically storing RowId objects. > h3. Problems > There's one thing that worries me: indexes. > Because storages are unaware of table schemas, it's impossible to cleanup > indexes. This gets me thinking. API should be flexible enough so that indexes > cleanup can be performed as an external operation over the storage. This will > also reduce the amount of job that we need to do for the implementation. > To be specific, it feels like the method would look like this: > {code:java} > RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, > BiConsumer<RowId, BinaryRow> indexCleaner);{code} > Explanation is required. > * timestamp represents the same thing as before - low watermark. > * startId - the row that should be first to iterate in current batch. > * numRows - number of rows that should be cleaned in current batch. By this > I mean individual versions, not chains. > * cleaner closure must be used to clean indexes after every individual > version removal. Right now it doesn't look optimal to me, but I struggle to > find a good solution on efficient indexes cleanup. > * next rowId is returned, that should be used a startId in next call. "null" > if cleanup is complete. In this case it can be started from the beginning or > simply postponed until new low watermark value is available. > How to operate it: > * numRows has two strategic purposes: > ** to control the size of batches in "runConsistently" closures. > ** to be able to run many cleanups in parallel avoiding pools starvation. > Every task is split into small chunks. > * cleanup should start from the smallest possible row id. > * low watermark value can be changed in-between calls. This is fine and even > preferable. > h3. Alternative > Some might find given API too complicated. There's another way that involves > less actions from the "client". > {code:java} > void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, > BinaryRow> indexCleaner);{code} > In a way, both of these methods are equivalent, the difference is minuscule. > One can be transformed into the other with relative ease. If anything, this > one is even preferable as a first implementation, because it's smaller and > there's less room for errors. > h3. Different engines > This is the part where I express my thoughts of what can (and will) go wrong > in different storages. > Generally speaking, right now there are three implementations that must be > compatible in behavior: > # test storage that uses a ConcurrentSkipListMap and actual chains, > represented with singly-linked lists > # B+Tree-based storages, two of them. They use trees with, also, chains, > also represented with singly-linked lists, but off-heap this time > # RocksDB-based storage. It uses a flat structure, so it's just a big mess > of (RowId, Timestamp?) pairs in keyset > Let's discuss it one by one: > # This is the easiest one. Currently links to next chain elements are final. > They must become volatile. There's no manual "freeing" of the memory used by > chains, so it'll just work. > # This one is hard (in my opinion). Imagine having a "RowVersion" instance > and trying to read the "next" row versions, while either looking for a > specific version or performing index build/cleanup, doesn't matter. How do > you guarantee that the link is valid and has not been deleted while you were > doing your things? > Out of the box, there's no way to tell that item in a data page have been > deleted (to be completely honest, there _might_ be a loop-hole that exploits > the fact that there's only a single _write_ thread that inserts anything and > maybe a bunch of GC threads that delete data, but I wouldn't go that route). > This is because _links_ have to rotation id, it's just not there. And we will > not add them for this reason only, this would complicate the design too much > and possibly even compromise a lot of things, like data density or something. > For this reason, some external synchronization is required. Links must never > be de-referenced without 100% certainty that they actually hold data that > developers expects them to have. There are no exceptions to this rule. > Exact way to synchronize things is up to the one who will implement GC in the > engine. > (There's also a problem with completely deleting rows from the tree. It may > clash with iterators that cache stuff from the pages they have read. This may > also result in obsolete links) > # RocksDB has an interesting thing about it - it effectively creates a > snapshot for each cursor that's created. I don't think that we really ever > explored the realm of bugs that can occur because of that. Things that may > broke include: > - transactional scans > - index cleanup > - GC > I believe that occasional call of {{RocksIteratorInterface#refresh}} will > help here. But before doing that, we should clearly identify problematic > places and write tests that actually show a problem. Doing such things > blindly is dangerous. > h3. Random notes > Passed closure should only be executed when specific version is already > deleted. Otherwise the data in row store will match indexes and indexes would > not be cleaned, which is wrong. There must be a test for it, of course. > Unsurprisingly, integration of this method into a raft machine is out of > scope. I don't think that LW propagation is implemented at this point. I may > be wrong. Anyways, we better discuss it with guys who implement transactions > right now. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)