Ivan Bessonov created IGNITE-25030:
--------------------------------------
Summary: Metastorage compaction blocks other threads by reading
from drive
Key: IGNITE-25030
URL: https://issues.apache.org/jira/browse/IGNITE-25030
Project: Ignite
Issue Type: Bug
Reporter: Ivan Bessonov
h3. Problem description
This happened:
{code:java}
2025-04-02 17:30:25:607 +0000
[WARNING][%node18%common-scheduler-0][FailureManager] Possible failure
suppressed according to a configured handler [hnd=NoOpFailureHandler
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException:
IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread is
blocked for 46737 ms that is more than the allowed 500 ms, it is
"%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned by
"%node18%metastorage-compaction-executor-0" Id=324{code}
The fact that we're doing it in a network thread will be addressed in a
separate issue. All we need to know in this context is that:
* When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're
waiting for read lock, which is kind of stupid. This problem actually scales to
all the other read operations. I'm calling it stupid because meta-storage is a
multi-versioned storage by design, and such locks should not exist.
* When we're holding a write lock in compaction thread, or potentially any
other write operation in a meta-storage FSM thread, we might spend an
unexpectedly long amount of time there, which is simply unacceptable. Reading
data from the storage should be avoided when we're holding a lock like that.
I created this issue as a *Bug* because it is a serious design flaw, and it
makes the system completely unresponsive in certain cases.
h3. Potential improvements
Here I will present a semi-organized list of thoughts, that might be converted
into a design later on.
* RW lock in its current form must be removed.
* Every time we read the data, we:
** Fix the revision number for the operation.
** Read the consistent cut.
** Read optimistically, assuming that compaction doesn't happen.
** If we figured that compaction has indeed happen when we finished reading
data, and the data might be invalid, we throw compaction exception.
* Resolving timestamp into a revision (or the opposite) should never require
lock either.
** We should use the same optimistic strategy as with regular reads.
** We should introduce an on-hep cache for the most recent values, like we did
for terms in raft log manager.
*** While doing that, we should ensure that it'll make sense. In order to do
that, we need to check that we are more likely to resolve recent
timestamps/revisions.
*** Unlike term values in raft log manager, here we might want to cache a few
hundred values. We also need a concurrent access for reading the cache, meaning
that the implementation will different. Just make sure to make it {*}non
blocking{*}.
*** _Should be done in a separate Jira_
* Same for reading index/term/confguration/etc. There must be a simple way to
make it non blocking either.
* The fun part - FSM operations. Several ways to improve them:
** Remove the write lock. Use read-evaluate-write approach.
** When we read the data in FSM, we should not fix the revision, we should
read the "latest" revision, it is more efficient. _Might be done in a separate
Jira._
** We should always remember - there are only two writing agents here. A state
machine and a compaction process.
** State machine is a latency-sensitive part of the system. We should avoid
blocking it with concurrent compaction. This restriction dictates the rest of
the design.
** When we read a list of existing revisions for an entry and prepend a new
revision to it, we don't lock anything. This means that by the time we write
this list back into a storage, it might contain some compacted revisions. *This
is fine.* The compactor itself should keep this race in mind and expect such a
trash in revisions list. If the value is not found - it's been compacted
previously. In my understanding, this is the only adequate approach for the
task.
** By the way, we should also expect that we have some trash there while just
reading the data. If something is in the process of being compacted, we
shouldn't even try to read it, but if we do, we should be ready to the data
that concurrently disappeared.
* Compactor:
** Here we can't allow to write trash data into a revisions list of any entry,
because (unlike in FSM) that would lead to inconsistent data.
** I propose doing the
{{db.write(writeOptions, batch);}} calls while holding an exclusive lock.
** FSM just acquires the lock and writes the data, without any retries,
because it has a priority.
** Compactor should have a check, like "is it safe to write a batch that I
prepared":
*** If the answer is "yes", we proceed with the write.
*** If the answer is "no" or "idk", we:
**** Don't proceed with writing the batch.
**** Refresh the cursor.
**** Re-read the data and prepare a new batch.
**** Repeat.
*** The goal is to have a check that will have a small likelihood of "maybe".
I propose having a sufficiently large bloom filter and update it synchronously.
It should be cleared every time we refresh the iterator.
** Such an approach will save us from constant "refresh" calls on iterator,
which are slow. They will become rare. They will almost disappear if there's
not much concurrent load.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)