[jira] [Created] (IGNITE-25030) Metastorage compaction blocks other threads by reading from drive

Ivan Bessonov (Jira) Fri, 04 Apr 2025 02:21:33 -0700

Ivan Bessonov created IGNITE-25030:
--------------------------------------

             Summary: Metastorage compaction blocks other threads by reading 
from drive
                 Key: IGNITE-25030
                 URL: https://issues.apache.org/jira/browse/IGNITE-25030
             Project: Ignite
          Issue Type: Bug
            Reporter: Ivan Bessonov



h3. Problem description

This happened:

 
{code:java}
2025-04-02 17:30:25:607 +0000 
[WARNING][%node18%common-scheduler-0][FailureManager] Possible failure 
suppressed according to a configured handler [hnd=NoOpFailureHandler 
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: 
IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread is 
blocked for 46737 ms that is more than the allowed 500 ms, it is 
"%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned by 
"%node18%metastorage-compaction-executor-0" Id=324{code}
 

The fact that we're doing it in a network thread will be addressed in a 
separate issue. All we need to know in this context is that:
 * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're 
waiting for read lock, which is kind of stupid. This problem actually scales to 
all the other read operations. I'm calling it stupid because meta-storage is a 
multi-versioned storage by design, and such locks should not exist.
 * When we're holding a write lock in compaction thread, or potentially any 
other write operation in a meta-storage FSM thread, we might spend an 
unexpectedly long amount of time there, which is simply unacceptable. Reading 
data from the storage should be avoided when we're holding a lock like that.

I created this issue as a *Bug* because it is a serious design flaw, and it 
makes the system completely unresponsive in certain cases.
h3. Potential improvements

Here I will present a semi-organized list of thoughts, that might be converted 
into a design later on.
 * RW lock in its current form must be removed.
 * Every time we read the data, we:
 ** Fix the revision number for the operation.
 ** Read the consistent cut.
 ** Read optimistically, assuming that compaction doesn't happen.
 ** If we figured that compaction has indeed happen when we finished reading 
data, and the data might be invalid, we throw compaction exception.
 * Resolving timestamp into a revision (or the opposite) should never require 
lock either.

 ** We should use the same optimistic strategy as with regular reads.
 ** We should introduce an on-hep cache for the most recent values, like we did 
for terms in raft log manager.
 *** While doing that, we should ensure that it'll make sense. In order to do 
that, we need to check that we are more likely to resolve recent 
timestamps/revisions.
 *** Unlike term values in raft log manager, here we might want to cache a few 
hundred values. We also need a concurrent access for reading the cache, meaning 
that the implementation will different. Just make sure to make it {*}non 
blocking{*}.
 *** _Should be done in a separate Jira_
 * Same for reading index/term/confguration/etc. There must be a simple way to 
make it non blocking either.
 * The fun part - FSM operations. Several ways to improve them:
 ** Remove the write lock. Use read-evaluate-write approach.
 ** When we read the data in FSM, we should not fix the revision, we should 
read the "latest" revision, it is more efficient. _Might be done in a separate 
Jira._
 ** We should always remember - there are only two writing agents here. A state 
machine and a compaction process.
 ** State machine is a latency-sensitive part of the system. We should avoid 
blocking it with concurrent compaction. This restriction dictates the rest of 
the design.
 ** When we read a list of existing revisions for an entry and prepend a new 
revision to it, we don't lock anything. This means that by the time we write 
this list back into a storage, it might contain some compacted revisions. *This 
is fine.* The compactor itself should keep this race in mind and expect such a 
trash in revisions list. If the value is not found - it's been compacted 
previously. In my understanding, this is the only adequate approach for the 
task.
 ** By the way, we should also expect that we have some trash there while just 
reading the data. If something is in the process of being compacted, we 
shouldn't even try to read it, but if we do, we should be ready to the data 
that concurrently disappeared.
 * Compactor:
 ** Here we can't allow to write trash data into a revisions list of any entry, 
because (unlike in FSM) that would lead to inconsistent data.
 ** I propose doing the 
{{db.write(writeOptions, batch);}} calls while holding an exclusive lock.
 ** FSM just acquires the lock and writes the data, without any retries, 
because it has a priority.
 ** Compactor should have a check, like "is it safe to write a batch that I 
prepared":
 *** If the answer is "yes", we proceed with the write.
 *** If the answer is "no" or "idk", we:
 **** Don't proceed with writing the batch.
 **** Refresh the cursor.
 **** Re-read the data and prepare a new batch.
 **** Repeat.
 *** The goal is to have a check that will have a small likelihood of "maybe". 
I propose having a sufficiently large bloom filter and update it synchronously. 
It should be cleared every time we refresh the iterator.
 ** Such an approach will save us from constant "refresh" calls on iterator, 
which are slow. They will become rare. They will almost disappear if there's 
not much concurrent load.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-25030) Metastorage compaction blocks other threads by reading from drive

Reply via email to