[jira] [Updated] (IGNITE-20072) Reduce waiting time in partition listener

Ivan Bessonov (Jira) Thu, 27 Jul 2023 05:36:04 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-20072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Bessonov updated IGNITE-20072:
-----------------------------------
    Description: 
h3. The problem

for every row that we update, we must take a lock for the rowId. This 
guarantees consistency of data and indexes. But, there are cons:
 * we must allocate locks, as well as a collection for locks;
 * we may wait for some time while GC or another background process does its 
thing.

Ideally, no locks should be used. We must investigate the issue and reduce 
waiting time or eliminate it completely. Following is the short explanation of 
what's going now right now, and some ideas about possible fixes.
h3. Full state transfer (snapshot) lock

{{OutgoingSnapshot#acquireMvLock}}

This locks is required to provide cooperation of partition listener and the 
snapshot output process. The idea is the following:
 * "snapshot reader" iterates the partition
 * if there's a concurrent load, we "notify" the reader, that certain version 
chains are updated, and their previous state must be preserved

Both of these steps take exclusive lock to process data. This is not optimal.

There is (almost) a way around it: persisting update timestamp together with 
write intents. This way snapshot will effectively be equivalent to the "scan by 
timestamp", but including write intents.

One part that makes this process not optimal is updating/committing current 
write intents. These two operations replace the head of the version chain, 
losing its previous value, making "pure" timestamp scan impossible.

Another point is the time that we hold the lock in snapshot reader. It must be 
a single lock per single entry at most. Ideally, partition listener should have 
higher priority while acquiring that lock. This is easy to do with row-level 
locks that we use for GC.
There is also a way to save on reading the entire version chain holding a 
single lock, doing that in several sessions, and we should do this. The Justin 
Bieber Problem is not a joke. The time that we hold any lock must be 
predictable and limited with some known bounds.

One thing to consider: we should probably perform write intent resolution in 
snapshot reader. Otherwise the receiver node will have to do it manually later, 
but with no explicit notification that some of transactions are already 
committed or rolled back.
h3. GC locks

Earlier I mentioned, that these locks can be reused in snapshots. This makes it 
even harder to get rid of them.

So, what's the actual problem?
 # Indexes
 # Surprisingly, aforementioned snapshots

They are required for the following reasons:
 * write batches in RocksDB are isolated, which makes reading the data from 
another "runConsistently" impossible. Thus we use external synchronization to 
resolve race conditions, because block-free algorithms won't work without data 
being accessible
 * concurrent index update and index cleanup may break the index:
 ** let's imagine there is a chain [\{PK, A}, \{PK, B}] and an index, that has 
[A, B]
 ** somebody adds another \{PK, B} to the head, and at the same time GC removes 
\{PK, B} from the tail
 ** remove sees, that the only surviving value is A, and planning to remove B 
from the index
 ** after that, update may insert value \{PK, B} into the head, but GC won't 
notice that
 ** after that, if GC removes B from the index, we will have inconsistent data
There may be other similar scenarios. Fixing such race doesn't seem to be easy
 * concurrent version chain truncation and iteration may result in reading 
invalid data in page memory engine
 ** this is related to rebalance and to tombstone removal in GC. While the 
second point can be re-written (I think), during the rebalance we must read the 
entire version chain. If we do this "version by version", iterating through the 
underlying linked list, we may end up with a pointer that's already 
de-alocated, and there's no way to know that. Workaround for that is the 
stopping of GC until snapshot is complete, or at least don't GC unprocessed 
data. But it only solves this particular issue, without addressing everything 
else.

One problem that can easily be solved is the memory consumption of these 
dynamic locks. Striped lock may solve this issue, but it may (or will) add 
contention.

There are at least two issues in one, as stated above, and they may be 
optimized independently.

  was:
h3. The problem

for every row that we update, we must take a lock for the rowId. This 
guarantees consistency of data and indexes. But, there are cons:
 * we must allocate locks, as well as a collection for locks;
 * we may wait for some time while GC or another background process does its 
thing.

Ideally, no locks should be used. We must investigate the issue and reduce 
waiting time or eliminate it completely. Following is the short explanation of 
what's going now right now, and some ideas about possible fixes.
h3. Full state transfer (snapshot) lock

{{OutgoingSnapshot#acquireMvLock}}

This locks is required to provide cooperation of partition listener and the 
snapshot output process. The idea is the following:
 * "snapshot reader" iterates the partition
 * if there's a concurrent load, we "notify" the reader, that certain version 
chains are updated, and their previous state must be preserved

Both of these steps take exclusive lock to process data. This is not optimal.

There is (almost) a way around it: persisting update timestamp together with 
write intents. This way snapshot will effectively be equivalent to the "scan by 
timestamp", but including write intents.

One part that makes this process not optimal is updating/committing current 
write intents. These two operations replace the head of the version chain, 
losing its previous value, making "pure" timestamp scan impossible.

Another point is the time that we hold the lock in snapshot reader. It must be 
a single lock per single entry at most. Ideally, partition listener should have 
higher priority while acquiring that lock. This is easy to do with row-level 
locks that we use for GC.
There is also a way to save on reading the entire version chain holding a 
single lock, doing that in several sessions, and we should do this. The Justin 
Bieber Problem is not a joke. The time that we hold any lock must be 
predictable and limited by some known bounds.

One thing to consider: we should probably perform write intent resolution in 
snapshot reader. Otherwise the receiver node will have to do it manually later, 
but with no explicit notification that some of transactions are already 
committed or rolled back.
h3. GC locks

 


> Reduce waiting time in partition listener
> -----------------------------------------
>
>                 Key: IGNITE-20072
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20072
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. The problem
> for every row that we update, we must take a lock for the rowId. This 
> guarantees consistency of data and indexes. But, there are cons:
>  * we must allocate locks, as well as a collection for locks;
>  * we may wait for some time while GC or another background process does its 
> thing.
> Ideally, no locks should be used. We must investigate the issue and reduce 
> waiting time or eliminate it completely. Following is the short explanation 
> of what's going now right now, and some ideas about possible fixes.
> h3. Full state transfer (snapshot) lock
> {{OutgoingSnapshot#acquireMvLock}}
> This locks is required to provide cooperation of partition listener and the 
> snapshot output process. The idea is the following:
>  * "snapshot reader" iterates the partition
>  * if there's a concurrent load, we "notify" the reader, that certain version 
> chains are updated, and their previous state must be preserved
> Both of these steps take exclusive lock to process data. This is not optimal.
> There is (almost) a way around it: persisting update timestamp together with 
> write intents. This way snapshot will effectively be equivalent to the "scan 
> by timestamp", but including write intents.
> One part that makes this process not optimal is updating/committing current 
> write intents. These two operations replace the head of the version chain, 
> losing its previous value, making "pure" timestamp scan impossible.
> Another point is the time that we hold the lock in snapshot reader. It must 
> be a single lock per single entry at most. Ideally, partition listener should 
> have higher priority while acquiring that lock. This is easy to do with 
> row-level locks that we use for GC.
> There is also a way to save on reading the entire version chain holding a 
> single lock, doing that in several sessions, and we should do this. The 
> Justin Bieber Problem is not a joke. The time that we hold any lock must be 
> predictable and limited with some known bounds.
> One thing to consider: we should probably perform write intent resolution in 
> snapshot reader. Otherwise the receiver node will have to do it manually 
> later, but with no explicit notification that some of transactions are 
> already committed or rolled back.
> h3. GC locks
> Earlier I mentioned, that these locks can be reused in snapshots. This makes 
> it even harder to get rid of them.
> So, what's the actual problem?
>  # Indexes
>  # Surprisingly, aforementioned snapshots
> They are required for the following reasons:
>  * write batches in RocksDB are isolated, which makes reading the data from 
> another "runConsistently" impossible. Thus we use external synchronization to 
> resolve race conditions, because block-free algorithms won't work without 
> data being accessible
>  * concurrent index update and index cleanup may break the index:
>  ** let's imagine there is a chain [\{PK, A}, \{PK, B}] and an index, that 
> has [A, B]
>  ** somebody adds another \{PK, B} to the head, and at the same time GC 
> removes \{PK, B} from the tail
>  ** remove sees, that the only surviving value is A, and planning to remove B 
> from the index
>  ** after that, update may insert value \{PK, B} into the head, but GC won't 
> notice that
>  ** after that, if GC removes B from the index, we will have inconsistent data
> There may be other similar scenarios. Fixing such race doesn't seem to be easy
>  * concurrent version chain truncation and iteration may result in reading 
> invalid data in page memory engine
>  ** this is related to rebalance and to tombstone removal in GC. While the 
> second point can be re-written (I think), during the rebalance we must read 
> the entire version chain. If we do this "version by version", iterating 
> through the underlying linked list, we may end up with a pointer that's 
> already de-alocated, and there's no way to know that. Workaround for that is 
> the stopping of GC until snapshot is complete, or at least don't GC 
> unprocessed data. But it only solves this particular issue, without 
> addressing everything else.
> One problem that can easily be solved is the memory consumption of these 
> dynamic locks. Striped lock may solve this issue, but it may (or will) add 
> contention.
> There are at least two issues in one, as stated above, and they may be 
> optimized independently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-20072) Reduce waiting time in partition listener

Reply via email to