[
https://issues.apache.org/jira/browse/IGNITE-20072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-20072:
-----------------------------------
Description:
h3. The problem
for every row that we update, we must take a lock for the rowId. This
guarantees consistency of data and indexes. But, there are cons:
* we must allocate locks, as well as a collection for locks;
* we may wait for some time while GC or another background process does its
thing.
Ideally, no locks should be used. We must investigate the issue and reduce
waiting time or eliminate it completely. Following is the short explanation of
what's going now right now, and some ideas about possible fixes.
h3. Full state transfer (snapshot) lock
{{OutgoingSnapshot#acquireMvLock}}
This locks is required to provide cooperation of partition listener and the
snapshot output process. The idea is the following:
* "snapshot reader" iterates the partition
* if there's a concurrent load, we "notify" the reader, that certain version
chains are updated, and their previous state must be preserved
Both of these steps take exclusive lock to process data. This is not optimal.
There is (almost) a way around it: persisting update timestamp together with
write intents. This way snapshot will effectively be equivalent to the "scan by
timestamp", but including write intents.
One part that makes this process not optimal is updating/committing current
write intents. These two operations replace the head of the version chain,
losing its previous value, making "pure" timestamp scan impossible.
Another point is the time that we hold the lock in snapshot reader. It must be
a single lock per single entry at most. Ideally, partition listener should have
higher priority while acquiring that lock. This is easy to do with row-level
locks that we use for GC.
There is also a way to save on reading the entire version chain holding a
single lock, doing that in several sessions, and we should do this. The Justin
Bieber Problem is not a joke. The time that we hold any lock must be
predictable and limited with some known bounds.
One thing to consider: we should probably perform write intent resolution in
snapshot reader. Otherwise the receiver node will have to do it manually later,
but with no explicit notification that some of transactions are already
committed or rolled back.
h3. GC locks
Earlier I mentioned, that these locks can be reused in snapshots. This makes it
even harder to get rid of them.
So, what's the actual problem?
# Indexes
# Surprisingly, aforementioned snapshots
They are required for the following reasons:
* write batches in RocksDB are isolated, which makes reading the data from
another "runConsistently" impossible. Thus we use external synchronization to
resolve race conditions, because block-free algorithms won't work without data
being accessible
* concurrent index update and index cleanup may break the index:
** let's imagine there is a chain [\{PK, A}, \{PK, B}] and an index, that has
[A, B]
** somebody adds another \{PK, B} to the head, and at the same time GC removes
\{PK, B} from the tail
** remove sees, that the only surviving value is A, and planning to remove B
from the index
** after that, update may insert value \{PK, B} into the head, but GC won't
notice that
** after that, if GC removes B from the index, we will have inconsistent data
There may be other similar scenarios. Fixing such race doesn't seem to be easy
* concurrent version chain truncation and iteration may result in reading
invalid data in page memory engine
** this is related to rebalance and to tombstone removal in GC. While the
second point can be re-written (I think), during the rebalance we must read the
entire version chain. If we do this "version by version", iterating through the
underlying linked list, we may end up with a pointer that's already
de-alocated, and there's no way to know that. Workaround for that is the
stopping of GC until snapshot is complete, or at least don't GC unprocessed
data. But it only solves this particular issue, without addressing everything
else.
One problem that can easily be solved is the memory consumption of these
dynamic locks. Striped lock may solve this issue, but it may (or will) add
contention.
There are at least two issues in one, as stated above, and they may be
optimized independently.
was:
h3. The problem
for every row that we update, we must take a lock for the rowId. This
guarantees consistency of data and indexes. But, there are cons:
* we must allocate locks, as well as a collection for locks;
* we may wait for some time while GC or another background process does its
thing.
Ideally, no locks should be used. We must investigate the issue and reduce
waiting time or eliminate it completely. Following is the short explanation of
what's going now right now, and some ideas about possible fixes.
h3. Full state transfer (snapshot) lock
{{OutgoingSnapshot#acquireMvLock}}
This locks is required to provide cooperation of partition listener and the
snapshot output process. The idea is the following:
* "snapshot reader" iterates the partition
* if there's a concurrent load, we "notify" the reader, that certain version
chains are updated, and their previous state must be preserved
Both of these steps take exclusive lock to process data. This is not optimal.
There is (almost) a way around it: persisting update timestamp together with
write intents. This way snapshot will effectively be equivalent to the "scan by
timestamp", but including write intents.
One part that makes this process not optimal is updating/committing current
write intents. These two operations replace the head of the version chain,
losing its previous value, making "pure" timestamp scan impossible.
Another point is the time that we hold the lock in snapshot reader. It must be
a single lock per single entry at most. Ideally, partition listener should have
higher priority while acquiring that lock. This is easy to do with row-level
locks that we use for GC.
There is also a way to save on reading the entire version chain holding a
single lock, doing that in several sessions, and we should do this. The Justin
Bieber Problem is not a joke. The time that we hold any lock must be
predictable and limited by some known bounds.
One thing to consider: we should probably perform write intent resolution in
snapshot reader. Otherwise the receiver node will have to do it manually later,
but with no explicit notification that some of transactions are already
committed or rolled back.
h3. GC locks
> Reduce waiting time in partition listener
> -----------------------------------------
>
> Key: IGNITE-20072
> URL: https://issues.apache.org/jira/browse/IGNITE-20072
> Project: Ignite
> Issue Type: Improvement
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> h3. The problem
> for every row that we update, we must take a lock for the rowId. This
> guarantees consistency of data and indexes. But, there are cons:
> * we must allocate locks, as well as a collection for locks;
> * we may wait for some time while GC or another background process does its
> thing.
> Ideally, no locks should be used. We must investigate the issue and reduce
> waiting time or eliminate it completely. Following is the short explanation
> of what's going now right now, and some ideas about possible fixes.
> h3. Full state transfer (snapshot) lock
> {{OutgoingSnapshot#acquireMvLock}}
> This locks is required to provide cooperation of partition listener and the
> snapshot output process. The idea is the following:
> * "snapshot reader" iterates the partition
> * if there's a concurrent load, we "notify" the reader, that certain version
> chains are updated, and their previous state must be preserved
> Both of these steps take exclusive lock to process data. This is not optimal.
> There is (almost) a way around it: persisting update timestamp together with
> write intents. This way snapshot will effectively be equivalent to the "scan
> by timestamp", but including write intents.
> One part that makes this process not optimal is updating/committing current
> write intents. These two operations replace the head of the version chain,
> losing its previous value, making "pure" timestamp scan impossible.
> Another point is the time that we hold the lock in snapshot reader. It must
> be a single lock per single entry at most. Ideally, partition listener should
> have higher priority while acquiring that lock. This is easy to do with
> row-level locks that we use for GC.
> There is also a way to save on reading the entire version chain holding a
> single lock, doing that in several sessions, and we should do this. The
> Justin Bieber Problem is not a joke. The time that we hold any lock must be
> predictable and limited with some known bounds.
> One thing to consider: we should probably perform write intent resolution in
> snapshot reader. Otherwise the receiver node will have to do it manually
> later, but with no explicit notification that some of transactions are
> already committed or rolled back.
> h3. GC locks
> Earlier I mentioned, that these locks can be reused in snapshots. This makes
> it even harder to get rid of them.
> So, what's the actual problem?
> # Indexes
> # Surprisingly, aforementioned snapshots
> They are required for the following reasons:
> * write batches in RocksDB are isolated, which makes reading the data from
> another "runConsistently" impossible. Thus we use external synchronization to
> resolve race conditions, because block-free algorithms won't work without
> data being accessible
> * concurrent index update and index cleanup may break the index:
> ** let's imagine there is a chain [\{PK, A}, \{PK, B}] and an index, that
> has [A, B]
> ** somebody adds another \{PK, B} to the head, and at the same time GC
> removes \{PK, B} from the tail
> ** remove sees, that the only surviving value is A, and planning to remove B
> from the index
> ** after that, update may insert value \{PK, B} into the head, but GC won't
> notice that
> ** after that, if GC removes B from the index, we will have inconsistent data
> There may be other similar scenarios. Fixing such race doesn't seem to be easy
> * concurrent version chain truncation and iteration may result in reading
> invalid data in page memory engine
> ** this is related to rebalance and to tombstone removal in GC. While the
> second point can be re-written (I think), during the rebalance we must read
> the entire version chain. If we do this "version by version", iterating
> through the underlying linked list, we may end up with a pointer that's
> already de-alocated, and there's no way to know that. Workaround for that is
> the stopping of GC until snapshot is complete, or at least don't GC
> unprocessed data. But it only solves this particular issue, without
> addressing everything else.
> One problem that can easily be solved is the memory consumption of these
> dynamic locks. Striped lock may solve this issue, but it may (or will) add
> contention.
> There are at least two issues in one, as stated above, and they may be
> optimized independently.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)