Hi Igniters,
I would like to propose an implementation approach for SELECT ... FOR UPDATE in
the Calcite engine.
The target syntax is:
SELECT ...
FROM ...
WHERE ...
FOR UPDATE [WAIT <N> | NOWAIT | SKIP LOCKED]
The main goal is to provide useful row-level locking semantics for simple
transactional queries without changing the existing transaction protocol.
The proposed implementation is built on top of the current key-based
locking mechanism.
Proposed restrictions for the first implementation:
Unsupported if the query contains:
- JOIN (It is likely that this limitation will not be significant in the
future, because some databases use FOR UPDATE together with JOIN specifying
a table for which locks will be taken.)
- DISTINCT
- HAVING
- GROUP BY
- OVER/window functions
- using in subqueries
Unsupported if the query is not executed inside an explicit:
- PESSIMISTIC
- READ_COMMITTED
transaction
In these cases we can throw something like:
UnsupportedException("FOR UPDATE is unsupported with the clause...")
Proposed algorithm:
1.
Validate the query.
If the query contains unsupported clauses, or there is no active PESSIMISTIC
READ_COMMITTED transaction, fail fast.
2.
Run a distributed scan.
The query is executed as a distributed task. Each participating node
scans local data and finds rows matching the WHERE condition.
For each matching row, the node returns to the initiator:
key
GridCacheVersion
fields required by SELECT / ORDER BY
If ORDER BY is specified, the initiator preserves the required result
ordering.
3.
Try to acquire locks.
The initiator groups keys by node and sends lock requests to the
corresponding nodes.
Keys should be locked in a stable global order, for example by sorting
them by binary representation. This should prevent deadlocks between
concurrent SELECT ... FOR UPDATE queries that try to lock the same set
of keys.
A lock is considered successful only if the current GridCacheVersion is
equal to the version observed during the scan phase.
The modifiers are handled at this stage:
WAIT <N> wait up to N
NOWAIT do not wait, fail on conflict
SKIP LOCKED do not wait, skip locked rows
For successfully locked keys, the node returns the records. These
records must contain all fields required by the original SELECT list.
4.
Retry on failure.
If not all required locks were acquired, we clean up the intermediate
transaction state:
- remove NearTx state related to keys we attempted to lock
- send unlock requests if some locks were already acquired
Then we retry from step 2.
This is important because the result of the first scan may already be
stale. A row may no longer match the WHERE condition, or another row may
now match it.
5.
Return rows to the user.
Once all required locks are acquired and versions are verified, the
initiator starts returning rows to the user in the order required by the
original query.
Example
accounts
id=1, balance=1200, version=v10
id=2, balance=1500, version=v20
Transaction 1 (T1)
SELECT *
FROM accounts
WHERE balance >= 1000
FOR UPDATE WAIT 5;
Transaction 2 (T2)
UPDATE accounts
SET balance = 500
WHERE id = 1;
time ---------------------------------------------------------------->
T1:
attempt #1: distributed scan
sees:
id=1, balance=1200, version=v10 -> matches WHERE
id=2, balance=1500, version=v20 -> matches WHERE
candidates:
key=1, version=v10
key=2, version=v20
T2:
lock key=1
id=1: balance 1200 -> 500
version v10 -> v11
commit / unlock
T1:
lock phase
lock key=1 only if
version == v10
lock key=2 only if
version == v20
Node:
key=1:
current version = v11
scanned version = v10
version mismatch
key=2:
may be locked
successfully
T1:
attempt #1 failed
cleanup:
unlock key=2 if it was
locked
clean NearTx state for
key=1/key=2
retry
After retry:
T1:
attempt #2: distributed scan
sees current state:
id=1, balance=500, version=v11 -> does not match WHERE
id=2, balance=1500, version=v20 -> matches WHERE
candidates:
key=2, version=v20
lock phase:
lock key=2 only if version == v20
Node:
key=2 is free
current version = v20
version matches
lock acquired
T1:
returns:
id=2, balance=1500
lock on key=2 is kept until COMMIT/ROLLBACK
This avoids returning stale rows. The row id=1, balance=1200 was seen
during the first scan, but it was not returned because the version changed
before the lock was acquired.
I also understand that this algorithm may not perform well under high
contention because version mismatches can lead to repeated retries. It
probably makes sense to limit the number of retries or bind the retry loop
to the query timeout / transaction timeout.
Another concern is that the algorithm requires two network round trips:
first to collect candidate keys and versions, and then to acquire locks and
fetch the final rows. This is not ideal from a performance perspective, but
it follows from the current transaction model: the near node has to know
the keys before it can request locks for them on remote nodes.
I would also like to explicitly ask whether the community has any concerns
about this algorithm.
--
Vladislav Pyatkov