Hi Igniters,

I would like to propose an implementation approach for SELECT ... FOR UPDATE in
the Calcite engine.

The target syntax is:

SELECT ...
FROM ...
WHERE ...
FOR UPDATE [WAIT <N> | NOWAIT | SKIP LOCKED]


The main goal is to provide useful row-level locking semantics for simple
transactional queries without changing the existing transaction protocol.
The proposed implementation is built on top of the current key-based
locking mechanism.

Proposed restrictions for the first implementation:

Unsupported if the query contains:
- JOIN (It is likely that this limitation will not be significant in the
future, because some databases use FOR UPDATE together with JOIN specifying
a table for which locks will be taken.)
- DISTINCT
- HAVING
- GROUP BY
- OVER/window functions
- using in subqueries
Unsupported if the query is not executed inside an explicit:
- PESSIMISTIC
- READ_COMMITTED
transaction

In these cases we can throw something like:

UnsupportedException("FOR UPDATE is unsupported with the clause...")


Proposed algorithm:

   1.

   Validate the query.

   If the query contains unsupported clauses, or there is no active PESSIMISTIC
   READ_COMMITTED transaction, fail fast.
   2.

   Run a distributed scan.

   The query is executed as a distributed task. Each participating node
   scans local data and finds rows matching the WHERE condition.

   For each matching row, the node returns to the initiator:
   key
   GridCacheVersion
   fields required by SELECT / ORDER BY
   If ORDER BY is specified, the initiator preserves the required result
   ordering.
   3.

   Try to acquire locks.

   The initiator groups keys by node and sends lock requests to the
   corresponding nodes.

   Keys should be locked in a stable global order, for example by sorting
   them by binary representation. This should prevent deadlocks between
   concurrent SELECT ... FOR UPDATE queries that try to lock the same set
   of keys.

   A lock is considered successful only if the current GridCacheVersion is
   equal to the version observed during the scan phase.

   The modifiers are handled at this stage:
   WAIT <N>     wait up to N
   NOWAIT       do not wait, fail on conflict
   SKIP LOCKED  do not wait, skip locked rows
   For successfully locked keys, the node returns the records. These
   records must contain all fields required by the original SELECT list.
   4.

   Retry on failure.

   If not all required locks were acquired, we clean up the intermediate
   transaction state:
   - remove NearTx state related to keys we attempted to lock
   - send unlock requests if some locks were already acquired

   Then we retry from step 2.

   This is important because the result of the first scan may already be
   stale. A row may no longer match the WHERE condition, or another row may
   now match it.
   5.

   Return rows to the user.
   Once all required locks are acquired and versions are verified, the
   initiator starts returning rows to the user in the order required by the
   original query.


Example

accounts
id=1, balance=1200, version=v10
id=2, balance=1500, version=v20

Transaction 1 (T1)
SELECT *
FROM accounts
WHERE balance >= 1000
FOR UPDATE WAIT 5;

Transaction 2 (T2)
UPDATE accounts
SET balance = 500
WHERE id = 1;

time ---------------------------------------------------------------->

T1:
    attempt #1: distributed scan

    sees:
      id=1, balance=1200, version=v10  -> matches WHERE
      id=2, balance=1500, version=v20  -> matches WHERE

    candidates:
      key=1, version=v10
      key=2, version=v20

T2:
                         lock key=1
                         id=1: balance 1200 -> 500
                         version v10 -> v11
                         commit / unlock

T1:
                                                  lock phase

                                                  lock key=1 only if
version == v10
                                                  lock key=2 only if
version == v20

Node:
                                                  key=1:
                                                    current version = v11
                                                    scanned version = v10
                                                    version mismatch

                                                  key=2:
                                                    may be locked
successfully

T1:
                                                  attempt #1 failed

                                                  cleanup:
                                                    unlock key=2 if it was
locked
                                                    clean NearTx state for
key=1/key=2

                                                  retry

After retry:
T1:
    attempt #2: distributed scan

    sees current state:
      id=1, balance=500,  version=v11  -> does not match WHERE
      id=2, balance=1500, version=v20  -> matches WHERE

    candidates:
      key=2, version=v20

    lock phase:
      lock key=2 only if version == v20

Node:
    key=2 is free
    current version = v20
    version matches
    lock acquired

T1:
    returns:
      id=2, balance=1500

    lock on key=2 is kept until COMMIT/ROLLBACK

This avoids returning stale rows. The row id=1, balance=1200 was seen
during the first scan, but it was not returned because the version changed
before the lock was acquired.

I also understand that this algorithm may not perform well under high
contention because version mismatches can lead to repeated retries. It
probably makes sense to limit the number of retries or bind the retry loop
to the query timeout / transaction timeout.
Another concern is that the algorithm requires two network round trips:
first to collect candidate keys and versions, and then to acquire locks and
fetch the final rows. This is not ideal from a performance perspective, but
it follows from the current transaction model: the near node has to know
the keys before it can request locks for them on remote nodes.
I would also like to explicitly ask whether the community has any concerns
about this algorithm.
-- 
Vladislav Pyatkov

Reply via email to