[
https://issues.apache.org/jira/browse/HBASE-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870199#action_12870199
]
Michael Dalton commented on HBASE-2332:
---------------------------------------
Here's a rough sketch of the algorithm, sorry if it's a bit verbose. I take
into account that row locks are per-RS. To give you a quick overview, we
support reading, updating, and renaming a value V in a row. Renames move value
V from one row to another. All reads occur without row locks (unless a rename
failure occurs), and all updates (put/delete), which only affect a single row,
use HBase CAS. The only situation where I need locking is to deal with renames
and rename failure recovery. I need rename to appear to be atomic from the
client's perspective, but I can perform rename failure recovery so long as it
appears transparent.
Rename of value V from row A to row B is accomplished with 5 steps: (1) acquire
row locks on A and B in row key order (so that a deadly embrace does not occur
where another client locks B and then A) (2) Update value V in row A using CAS
with V~src~ which marks V as a rename source and includes the row key of B as
the destination (3) Update B using CAS (using an empty value as the comparison
value) with a modified copy of V, V~dst~, which marks B as the destination(and
includes A's row key as source) (4) Delete V from Row A (5) Put original value
V in Row B (this time not marked as a rename source or destination, just the
exact original bytes). In steps 2-5, each of the CAS, Delete, and Put
operations uses the appropriate row lock, so row lock A is used for any
modifications to row A.
At any one of these 5 steps row locks may be lost or the machine performing the
rename may die. This will leave one or both of the rename source/destination in
the 'pending rename' state. Thus we must recover from any prefix of the rename
operations executed by a failed rename, which occurs when we read a value that
is marked as pending rename source (V~src~) or pending rename
destination(V~dst~).
Initial reads occur without locks. However, reads are _not_ allowed to return
'pending rename' values. Updates (Put/Delete) are not affected by renames, as
all updates use CAS, and no 'read value' operation is permitted to return a
'pending rename' value, and thus no update will successfully modify a value
marked 'pending rename' as the CAS will fail.
If a pending rename value is encountered, the read must recover from the rename
failure. The recovery method acquires locks on A _and_ B (which it can do
because pending values include the row key of the 'other' row, i.e. V~src~
includes B's row key). Then the recovery process performs an undo or redo
depending on if V~dst~ was written, undoing if V~dst~ was never written by
writing the original value V to row A, and re-doing the rename by re-executing
steps (4)-(5) of the rename otherwise. In either case, all modifications are
performed with the appropriate row lock.
The critical benefit locks provide is that once the rename recovery procedure
has acquired row locks on A and B, it is guaranteed that the original rename
procedure (or previous failed rename recovery procedures) will never 'wake up'
later and successfully perform mutations related to the failed rename (as their
locks will have expired). I can provide more details about this algorithm in
general if the above is unclear.
This algorithm seems to work correctly given that locks may be lost at any
time. If I understand what you said correctly, the major issue is that locks
are not persisted in HLog currently. However, the only risk presented by lack
of lock persistence is locks may be lost before lease expiration, correct?
However, it is still the case that at most one client can lock at a row -- it's
just that a client may lose their lock even before the lease expires due to
lack of HLog persistence (and all future operations performed with the 'lost'
lock will fail).
I don't see this as a huge drawback as anyone dealing with a critical section
guarded by HBase row locks _must_ handle the case that locks may be lost
because locks have leases that may expire. Thus any critical section may fail
at an arbitrary point and complete only a prefix of the operations in the
critical section. From a correctness perspective, ensuring that programs
transparently handle lock failure due to lease expiration should also address
losing locks due to lack of persistence in the HLog. Please correct me if I'm
missing something here concerning the HLog durability, or if the effect differs
from what I've described..
> Remove client-exposed row locks from region server
> --------------------------------------------------
>
> Key: HBASE-2332
> URL: https://issues.apache.org/jira/browse/HBASE-2332
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: client, regionserver
> Reporter: Todd Lipcon
> Fix For: 0.22.0
>
>
> Discussion in HBASE-2294 has surfaced that the client-exposed row lock
> feature in the HBase API may not be scalable/necessary. Additionally there
> are some benefits we can reap by removing the feature (or pushing it into the
> client).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.