[
https://issues.apache.org/jira/browse/HBASE-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032513#comment-16032513
]
stack commented on HBASE-18144:
-------------------------------
h1. Setup
In hbase-1.1.x and earlier, hbase took an exclusive lock on the WHOLE row
whether a read, write, or increment (read-modify-write). Absolute
single-threaded access crassly ensured we returned a consistent view. The
hbase-1.1.x system was also indiscriminate about which queued handler next got
access; i.e it was not ‘fair’.
In hbase-1.2.x, the hbase row lock was moved to use a reentrant read/write
native java lock as its core (HBASE-12751); we took the read lock whenever we
read or mutated the row – MVCC ensured consistent view while the row was under
modification – and the write lock was taken only for the usually ‘rare’
checkAndPut type operation; i.e. checkAndPut, checkAndDelete, Increment, and
Append. The non-exclusive access to the Row upped throughput (in part because
we were also able to simplify our read/write path using the new approach). The
fairness helped latencies. hbase-1.2.x has been out for a good while now. No
one up to this to my knowledge has complained about they new-style-locking.
What happens though when you have a loading that is mostly increments all on a
single row; i.e. lots of write exclusive locks concurrent with lots of read
locks?
We found that the old exclusive lock performs way better than hbase-1.2.x
read/write locking at least in one extreme, so much so, we need to bring it
back as a option if only so we can carry such 'extreme' increment-type
workloads on newer hbases (we should work on getting the write-only-increment
working as first class in hbase, but that is for another issue). This issue is
about adding back exclusive lock as a table option.
h1. Problem
In essence, at a users deploy, all threads in variations of the below:
{code}
"RpcServer.FifoWFPBQ.default.handler=190,queue=10,port=60020" #243 daemon
prio=5 os_prio=0 tid=0x00007fbb58691800 nid=0x2d2527 waiting on condition
[0x00007fbb4ca49000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007c6001b38> (a
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:934)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247)
at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115)
at
org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171)
at org.apache.hadoop.hbase.regionserver.HRegion.doIncrement(HRegion.java:7453)
{code}
Or this:
{code}
"RpcServer.FifoWFPBQ.default.handler=180,queue=0,port=60020" #233 daemon prio=5
os_prio=0 tid=0x00007fbb586ed800 nid=0x2d251d waiting on condition
[0x00007fbb4d453000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000354976c00> (a
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
at
org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171)
at
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3017)
…
{code}
All threads are stuck trying to get the row lock.
Various diggings had it that all 200 handlers (200 in this case) were NOT
trying to get to the same row but that there were about 16 rows being addressed
with varying numbers of handlers backed up (from 50 handlers in one case down
to 10 or so at the other extreme).
It looks like a deadlock. You’d have expected someone of the handlers was
inside doing the lock but in this thead dump, there were none.
Thread dumps would stay in this stasis for minutes. Progress seemed to be going
on because the players changed across thread dumps (we figured the thread dump
at ‘safe point’ was not giving us a true picture of the actual state of
affairs). Eventually we got over the hump and the RegionServer came alive again.
Looking at the same load up on an hbase-1.0.x, we’d see handlers stuck trying
to get locks but only about 1/5th of all handlers would be stuck on trying to
get locks at peak.
h1. Investigation
It took a while to arrive back at locking as the culprit. The tea leaves as
noted above pointed at locking but we wouldn’t buy it: “hbase-1.2.x has been
shipping for ages…”, “Measurement shows the new locking taking 10-20% longer
but lock-acquisition-time is so short, it probably immeasurable”, etc.
A big problem was our inability at being able to faithfully reproduce the
loading locally ([~esteban] and I made distinct attempts in two different
environments). While eventually in simulations we could reproduce the ‘weird’
thread dump the user was seeing,1.2.x always had overall better throughput than
1.0.x in our simulations (lock-acquisition-time was indeed longer in 1,2,x...
by about 20%).
Here are some of the paths we took following the wild goose (we spent a good
while on this issue – here is a sketch of what we were at):
h2. Row Key Hash
We spent time on the key used to finger locks. It was made of a hash of the row
key bytes. The hashing algorithm is poor. We speculated clashes were broadening
the scope of the lock or there was a scenario where we obtained the lock with
one hash but a hash that clashed somehow unlocked another leaving locks in
place.
Interesting findings included the fact that hashing took longer the longer the
row key was (User keys wer of 150-200 characters).
When we found that row key hash in 1.1.x was same as in 1.2.x, this put an end
to this particular pursuit.
h2. Compare
We expended a bunch of effort comparing locking in 1.0.x and 1.2.x. We measured
how long lock acquisition was taking (fair vs unfair, exclusive vs read/write)
and how many locks were being taken. It looked like 1.2.x was taking many more
than 1.0.x. Comparing code bases, it looked liked we’d dropped an optimization
going to 1.2 here:
{code}
commit b15a05d7b611726ff9b2dd1a3c5c9448b41e1474
Author: Enis Soztutar <[email protected]>
Date: Wed Oct 28 18:16:32 2015 -0700
HBASE-14689 Addendum and unit test for HBASE-13471
{code}
… a patch that was subsequently messily reverted then backported here:
{code}
commit bae2a970817d3693946c5b1d4e2382e56b1e94b8
Author: stack <[email protected]>
Date: Wed Sep 30 11:48:46 2015 -0700
HBASE-14465 Backport 'Allow rowlock to be reader/write' to branch-1
// If we haven't got any rows in our batch, we should block to
// get the next one.
- boolean shouldBlock = numReadyToWrite == 0;
RowLock rowLock = null;
try {
- rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
+ rowLock = getRowLock(mutation.getRow(), true);
} catch (IOException ioe) {
LOG.warn("Failed getting lock in batch put, row="
+ Bytes.toStringBinary(mutation.getRow()), ioe);
}
if (rowLock == null) {
// We failed to grab another lock
- assert !shouldBlock : "Should never fail to get lock when blocking";
+ assert false: "Should never fail to get lock when blocking";
{code}
Turned out we were counting locking incorrectly and there was a
misunderstanding around lock aggregation (HBASE-18074 ).
We even compared running on different JVM versions.
h2. Improving Lock-access time
Idea was to improve 1.2 locking regime. Did profiling and then experiments,
refactors of lock acquisition making do w/ less Atomics and volatiles, and
investigation of row sorting so can take lock once for a group of edits…. Gave
up on this route after seeing how little improvement we could make and after
figuring that this code that originally came from FB had had a lot of thought
applied already so pickings would be slim.
h1. Culprit: ReentrantReadWriteLock performance goes south if writes
RRWL is from circa 2005 (JDK5?). No good looking in the bug database for issues
in 2017. But from the javadoc:
"ReentrantReadWriteLocks can be used to improve concurrency in some uses of
some kinds of Collections. This is typically worthwhile only when the
collections are expected to be large, and entail operations with overhead that
outweighs synchronization overhead."
And “...if updates become frequent then the data spends most of its time being
exclusively locked and there is little, if any increase in concurrency.”
Or Ben Manes in
https://groups.google.com/forum/#!topic/guava-discuss/Vqu34OODWdw "Read/Write
(shared) locks are expensive to maintain, much more so than exclusive locks. If
the lock hold time is small then most often a regular lock (synchronized or
ReentrantLock) is faster. It is only when the lock hold time is large does the
benefit of a shared lock become apparent."
This article on RWLock being AWFUL --
https://www.javacodegeeks.com/2014/06/java-8-stampedlocks-vs-readwritelocks-and-synchronized.html
-- gets dinged in the comments section for its testing technique only the
setup seemed to resemble the user-under-scrutiny loading (hundreds of
concurrent threads w/ a good portion taking write locks); i.e. a read from
hbase can turn into 1k Puts followed by 1k Increments (IIUC) all on the one row.
So, decided to try forward porting the old mechanism, an exclusive (reentrant)
lock rather than a RRWL as our locking basis.
Testing though in my setup, the exclusive lock seemed to perform worse but
deploying the patch at user production, it does much better than the 1.2.x
default locking regime.
Let me forward port the old locking mechanism in case anyone else has a
workload like the above users.
> Forward-port the old exclusive row lock; there are scenarios where it
> performs better
> -------------------------------------------------------------------------------------
>
> Key: HBASE-18144
> URL: https://issues.apache.org/jira/browse/HBASE-18144
> Project: HBase
> Issue Type: Bug
> Components: Increment
> Affects Versions: 1.2.5
> Reporter: stack
> Assignee: stack
> Fix For: 2.0.0, 1.3.2, 1.2.7
>
>
> Description to follow.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)