[ https://issues.apache.org/jira/browse/HBASE-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032513#comment-16032513 ]
stack commented on HBASE-18144: ------------------------------- h1. Setup In hbase-1.1.x and earlier, hbase took an exclusive lock on the WHOLE row whether a read, write, or increment (read-modify-write). Absolute single-threaded access crassly ensured we returned a consistent view. The hbase-1.1.x system was also indiscriminate about which queued handler next got access; i.e it was not ‘fair’. In hbase-1.2.x, the hbase row lock was moved to use a reentrant read/write native java lock as its core (HBASE-12751); we took the read lock whenever we read or mutated the row – MVCC ensured consistent view while the row was under modification – and the write lock was taken only for the usually ‘rare’ checkAndPut type operation; i.e. checkAndPut, checkAndDelete, Increment, and Append. The non-exclusive access to the Row upped throughput (in part because we were also able to simplify our read/write path using the new approach). The fairness helped latencies. hbase-1.2.x has been out for a good while now. No one up to this to my knowledge has complained about they new-style-locking. What happens though when you have a loading that is mostly increments all on a single row; i.e. lots of write exclusive locks concurrent with lots of read locks? We found that the old exclusive lock performs way better than hbase-1.2.x read/write locking at least in one extreme, so much so, we need to bring it back as a option if only so we can carry such 'extreme' increment-type workloads on newer hbases (we should work on getting the write-only-increment working as first class in hbase, but that is for another issue). This issue is about adding back exclusive lock as a table option. h1. Problem In essence, at a users deploy, all threads in variations of the below: {code} "RpcServer.FifoWFPBQ.default.handler=190,queue=10,port=60020" #243 daemon prio=5 os_prio=0 tid=0x00007fbb58691800 nid=0x2d2527 waiting on condition [0x00007fbb4ca49000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007c6001b38> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:934) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115) at org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171) at org.apache.hadoop.hbase.regionserver.HRegion.doIncrement(HRegion.java:7453) {code} Or this: {code} "RpcServer.FifoWFPBQ.default.handler=180,queue=0,port=60020" #233 daemon prio=5 os_prio=0 tid=0x00007fbb586ed800 nid=0x2d251d waiting on condition [0x00007fbb4d453000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000354976c00> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871) at org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3017) … {code} All threads are stuck trying to get the row lock. Various diggings had it that all 200 handlers (200 in this case) were NOT trying to get to the same row but that there were about 16 rows being addressed with varying numbers of handlers backed up (from 50 handlers in one case down to 10 or so at the other extreme). It looks like a deadlock. You’d have expected someone of the handlers was inside doing the lock but in this thead dump, there were none. Thread dumps would stay in this stasis for minutes. Progress seemed to be going on because the players changed across thread dumps (we figured the thread dump at ‘safe point’ was not giving us a true picture of the actual state of affairs). Eventually we got over the hump and the RegionServer came alive again. Looking at the same load up on an hbase-1.0.x, we’d see handlers stuck trying to get locks but only about 1/5th of all handlers would be stuck on trying to get locks at peak. h1. Investigation It took a while to arrive back at locking as the culprit. The tea leaves as noted above pointed at locking but we wouldn’t buy it: “hbase-1.2.x has been shipping for ages…”, “Measurement shows the new locking taking 10-20% longer but lock-acquisition-time is so short, it probably immeasurable”, etc. A big problem was our inability at being able to faithfully reproduce the loading locally ([~esteban] and I made distinct attempts in two different environments). While eventually in simulations we could reproduce the ‘weird’ thread dump the user was seeing,1.2.x always had overall better throughput than 1.0.x in our simulations (lock-acquisition-time was indeed longer in 1,2,x... by about 20%). Here are some of the paths we took following the wild goose (we spent a good while on this issue – here is a sketch of what we were at): h2. Row Key Hash We spent time on the key used to finger locks. It was made of a hash of the row key bytes. The hashing algorithm is poor. We speculated clashes were broadening the scope of the lock or there was a scenario where we obtained the lock with one hash but a hash that clashed somehow unlocked another leaving locks in place. Interesting findings included the fact that hashing took longer the longer the row key was (User keys wer of 150-200 characters). When we found that row key hash in 1.1.x was same as in 1.2.x, this put an end to this particular pursuit. h2. Compare We expended a bunch of effort comparing locking in 1.0.x and 1.2.x. We measured how long lock acquisition was taking (fair vs unfair, exclusive vs read/write) and how many locks were being taken. It looked like 1.2.x was taking many more than 1.0.x. Comparing code bases, it looked liked we’d dropped an optimization going to 1.2 here: {code} commit b15a05d7b611726ff9b2dd1a3c5c9448b41e1474 Author: Enis Soztutar <e...@apache.org> Date: Wed Oct 28 18:16:32 2015 -0700 HBASE-14689 Addendum and unit test for HBASE-13471 {code} … a patch that was subsequently messily reverted then backported here: {code} commit bae2a970817d3693946c5b1d4e2382e56b1e94b8 Author: stack <st...@apache.org> Date: Wed Sep 30 11:48:46 2015 -0700 HBASE-14465 Backport 'Allow rowlock to be reader/write' to branch-1 // If we haven't got any rows in our batch, we should block to // get the next one. - boolean shouldBlock = numReadyToWrite == 0; RowLock rowLock = null; try { - rowLock = getRowLockInternal(mutation.getRow(), shouldBlock); + rowLock = getRowLock(mutation.getRow(), true); } catch (IOException ioe) { LOG.warn("Failed getting lock in batch put, row=" + Bytes.toStringBinary(mutation.getRow()), ioe); } if (rowLock == null) { // We failed to grab another lock - assert !shouldBlock : "Should never fail to get lock when blocking"; + assert false: "Should never fail to get lock when blocking"; {code} Turned out we were counting locking incorrectly and there was a misunderstanding around lock aggregation (HBASE-18074 ). We even compared running on different JVM versions. h2. Improving Lock-access time Idea was to improve 1.2 locking regime. Did profiling and then experiments, refactors of lock acquisition making do w/ less Atomics and volatiles, and investigation of row sorting so can take lock once for a group of edits…. Gave up on this route after seeing how little improvement we could make and after figuring that this code that originally came from FB had had a lot of thought applied already so pickings would be slim. h1. Culprit: ReentrantReadWriteLock performance goes south if writes RRWL is from circa 2005 (JDK5?). No good looking in the bug database for issues in 2017. But from the javadoc: "ReentrantReadWriteLocks can be used to improve concurrency in some uses of some kinds of Collections. This is typically worthwhile only when the collections are expected to be large, and entail operations with overhead that outweighs synchronization overhead." And “...if updates become frequent then the data spends most of its time being exclusively locked and there is little, if any increase in concurrency.” Or Ben Manes in https://groups.google.com/forum/#!topic/guava-discuss/Vqu34OODWdw "Read/Write (shared) locks are expensive to maintain, much more so than exclusive locks. If the lock hold time is small then most often a regular lock (synchronized or ReentrantLock) is faster. It is only when the lock hold time is large does the benefit of a shared lock become apparent." This article on RWLock being AWFUL -- https://www.javacodegeeks.com/2014/06/java-8-stampedlocks-vs-readwritelocks-and-synchronized.html -- gets dinged in the comments section for its testing technique only the setup seemed to resemble the user-under-scrutiny loading (hundreds of concurrent threads w/ a good portion taking write locks); i.e. a read from hbase can turn into 1k Puts followed by 1k Increments (IIUC) all on the one row. So, decided to try forward porting the old mechanism, an exclusive (reentrant) lock rather than a RRWL as our locking basis. Testing though in my setup, the exclusive lock seemed to perform worse but deploying the patch at user production, it does much better than the 1.2.x default locking regime. Let me forward port the old locking mechanism in case anyone else has a workload like the above users. > Forward-port the old exclusive row lock; there are scenarios where it > performs better > ------------------------------------------------------------------------------------- > > Key: HBASE-18144 > URL: https://issues.apache.org/jira/browse/HBASE-18144 > Project: HBase > Issue Type: Bug > Components: Increment > Affects Versions: 1.2.5 > Reporter: stack > Assignee: stack > Fix For: 2.0.0, 1.3.2, 1.2.7 > > > Description to follow. -- This message was sent by Atlassian JIRA (v6.3.15#6346)