[
https://issues.apache.org/jira/browse/HBASE-14689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015279#comment-16015279
]
stack commented on HBASE-14689:
-------------------------------
[~enis] Having an issue where branch-1.2 with lots of concurrent increments and
doMiniBatchMutation on same row can backup on tryLock. Throughput goes to
almost zero. I see this sort of thing:
{code}
"RpcServer.FifoWFPBQ.default.handler=190,queue=10,port=60020" #243 daemon
prio=5 os_prio=0 tid=0x00007fbb58691800 nid=0x2d2527 waiting on condition
[0x00007fbb4ca49000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007c6001b38> (a
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:934)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247)
at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115)
at
org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171)
at org.apache.hadoop.hbase.regionserver.HRegion.doIncrement(HRegion.java:7453)
...
{code}
and
{code}
"RpcServer.FifoWFPBQ.default.handler=180,queue=0,port=60020" #233 daemon prio=5
os_prio=0 tid=0x00007fbb586ed800 nid=0x2d251d waiting on condition
[0x00007fbb4d453000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000354976c00> (a
java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
at
org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171)
at
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3017)
...
{code}
At an extreme the thread dump is all threads doing tryLock.
When I see the change made by this issue:
{code}
...
// If we haven't got any rows in our batch, we should block to
// get the next one.
- boolean shouldBlock = numReadyToWrite == 0;
RowLock rowLock = null;
try {
- rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
+ rowLock = getRowLock(mutation.getRow(), true);
} catch (IOException ioe) {
LOG.warn("Failed getting lock in batch put, row="
+ Bytes.toStringBinary(mutation.getRow()), ioe);
}
if (rowLock == null) {
// We failed to grab another lock
..
{code}
... it looks to me like we dropped an optimization that has us get the lock
once; instead we get a lock for EACH mutation in the batch where before we just
got the row lock with the first one. Post HBASE-12751, we are taking read-locks
only but these cost and the incidence of locks probably makes accounting take
longer. I'll dig more tomorrow, and you've probably forgotten what this issue
was all about but dropping comment here mean time in case you have any remarks.
Thanks.
> Addendum and unit test for HBASE-13471
> --------------------------------------
>
> Key: HBASE-14689
> URL: https://issues.apache.org/jira/browse/HBASE-14689
> Project: HBase
> Issue Type: Bug
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3, 0.98.16, 0.98.17
>
> Attachments: hbase-14689-after-revert.patch,
> hbase-14689-after-revert.patch, hbase-14689_v1-branch-1.1.patch,
> hbase-14689_v1-branch-1.1.patch, hbase-14689_v1.patch
>
>
> One of our customers ran into HBASE-13471, which resulted in all the handlers
> getting blocked and various other issues. While backporting the issue, I
> noticed that there is one more case where we might go into infinite loop. In
> case a row lock cannot be acquired (due to a previous leak for example which
> we have seen in Phoenix before) this will cause similar infinite loop.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)