[ 
https://issues.apache.org/jira/browse/HBASE-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051701#comment-16051701
 ] 

Allan Yang commented on HBASE-18144:
------------------------------------

Hi,[~stack], after a lot of debugging and logging, I finally figured out why 
disordered batch will cause this situation. 
For example (the UT in DisorderedBatchAndIncrementUT.patch )
handler 1 is doing a batch put of row(1,2,3,4,5,6,7,8,9). At the same time, 
handler 4 is doing a batch put but with reversed keys(9,8,7,6,5,4,3,2,1). 
1. handler 1 have got readlock for row 1,2,3,4,5,6,7, and going to try row 8's 
readlock
2. handler 4 have got readlock for row 9,8,7,6,5,4,3, and going to try row 2's 
readlock
3. At the same time, handler 0 is serving a request to increment row 2, it need 
to get the writelock of row 2, but it have to wait since handler 1 have already 
got row 2's readlock (handler 0 blocked)
4. since handler 0 is trying to get row 2' writelock, handler 4's attempt to 
try row 2's readlock need to wait(handler 4 blocked)
5. At the same time,handler 3 is serving a request to increment row 8, it need 
to get the writelock of row 8, but it have to wait since handler 4 have already 
got row 2's readlock (handler 3 blocked)
6. since handler 3 is trying to get row 8' writelock, handler 1's attempt to 
try row 8's readlock need to wait(handler 1 blocked)

At this point, handler 0,1,2,4 is all blocked!!!! Until one thread is timeout 
after rowLockWaitDuration.
{code}
if (!result.getLock().tryLock(this.rowLockWaitDuration, TimeUnit.MILLISECONDS)) 
{
        if (traceScope != null) {
          traceScope.getSpan().addTimelineAnnotation("Failed to get row lock");
        }
        result = null;
        // Clean up the counts just in case this was the thing keeping the 
context alive.
        rowLockContext.cleanUp();
        throw new IOException("Timed out waiting for lock for row: " + rowKey);
      }
{code}

So, if all batches are sorted, there will be no such problem!

**Why branch-1.1 don't have this kind of problem**
This is because it simplely don't wait for the lock!
{code}
        // If we haven't got any rows in our batch, we should block to
        // get the next one.
        boolean shouldBlock = numReadyToWrite == 0;
        RowLock rowLock = null;
        try {
          rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
        } catch (IOException ioe) {
          LOG.warn("Failed getting lock in batch put, row="
            + Bytes.toStringBinary(mutation.getRow()), ioe);
        }
{code}

**Conclusion**
1. Commit patch HBASE-17924 to branch-1.2
2. We shouldn't wait for the lock in dominibatchmutation(like branch-1.1 did), 
will open another issue to disccuss.

> Forward-port the old exclusive row lock; there are scenarios where it 
> performs better
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-18144
>                 URL: https://issues.apache.org/jira/browse/HBASE-18144
>             Project: HBase
>          Issue Type: Bug
>          Components: Increment
>    Affects Versions: 1.2.5
>            Reporter: stack
>            Assignee: stack
>             Fix For: 2.0.0, 1.3.2, 1.2.7
>
>         Attachments: DisorderedBatchAndIncrementUT.patch, 
> HBASE-18144.master.001.patch
>
>
> Description to follow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to