[ https://issues.apache.org/jira/browse/HBASE-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051701#comment-16051701 ]
Allan Yang commented on HBASE-18144: ------------------------------------ Hi,[~stack], after a lot of debugging and logging, I finally figured out why disordered batch will cause this situation. For example (the UT in DisorderedBatchAndIncrementUT.patch ) handler 1 is doing a batch put of row(1,2,3,4,5,6,7,8,9). At the same time, handler 4 is doing a batch put but with reversed keys(9,8,7,6,5,4,3,2,1). 1. handler 1 have got readlock for row 1,2,3,4,5,6,7, and going to try row 8's readlock 2. handler 4 have got readlock for row 9,8,7,6,5,4,3, and going to try row 2's readlock 3. At the same time, handler 0 is serving a request to increment row 2, it need to get the writelock of row 2, but it have to wait since handler 1 have already got row 2's readlock (handler 0 blocked) 4. since handler 0 is trying to get row 2' writelock, handler 4's attempt to try row 2's readlock need to wait(handler 4 blocked) 5. At the same time,handler 3 is serving a request to increment row 8, it need to get the writelock of row 8, but it have to wait since handler 4 have already got row 2's readlock (handler 3 blocked) 6. since handler 3 is trying to get row 8' writelock, handler 1's attempt to try row 8's readlock need to wait(handler 1 blocked) At this point, handler 0,1,2,4 is all blocked!!!! Until one thread is timeout after rowLockWaitDuration. {code} if (!result.getLock().tryLock(this.rowLockWaitDuration, TimeUnit.MILLISECONDS)) { if (traceScope != null) { traceScope.getSpan().addTimelineAnnotation("Failed to get row lock"); } result = null; // Clean up the counts just in case this was the thing keeping the context alive. rowLockContext.cleanUp(); throw new IOException("Timed out waiting for lock for row: " + rowKey); } {code} So, if all batches are sorted, there will be no such problem! **Why branch-1.1 don't have this kind of problem** This is because it simplely don't wait for the lock! {code} // If we haven't got any rows in our batch, we should block to // get the next one. boolean shouldBlock = numReadyToWrite == 0; RowLock rowLock = null; try { rowLock = getRowLockInternal(mutation.getRow(), shouldBlock); } catch (IOException ioe) { LOG.warn("Failed getting lock in batch put, row=" + Bytes.toStringBinary(mutation.getRow()), ioe); } {code} **Conclusion** 1. Commit patch HBASE-17924 to branch-1.2 2. We shouldn't wait for the lock in dominibatchmutation(like branch-1.1 did), will open another issue to disccuss. > Forward-port the old exclusive row lock; there are scenarios where it > performs better > ------------------------------------------------------------------------------------- > > Key: HBASE-18144 > URL: https://issues.apache.org/jira/browse/HBASE-18144 > Project: HBase > Issue Type: Bug > Components: Increment > Affects Versions: 1.2.5 > Reporter: stack > Assignee: stack > Fix For: 2.0.0, 1.3.2, 1.2.7 > > Attachments: DisorderedBatchAndIncrementUT.patch, > HBASE-18144.master.001.patch > > > Description to follow. -- This message was sent by Atlassian JIRA (v6.4.14#64029)