[
https://issues.apache.org/jira/browse/HBASE-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051701#comment-16051701
]
Allan Yang edited comment on HBASE-18144 at 6/16/17 10:01 AM:
--------------------------------------------------------------
Hi,[~stack], after a lot of debugging and logging, I finally figured out why
disordered batch will cause this situation.
For example (the UT in DisorderedBatchAndIncrementUT.patch )
*handler 1* is doing a batch put of row(1,2,3,4,5,6,7,8,9). At the same time,
*handler 4* is doing a batch put but with reversed keys(9,8,7,6,5,4,3,2,1).
1. *handler 1* have got readlock for row 1,2,3,4,5,6,7, and going to try row
8's readlock
2. *handler 4* have got readlock for row 9,8,7,6,5,4,3, and going to try row
2's readlock
3. At the same time, *handler 0* is serving a request to increment row 2, it
need to get the writelock of row 2, but it have to wait since *handler 1* have
already got row 2's readlock (*handler 0* blocked)
4. since *handler 0* is trying to get row 2' writelock, *handler 4*'s attempt
to try row 2's readlock need to wait(*handler 4* blocked)
5. At the same time,*handler 3* is serving a request to increment row 8, it
need to get the writelock of row 8, but it have to wait since *handler 4* have
already got row 2's readlock (*handler 3* blocked)
6. since *handler 3* is trying to get row 8' writelock, handler 1's attempt to
try row 8's readlock need to wait(*handler 1* blocked)
At this point, handler 0,1,3,4 is all blocked!!!! Until one thread is timeout
after rowLockWaitDuration.
{code}
if (!result.getLock().tryLock(this.rowLockWaitDuration, TimeUnit.MILLISECONDS))
{
if (traceScope != null) {
traceScope.getSpan().addTimelineAnnotation("Failed to get row lock");
}
result = null;
// Clean up the counts just in case this was the thing keeping the
context alive.
rowLockContext.cleanUp();
throw new IOException("Timed out waiting for lock for row: " + rowKey);
}
{code}
So, if all batches are sorted, there will be no such problem!
**Why branch-1.1 don't have this kind of problem**
This is because it simplely don't wait for the lock!
{code}
// If we haven't got any rows in our batch, we should block to
// get the next one.
boolean shouldBlock = numReadyToWrite == 0;
RowLock rowLock = null;
try {
rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
} catch (IOException ioe) {
LOG.warn("Failed getting lock in batch put, row="
+ Bytes.toStringBinary(mutation.getRow()), ioe);
}
{code}
**Conclusion**
1. Commit patch HBASE-17924 to branch-1.2
2. We shouldn't wait for the lock in dominibatchmutation(like branch-1.1 did),
will open another issue to disccuss.
was (Author: allan163):
Hi,[~stack], after a lot of debugging and logging, I finally figured out why
disordered batch will cause this situation.
For example (the UT in DisorderedBatchAndIncrementUT.patch )
handler 1 is doing a batch put of row(1,2,3,4,5,6,7,8,9). At the same time,
handler 4 is doing a batch put but with reversed keys(9,8,7,6,5,4,3,2,1).
1. handler 1 have got readlock for row 1,2,3,4,5,6,7, and going to try row 8's
readlock
2. handler 4 have got readlock for row 9,8,7,6,5,4,3, and going to try row 2's
readlock
3. At the same time, handler 0 is serving a request to increment row 2, it need
to get the writelock of row 2, but it have to wait since handler 1 have already
got row 2's readlock (handler 0 blocked)
4. since handler 0 is trying to get row 2' writelock, handler 4's attempt to
try row 2's readlock need to wait(handler 4 blocked)
5. At the same time,handler 3 is serving a request to increment row 8, it need
to get the writelock of row 8, but it have to wait since handler 4 have already
got row 2's readlock (handler 3 blocked)
6. since handler 3 is trying to get row 8' writelock, handler 1's attempt to
try row 8's readlock need to wait(handler 1 blocked)
At this point, handler 0,1,2,4 is all blocked!!!! Until one thread is timeout
after rowLockWaitDuration.
{code}
if (!result.getLock().tryLock(this.rowLockWaitDuration, TimeUnit.MILLISECONDS))
{
if (traceScope != null) {
traceScope.getSpan().addTimelineAnnotation("Failed to get row lock");
}
result = null;
// Clean up the counts just in case this was the thing keeping the
context alive.
rowLockContext.cleanUp();
throw new IOException("Timed out waiting for lock for row: " + rowKey);
}
{code}
So, if all batches are sorted, there will be no such problem!
**Why branch-1.1 don't have this kind of problem**
This is because it simplely don't wait for the lock!
{code}
// If we haven't got any rows in our batch, we should block to
// get the next one.
boolean shouldBlock = numReadyToWrite == 0;
RowLock rowLock = null;
try {
rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
} catch (IOException ioe) {
LOG.warn("Failed getting lock in batch put, row="
+ Bytes.toStringBinary(mutation.getRow()), ioe);
}
{code}
**Conclusion**
1. Commit patch HBASE-17924 to branch-1.2
2. We shouldn't wait for the lock in dominibatchmutation(like branch-1.1 did),
will open another issue to disccuss.
> Forward-port the old exclusive row lock; there are scenarios where it
> performs better
> -------------------------------------------------------------------------------------
>
> Key: HBASE-18144
> URL: https://issues.apache.org/jira/browse/HBASE-18144
> Project: HBase
> Issue Type: Bug
> Components: Increment
> Affects Versions: 1.2.5
> Reporter: stack
> Assignee: stack
> Fix For: 2.0.0, 1.3.2, 1.2.7
>
> Attachments: DisorderedBatchAndIncrementUT.patch,
> HBASE-18144.master.001.patch
>
>
> Description to follow.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)