apurtell commented on a change in pull request #2574:
URL: https://github.com/apache/hbase/pull/2574#discussion_r512965703
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
##########
@@ -4569,13 +4665,29 @@ private void doMiniBatchMutate(BatchOperation<?>
batchOp) throws IOException {
// We should record the timestamp only after we have acquired the
rowLock,
// otherwise, newer puts/deletes/increment/append are not guaranteed to
have a newer
// timestamp
+
+ // Check for thread interrupt status in case we have been signaled from
+ // #interruptRegionOperation.
+ checkInterrupt();
+
long now = EnvironmentEdgeManager.currentTime();
batchOp.prepareMiniBatchOperations(miniBatchOp, now, acquiredRowLocks);
// STEP 3. Build WAL edit
+
+ // Check for thread interrupt status in case we have been signaled from
+ // #interruptRegionOperation.
+ checkInterrupt();
+
List<Pair<NonceKey, WALEdit>> walEdits =
batchOp.buildWALEdits(miniBatchOp);
// STEP 4. Append the WALEdits to WAL and sync.
+
+ // Check for thread interrupt status in case we have been signaled from
+ // #interruptRegionOperation. This is the last place we can do it
"safely" before
+ // WAL appends.
+ checkInterrupt();
+
Review comment:
Since we are on the topic of WAL syncs....
We could protect against interrupting a sync using a couple of different
approaches. The simplest in the context of this proposed change would be to
remove the current handler thread from the regionLockHolders set whenever we
decide interrupting it is no longer the best choice, like a point of no return
on the mutation code path.
startRegionOperation(Operation.PUT); // add thread to the set so it is
interruptible
try {
...
startWALOperation(); // remove thread from the set so it will not be
interrupted
try {
...
} finally {
endWALOperation(); // add thread to the set so it is eligible to
be interrupted again
}
...
} finally {
endRegionOperation(); // remove thread from the set as it is no
longer actively handling RPC
}
However we already have issues with sync timeouts or regionserver aborts
during sync such that we are not introducing anything new here. In other words,
if it is problematic that a WAL sync can be aborted before it completes, this
already happens for other reasons, like timeouts (presumably because HDFS is
slow), or HDFS level pipeline recovery failure, or server shutdown.
Thinking about this though makes me think of the English expression "opening
a can of worms". I don't know of a Chinese equivalent. I'm going to implement
the proposal above out of an abundance of caution and we can decide upon
further review if it makes sense.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]