apurtell commented on pull request #2574:
URL: https://github.com/apache/hbase/pull/2574#issuecomment-716699603
I have found one test where interrupt by default causes a repeatable
problem.
TestSyncReplicationActive
[ERROR] TestSyncReplicationActive.testActive:99
Expected: a string containing "only marker edit is allowed"
but: was "Failed after attempts=1, exceptions:
2020-10-25T02:46:59.367Z, java.io.InterruptedIOException:
java.io.InterruptedIOException
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.convertInterruptedExceptionToIOException(AbstractFSWAL.java:878)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:866)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:710)
at
org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:9031)
at
org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:8624)
at
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4674)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4594)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4522)
at
org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4992)
at
org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4987)
at
org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4983)
at
org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3302)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.put(RSRpcServices.java:3031)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2994)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45251)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:397)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:142)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:855)
... 17 more
Thinking about what to do here it occurred to me we do not need to be so
greedy about interrupting handlers by default. The original motivation for
interrupting RPCs in flight was to address the case where we get stuck closing
the region, so we can be less aggressive and wait until we actually seem to be
stuck. In an early version of this patch the tryLock was attempted in a loop
that would wait the entire configured wait interval before triggering the
abort. Rightfully so @bharathv provided feedback the loop wasn't providing any
advantage, especially considering we crash the RS if interrupted, but I think
we should bring this back to do this:
waitTime = <some significant fraction of total wait interval>
do {
start = EnvironmentEdgeManager.getCurrentTime();
acquired = tryLock(waitTime, TimeUnit.MILLISECONDS);
end = EnvironmentEdgeManager.getCurrentTime();
totalWaitTime += end - start;
waitTime -= end - start;
if (!acquired) {
interruptRegionOperations();
}
} while (!acquired && waitTime > 0);
This will cause us to begin interrupting region lock holders only if we have
already waited for some significant fraction of the total wait interval. This
also has the benefit (IMHO) of potentially issuing more than one interrupt to a
handler if the earlier interrupt was somehow swallowed by code we don't
control, like a Hadoop library, or in the HDFS client.
We do want to interrupt even things like WAL append if the handler is
holding us up closing the region, but upon reflection I do not believe we
should be so aggressive to proactively issue interrupts immediately when we
want to close.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]