Xiaolin Ha created HBASE-26552:
----------------------------------
Summary: Introduce retry to logroller when encounters IOException
Key: HBASE-26552
URL: https://issues.apache.org/jira/browse/HBASE-26552
Project: HBase
Issue Type: Improvement
Components: wal
Affects Versions: 2.0.0, 3.0.0-alpha-1
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha
When calling RollController#rollWal in AbstractWALRoller, the regionserver may
abort when encounters exception,
{code:java}
...
} catch (FailedLogCloseException | ConnectException e) {
abort("Failed log close in log roller", e);
} catch (IOException ex) {
// Abort if we get here. We probably won't recover an IOE. HBASE-1132
abort("IOE in log roller",
ex instanceof RemoteException ? ((RemoteException)
ex).unwrapRemoteException() : ex);
} catch (Exception ex) {
LOG.error("Log rolling failed", ex);
abort("Log rolling failed", ex);
} {code}
I think we should support retry of rollWal here to avoid recovering the service
by killing regionserver. The restart of regionserver is costly and very not
friendly to the availability.
I find that when creating new writer for the WAL in
FanOutOneBlockAsyncDFSOutputHelper#createOutput, it supports retry to addBlock
by setting this config "hbase.fs.async.create.retries".
But the initialization of new WAL writer also includes flushing the write
buffer flush and waiting until it is completed by
AsyncProtobufLogWriter#writeMagicAndWALHeader, which can also fail by some
hardware reasons. The regionserver connected to the datanodes after addBlock,
but that not means the magic and header can be flushed successfully.
{code:java}
protected long writeMagicAndWALHeader(byte[] magic, WALHeader header) throws
IOException {
return write(future -> {
output.write(magic);
try {
header.writeDelimitedTo(asyncOutputWrapper);
} catch (IOException e) {
// should not happen
throw new AssertionError(e);
}
addListener(output.flush(false), (len, error) -> {
if (error != null) {
future.completeExceptionally(error);
} else {
future.complete(len);
}
});
});
}{code}
We have found that in our production clusters, there exists aborting of
regionservers that caused by "IOE in log roller". And the practice in our
clusters is that just one more retry of rollWal can make the WAL roll complete
and continue serving.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)