[
https://issues.apache.org/jira/browse/HBASE-26552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502337#comment-17502337
]
Hudson commented on HBASE-26552:
--------------------------------
Results for branch master
[build #528 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/528/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/528/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/528/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> Introduce retry to logroller to avoid abort
> -------------------------------------------
>
> Key: HBASE-26552
> URL: https://issues.apache.org/jira/browse/HBASE-26552
> Project: HBase
> Issue Type: Improvement
> Components: wal
> Affects Versions: 3.0.0-alpha-2, 2.4.10
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.11
>
>
> When calling RollController#rollWal in AbstractWALRoller, the regionserver
> may abort when encounters exception,
> {code:java}
> ...
> } catch (FailedLogCloseException | ConnectException e) {
> abort("Failed log close in log roller", e);
> } catch (IOException ex) {
> // Abort if we get here. We probably won't recover an IOE. HBASE-1132
> abort("IOE in log roller",
> ex instanceof RemoteException ? ((RemoteException)
> ex).unwrapRemoteException() : ex);
> } catch (Exception ex) {
> LOG.error("Log rolling failed", ex);
> abort("Log rolling failed", ex);
> } {code}
> I think we should support retry of rollWal here to avoid recovering the
> service by killing regionserver. The restart of regionserver is costly and
> very not friendly to the availability.
> I find that when creating new writer for the WAL in
> FanOutOneBlockAsyncDFSOutputHelper#createOutput, it supports retry to
> addBlock by setting this config "hbase.fs.async.create.retries". The idea of
> retry to roll WAL is similar to it, they both try best to make roll WAL
> succeed.
> But the initialization of new WAL writer also includes flushing the write
> buffer flush and waiting until it is completed by
> AsyncProtobufLogWriter#writeMagicAndWALHeader, which can also fail by some
> hardware reasons. The regionserver connected to the datanodes after addBlock,
> but that not means the magic and header can be flushed successfully.
> {code:java}
> protected long writeMagicAndWALHeader(byte[] magic, WALHeader header) throws
> IOException {
> return write(future -> {
> output.write(magic);
> try {
> header.writeDelimitedTo(asyncOutputWrapper);
> } catch (IOException e) {
> // should not happen
> throw new AssertionError(e);
> }
> addListener(output.flush(false), (len, error) -> {
> if (error != null) {
> future.completeExceptionally(error);
> } else {
> future.complete(len);
> }
> });
> });
> }{code}
> We have found that in our production clusters, there exists aborting of
> regionservers that caused by "IOE in log roller". And the practice in our
> clusters is that just one more retry of rollWal can make the WAL roll
> complete and continue serving.
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)