[jira] [Commented] (HBASE-26552) Introduce retry to logroller to avoid abort

Hudson (Jira) Mon, 07 Mar 2022 23:46:12 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-26552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502777#comment-17502777
 ]


Hudson commented on HBASE-26552:
--------------------------------

Results for branch branch-2.4
        [build #302 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/302/]:
 (/) *{color:green}+1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/302/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/302/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/302/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/302/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Introduce retry to logroller to avoid abort
> -------------------------------------------
>
>                 Key: HBASE-26552
>                 URL: https://issues.apache.org/jira/browse/HBASE-26552
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>    Affects Versions: 3.0.0-alpha-2, 2.4.10
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>             Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.11
>
>
> When calling RollController#rollWal in AbstractWALRoller, the regionserver 
> may abort when encounters exception,
> {code:java}
> ...
> } catch (FailedLogCloseException | ConnectException e) {
>   abort("Failed log close in log roller", e);
> } catch (IOException ex) {
>   // Abort if we get here. We probably won't recover an IOE. HBASE-1132
>   abort("IOE in log roller",
>     ex instanceof RemoteException ? ((RemoteException) 
> ex).unwrapRemoteException() : ex);
> } catch (Exception ex) {
>   LOG.error("Log rolling failed", ex);
>   abort("Log rolling failed", ex);
> } {code}
> I think we should support retry of rollWal here to avoid recovering the 
> service by killing regionserver. The restart of regionserver is costly and 
> very not friendly to the availability.
> I find that when creating new writer for the WAL in 
> FanOutOneBlockAsyncDFSOutputHelper#createOutput, it supports retry to 
> addBlock by setting this config "hbase.fs.async.create.retries". The idea of 
> retry to roll WAL is similar to it, they both try best to make roll WAL 
> succeed. 
> But the initialization of new WAL writer also includes flushing the write 
> buffer flush and waiting until it is completed by 
> AsyncProtobufLogWriter#writeMagicAndWALHeader, which can also fail by some 
> hardware reasons. The regionserver connected to the datanodes after addBlock, 
> but that not means the magic and header can be flushed successfully.
> {code:java}
> protected long writeMagicAndWALHeader(byte[] magic, WALHeader header) throws 
> IOException {
>   return write(future -> {
>     output.write(magic);
>     try {
>       header.writeDelimitedTo(asyncOutputWrapper);
>     } catch (IOException e) {
>       // should not happen
>       throw new AssertionError(e);
>     }
>     addListener(output.flush(false), (len, error) -> {
>       if (error != null) {
>         future.completeExceptionally(error);
>       } else {
>         future.complete(len);
>       }
>     });
>   });
> }{code}
> We have found that in our production clusters, there exists aborting of 
> regionservers that caused by "IOE in log roller". And the practice in our 
> clusters is that just one more retry of rollWal can make the WAL roll 
> complete and continue serving.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HBASE-26552) Introduce retry to logroller to avoid abort

Reply via email to