[ 
https://issues.apache.org/jira/browse/HBASE-25887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-25887:
----------------------------------------
    Fix Version/s: 2.4.4
                   2.3.6
                   2.5.0
                   3.0.0-alpha-1

> Corrupt wal while region server is aborting.
> --------------------------------------------
>
>                 Key: HBASE-25887
>                 URL: https://issues.apache.org/jira/browse/HBASE-25887
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver, wal
>    Affects Versions: 3.0.0-alpha-1, 1.6.0, 2.5.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.6, 2.4.4
>
>
> We have seen a case in our production cluster where we ended up in corrupt 
> wal. WALSplitter logged the below error
> {noformat}
> 2021-05-12 00:42:46,786 FATAL [XXXX:60020-1] regionserver.HRegionServer - 
> ABORTING region server HOST-B,60020,16207794418
> 88: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150)
>         at 
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408)
>         at 
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
>         at 
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
>         at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Looking at the raw wal file, we could see that the last WALEdit contains the 
> region id, tablename and sequence number but cells were not persisted.
>  Looking at the logs of the RS that generated that corrupt wal file,
> {noformat}
> 2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer 
> in /hbase/WALs/HOST-A,60020,1620774393046
> 2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - 
> Stopping rpc client
> 2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - 
> regionserver/HOST-A/:60020 closing leases
> 2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - 
> regionserver/HOST-A:/HOST-A:60020 closed leases
> 2021-05-11 23:29:22,198 WARN  [0020.append-pool8-t1] wal.FSHLog - Append 
> sequenceId=7147823, requesting roll of WAL
> java.nio.channels.ClosedChannelException
>         at 
> org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
>         at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
>         at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
>         at java.io.DataOutputStream.write(DataOutputStream.java:107)
>         at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
>         at 
> org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
>         at 
> com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> These 2 lines are interesting.
> {quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL 
> writer in /hbase/WALs/HOST-A,60020,1620774393046
>  ....
>  ....
>  2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append 
> sequenceId=7147823, requesting roll of WAL
>  java.nio.channels.ClosedChannelException
> {quote}
> The append thread encountered java.nio.channels.ClosedChannelException while 
> writing to wal file because the wal file was already closed.
> This is the sequence of shutting down of threads when RS aborts.
> {noformat}
>       // With disruptor down, this is safe to let go.
>       if (this.appendExecutor !=  null) this.appendExecutor.shutdown();
>       // Tell our listeners that the log is closing
>        ...
>       if (LOG.isDebugEnabled()) {
>         LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir));
>       }
>       if (this.writer != null) {
>         this.writer.close();
>         this.writer = null;
>       }
>     }
> {noformat}
> 1. Shutdown the appendExecutor (but we *don't* wait for the currently running 
> tasks to complete)
>  2. Close the writer. This will in turn close the wal log file.
> We need to wait for appendExecutor to terminate properly and then only close 
> the writer.
> Code has been changed a lot in master branch. Need to study whether this bug 
> exists in master/branch-2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to