Rushabh Shah created HBASE-25887:
------------------------------------
Summary: Corrupt wal while region server is aborting.
Key: HBASE-25887
URL: https://issues.apache.org/jira/browse/HBASE-25887
Project: HBase
Issue Type: Improvement
Components: regionserver, wal
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah
We have seen a case in our production cluster where we ended up in corrupt wal.
WALSplitter logged the below error
{noformat}
2021-05-12 00:42:46,786 FATAL [XXXX:60020-1] regionserver.HRegionServer -
ABORTING region server HOST-B,60020,16207794418
88: Caught throwable while processing event RS_LOG_REPLAY
java.lang.NullPointerException
at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411)
at
org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145)
at
org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
at
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
Looking at the raw wal file, we could see that the last WALEdit contains the
region id, tablename and sequence number but cells were not persisted.
Looking at the logs of the RS that generated that corrupt wal file,
{noformat}
2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer
in /hbase/WALs/HOST-A,60020,1620774393046
2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - Stopping
rpc client
2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases -
regionserver/HOST-A/:60020 closing leases
2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases -
regionserver/HOST-A:/HOST-A:60020 closed leases
2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append
sequenceId=7147823, requesting roll of WAL
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
at
org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
at
org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
at
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
These 2 lines are interesting.
{quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL
writer in /hbase/WALs/HOST-A,60020,1620774393046
....
....
2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append
sequenceId=7147823, requesting roll of WAL
java.nio.channels.ClosedChannelException
{quote}
The append thread encountered java.nio.channels.ClosedChannelException while
writing to wal file because the wal file was already closed.
This is the sequence of shutting down of threads when RS aborts.
{noformat}
// With disruptor down, this is safe to let go.
if (this.appendExecutor != null) this.appendExecutor.shutdown();
// Tell our listeners that the log is closing
...
if (LOG.isDebugEnabled()) {
LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir));
}
if (this.writer != null) {
this.writer.close();
this.writer = null;
}
}
{noformat}
1. Shutdown the appendExecutor (but we *don't* wait for the currently running
tasks to complete)
2. Close the writer. This will in turn close the wal log file.
We need to wait for appendExecutor to terminate properly and then only close
the writer.
Code has been changed a lot in master branch. Need to study whether this bug
exists in master/branch-2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)