[
https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025829#comment-17025829
]
Pankaj Kumar commented on HBASE-23633:
--------------------------------------
I also observed this problem during test, many regions *FAILED* to open due to
CorruptHFileException.
{noformat}
2020-01-29 07:07:13,911 | INFO | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Validating
hfile at
hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
for inclusion in store family region
usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845. |
org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:730)
2020-01-29 07:07:13,930 | ERROR | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Failed open
of
region=usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845.,
starting to roll back the global memstore size. |
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:386)
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile
Trailer from file
hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
at org.apache.hadoop.hbase.io.hfile.HFile.openReader(HFile.java:503)
at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:562)
at
org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:732)
at
org.apache.hadoop.hbase.regionserver.HRegion.loadRecoveredHFilesIfAny(HRegion.java:4905)
at
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:863)
at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:824)
at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7023)
{noformat}
After digging more into the log, observed this problem occured when
"split-log-closeStream" thread was splitting WAL into hfile and Region Server
abort due to some region. So the "split-log-closeStream" thread was interrupted
and left the recovered hfile in an intermediate state.
{noformat}
2020-01-28 23:01:04,962 | WARN | RS_LOG_REPLAY_OPS-8-5-179-5:RS-PORT-0 | log
splitting of
WALs/RS-IP,RS-PORT,1580220469213-splitting/RS-IP%2CRS-PORT%2C1580220469213.1580222580793
interrupted, resigning |
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
java.io.InterruptedIOException
at
org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:186)
at
org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.close(BoundedRecoveredHFilesOutputSink.java:155)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:404)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:225)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
at
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
at
org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:179)
... 9 more
{noformat}
Further I checked and confirmed from the NN audit log that file was not written
completelty and RS went down,
{noformat}
2020-01-28 23:01:04,946 | INFO | IPC Server handler 125 on 25000 | BLOCK*
allocate blk_1092127264_18392260, replicas=DN-IP1:DN-PORT, DN-IP2:DN-PORT,
DN-IP3:DN-PORT for
/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
| FSDirWriteFileOp.java:856
----
2020-01-29 00:01:04,956 | INFO |
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@862fb5 | Recovering
[Lease. Holder: DFSClient_NONMAPREDUCE_-1098699935_1, pending creates: 21],
src=/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
| FSNamesystem.java:3344
2020-01-29 00:01:04,957 | WARN |
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@862fb5 | DIR*
NameSystem.internalReleaseLease: File
/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793
has not been closed. Lease recovery is in progress. RecoveryId = 18395023 for
block blk_1092127264_18392260 | FSNamesystem.java:3470
2020-01-29 00:01:14,504 | INFO | IPC Server handler 0 on 25006 |
commitBlockSynchronization(oldBlock=BP-2062589142-192.168.250.11-1574429102552:blk_1092127264_18392260,
file=/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793,
newgenerationstamp=18395023, newlength=0, newtargets=[]) successful |
FSNamesystem.java:3748
{noformat}
Since WAL split was interrupted, so HMaster will recover it by resubmitting the
WAL split task. So there may not be data loss IMO.
We should cleanup such corrupted hfile. What do you think [~zghao] [~stack] sir
?
> Find a way to handle the corrupt recovered hfiles
> -------------------------------------------------
>
> Key: HBASE-23633
> URL: https://issues.apache.org/jira/browse/HBASE-23633
> Project: HBase
> Issue Type: Sub-task
> Reporter: Guanghao Zhang
> Priority: Major
>
> Copy the comment from PR review.
>
> If the file is a corrupt HFile, an exception will be thrown here, which will
> cause the region to fail to open.
> Maybe we can add a new parameter to control whether to skip the exception,
> similar to recover edits which has a parameter
> "hbase.hregion.edits.replay.skip.errors";
>
> Regions that can't be opened because of detached References or corrupt hfiles
> are a fact-of-life. We need work on this issue. This will be a new variant on
> the problem -- i.e. bad recovered hfiles.
> On adding a config to ignore bad files and just open, thats a bit dangerous
> as per @infraio .... as it could mean silent data loss.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)