[jira] [Commented] (HBASE-26408) Aborting to preserve WAL as source of truth can abort in recoverable situations

Bryan Beaudreault (Jira) Tue, 02 Nov 2021 05:44:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437333#comment-17437333
 ]


Bryan Beaudreault commented on HBASE-26408:
-------------------------------------------

Actually I realized that the exceptions I was seeing were coming from the 
previous append calls. When append fails, the exception gets wrapped in a 
DamagedWalException and stashed in the RingBufferEventHandler. Next time sync 
is called, if there is an exception stashed the sync fails with that exception.

I think it might just make sense to skip aborting on DamagedWalException. 
Example exception:

 

FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region 
server rs1,60020,1635818350839: WAL sync failed, aborting to preserve WAL as 
source of truth
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append 
sequenceId=428338, requesting roll of WAL
 at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1940)
 at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1793)
 at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1703)
 at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to replace a bad datanode on the 
existing pipeline due to no more good datanodes being available to try. (Nodes: 
current=[DatanodeInfoWithStorage[172.18.48.231:50010,DS-731493c0-d6e0-4b9c-805a-5fffc343c9e1,DISK],
 
DatanodeInfoWithStorage[172.18.64.13:50010,DS-62ecc5f9-feac-4c2d-92f0-c5cb14c50ac8,DISK]],
 
original=[DatanodeInfoWithStorage[172.18.48.231:50010,DS-731493c0-d6e0-4b9c-805a-5fffc343c9e1,DISK],
 
DatanodeInfoWithStorage[172.18.64.13:50010,DS-62ecc5f9-feac-4c2d-92f0-c5cb14c50ac8,DISK]]).
 The current failed datanode replacement policy is DEFAULT, and a client may 
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' 
in its configuration.
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1309)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1374)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)

> Aborting to preserve WAL as source of truth can abort in recoverable 
> situations
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-26408
>                 URL: https://issues.apache.org/jira/browse/HBASE-26408
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> HBASE-26195 added an important feature to avoid data corruption by preserving 
> the WAL as a source of truth when WAL sync fails. See that issue for 
> background.
> That issue's primary driver was a TimeoutIOException, but the solution was to 
> catch and abort on Throwable. The idea here was that we can't anticipate all 
> possible failures, so we should err on the side of data correctness. As 
> pointed out by [~rushabh.shah] in his comments, this solution has the 
> potential to lose HBase capacity quickly in "not very grave" situations. It 
> would be good to add an escape hatch for those explicit known cases, of which 
> I recently encountered:
> I recently rolled this out to some of our test clusters, most of which are 
> small. Afterward, doing a rolling restart of DataNodes caused the following 
> IOException: "Failed to replace a bad datanode on the existing pipeline due 
> to no more good datanodes being available to try..."
> If you're familiar with HDFS pipeline recovery, this error will be familiar. 
> Basically the restarted DataNodes caused pipeline failures, those datanodes 
> were added to an internal exclude list that never gets cleared, and 
> eventually there were no more nodes to choose from resulting in an error.
> This error is pretty explicit, and at this point the DFSOutputStream for the 
> WAL is dead. I think this error is a reasonable one to simply bubble up and 
> not abort the RegionServer on, instead just failing and rolling back the 
> writes.
> What do people think about starting an allowlist of known good error messages 
> for which we do not trigger an abort of the RS? Something like this:
> {{} catch (Throwable t) {}}
>  {{  // WAL sync failed. Aborting to avoid a mismatch between the memstore, 
> WAL,}}
>  {{  // and any replicated clusters.}}
>  {{  if (!walSyncSuccess && !allowedException(t)) {}}
>  {{  rsServices.abort("WAL sync failed, aborting to preserve WAL as source of 
> truth", t);}}
>  \{{ }}}
> {{... snip ..}}
> {{private boolean allowedException(Throwable t) {}}{\{  }}
> {{  return t.getMessage().startsWith("Failed to replace a bad datanode");}}
> {{}}}
> We could of course make configurable if people like, or just add to it over 
> time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-26408) Aborting to preserve WAL as source of truth can abort in recoverable situations

Reply via email to