[ 
https://issues.apache.org/jira/browse/HBASE-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436141#comment-17436141
 ] 

Bryan Beaudreault commented on HBASE-26408:
-------------------------------------------

I can work around this by setting 
dfs.client.block.write.replace-datanode-on-failure.min-replication to 2 or 
using dfs.client.block.write.replace-datanode-on-failure.best-effort. But both 
of those reduce the potential redundancy of our WAL and i'd rather just not 
abort the RS in this case.

> Aborting to preserve WAL as source of truth can abort in recoverable 
> situations
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-26408
>                 URL: https://issues.apache.org/jira/browse/HBASE-26408
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> HBASE-26195 added an important feature to avoid data corruption by preserving 
> the WAL as a source of truth when WAL sync fails. See that issue for 
> background.
> That issue's primary driver was a TimeoutIOException, but the solution was to 
> catch and abort on Throwable. The idea here was that we can't anticipate all 
> possible failures, so we should err on the side of data correctness. As 
> pointed out by [~rushabh.shah] in his comments, this solution has the 
> potential to lose HBase capacity quickly in "not very grave" situations.
> I recently rolled this out to some of our test clusters, most of which are 
> small. Afterward, doing a rolling restart of DataNodes caused the following 
> IOException: "Failed to replace a bad datanode on the existing pipeline due 
> to no more good datanodes being available to try..."
> If you're familiar with HDFS pipeline recovery, this error will be familiar. 
> Basically the restarted DataNodes caused pipeline failures, those datanodes 
> were added to an internal exclude list that never gets cleared, and 
> eventually there were no more nodes to choose from resulting in an error.
> This error is not recoverable, so at this point the DFSOutputStream for the 
> WAL is dead. I think this error is a reasonable one to simply bubble up and 
> not abort the RegionServer on.
> What do people think about starting an allowlist of known good error messages 
> for which we do not trigger an abort of the RS? Something like this:
>  
> {{} catch (Throwable t) {}}
> {{  // WAL sync failed. Aborting to avoid a mismatch between the memstore, 
> WAL,}}
> {{  // and any replicated clusters.}}
> {{  if (!walSyncSuccess && !allowedException(t)) {}}
> {{  rsServices.abort("WAL sync failed, aborting to preserve WAL as source of 
> truth", t);}}
> {{ }}}
> {{... snip ..}}
> {{private boolean allowedException(Throwable t) {}}{{  }}
> {{  return t.getMessage().startsWith("Failed to replace a bad datanode");}}
> {{}}}
> We could of course make configurable if people like, or just add to it over 
> time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to