[
https://issues.apache.org/jira/browse/HBASE-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436141#comment-17436141
]
Bryan Beaudreault commented on HBASE-26408:
-------------------------------------------
I can work around this by setting
dfs.client.block.write.replace-datanode-on-failure.min-replication to 2 or
using dfs.client.block.write.replace-datanode-on-failure.best-effort. But both
of those reduce the potential redundancy of our WAL and i'd rather just not
abort the RS in this case.
> Aborting to preserve WAL as source of truth can abort in recoverable
> situations
> -------------------------------------------------------------------------------
>
> Key: HBASE-26408
> URL: https://issues.apache.org/jira/browse/HBASE-26408
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.8.0
> Reporter: Bryan Beaudreault
> Priority: Major
>
> HBASE-26195 added an important feature to avoid data corruption by preserving
> the WAL as a source of truth when WAL sync fails. See that issue for
> background.
> That issue's primary driver was a TimeoutIOException, but the solution was to
> catch and abort on Throwable. The idea here was that we can't anticipate all
> possible failures, so we should err on the side of data correctness. As
> pointed out by [~rushabh.shah] in his comments, this solution has the
> potential to lose HBase capacity quickly in "not very grave" situations.
> I recently rolled this out to some of our test clusters, most of which are
> small. Afterward, doing a rolling restart of DataNodes caused the following
> IOException: "Failed to replace a bad datanode on the existing pipeline due
> to no more good datanodes being available to try..."
> If you're familiar with HDFS pipeline recovery, this error will be familiar.
> Basically the restarted DataNodes caused pipeline failures, those datanodes
> were added to an internal exclude list that never gets cleared, and
> eventually there were no more nodes to choose from resulting in an error.
> This error is not recoverable, so at this point the DFSOutputStream for the
> WAL is dead. I think this error is a reasonable one to simply bubble up and
> not abort the RegionServer on.
> What do people think about starting an allowlist of known good error messages
> for which we do not trigger an abort of the RS? Something like this:
>
> {{} catch (Throwable t) {}}
> {{ // WAL sync failed. Aborting to avoid a mismatch between the memstore,
> WAL,}}
> {{ // and any replicated clusters.}}
> {{ if (!walSyncSuccess && !allowedException(t)) {}}
> {{ rsServices.abort("WAL sync failed, aborting to preserve WAL as source of
> truth", t);}}
> {{ }}}
> {{... snip ..}}
> {{private boolean allowedException(Throwable t) {}}{{ }}
> {{ return t.getMessage().startsWith("Failed to replace a bad datanode");}}
> {{}}}
> We could of course make configurable if people like, or just add to it over
> time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)