[
https://issues.apache.org/jira/browse/HBASE-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437418#comment-17437418
]
Rushabh Shah commented on HBASE-26408:
--------------------------------------
> I agree that it's possible for postWALWrite to fail, and that should also
> probably not abort.
[~bbeaudreault] Trying to understand why it shouldn't abort ? postWALWrite
failed but the entry is written to HDFS/WAL. But in HRegion#append, the write
will fail causing it to roll back from memstore and again primary and
replicated cluster will be out of sync.
> Aborting to preserve WAL as source of truth can abort in recoverable
> situations
> -------------------------------------------------------------------------------
>
> Key: HBASE-26408
> URL: https://issues.apache.org/jira/browse/HBASE-26408
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.8.0
> Reporter: Bryan Beaudreault
> Priority: Major
>
> HBASE-26195 added an important feature to avoid data corruption by preserving
> the WAL as a source of truth when WAL sync fails. See that issue for
> background.
> That issue's primary driver was a TimeoutIOException, but the solution was to
> catch and abort on Throwable. The idea here was that we can't anticipate all
> possible failures, so we should err on the side of data correctness. As
> pointed out by [~rushabh.shah] in his comments, this solution has the
> potential to lose HBase capacity quickly in "not very grave" situations. It
> would be good to add an escape hatch for those explicit known cases, of which
> I recently encountered:
> I recently rolled this out to some of our test clusters, most of which are
> small. Afterward, doing a rolling restart of DataNodes caused the following
> IOException: "Failed to replace a bad datanode on the existing pipeline due
> to no more good datanodes being available to try..."
> If you're familiar with HDFS pipeline recovery, this error will be familiar.
> Basically the restarted DataNodes caused pipeline failures, those datanodes
> were added to an internal exclude list that never gets cleared, and
> eventually there were no more nodes to choose from resulting in an error.
> This error is pretty explicit, and at this point the DFSOutputStream for the
> WAL is dead. I think this error is a reasonable one to simply bubble up and
> not abort the RegionServer on, instead just failing and rolling back the
> writes.
> What do people think about starting an allowlist of known good error messages
> for which we do not trigger an abort of the RS? Something like this:
> {{} catch (Throwable t) {}}
> {{ // WAL sync failed. Aborting to avoid a mismatch between the memstore,
> WAL,}}
> {{ // and any replicated clusters.}}
> {{ if (!walSyncSuccess && !allowedException(t)) {}}
> {{ rsServices.abort("WAL sync failed, aborting to preserve WAL as source of
> truth", t);}}
> \{{ }}}
> {{... snip ..}}
> {{private boolean allowedException(Throwable t) {}}{\{ }}
> {{ return t.getMessage().startsWith("Failed to replace a bad datanode");}}
> {{}}}
> We could of course make configurable if people like, or just add to it over
> time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)