Bryan Beaudreault created HBASE-26408:
-----------------------------------------

             Summary: Aborting to preserve WAL as source of truth can abort in 
recoverable situations
                 Key: HBASE-26408
                 URL: https://issues.apache.org/jira/browse/HBASE-26408
             Project: HBase
          Issue Type: Bug
    Affects Versions: 1.8.0
            Reporter: Bryan Beaudreault


HBASE-26195 added an important feature to avoid data corruption by preserving 
the WAL as a source of truth when WAL sync fails. See that issue for background.

That issue's primary driver was a TimeoutIOException, but the solution was to 
catch and abort on Throwable. The idea here was that we can't anticipate all 
possible failures, so we should err on the side of data correctness. As pointed 
out by [~rushabh.shah] in his comments, this solution has the potential to lose 
HBase capacity quickly in "not very grave" situations.

I recently rolled this out to some of our test clusters, most of which are 
small. Afterward, doing a rolling restart of DataNodes caused the following 
IOException: "Failed to replace a bad datanode on the existing pipeline due to 
no more good datanodes being available to try..."

If you're familiar with HDFS pipeline recovery, this error will be familiar. 
Basically the restarted DataNodes caused pipeline failures, those datanodes 
were added to an internal exclude list that never gets cleared, and eventually 
there were no more nodes to choose from resulting in an error.

This error is not recoverable, so at this point the DFSOutputStream for the WAL 
is dead. I think this error is a reasonable one to simply bubble up and not 
abort the RegionServer on.

What do people think about starting an allowlist of known good error messages 
for which we do not trigger an abort of the RS? Something like this:

 

{{} catch (Throwable t) {}}
{{  // WAL sync failed. Aborting to avoid a mismatch between the memstore, 
WAL,}}
{{  // and any replicated clusters.}}
{{  if (!walSyncSuccess && !allowedException(t)) {}}
{{  rsServices.abort("WAL sync failed, aborting to preserve WAL as source of 
truth", t);}}
{{ }}}

{{... snip ..}}

{{private boolean allowedException(Throwable t) {}}{{  }}

{{  return t.getMessage().startsWith("Failed to replace a bad datanode");}}

{{}}}

We could of course make configurable if people like, or just add to it over 
time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to