[ 
https://issues.apache.org/jira/browse/HBASE-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437393#comment-17437393
 ] 

Rushabh Shah commented on HBASE-26408:
--------------------------------------

Thank you [~bbeaudreault] for reporting the issue.

This is the code snippet from 
[+FSHLog#RingBufferEventHandler#append+|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java#L2081-L2129]
 method
{code:java}
    void append(final FSWALEntry entry) throws Exception {
      // TODO: WORK ON MAKING THIS APPEND FASTER. DOING WAY TOO MUCH WORK WITH 
CPs, PBing, etc.
      atHeadOfRingBufferEventHandlerAppend();

      long start = EnvironmentEdgeManager.currentTime();
      byte [] encodedRegionName = entry.getKey().getEncodedRegionName();
      long regionSequenceId = WALKey.NO_SEQUENCE_ID;
      try {

        regionSequenceId = entry.getKey().getSequenceId();
        // Edits are empty, there is nothing to append.  Maybe empty when we 
are looking for a
        // region sequence id only, a region edit/sequence id that is not 
associated with an actual
        // edit. It has to go through all the rigmarole to be sure we have the 
right ordering.
        if (entry.getEdit().isEmpty()) {
          return;
        }

        // Coprocessor hook.
        if (!coprocessorHost.preWALWrite(entry.getHRegionInfo(), entry.getKey(),
            entry.getEdit())) {
          if (entry.getEdit().isReplay()) {
            // Set replication scope null so that this won't be replicated
            entry.getKey().setScopes(null);
          }
        }
        if (!listeners.isEmpty()) {
          for (WALActionsListener i: listeners) {
            // TODO: Why does listener take a table description and CPs take a 
regioninfo?  Fix.
            i.visitLogEntryBeforeWrite(entry.getHTableDescriptor(), 
entry.getKey(),
              entry.getEdit());
          }
        }

        writer.append(entry);
        assert highestUnsyncedSequence < entry.getSequence();
        highestUnsyncedSequence = entry.getSequence();
        sequenceIdAccounting.update(encodedRegionName, entry.getFamilyNames(), 
regionSequenceId,
          entry.isInMemstore());
        coprocessorHost.postWALWrite(entry.getHRegionInfo(), entry.getKey(), 
entry.getEdit());
        // Update metrics.
        postAppend(entry, EnvironmentEdgeManager.currentTime() - start);
      } catch (Exception e) {
        String msg = "Append sequenceId=" + regionSequenceId + ", requesting 
roll of WAL";
        LOG.warn(msg, e);
        requestLogRoll();
        throw new DamagedWALException(msg, e);
      }
      numEntries.incrementAndGet();
    }
{code}

We catch all the Exception coming out of append method and throw blanket  
DamagedWALException. It is possible that writer#append succeeds at L#2114 but 
something fails afterwards in coprocessorHost#postWALWrite method. 
If that happens again we will be in an inconsistent state between primary and 
replicated cluster. Instead of skip aborting on DamagedWALException, we should 
unwrap DamagedWALException and see the underlying cause and skip aborting if 
the cause is in allowedList. 


> Aborting to preserve WAL as source of truth can abort in recoverable 
> situations
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-26408
>                 URL: https://issues.apache.org/jira/browse/HBASE-26408
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> HBASE-26195 added an important feature to avoid data corruption by preserving 
> the WAL as a source of truth when WAL sync fails. See that issue for 
> background.
> That issue's primary driver was a TimeoutIOException, but the solution was to 
> catch and abort on Throwable. The idea here was that we can't anticipate all 
> possible failures, so we should err on the side of data correctness. As 
> pointed out by [~rushabh.shah] in his comments, this solution has the 
> potential to lose HBase capacity quickly in "not very grave" situations. It 
> would be good to add an escape hatch for those explicit known cases, of which 
> I recently encountered:
> I recently rolled this out to some of our test clusters, most of which are 
> small. Afterward, doing a rolling restart of DataNodes caused the following 
> IOException: "Failed to replace a bad datanode on the existing pipeline due 
> to no more good datanodes being available to try..."
> If you're familiar with HDFS pipeline recovery, this error will be familiar. 
> Basically the restarted DataNodes caused pipeline failures, those datanodes 
> were added to an internal exclude list that never gets cleared, and 
> eventually there were no more nodes to choose from resulting in an error.
> This error is pretty explicit, and at this point the DFSOutputStream for the 
> WAL is dead. I think this error is a reasonable one to simply bubble up and 
> not abort the RegionServer on, instead just failing and rolling back the 
> writes.
> What do people think about starting an allowlist of known good error messages 
> for which we do not trigger an abort of the RS? Something like this:
> {{} catch (Throwable t) {}}
>  {{  // WAL sync failed. Aborting to avoid a mismatch between the memstore, 
> WAL,}}
>  {{  // and any replicated clusters.}}
>  {{  if (!walSyncSuccess && !allowedException(t)) {}}
>  {{  rsServices.abort("WAL sync failed, aborting to preserve WAL as source of 
> truth", t);}}
>  \{{ }}}
> {{... snip ..}}
> {{private boolean allowedException(Throwable t) {}}{\{  }}
> {{  return t.getMessage().startsWith("Failed to replace a bad datanode");}}
> {{}}}
> We could of course make configurable if people like, or just add to it over 
> time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to