[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"

Michael Stack (Jira) Mon, 21 Oct 2019 20:07:40 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956635#comment-16956635
 ]


Michael Stack commented on HBASE-23181:
---------------------------------------

This is a candidate:
{code}
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
@@ -2606,6 +2606,11 @@ public class HRegion implements HeapSize, 
PropagatingConfigurationObserver, Regi
     byte[] encodedRegionName = getRegionInfo().getEncodedNameAsBytes();
     try {
       if (wal != null) {
+        // Get the next sequenceid. The method will not return until mvcc has 
caught up to the
+        // returned flushOpSeqId. This means all edits have been committed up 
to this point.
+        // This must be done before we call wal#startCacheFlush else edits may 
complete AFTER
+        // the call to startCacheFlush.
+        flushOpSeqId = getNextSequenceId(wal);
         Long earliestUnflushedSequenceIdForTheRegion =
             wal.startCacheFlush(encodedRegionName, flushedFamilyNamesToSeq);
         if (earliestUnflushedSequenceIdForTheRegion == null) {
@@ -2616,7 +2621,6 @@ public class HRegion implements HeapSize, 
PropagatingConfigurationObserver, Regi
               new FlushResultImpl(FlushResult.Result.CANNOT_FLUSH, msg, false),
               myseqid);
         }
-        flushOpSeqId = getNextSequenceId(wal);
         // Back up 1, minus 1 from oldest sequence id in memstore to get last 
'flushed' edit
         flushedSeqId =
             earliestUnflushedSequenceIdForTheRegion.longValue() == 
HConstants.NO_SEQNUM?
{code}

HDFS is usually struggling when this issue hits (loaded cluster). I notice 
after study that the 'stuck' sequenceid in SequenceIdAccounting puked out when 
the PURGE code runs (see attached patch) is usually one less than the sequence 
id used to stamp the last successful flush (This flush is usually the flush 
that happens on region close). Somehow the 'stuck' sequenceid is getting added 
to the sequenceidaccounting lowest entries for a region AFTER the flush's 
sequencied.

There is a hole between wal.startCacheFlush where we'll do sequenceid 
accounting for current memstore flush and the following getNextSequenceId which 
waits until all sequenceids have 'completed' and then gives out the next 
highest; supposition is that the flush begins, the last appends complete (and 
get added to sequenceidaccounting), and then we do the wait on all edits in 
flight.

Testing.

> Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it 
> is not online on us"
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-23181
>                 URL: https://issues.apache.org/jira/browse/HBASE-23181
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.1
>            Reporter: Michael Stack
>            Priority: Major
>
> On a heavily loaded cluster, WAL count keeps rising and we can get into a 
> state where we are not rolling the logs off fast enough. In particular, there 
> is this interesting state at the extreme where we pick a region to flush 
> because 'Too many WALs' but the region is actually not online. As the WAL 
> count rises, we keep picking a region-to-flush that is no longer on the 
> server. This condition blocks our being able to clear WALs; eventually WALs 
> climb into the hundreds and the RS goes zombie with a full Call queue that 
> starts throwing CallQueueTooLargeExceptions (bad if this servers is the one 
> carrying hbase:meta): i.e. clients fail to access the RegionServer.
> One symptom is a fast spike in WAL count for the RS. A restart of the RS will 
> break the bind.
> Here is how it looks in the log:
> {code}
> # Here is region closing....
> 2019-10-16 23:10:55,897 INFO 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 
> 8ee433ad59526778c53cc85ed3762d0b
> ....
> # Then soon after ...
> 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> 2019-10-16 23:11:45,006 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=45, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> ...
> # Later...
> 2019-10-16 23:20:25,427 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=542, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> {code}
> I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version 
> regularly that had HBASE-16721 fix in it, but can't say yet if it was for 
> same reason as above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"

Reply via email to