[ 
https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959387#comment-16959387
 ] 

Michael Stack commented on HBASE-23181:
---------------------------------------

My first cut at a patch here where the fix is a 'one-liner' changing order in 
which we do things I now realize as wrong (as [~zhangduo] speculated). It was 
making sure the MVCC engine was caught-up when our problem is order of appends 
to WAL completing (and their reflection in SequenceIdAccounting).

[~zhangduo] in PR, you say "Issuing a wal sync while holding the writeLock of 
updateLock is not good, and we have been doing a lot of works to avoid this..." 
What are you referring to here? Is it the effort at doing the minimum possible 
while the locks are held so as to minimize the time that we are not taking 
writes? If so, makes sense. I had been thinking of adding a wait on a sync of 
the flush edit append.

> Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it 
> is not online on us"
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-23181
>                 URL: https://issues.apache.org/jira/browse/HBASE-23181
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.1
>            Reporter: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3
>
>
> On a heavily loaded cluster, WAL count keeps rising and we can get into a 
> state where we are not rolling the logs off fast enough. In particular, there 
> is this interesting state at the extreme where we pick a region to flush 
> because 'Too many WALs' but the region is actually not online. As the WAL 
> count rises, we keep picking a region-to-flush that is no longer on the 
> server. This condition blocks our being able to clear WALs; eventually WALs 
> climb into the hundreds and the RS goes zombie with a full Call queue that 
> starts throwing CallQueueTooLargeExceptions (bad if this servers is the one 
> carrying hbase:meta): i.e. clients fail to access the RegionServer.
> One symptom is a fast spike in WAL count for the RS. A restart of the RS will 
> break the bind.
> Here is how it looks in the log:
> {code}
> # Here is region closing....
> 2019-10-16 23:10:55,897 INFO 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 
> 8ee433ad59526778c53cc85ed3762d0b
> ....
> # Then soon after ...
> 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> 2019-10-16 23:11:45,006 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=45, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> ...
> # Later...
> 2019-10-16 23:20:25,427 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=542, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> {code}
> I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version 
> regularly that had HBASE-16721 fix in it, but can't say yet if it was for 
> same reason as above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to