[ 
https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956985#comment-16956985
 ] 

Duo Zhang commented on HBASE-23181:
-----------------------------------

If we do not use ASYNC_WAL then I do not think this could be the problem. As 
here we hold the writeLock of updateLock, so if we can run the code, all the 
updates to the region should have been done, which means there should be no 
enries in the ringbuffer of WAL. Do you see any SyncFuture timeout exception? 
Like "Failed to get sync result after...". If this happens then it could be a 
problem, as we will release the updateLock before actually waiting for the wal 
entry to be flushed out, but the default timeout is 5 minutes which should be 
very rare to timeout I think?

And I think a more general solution is in HBASE-23157, where we could also 
solve the problem for ASYNC_WAL, where the mvcc will be completed without 
waiting for the wal entry to be flushed out.



> Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it 
> is not online on us"
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-23181
>                 URL: https://issues.apache.org/jira/browse/HBASE-23181
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.1
>            Reporter: Michael Stack
>            Priority: Major
>
> On a heavily loaded cluster, WAL count keeps rising and we can get into a 
> state where we are not rolling the logs off fast enough. In particular, there 
> is this interesting state at the extreme where we pick a region to flush 
> because 'Too many WALs' but the region is actually not online. As the WAL 
> count rises, we keep picking a region-to-flush that is no longer on the 
> server. This condition blocks our being able to clear WALs; eventually WALs 
> climb into the hundreds and the RS goes zombie with a full Call queue that 
> starts throwing CallQueueTooLargeExceptions (bad if this servers is the one 
> carrying hbase:meta): i.e. clients fail to access the RegionServer.
> One symptom is a fast spike in WAL count for the RS. A restart of the RS will 
> break the bind.
> Here is how it looks in the log:
> {code}
> # Here is region closing....
> 2019-10-16 23:10:55,897 INFO 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 
> 8ee433ad59526778c53cc85ed3762d0b
> ....
> # Then soon after ...
> 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> 2019-10-16 23:11:45,006 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=45, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> ...
> # Later...
> 2019-10-16 23:20:25,427 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=542, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> {code}
> I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version 
> regularly that had HBASE-16721 fix in it, but can't say yet if it was for 
> same reason as above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to