[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us"

Michael Stack (Jira) Wed, 16 Oct 2019 21:55:36 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953404#comment-16953404
 ]


Michael Stack commented on HBASE-23181:
---------------------------------------

[~busbey] Yeah, it should (has) flushed already but as part of close, we should 
be removing the region from accounting. Not sure why it is not being removed. 
No complaints around close. Let me at least add a continue if region not online 
so we don't get stuck like this. Will be workaround till we figure why this is 
happening. It is catastrophic when it does.

The Region should be cleared from sequence id accounting in the close flush. I 
see flush message here:

2019-10-16 23:10:55,884 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Finished flush of dataSize ~4.30 MB/4511054, heapSize ~4.33 MB/4543520, 
currentSize=0 B/0 for 8ee433ad59526778c53cc85ed3762d0b in 47ms, 
sequenceid=271148, compaction requested=true

... just before the closed region message here...

2019-10-16 23:10:55,897 INFO 
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 
8ee433ad59526778c53cc85ed3762d0b

so, the region should have been removed from sequence id accounting.

[~gxcheng] No ASYNC_WAL in the mix here sir. Thanks for the intercession.




> Blocked WAL archive: "LogRoller: Failed to schedule flush of 
> 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us"
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-23181
>                 URL: https://issues.apache.org/jira/browse/HBASE-23181
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Michael Stack
>            Priority: Major
>
> On a heavily loaded cluster, WAL count keeps rising and we can get into a 
> state where we are not rolling the logs off fast enough. In particular, there 
> is this interesting state at the extreme where we pick a region to flush 
> because 'Too many WALs' but the region is actually not online. As the WAL 
> count rises, we keep picking a region-to-flush that is no longer on the 
> server. This condition blocks our being able to clear WALs; eventually WALs 
> climb into the hundreds and the RS goes zombie with a full Call queue that 
> starts throwing CallQueueTooLargeExceptions (bad if this servers is the one 
> carrying hbase:meta).
> Here is how it looks in the log:
> {code}
> # Here is region closing....
> 2019-10-16 23:10:55,897 INFO 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 
> 8ee433ad59526778c53cc85ed3762d0b
> ....
> # Then soon after ...
> 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> 2019-10-16 23:11:45,006 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=45, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> ...
> # Later...
> 2019-10-16 23:20:25,427 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=542, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> {code}
> I've seen this runaway WALs in old 1.2.x hbase and this exception is from 
> 2.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us"

Reply via email to