[ 
https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956374#comment-16956374
 ] 

Michael Stack commented on HBASE-23181:
---------------------------------------

Studying a live 1.2.x deploy (trying to dig up clues on why we go bad), the 
flush went bad on a region split; thereafter we could no longer flush the 
parent yet we kept trying to so we could clear the oldest WAL.

2019-10-18 20:15:03,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: 
Region split, hbase:meta updated, and report to master. 
Parent=XYZ,7dd2c3e4-a908-42cd-9dd7-bf20651acdbb,1522152449885.0f4cd0e561a660ea6b11d4703258cd60.,
 new regions: 
XYZ,7dd2c3e4-a908-42cd-9dd7-bf20651acdbb,1571429702384.b912e2a50897e5ca002107dd472efcfe.,
 
XYZ,7dd6df6c-1949-4116-a5ed-008f6d4ae35a,1571429702384.0890b696df81a28a1ba4e9e00e8c43c0..
 Split took 0sec

Then later... 


2019-10-18 20:19:33,622 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Too many WALs; count=33, max=32; forcing flush of 8 regions(s): 
4ab5672c35baa260ce3742e403c03d3b, 0890b696df81a28a1ba4e9e00e8c43c0, 
89ca70c06e66c0fe8732d7ebfc9e6eff, f7b98bf63c74325a7ee703be48cbd991, 
0cba425e6b3caec2af1ff0d52eaffd92, 0f4cd0e561a660ea6b11d4703258cd60, 
9758363e8d237b8559385b4f2a2da78d, b912e2a50897e5ca002107dd472efcfe

but...

2019-10-18 20:19:33,623 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
Failed to schedule flush of 0f4cd0e561a660ea6b11d4703258cd60, region=null, 
requester=null

And so on.

So, seems like split can be problem in branch-1; i.e. we may not clear the 
region from the sequenceidaccounting on split.

> Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it 
> is not online on us"
> --------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-23181
>                 URL: https://issues.apache.org/jira/browse/HBASE-23181
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.1
>            Reporter: Michael Stack
>            Priority: Major
>
> On a heavily loaded cluster, WAL count keeps rising and we can get into a 
> state where we are not rolling the logs off fast enough. In particular, there 
> is this interesting state at the extreme where we pick a region to flush 
> because 'Too many WALs' but the region is actually not online. As the WAL 
> count rises, we keep picking a region-to-flush that is no longer on the 
> server. This condition blocks our being able to clear WALs; eventually WALs 
> climb into the hundreds and the RS goes zombie with a full Call queue that 
> starts throwing CallQueueTooLargeExceptions (bad if this servers is the one 
> carrying hbase:meta): i.e. clients fail to access the RegionServer.
> One symptom is a fast spike in WAL count for the RS. A restart of the RS will 
> break the bind.
> Here is how it looks in the log:
> {code}
> # Here is region closing....
> 2019-10-16 23:10:55,897 INFO 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed 
> 8ee433ad59526778c53cc85ed3762d0b
> ....
> # Then soon after ...
> 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> 2019-10-16 23:11:45,006 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=45, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> ...
> # Later...
> 2019-10-16 23:20:25,427 INFO 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; 
> count=542, max=32; forcing flush of 1 regions(s): 
> 8ee433ad59526778c53cc85ed3762d0b
> 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: 
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is 
> not online on us
> {code}
> I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version 
> regularly that had HBASE-16721 fix in it, but can't say yet if it was for 
> same reason as above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to