[
https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960504#comment-16960504
]
Hudson commented on HBASE-23181:
--------------------------------
Results for branch branch-2.1
[build #1691 on
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691/]:
(/) *{color:green}+1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691//General_Nightly_Build_Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691//JDK8_Nightly_Build_Report_(Hadoop2)/]
(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691//JDK8_Nightly_Build_Report_(Hadoop3)/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it
> is not online on us"
> --------------------------------------------------------------------------------------------------
>
> Key: HBASE-23181
> URL: https://issues.apache.org/jira/browse/HBASE-23181
> Project: HBase
> Issue Type: Bug
> Components: regionserver, wal
> Affects Versions: 2.2.1
> Reporter: Michael Stack
> Assignee: Duo Zhang
> Priority: Major
> Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3
>
>
> On a heavily loaded cluster, WAL count keeps rising and we can get into a
> state where we are not rolling the logs off fast enough. In particular, there
> is this interesting state at the extreme where we pick a region to flush
> because 'Too many WALs' but the region is actually not online. As the WAL
> count rises, we keep picking a region-to-flush that is no longer on the
> server. This condition blocks our being able to clear WALs; eventually WALs
> climb into the hundreds and the RS goes zombie with a full Call queue that
> starts throwing CallQueueTooLargeExceptions (bad if this servers is the one
> carrying hbase:meta): i.e. clients fail to access the RegionServer.
> One symptom is a fast spike in WAL count for the RS. A restart of the RS will
> break the bind.
> Here is how it looks in the log:
> {code}
> # Here is region closing....
> 2019-10-16 23:10:55,897 INFO
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed
> 8ee433ad59526778c53cc85ed3762d0b
> ....
> # Then soon after ...
> 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller:
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is
> not online on us
> 2019-10-16 23:11:45,006 INFO
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs;
> count=45, max=32; forcing flush of 1 regions(s):
> 8ee433ad59526778c53cc85ed3762d0b
> ...
> # Later...
> 2019-10-16 23:20:25,427 INFO
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs;
> count=542, max=32; forcing flush of 1 regions(s):
> 8ee433ad59526778c53cc85ed3762d0b
> 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller:
> Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is
> not online on us
> {code}
> I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version
> regularly that had HBASE-16721 fix in it, but can't say yet if it was for
> same reason as above.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)