[ https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528433#comment-15528433 ]
Jerry He commented on HBASE-16721: ---------------------------------- I had a similar problem on a customer production cluster recently. The WALs for one of the region servers (server 11) kept on accumulating, and these LOG info repeatedly showed up. {code} 2016-09-03 14:37:15,989 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Too many hlogs: logs=817, maxlogs=32; forcing flush of 2 regions(s): 1b86c057f80721d4fde43a303f63ebde, 32d36d4864259dc9d984326bf27dcc5e 2016-09-03 14:37:15,990 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 1b86c057f80721d4fde43a303f63ebde, region=null, requester=null 2016-09-03 14:37:15,990 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 32d36d4864259dc9d984326bf27dcc5e, region=null, requester=null {code} It turned out that the two regions were opened and hosted on other region servers, not on this region server. After manually moving the complaining regions from other region servers to server 11, then server 11 was able to finish the flush. The wal files for server 11 came down right after that. I didn't had a chance to look into what the root cause was. Some of the region servers had crashed before that. > Concurrency issue in WAL unflushed seqId tracking > ------------------------------------------------- > > Key: HBASE-16721 > URL: https://issues.apache.org/jira/browse/HBASE-16721 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 1.2.4 > > > I'm inspecting an interesting case where in a production cluster, some > regionservers ends up accumulating hundreds of WAL files, even with force > flushes going on due to max logs. This happened multiple times on the > cluster, but not on other clusters. The cluster has periodic memstore flusher > disabled, however, this still does not explain why the force flush of regions > due to max limit is not working. I think the periodic memstore flusher just > masks the underlying problem, which is why we do not see this in other > clusters. > The problem starts like this: > {code} > 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > then, it continues until the RS is restarted: > {code} > 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller] > wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 > regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f > 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller] > regionserver.LogRoller: Failed to schedule flush of > d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null > {code} > The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already > split some time ago, and was able to flush its data and split without any > problems. However, the FSHLog still thinks that there is some unflushed data > for this region. -- This message was sent by Atlassian JIRA (v6.3.4#6332)