[
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15539487#comment-15539487
]
Hudson commented on HBASE-16721:
--------------------------------
FAILURE: Integrated in Jenkins build HBase-1.2-JDK8 #33 (See
[https://builds.apache.org/job/HBase-1.2-JDK8/33/])
HBASE-16721 Concurrency issue in WAL unflushed seqId tracking - ADDENDUM (enis:
rev 77e25d32b3ad8863625c9d25e3ecd7526608acf6)
* (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WAL.java
> Concurrency issue in WAL unflushed seqId tracking
> -------------------------------------------------
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
> Issue Type: Bug
> Components: wal
> Affects Versions: 1.0.0, 1.1.0, 1.2.0
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch,
> hbase-16721_v1.branch-1.patch, hbase-16721_v2.branch-1.patch,
> hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some
> regionservers ends up accumulating hundreds of WAL files, even with force
> flushes going on due to max logs. This happened multiple times on the
> cluster, but not on other clusters. The cluster has periodic memstore flusher
> disabled, however, this still does not explain why the force flush of regions
> due to max limit is not working. I think the periodic memstore flusher just
> masks the underlying problem, which is why we do not see this in other
> clusters.
> The problem starts like this:
> {code}
> 2016-09-21 17:49:18,272 INFO [regionserver//10.2.0.55:16020.logRoller]
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN [regionserver//10.2.0.55:16020.logRoller]
> regionserver.LogRoller: Failed to schedule flush of
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted:
> {code}
> 2016-09-23 17:43:49,356 INFO [regionserver//10.2.0.55:16020.logRoller]
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN [regionserver//10.2.0.55:16020.logRoller]
> regionserver.LogRoller: Failed to schedule flush of
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already
> split some time ago, and was able to flush its data and split without any
> problems. However, the FSHLog still thinks that there is some unflushed data
> for this region.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)