[ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-16721:
----------------------------------
    Attachment: hbase-16721_addendum2.branch-1.1.patch

This addendum is needed for 1.1 without a proper fix for HBASE-16820. It is 
basically undoing the HRegion changes, but adding a 
{{waitForPreviousTransactionsComplete()}} call at the beginning instead. We end 
up doing two mvcc transactions, but at the end of the flush, we are guaranteed 
to have the mvcc read point to be advanced to the flush sequence id. 

> Concurrency issue in WAL unflushed seqId tracking
> -------------------------------------------------
>
>                 Key: HBASE-16721
>                 URL: https://issues.apache.org/jira/browse/HBASE-16721
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>            Priority: Critical
>             Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
>         Attachments: hbase-16721_addendum.patch, 
> hbase-16721_addendum2.branch-1.1.patch, hbase-16721_v1.branch-1.patch, 
> hbase-16721_v2.branch-1.patch, hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to