[
https://issues.apache.org/jira/browse/HBASE-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yunfan Zhong updated HBASE-10466:
---------------------------------
Description:
When there are failed flushes, data to be flush are kept in each MemStore's
snapshot. Next flush attempt will continue on snapshot first. However, the
counter of total memstore size in HRegion is always deduced by the sum of
current memstore sizes after the flush succeeds. This calculation is definitely
wrong if flush fails last time.
When the region is closing, there are two flushes. During the period that some
data is in snapshot and the memstore size is incorrect, the first flush
successfully saved data in snapshot. But the memstore size counter was reduced
to 0 or even less. This prevented the second flush since
HRegion.internalFlushcache() directly returns while total memstore size is not
greater than 0. As result, data in memstores were lost.
It could cause mass data loss up to the size limit of memstores.
was:
When there are failed flushes, data to be flush are kept in each MemStore's
snapshot. Next flush attempt will continue on snapshot first. However, the
counter of total memstore size in HRegion is always deduced by the sum of
current memstore sizes after the flush succeeds. This calculation is definitely
wrong if flush fails last time.
When the server is shutting down, there are two flushes. During the missing KV
issue period, the first flush successfully saved data in snapshot. But the size
counter was reduced to 0 or even less. This prevented the second flush since
HRegion.internalFlushcache() directly returns while total memstore size is not
greater than 0. As result, data in memstores were lost.
It could cause mass data loss up to the size limit of each memstore. For
example, a region had 516.3M data (size limit is 516M) in memstore at the
moment because of failing flushes for more than one hour. After the region was
closed, these KVs were missing from HFiles but exist in HLog.
> Wrong calculation of total memstore size in HRegion which could cause data
> loss
> -------------------------------------------------------------------------------
>
> Key: HBASE-10466
> URL: https://issues.apache.org/jira/browse/HBASE-10466
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 0.89-fb
> Reporter: Yunfan Zhong
> Priority: Critical
> Fix For: 0.89-fb
>
>
> When there are failed flushes, data to be flush are kept in each MemStore's
> snapshot. Next flush attempt will continue on snapshot first. However, the
> counter of total memstore size in HRegion is always deduced by the sum of
> current memstore sizes after the flush succeeds. This calculation is
> definitely wrong if flush fails last time.
> When the region is closing, there are two flushes. During the period that
> some data is in snapshot and the memstore size is incorrect, the first flush
> successfully saved data in snapshot. But the memstore size counter was
> reduced to 0 or even less. This prevented the second flush since
> HRegion.internalFlushcache() directly returns while total memstore size is
> not greater than 0. As result, data in memstores were lost.
> It could cause mass data loss up to the size limit of memstores.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)