[ 
https://issues.apache.org/jira/browse/HBASE-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yunfan Zhong updated HBASE-10466:
---------------------------------

    Description: 
When there are failed flushes, data to be flush are kept in each MemStore's 
snapshot. Next flush attempt will continue on snapshot first. However, the 
counter of total memstore size in HRegion is always deduced by the sum of 
current memstore sizes after the flush succeeds. This calculation is definitely 
wrong if flush fails last time.
When the region is closing, there are two flushes. During the period that some 
data is in snapshot and the memstore size is incorrect, the first flush 
successfully saved data in snapshot. But the memstore size counter was reduced 
to 0 or even less. This prevented the second flush since 
HRegion.internalFlushcache() directly returns while total memstore size is not 
greater than 0. As result, data in memstores were lost.
It could cause mass data loss up to the size limit of memstores.

  was:
When there are failed flushes, data to be flush are kept in each MemStore's 
snapshot. Next flush attempt will continue on snapshot first. However, the 
counter of total memstore size in HRegion is always deduced by the sum of 
current memstore sizes after the flush succeeds. This calculation is definitely 
wrong if flush fails last time.
When the server is shutting down, there are two flushes. During the missing KV 
issue period, the first flush successfully saved data in snapshot. But the size 
counter was reduced to 0 or even less. This prevented the second flush since 
HRegion.internalFlushcache() directly returns while total memstore size is not 
greater than 0. As result, data in memstores were lost.
It could cause mass data loss up to the size limit of each memstore. For 
example, a region had 516.3M data (size limit is 516M) in memstore at the 
moment because of failing flushes for more than one hour. After the region was 
closed, these KVs were missing from HFiles but exist in HLog.


> Wrong calculation of total memstore size in HRegion which could cause data 
> loss
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-10466
>                 URL: https://issues.apache.org/jira/browse/HBASE-10466
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.89-fb
>            Reporter: Yunfan Zhong
>            Priority: Critical
>             Fix For: 0.89-fb
>
>
> When there are failed flushes, data to be flush are kept in each MemStore's 
> snapshot. Next flush attempt will continue on snapshot first. However, the 
> counter of total memstore size in HRegion is always deduced by the sum of 
> current memstore sizes after the flush succeeds. This calculation is 
> definitely wrong if flush fails last time.
> When the region is closing, there are two flushes. During the period that 
> some data is in snapshot and the memstore size is incorrect, the first flush 
> successfully saved data in snapshot. But the memstore size counter was 
> reduced to 0 or even less. This prevented the second flush since 
> HRegion.internalFlushcache() directly returns while total memstore size is 
> not greater than 0. As result, data in memstores were lost.
> It could cause mass data loss up to the size limit of memstores.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to