[
https://issues.apache.org/jira/browse/HBASE-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yunfan Zhong updated HBASE-10466:
---------------------------------
Description:
During region close, there are two flushes to ensure nothing is persisted in
memory. When there is data in current memstore only, 1 flush is required. When
there is data also in memstore's snapshot, 2 flushes are essential otherwise we
have data loss. However, recently we found two bugs that lead to at least 1
flush skipped and caused data loss.
Bug 1: Wrong calculation of HRegion.memstoreSize
When a flush fails, data to be flushed is kept in each MemStore's snapshot and
wait for next flush attempt to continue on it. But when the next flush
succeeds, the counter of total memstore size in HRegion is always deduced by
the sum of current memstore sizes instead of snapshots left from previous
failed flush. This calculation is problematic that almost every time there is
failed flush, HRegion.memstoreSize gets reduced by a wrong value. If region
flush could not proceed for a couple cycles, the size in current memstore could
be much larger than the snapshot. It's likely to drift memstoreSize much
smaller than expected. In extreme case, if the error accumulates to even bigger
than HRegion's memstore size limit, any further flush is skipped because flush
does not do anything if memstoreSize is not larger than 0.
When the region is closing, if the two flushes get skipped and leave data in
current memstore and/or snapshot, we could lose data up to the memstore size
limit of the region.
The fix is deducing correct size of data that is going to be flushed from
memstoreSize.
Bug 2: Conditions for the first flush of region close (so-called pre-flush)
If memstoreSize is smaller than a certain value, or when region close starts a
flush is ongoing, the first flush is skipped and only the second flush takes
place. However, two flushes are required in case previous flush fails and
leaves some data in snapshot. The bug could cause loss of data in current
memstore.
The fix is removing all conditions except abort check so we ensure 2 flushes
for region close.
was:
When there are failed flushes, data to be flush are kept in each MemStore's
snapshot. Next flush attempt will continue on snapshot first. However, the
counter of total memstore size in HRegion is always deduced by the sum of
current memstore sizes after the flush succeeds. This calculation is definitely
wrong if flush fails last time.
When the region is closing, there are two flushes. During the period that some
data is in snapshot and the memstore size is incorrect, the first flush
successfully saved data in snapshot. But the memstore size counter was reduced
to 0 or even less. This prevented the second flush since
HRegion.internalFlushcache() directly returns while total memstore size is not
greater than 0. As result, data in memstores were lost.
It could cause mass data loss up to the size limit of memstores.
Summary: Bugs that causes flushes being skipped during HRegion close
could cause data loss (was: Wrong calculation of total memstore size in
HRegion which could cause data loss)
> Bugs that causes flushes being skipped during HRegion close could cause data
> loss
> ---------------------------------------------------------------------------------
>
> Key: HBASE-10466
> URL: https://issues.apache.org/jira/browse/HBASE-10466
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 0.89-fb
> Reporter: Yunfan Zhong
> Priority: Critical
> Fix For: 0.89-fb
>
>
> During region close, there are two flushes to ensure nothing is persisted in
> memory. When there is data in current memstore only, 1 flush is required.
> When there is data also in memstore's snapshot, 2 flushes are essential
> otherwise we have data loss. However, recently we found two bugs that lead to
> at least 1 flush skipped and caused data loss.
> Bug 1: Wrong calculation of HRegion.memstoreSize
> When a flush fails, data to be flushed is kept in each MemStore's snapshot
> and wait for next flush attempt to continue on it. But when the next flush
> succeeds, the counter of total memstore size in HRegion is always deduced by
> the sum of current memstore sizes instead of snapshots left from previous
> failed flush. This calculation is problematic that almost every time there is
> failed flush, HRegion.memstoreSize gets reduced by a wrong value. If region
> flush could not proceed for a couple cycles, the size in current memstore
> could be much larger than the snapshot. It's likely to drift memstoreSize
> much smaller than expected. In extreme case, if the error accumulates to even
> bigger than HRegion's memstore size limit, any further flush is skipped
> because flush does not do anything if memstoreSize is not larger than 0.
> When the region is closing, if the two flushes get skipped and leave data in
> current memstore and/or snapshot, we could lose data up to the memstore size
> limit of the region.
> The fix is deducing correct size of data that is going to be flushed from
> memstoreSize.
> Bug 2: Conditions for the first flush of region close (so-called pre-flush)
> If memstoreSize is smaller than a certain value, or when region close starts
> a flush is ongoing, the first flush is skipped and only the second flush
> takes place. However, two flushes are required in case previous flush fails
> and leaves some data in snapshot. The bug could cause loss of data in current
> memstore.
> The fix is removing all conditions except abort check so we ensure 2 flushes
> for region close.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)