[
https://issues.apache.org/jira/browse/HBASE-10557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903056#comment-13903056
]
Feng Honghua commented on HBASE-10557:
--------------------------------------
Sorry, this one duplicates to HBASE-10556, I pressed the 'create' button for
second time after no response for a long time due to poor network :-(
> Possible data loss due to non-handled DroppedSnapshotException for
> user-triggered flush from client/shell
> ---------------------------------------------------------------------------------------------------------
>
> Key: HBASE-10557
> URL: https://issues.apache.org/jira/browse/HBASE-10557
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Reporter: Feng Honghua
> Assignee: Feng Honghua
> Priority: Critical
>
> During the code review when investigating HBASE-10499, a possibility of data
> loss due to non-handled DroppedSnapshotException for user-triggered flush is
> exposed.
> Data loss can happen as below:
> # A flush for some region is triggered via HBaseAdmin or shell
> # The request reaches regionserver and eventually HRegion.internalFlushcache
> is called, then fails at persisting memstore's snapshot to hfile,
> DroppedSnapshotException is thrown and the snapshot is left not cleared.
> # DroppedSnapshotException is not handled in HRegion, and is just
> encapsulated as a ServiceException before returning to client
> # After a while, some new writes are handled and put in the current memstore,
> then a new flush is triggered for the region due to memstoreSize exceeds
> flush threshold
> # This second(new) flush succeeds, for the HStore which failed in the
> previous user-triggered flush, the remained non-empty snapshot is used rather
> than a new snapshot made from the current memstore, but HLog's latest
> sequenceId is used for the resultant hfiles --- the sequenceId attached
> within the hfiles says all edits with sequenceId <= it have all been
> persisted, but actually it's not the truth for the edits still in the
> existing memstore
> # Now the regionserver hosting this region dies
> # During the replay phase of failover, the edits corresponding to the ones
> while are in memstore and not actually persisted in hfiles when the previous
> regionserver dies will be ignored, since they are deemed as persisted by
> compared to the hfiles' latest consequenceID --- These edits are lost...
> For the second flush, we also can't discard the remained snapshot and make a
> new one using current memstore, that way the data in the remained snapshot is
> lost. We should abort the regionserver immediately and rely on the failover
> to replay the log for data safety.
> DroppedSnapshotException is correctly handled in MemStoreFlusher for
> internally triggered flush (which are generated by flush-size / rollWriter /
> periodicFlusher). But user-triggered flush is processed directly by
> HRegionServer->HRegion without putting a flush entry to flushQueue, hence not
> handled by MemStoreFlusher
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)