Lars Hofhansl created HBASE-12363:
-------------------------------------

             Summary: KEEP_DELETED_CELLS considered harmful?
                 Key: HBASE-12363
                 URL: https://issues.apache.org/jira/browse/HBASE-12363
             Project: HBase
          Issue Type: Sub-task
            Reporter: Lars Hofhansl


Brainstorming...

This morning in the train (of all places) I realized a fundamental issue in how 
KEEP_DELETED_CELLS is implemented.

The problem is around knowing when it is safe to remove a delete marker (we 
cannot remove it unless all cells affected by it are remove otherwise).
This was particularly hard for family marker, since they sort before all cells 
of a row, and hence scanning forward through an HFile you cannot know whether 
the family markers are still needed until at least the entire row is scanned.

My solution was to keep the TS of the oldest put in any given HFile, and only 
remove delete markers older than that TS.
That sounds good on the face of it... But now imagine you wrote a version of 
ROW 1 and then never update it again. Then later you write a billion other rows 
and delete them all. Since the TS of the cells in ROW 1 is older than all the 
delete markers for the other billion rows, these will never be collected... At 
least for the region that hosts ROW 1 after a major compaction.

I don't see a good way out of this. In parent I outlined these four solutions:
So there are three options I think:
# Only allow the new flag set on CFs with TTL set. MIN_VERSIONS would not apply 
to deleted rows or delete marker rows (wouldn't know how long to keep family 
deletes in that case). (MAX)VERSIONS would still be enforced on all rows types 
except for family delete markers.
# Translate family delete markers to column delete marker at (major) compaction 
time.
# Change HFileWriterV* to keep track of the earliest put TS in a store and 
write it to the file metadata. Use that use expire delete marker that are older 
and hence can't affect any puts in the file.
# Have Store.java keep track of the earliest put in internalFlushCache and 
compactStore and then append it to the file metadata. That way HFileWriterV* 
would not need to know about KVs.

And I implemented #4.

I'd love to get input on ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to