[
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13689160#comment-13689160
]
Feng Honghua commented on HBASE-8721:
-------------------------------------
I list some merits with behavior 'Delete can't mask puts that happen after the
delete':
1) Can avoid the inconsistency such as I mentioned above, with our patch, user
can always read the put by 4>. It's more natural and intuitive:
1> put a kv (timestamp = T0), and flush;
2> delete that kv using a DeleteColumn type kv with timestamp T0 (or any
timestamp >= T0), and flush;
3> a major compact occurs [or not];
4> put that kv again (timestamp = T0);
5> read that kv;
===>
a) if a major compact occurs at step 3>, then step 5> will get the put
written at step 4>
b) if no major compact occurs at step 3>, then step 5> get nothing
2) Can provide strong guarantee for such operation: "I don't know
which/how-many versions in a cell, now I (by removing all existing ones) just
want to put a new version into it and ensure only this new put in the cell
regardless of the ts comparison with old existing ones" (I think this
operation/guarantee is useful in many scenarios). Current delete behavior can't
provide such guarantee.
3) 'delete latest version'(deleteColumn() without ts) can be tuned to remove
the read (latest version for its ts) during 'deleteColumn'. Current delete
behavior can't be tuned to remove the read operation during 'deleteColumn'
4) 'new put can't be masked (disappear) by old/existing delete' itself is a
merit for many use-cases / application since it's more natural and intuitive. I
ever explained many times to different customers for the old semantics of
version/delete and without exception all the first responses from them are
"weird... why so?"
Per my understanding, contrary to [~lhofhansl] and [~sershe], 'timestamp' is
just a long type to determine versions' ordering using the rule of 'the
bigger/later wins', and it happens the timestamp in 'time' semantic is a long
type and new put with its 'current' timestamp has bigger timestamp, and in most
cases new put versions knock out older ones. And for many use cases
time-semantic for 'timestamp' is enough for the real-world requirement, but by
design it's not always the case, otherwise the timestamp won't be exposed for
user to set it explicitly.
In a word, as long as user knows 'timestamp' is just only the dimension of long
type to determine the version ordering using the rule 'the bigger wins', he can
reason out the result of any operation sequences. In essence 'timestamp as a
dimension for version ordering' doesn't related to delete semantic.
-- I know my understanding is arguable for many guys, since the old delete
semantic and behavior has existed for so long and everybody has already taken
it for granted (I mean no offence here)
At last I also list the downside of proposed optional solutions I received:
A> 'KEEP_DELETE_CELLS' is definitely a nice feature, but many users don't need
this feature (to time-travel or trace-back action history) and this feature
prevent major-compact to shrink data-set by collecting.
B> disallow user explicitly set timestamp, this treatment limits HBase's schema
flexibility, and prohibit many innovative design such as facebook's message
search index, and at last it can't guarantee unique timestamp hence can still
lead to tricky / confusing behavior.
> Deletes can mask puts that happen after the delete
> --------------------------------------------------
>
> Key: HBASE-8721
> URL: https://issues.apache.org/jira/browse/HBASE-8721
> Project: HBase
> Issue Type: Improvement
> Components: regionserver
> Reporter: Feng Honghua
> Attachments: HBASE-8721-0.94-V0.patch
>
>
> this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
> "Deletes mask puts, even puts that happened after the delete was entered.
> Remember that a delete writes a tombstone, which only disappears after then
> next major compaction has run. Suppose you do a delete of everything <= T.
> After this you do a new put with a timestamp <= T. This put, even if it
> happened after the delete, will be masked by the delete tombstone. Performing
> the put will not fail, but when you do a get you will notice the put did have
> no effect. It will start working again after the major compaction has run.
> These issues should not be a problem if you use always-increasing versions
> for new puts to a row. But they can occur even if you do not care about time:
> just do delete and put immediately after each other, and there is some chance
> they happen within the same millisecond."
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira