[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13689160#comment-13689160
 ] 

Feng Honghua commented on HBASE-8721:
-------------------------------------

I list some merits with behavior 'Delete can't mask puts that happen after the 
delete':

1) Can avoid the inconsistency such as I mentioned above, with our patch, user 
can always read the put by 4>. It's more natural and intuitive:

  1> put a kv (timestamp = T0), and flush;
  2> delete that kv using a DeleteColumn type kv with timestamp T0 (or any 
timestamp >= T0), and flush;
  3> a major compact occurs [or not];
  4> put that kv again (timestamp = T0);
  5> read that kv;
  ===>
  a) if a major compact occurs at step 3>, then step 5> will get the put 
written at step 4>
  b) if no major compact occurs at step 3>, then step 5> get nothing

2) Can provide strong guarantee for such operation: "I don't know 
which/how-many versions in a cell, now I (by removing all existing ones) just 
want to put a new version into it and ensure only this new put in the cell 
regardless of the ts comparison with old existing ones" (I think this 
operation/guarantee is useful in many scenarios). Current delete behavior can't 
provide such guarantee.

3) 'delete latest version'(deleteColumn() without ts) can be tuned to remove 
the read (latest version for its ts) during 'deleteColumn'. Current delete 
behavior can't be tuned to remove the read operation during 'deleteColumn'

4) 'new put can't be masked (disappear) by old/existing delete' itself is a 
merit for many use-cases / application since it's more natural and intuitive. I 
ever explained many times to different customers for the old semantics of 
version/delete and without exception all the first responses from them are 
"weird... why so?"

Per my understanding, contrary to [~lhofhansl] and [~sershe], 'timestamp' is 
just a long type to determine versions' ordering using the rule of 'the 
bigger/later wins', and it happens the timestamp in 'time' semantic is a long 
type and new put with its 'current' timestamp has bigger timestamp, and in most 
cases new put versions knock out older ones. And for many use cases 
time-semantic for 'timestamp' is enough for the real-world requirement, but by 
design it's not always the case, otherwise the timestamp won't be exposed for 
user to set it explicitly.

In a word, as long as user knows 'timestamp' is just only the dimension of long 
type to determine the version ordering using the rule 'the bigger wins', he can 
reason out the result of any operation sequences. In essence 'timestamp as a 
dimension for version ordering' doesn't related to delete semantic.

-- I know my understanding is arguable for many guys, since the old delete 
semantic and behavior has existed for so long and everybody has already taken 
it for granted (I mean no offence here)


At last I also list the downside of proposed optional solutions I received:

A> 'KEEP_DELETE_CELLS' is definitely a nice feature, but many users don't need 
this feature (to time-travel or trace-back action history) and this feature 
prevent major-compact to shrink data-set by collecting.

B> disallow user explicitly set timestamp, this treatment limits HBase's schema 
flexibility, and prohibit many innovative design such as facebook's message 
search index, and at last it can't guarantee unique timestamp hence can still 
lead to tricky / confusing behavior.
                
> Deletes can mask puts that happen after the delete
> --------------------------------------------------
>
>                 Key: HBASE-8721
>                 URL: https://issues.apache.org/jira/browse/HBASE-8721
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Feng Honghua
>         Attachments: HBASE-8721-0.94-V0.patch
>
>
> this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
> "Deletes mask puts, even puts that happened after the delete was entered. 
> Remember that a delete writes a tombstone, which only disappears after then 
> next major compaction has run. Suppose you do a delete of everything <= T. 
> After this you do a new put with a timestamp <= T. This put, even if it 
> happened after the delete, will be masked by the delete tombstone. Performing 
> the put will not fail, but when you do a get you will notice the put did have 
> no effect. It will start working again after the major compaction has run. 
> These issues should not be a problem if you use always-increasing versions 
> for new puts to a row. But they can occur even if you do not care about time: 
> just do delete and put immediately after each other, and there is some chance 
> they happen within the same millisecond."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to