[ 
https://issues.apache.org/jira/browse/HBASE-8721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685655#comment-13685655
 ] 

Feng Honghua commented on HBASE-8721:
-------------------------------------

Thanks guys for your feedback: [~apurtell], [~sershe], [~stack], [~lhofhansl]

I summarize issues/proposals as below:

A). We all agree this IS a bug:
  1> put a kv (timestamp = T0), and flush;
  2> delete that kv using a DeleteColumn type kv with timestamp T0 (or any 
timestamp >= T0), and flush;
  3> a major compact occurs [or not];
  4> put that kv again (timestamp = T0);
  5> read that kv;

  a) if a major compact occurs at step 3>, then step 5> will get the put 
written at step 4>
  b) if no major compact occurs at step 3>, then step 5> get nothing

B). [~stack] proposes to keep all deleted cells. This can be achieved either by 
turning on the KeepDeletedCells for ColumnFamilies or by degenerating 
major-compact to minor-compact (I guess you mean the former one). But these two 
options both result in a bigger data size than expectation.

C). [~lhofhansl] suggests to introduce a config for a Table/CF to disallow 
client to set timestamps when put. As a config, it means client still can 
create tables/CFs that allow him to explicitly set timestamps, and for these 
tables/CFs, bug of A) still exists.

D). As [~lhofhansl] said, timestamp is part of the Schema, it's visible to and 
can be set by client, hence it can be exploited by client for more general 
usage. For 'general' I mean it's not limited for only 'time' semantic, but as 
an ordinary dimension of a cell's coordinate. Such treatment can lead to many 
innovative schema design to address more complicated real-world problems. 

  Facebook uses msg-id as timestamp in their message search index CF. When 
using timestamp as an ordinary dimension of a cell's coordinate, that cell 
naturally has only one 'version' in the app context, and the CF usually to set 
the MaxVersions in HBase context to the max-size for accommodate as many 
different cells as possible. The client who uses timestamp as such general 
usage takes care of all the subtlety derived from this semantic change.

  Facebook's design details can be referred to in book 'HBase The Definitive 
Guide' - Chapter 9 Advanced Usage - Search Integration (page 374) or blog: 
http://www.facebook.com/notes/facebook-engineering/inside-facebook-messages-application-server/10150162742108920.

  Disabling client set timestamps or limiting timestamp only with 'time' 
semantic will prohibit such innovative usage of timestamp. As said, a good 
language/platform/product encourages and enables innovative extension/usage out 
of the original designer's imagination. We do expect HBase to be such a good 
platform/product, right?

E). [~apurtell] said: "This section of the book describes expected behavior. 
This is not a bug."

  I disagree. That section's title explicitly says it's 'current limitations' 
and explains in details why. It is by nature not an acceptable behaviour. It's 
counter-common-sense and counter-intuition. It now seems an 'expected 
behaviour' JUST because it exists from the very beginning.

F). [~lhofhansl] said: "HBase allows you to set the timestamps to influence the 
logical order in which things (are declared to have) happened. If you do not 
want strange behavior do not date Deletes into the future and Puts into the 
past. Period."

  As bug in A), strange behaviour occurs even dating Deletes/Puts into the same 
timestamp, but one the future and the other the past. (We allow setting 
timestamp, and we do set it) We get strange(buggy) behaviour when we "put - 
delete - put - get" that very same KV with that same timestamp. Isn't it weird?

G). [~lhofhansl] said: "If we did not have that as-of-time queries would be 
broken and we would break the idempotent nature of operations in HBase"

  For "idempotent nature of operations in HBase", my understanding is a series 
of Puts(or Deletes) for a same cell(exactly the same coordinate:value) will 
result in an eventually same result. But it's expected to be broken if 
interleaved by Deletes(Deletes interleaved by Puts). Such idempotent nature 
break is acceptable according to my opinion.
  Even we don't change the behaviour 'Deletes can mask puts that happen after 
the delete", scenario in A) still breaks the idempotent nature: we put that 
same cell multiple times, but the results can turn out to be different when 
interleaved by Deletes (with the effect of major compact together).

H). Since HBase is modeled after BigTable, so it makes sense we align the 
Delete behaviour here with BigTable, right?

I). At last, I think we need to have an open mind for this issue, not just 
suggesting a workaround at the cost of HBase's inherent flexibility.
                
> Deletes can mask puts that happen after the delete
> --------------------------------------------------
>
>                 Key: HBASE-8721
>                 URL: https://issues.apache.org/jira/browse/HBASE-8721
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Feng Honghua
>         Attachments: HBASE-8721-0.94-V0.patch
>
>
> this fix aims for bug mentioned in http://hbase.apache.org/book.html 5.8.2.1:
> "Deletes mask puts, even puts that happened after the delete was entered. 
> Remember that a delete writes a tombstone, which only disappears after then 
> next major compaction has run. Suppose you do a delete of everything <= T. 
> After this you do a new put with a timestamp <= T. This put, even if it 
> happened after the delete, will be masked by the delete tombstone. Performing 
> the put will not fail, but when you do a get you will notice the put did have 
> no effect. It will start working again after the major compaction has run. 
> These issues should not be a problem if you use always-increasing versions 
> for new puts to a row. But they can occur even if you do not care about time: 
> just do delete and put immediately after each other, and there is some chance 
> they happen within the same millisecond."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to