@Srivas - totally agree that B is the correct thing to do.

One way we have talked about implementing this is using the memstore ts.
Every insert of a KV into the memstore is given a memstore-ts. These are
persisted only till they are needed (to ensure read atomicity for
scanners) and then that value is zeroed out on a subsequent compaction
(saves space). If we retained the memstore-ts even beyond these
compactions, we could get a deterministic order for the puts and deletes
(first insert ts < del ts < second insert ts).

Thanks
Karthik


On 1/17/12 2:14 PM, "M. C. Srivas" <[email protected]> wrote:

>On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <[email protected]>
>wrote:
>
>> Yeah, it's confusing if one expects it to work like in a relational
>> database.
>> You can even do worse. If you by accident place a delete in the future
>>all
>> current inserts will be hidden until the next major compaction. :)
>> I got confused about this myself just recently (see my mail on the
>> dev-list).
>>
>>
>> In the end this is a pretty powerful feature and core to how HBase works
>> (not saying that is not confusing though).
>>
>>
>> If one keeps the following two points in mind it makes more sense:
>> 1. Delete just sets a tomb stone marker at a specific TS (marking
>> everything older as deleted).
>> 2. Everything is versioned, if no version is specified the current time
>> (at the regionserver) is used.
>>
>> In your example1 below t3 > 6, hence the insert is hidden.
>> In example2 both delete and insert TS are 6, hence the insert is hidden.
>>
>
>Lets consider my example2 for a little longer. Sequence of events
>
>   1.  ins  val1  with TS=6 set by client
>   2.  del  entire row at TS=6 set by client
>   3.  ins  val2  with TS=6  set by client
>   4.  read row
>
>The row returns nothing even though the insert at step 3 happened after
>the
>delete at step 2. (step 2 masks even future inserts)
>
>Now, the same sequence with a compaction thrown in the middle:
>
>   1.  ins  val1  with TS=6 set by client
>   2.  del  entire row at TS=6 set by client
>   3.  ---- table is compacted -----
>   4.  ins  val2  with TS=6  set by client
>   5.  read row
>
>The row returns val2.  (the delete at step2 got lost due to compaction).
>
>So we have different results depending upon whether an internal
>re-organization (like a compaction) happened or not. If we want both
>sequences to behave exactly the same, then we need to first choose what is
>the proper (and deterministic) behavior.
>
>A.  if we think that the first sequence is the correct one, then the
>delete
>at step 2 needs to be preserved forever.
>
>or,
>
>B. if we think that the second sequence is the correct behavior (ie, a
>read
>always produces the same results independent of compaction), then the
>record needs a second "internal TS" field to allow the RS to distinguish
>the real sequence of events, and not rely upon the TS field which is
>settable by the client.
>
>My opinion:
>
>We should do B.  It is normal for someone to write code that says  "if old
>exists, delete it;  add new". A subsequent read should always reliably
>return "new".
>
>The current way of relying on a client-settable TS field to determine
>causal order results in quirky behavior, and quirky is not good.
>
>
>
>> Look at these two examples:
>>
>> 1. insert Val1  at real time t1
>> 2. <del>  at real time t2 > t1
>> 3. insert  Val2 at real time  t3 > t2
>>
>> 1. insert Val1  with TS=1 at real time t1
>> 2. <del>  with TS = 2 at real time t2 > t1
>>
>> 3. insert  Val2 with TS = 3 at real time  t3 > t2
>>
>>
>> In both cases Val2 is visible.
>>
>> If the your code sets your own timestamps, you better know what you're
>> doing :)
>>
>> Note that my examples below are confusing even if you know how deletion
>>in
>> HBase works.
>> You have to look at Delete.java to figure out what is happening.
>> OK, since there were know objections in two days, I will commit my
>> proposed change in HBASE-5205.
>>
>>
>> -- Lars
>>
>> ________________________________
>> From: M. C. Srivas <[email protected]>
>> To: [email protected]; lars hofhansl <[email protected]>
>> Sent: Tuesday, January 17, 2012 8:13 AM
>> Subject: Re: Delete client API.
>>
>>
>> Delete seems to be confusing in general. Here are some examples that
>>make
>> me scratch my head (key is same in all the examples):
>>
>> Example1:
>> ----------------
>> 1. insert Val3  with TS=3  at real time t1
>> 2. insert Val5  with TS=5  at real time t2 > t1
>> 3. <del>    at real time t3 > t2
>> 4. insert  Val6  with TS=6  at real time  t4 > t3
>>
>> What does a read return?  (I would expect  Val6, since it was done
>>last).
>> But depending upon whether compaction happened or not between steps 3
>>and
>> 4, I get either Val6 or  nothing.
>>
>> Example 2:
>> -----------------
>> 1. insert Val3  with TS=3  at real time t1
>> 2. insert Val5  with TS=5  at real time t2 > t1
>> 3. <del>  TS=6  at real time t3 > t2
>> 4. insert  Val6  with TS=6  at real time  t4 > t3
>>
>> Note the difference in step 3 is this time a TS was specified by the
>> client.
>>
>> What does a read return?  Again, I expect Val6 to be returned. But
>> depending upon what's going on, I seem to get either Val5 or Val6.
>>
>>
>>
>>
>>
>> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <[email protected]>
>> wrote:
>>
>> There are some confusing parts about the Delete client API:
>> >1. calling deleteFamily removes all prior column or columns markers
>> without checking the TS.
>> >2. delete{Column|Columns|Family} do not use the timestamp passed to
>> Delete at construction time, but instead default to LATEST_TIMESTAMP.
>> >
>> >  Delete d = new Delete(R,T);
>> >  d.deleteFamily(CF);
>> >
>> >Does not do what you expect (won't use T for the family delete, but
>> rather the current time).
>> >
>> >Neither does
>> >  d.deleteColumns(CF, C1, T2);
>> >  d.deleteFamily(CF, T1); // T1 < T2
>> >
>> >
>> >(the columns marker will be removed)
>> >
>> >
>> >#1 prevents Delete from adding a family marker F for time T1 and a
>> column/columns marker for columns of F at T2 even if T2 > T1.
>> >#2 is just unexpected and different from what Put is doing.
>> >
>> >In HBASE-5205 I propose a simple patch to fix this.
>> >
>> >Since this is a (slight) API change, please provide feed back.
>> >
>> >Thanks.
>> >
>> >-- Lars
>> >
>> >
>>

Reply via email to