@Srivas - totally agree that B is the correct thing to do. One way we have talked about implementing this is using the memstore ts. Every insert of a KV into the memstore is given a memstore-ts. These are persisted only till they are needed (to ensure read atomicity for scanners) and then that value is zeroed out on a subsequent compaction (saves space). If we retained the memstore-ts even beyond these compactions, we could get a deterministic order for the puts and deletes (first insert ts < del ts < second insert ts).
Thanks Karthik On 1/17/12 2:14 PM, "M. C. Srivas" <[email protected]> wrote: >On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <[email protected]> >wrote: > >> Yeah, it's confusing if one expects it to work like in a relational >> database. >> You can even do worse. If you by accident place a delete in the future >>all >> current inserts will be hidden until the next major compaction. :) >> I got confused about this myself just recently (see my mail on the >> dev-list). >> >> >> In the end this is a pretty powerful feature and core to how HBase works >> (not saying that is not confusing though). >> >> >> If one keeps the following two points in mind it makes more sense: >> 1. Delete just sets a tomb stone marker at a specific TS (marking >> everything older as deleted). >> 2. Everything is versioned, if no version is specified the current time >> (at the regionserver) is used. >> >> In your example1 below t3 > 6, hence the insert is hidden. >> In example2 both delete and insert TS are 6, hence the insert is hidden. >> > >Lets consider my example2 for a little longer. Sequence of events > > 1. ins val1 with TS=6 set by client > 2. del entire row at TS=6 set by client > 3. ins val2 with TS=6 set by client > 4. read row > >The row returns nothing even though the insert at step 3 happened after >the >delete at step 2. (step 2 masks even future inserts) > >Now, the same sequence with a compaction thrown in the middle: > > 1. ins val1 with TS=6 set by client > 2. del entire row at TS=6 set by client > 3. ---- table is compacted ----- > 4. ins val2 with TS=6 set by client > 5. read row > >The row returns val2. (the delete at step2 got lost due to compaction). > >So we have different results depending upon whether an internal >re-organization (like a compaction) happened or not. If we want both >sequences to behave exactly the same, then we need to first choose what is >the proper (and deterministic) behavior. > >A. if we think that the first sequence is the correct one, then the >delete >at step 2 needs to be preserved forever. > >or, > >B. if we think that the second sequence is the correct behavior (ie, a >read >always produces the same results independent of compaction), then the >record needs a second "internal TS" field to allow the RS to distinguish >the real sequence of events, and not rely upon the TS field which is >settable by the client. > >My opinion: > >We should do B. It is normal for someone to write code that says "if old >exists, delete it; add new". A subsequent read should always reliably >return "new". > >The current way of relying on a client-settable TS field to determine >causal order results in quirky behavior, and quirky is not good. > > > >> Look at these two examples: >> >> 1. insert Val1 at real time t1 >> 2. <del> at real time t2 > t1 >> 3. insert Val2 at real time t3 > t2 >> >> 1. insert Val1 with TS=1 at real time t1 >> 2. <del> with TS = 2 at real time t2 > t1 >> >> 3. insert Val2 with TS = 3 at real time t3 > t2 >> >> >> In both cases Val2 is visible. >> >> If the your code sets your own timestamps, you better know what you're >> doing :) >> >> Note that my examples below are confusing even if you know how deletion >>in >> HBase works. >> You have to look at Delete.java to figure out what is happening. >> OK, since there were know objections in two days, I will commit my >> proposed change in HBASE-5205. >> >> >> -- Lars >> >> ________________________________ >> From: M. C. Srivas <[email protected]> >> To: [email protected]; lars hofhansl <[email protected]> >> Sent: Tuesday, January 17, 2012 8:13 AM >> Subject: Re: Delete client API. >> >> >> Delete seems to be confusing in general. Here are some examples that >>make >> me scratch my head (key is same in all the examples): >> >> Example1: >> ---------------- >> 1. insert Val3 with TS=3 at real time t1 >> 2. insert Val5 with TS=5 at real time t2 > t1 >> 3. <del> at real time t3 > t2 >> 4. insert Val6 with TS=6 at real time t4 > t3 >> >> What does a read return? (I would expect Val6, since it was done >>last). >> But depending upon whether compaction happened or not between steps 3 >>and >> 4, I get either Val6 or nothing. >> >> Example 2: >> ----------------- >> 1. insert Val3 with TS=3 at real time t1 >> 2. insert Val5 with TS=5 at real time t2 > t1 >> 3. <del> TS=6 at real time t3 > t2 >> 4. insert Val6 with TS=6 at real time t4 > t3 >> >> Note the difference in step 3 is this time a TS was specified by the >> client. >> >> What does a read return? Again, I expect Val6 to be returned. But >> depending upon what's going on, I seem to get either Val5 or Val6. >> >> >> >> >> >> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <[email protected]> >> wrote: >> >> There are some confusing parts about the Delete client API: >> >1. calling deleteFamily removes all prior column or columns markers >> without checking the TS. >> >2. delete{Column|Columns|Family} do not use the timestamp passed to >> Delete at construction time, but instead default to LATEST_TIMESTAMP. >> > >> > Delete d = new Delete(R,T); >> > d.deleteFamily(CF); >> > >> >Does not do what you expect (won't use T for the family delete, but >> rather the current time). >> > >> >Neither does >> > d.deleteColumns(CF, C1, T2); >> > d.deleteFamily(CF, T1); // T1 < T2 >> > >> > >> >(the columns marker will be removed) >> > >> > >> >#1 prevents Delete from adding a family marker F for time T1 and a >> column/columns marker for columns of F at T2 even if T2 > T1. >> >#2 is just unexpected and different from what Put is doing. >> > >> >In HBASE-5205 I propose a simple patch to fix this. >> > >> >Since this is a (slight) API change, please provide feed back. >> > >> >Thanks. >> > >> >-- Lars >> > >> > >>
