On Tue, Jan 17, 2012 at 8:56 PM, lars hofhansl <[email protected]> wrote:
> The memstoreTS is used for visibility during an intra-row transaction. > Are you proposing to do this only if the deletes/puts did not use the > current time? > > The ability to define timestamps for all operations is crucial to HBase. > o It ensures that HTable.batch works correctly (which reorders Deletes > w.r.t. to Puts at the Region Server). > o It ensures that replication works correctly. > o many other scenarios > > If you do not use application defined timestamp the current time is used > and everything works as expected. > If you use application defined timestamps you are asking for a delete to > be either in the future or the past, and you have to understand what that > means. > Maybe we should document the behavior better. > I guess I am saying that I *do* understand the current "delete with TS" behavior, and I find the current implementation unstable and non-deterministic. Documenting it more thoroughly does not make it less quirky or more stable. I propose fixing it along the lines suggested in option B. Karthik seems to agree. > > -- Lars > > > ----- Original Message ----- > From: Karthik Ranganathan <[email protected]> > To: "[email protected]" <[email protected]>; lars hofhansl < > [email protected]> > Cc: > Sent: Tuesday, January 17, 2012 3:27 PM > Subject: Re: Delete client API. > > > @Srivas - totally agree that B is the correct thing to do. > > One way we have talked about implementing this is using the memstore ts. > Every insert of a KV into the memstore is given a memstore-ts. These are > persisted only till they are needed (to ensure read atomicity for > scanners) and then that value is zeroed out on a subsequent compaction > (saves space). If we retained the memstore-ts even beyond these > compactions, we could get a deterministic order for the puts and deletes > (first insert ts < del ts < second insert ts). > > Thanks > Karthik > > > On 1/17/12 2:14 PM, "M. C. Srivas" <[email protected]> wrote: > > >On Tue, Jan 17, 2012 at 10:07 AM, lars hofhansl <[email protected]> > >wrote: > > > >> Yeah, it's confusing if one expects it to work like in a relational > >> database. > >> You can even do worse. If you by accident place a delete in the future > >>all > >> current inserts will be hidden until the next major compaction. :) > >> I got confused about this myself just recently (see my mail on the > >> dev-list). > >> > >> > >> In the end this is a pretty powerful feature and core to how HBase works > >> (not saying that is not confusing though). > >> > >> > >> If one keeps the following two points in mind it makes more sense: > >> 1. Delete just sets a tomb stone marker at a specific TS (marking > >> everything older as deleted). > >> 2. Everything is versioned, if no version is specified the current time > >> (at the regionserver) is used. > >> > >> In your example1 below t3 > 6, hence the insert is hidden. > >> In example2 both delete and insert TS are 6, hence the insert is hidden. > >> > > > >Lets consider my example2 for a little longer. Sequence of events > > > > 1. ins val1 with TS=6 set by client > > 2. del entire row at TS=6 set by client > > 3. ins val2 with TS=6 set by client > > 4. read row > > > >The row returns nothing even though the insert at step 3 happened after > >the > >delete at step 2. (step 2 masks even future inserts) > > > >Now, the same sequence with a compaction thrown in the middle: > > > > 1. ins val1 with TS=6 set by client > > 2. del entire row at TS=6 set by client > > 3. ---- table is compacted ----- > > 4. ins val2 with TS=6 set by client > > 5. read row > > > >The row returns val2. (the delete at step2 got lost due to compaction). > > > >So we have different results depending upon whether an internal > >re-organization (like a compaction) happened or not. If we want both > >sequences to behave exactly the same, then we need to first choose what is > >the proper (and deterministic) behavior. > > > >A. if we think that the first sequence is the correct one, then the > >delete > >at step 2 needs to be preserved forever. > > > >or, > > > >B. if we think that the second sequence is the correct behavior (ie, a > >read > >always produces the same results independent of compaction), then the > >record needs a second "internal TS" field to allow the RS to distinguish > >the real sequence of events, and not rely upon the TS field which is > >settable by the client. > > > >My opinion: > > > >We should do B. It is normal for someone to write code that says "if old > >exists, delete it; add new". A subsequent read should always reliably > >return "new". > > > >The current way of relying on a client-settable TS field to determine > >causal order results in quirky behavior, and quirky is not good. > > > > > > > >> Look at these two examples: > >> > >> 1. insert Val1 at real time t1 > >> 2. <del> at real time t2 > t1 > >> 3. insert Val2 at real time t3 > t2 > >> > >> 1. insert Val1 with TS=1 at real time t1 > >> 2. <del> with TS = 2 at real time t2 > t1 > >> > >> 3. insert Val2 with TS = 3 at real time t3 > t2 > >> > >> > >> In both cases Val2 is visible. > >> > >> If the your code sets your own timestamps, you better know what you're > >> doing :) > >> > >> Note that my examples below are confusing even if you know how deletion > >>in > >> HBase works. > >> You have to look at Delete.java to figure out what is happening. > >> OK, since there were know objections in two days, I will commit my > >> proposed change in HBASE-5205. > >> > >> > >> -- Lars > >> > >> ________________________________ > >> From: M. C. Srivas <[email protected]> > >> To: [email protected]; lars hofhansl <[email protected]> > >> Sent: Tuesday, January 17, 2012 8:13 AM > >> Subject: Re: Delete client API. > >> > >> > >> Delete seems to be confusing in general. Here are some examples that > >>make > >> me scratch my head (key is same in all the examples): > >> > >> Example1: > >> ---------------- > >> 1. insert Val3 with TS=3 at real time t1 > >> 2. insert Val5 with TS=5 at real time t2 > t1 > >> 3. <del> at real time t3 > t2 > >> 4. insert Val6 with TS=6 at real time t4 > t3 > >> > >> What does a read return? (I would expect Val6, since it was done > >>last). > >> But depending upon whether compaction happened or not between steps 3 > >>and > >> 4, I get either Val6 or nothing. > >> > >> Example 2: > >> ----------------- > >> 1. insert Val3 with TS=3 at real time t1 > >> 2. insert Val5 with TS=5 at real time t2 > t1 > >> 3. <del> TS=6 at real time t3 > t2 > >> 4. insert Val6 with TS=6 at real time t4 > t3 > >> > >> Note the difference in step 3 is this time a TS was specified by the > >> client. > >> > >> What does a read return? Again, I expect Val6 to be returned. But > >> depending upon what's going on, I seem to get either Val5 or Val6. > >> > >> > >> > >> > >> > >> On Sun, Jan 15, 2012 at 7:21 PM, lars hofhansl <[email protected]> > >> wrote: > >> > >> There are some confusing parts about the Delete client API: > >> >1. calling deleteFamily removes all prior column or columns markers > >> without checking the TS. > >> >2. delete{Column|Columns|Family} do not use the timestamp passed to > >> Delete at construction time, but instead default to LATEST_TIMESTAMP. > >> > > >> > Delete d = new Delete(R,T); > >> > d.deleteFamily(CF); > >> > > >> >Does not do what you expect (won't use T for the family delete, but > >> rather the current time). > >> > > >> >Neither does > >> > d.deleteColumns(CF, C1, T2); > >> > d.deleteFamily(CF, T1); // T1 < T2 > >> > > >> > > >> >(the columns marker will be removed) > >> > > >> > > >> >#1 prevents Delete from adding a family marker F for time T1 and a > >> column/columns marker for columns of F at T2 even if T2 > T1. > >> >#2 is just unexpected and different from what Put is doing. > >> > > >> >In HBASE-5205 I propose a simple patch to fix this. > >> > > >> >Since this is a (slight) API change, please provide feed back. > >> > > >> >Thanks. > >> > > >> >-- Lars > >> > > >> > > >> >
