That code was written to handle the pathological case where the same timestamp would be in memstore and hfile and we'd get a bad read and base counter updates on the wrong value. That is we'd have 2 KVs: KV1 ts=1 value=1500 KV2 ts=1 value=500
and we'd base the increment on KV2, thus 'losing' 1000 increments. With seqid we can determine that KV1 > KV2 even though ts=1 for both and resolve without doing the Wrong Thing. That we don't use the same KV is because you don't want to mutate a KV that someone else in a different thread (log flushing) might depend on. Unfortunately updating timestamp which is a long isnt atomic in KV, since you are updating 8 byte fields one at a time. Increment has always been atomic, and making multi increment atomic will be a little trickier, the pseudo code is: - beginMemstoreInsert(seqid); - store.memstore.add(kv); - commitMemstoreInsert(seqid); - store.memstore.remove_versions(kv); the last being necessary to reduce version bloat. On Wed, Jan 12, 2011 at 12:14 PM, Stack <[email protected]> wrote: > On Wed, Jan 12, 2011 at 12:02 PM, Jonathan Gray <[email protected]> wrote: >> I think there are some hacky fixes in there to prevent duplicate timestamps. >> > > Ok. > >> With the new seqid, this should not be an issue. I think we should then be >> able to generate a single now() value and use it everywhere (MemStore, >> HLog). Duplicate timestamps won't be a problem as the seqid ordering will >> take care of it and we can still use the same logic to cleanup old values. >> > > Right. > > OK if I strip back all I regard as 'workaround' code as part of my > first cut at 2856? > >> A separate issue is how "atomic" we want counters to be. Currently the >> multi-column Increment operation does not utilize any of the RWCC stuff. I >> guess here we should just use seqid now? Is it one seqid per KV or per "row >> transaction"? >> > > I chatted w/ Ryan on this yesterday. For single increment, it uses > RWCC. We'd need to change the ordering of RWCC updates a little so > that the increment was immediately available (previous we had a > memstoreTs set to 0 as a means for doing this -- this goes away now). > I'll have to look at the multi-increment. Sounds like it should go > under RWCC controls. > > Currently planning one sequence number per batch of edits, per write to WAL. > > St.Ack >
