That code was written to handle the pathological case where the same
timestamp would be in memstore and hfile and we'd get a bad read and
base counter updates on the wrong value.  That is we'd have 2 KVs:
KV1 ts=1  value=1500
KV2 ts=1  value=500

and we'd base the increment on KV2, thus 'losing' 1000 increments.
With seqid we can determine that KV1 > KV2 even though ts=1 for both
and resolve without doing the Wrong Thing.  That we don't use the same
KV is because you don't want to mutate a KV that someone else in a
different thread (log flushing) might depend on.  Unfortunately
updating timestamp which is a long isnt atomic in KV, since you are
updating 8 byte fields one at a time.

Increment has always been atomic, and making multi increment atomic
will be a little trickier, the pseudo code is:
- beginMemstoreInsert(seqid);
- store.memstore.add(kv);
- commitMemstoreInsert(seqid);
- store.memstore.remove_versions(kv);

the last being necessary to reduce version bloat.


On Wed, Jan 12, 2011 at 12:14 PM, Stack <[email protected]> wrote:
> On Wed, Jan 12, 2011 at 12:02 PM, Jonathan Gray <[email protected]> wrote:
>> I think there are some hacky fixes in there to prevent duplicate timestamps.
>>
>
> Ok.
>
>> With the new seqid, this should not be an issue.  I think we should then be 
>> able to generate a single now() value and use it everywhere (MemStore, 
>> HLog).  Duplicate timestamps won't be a problem as the seqid ordering will 
>> take care of it and we can still use the same logic to cleanup old values.
>>
>
> Right.
>
> OK if I strip back all I regard as 'workaround' code as part of my
> first cut at 2856?
>
>> A separate issue is how "atomic" we want counters to be.  Currently the 
>> multi-column Increment operation does not utilize any of the RWCC stuff.  I 
>> guess here we should just use seqid now?  Is it one seqid per KV or per "row 
>> transaction"?
>>
>
> I chatted w/ Ryan on this yesterday.  For single increment, it uses
> RWCC.  We'd need to change the ordering of RWCC updates a little so
> that the increment was immediately available (previous we had a
> memstoreTs set to 0 as a means for doing this -- this goes away now).
> I'll have to look at the multi-increment.  Sounds like it should go
> under RWCC controls.
>
> Currently planning one sequence number per batch of edits, per write to WAL.
>
> St.Ack
>

Reply via email to