Okay, so here's what I'm thinking now after reading through some of
the M/R docs. Suppose I did this.
1. Create 2 buckets
- one for K/V pairs
- one for changed keys keyed by a timestamp or bin or something
(run in post-commit on source colo).
2. Replicate both buckets to remote colo
2. Use a key filter with M/R to get keys changed from some time in the past
3. Run M/R regularly to publish key changes (probably to a rabbit queue)
4. Have local consumer read key changes then grab updated Values from first
bucket
I think this will all work, I'm not totally sure on the key filtering, but
it seems like a second bucket with time based keys would work best. I plan
to serialize all writes to each bucket as that is a requirement for auditing
so just having a single integer key with the time the entry was written
will probably work, then a key filter with a simple greater than. I can
even overlap times to pick up any late additions caused by backups in
replication, since I only keep track of changed keys, and always read
the most current. I guess you could end up with the timestamp based
bucket replicating faster and thus data drift, hmm, that could be an issue.
Maybe a secondary index with time might work better. I believe I need
some sort of secondary index as otherwise iterating over all the entries
in a bucket would be costly. I don't know exact numbers but I would guess
I'm looking at worst case several million K/V pairs per bucket so maybe M/R
on that isn't so bad. Is there any speed up with 2i and a key filter (can
you even create a key filter based on 2i?).
Anyway, still searching for a way to do this efficiently,
-Anthony
On Wed, Apr 04, 2012 at 09:20:04AM -0700, Anthony Molinaro wrote:
>
> On Wed, Apr 04, 2012 at 08:10:29AM -0600, Jon Meredith wrote:
> > Riak does have a last modified field, but it's last modified by client so
> > is deliberately left untouched on replication. Similarly the vclock is not
> > incremented either (the vclocks/siblings from both sides are resolved using
> > the two vclocks).
>
> That's great, as I'd want to know on the far end when the client modified
> it.
>
> > There are no obvious mechanisms for doing what you want currently. I'll
> > think about options and somebody will get back to you.
>
> Is it not possible to use the last modified filed in a Map/Reduce? I've
> not actually played with M/R in Riak yet (as I've only ever used it
> previously as a Key/Value store). I'll try to dig into it a bit today
> but I assumed I could do something to map over all records in a bucket
> checking last modified, and return the set modified since a certain
> time (or better yet put them in a rabbit queue to be consumed by my
> systems which will cache the data).
>
> Alternatively, I could maybe have a second bucket representing the changed
> keys, where each time a key is changed in the primary bucket, I could
> add an entry to the other bucket. I could then replicate that bucket
> and just list keys on the remote side (maybe also deleting so subsequent
> list keys only get changes, but then I think the replicator will replace
> those keys, so I'd have to have some sort of bidirectional replication
> for those buckets, sounds messy).
>
> Anyway, hopefully someone will have an idea,
>
> -Anthony
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro <[email protected]>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
--
------------------------------------------------------------------------
Anthony Molinaro <[email protected]>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com