Hi,
For Riak Mobile, I occasionally need to use a M/R job to scan all key/values in
a bucket and compute a content-hash for each object in a bucket. This works
fine, but ...
I'd like to be able to do a "consistent map/reduce" job i.e., with "R=2
semantics" for an "N=3 bucket". Maybe other people have the same need, but I
can't see if this is possible ... perhaps with the new riak_pipe infrastructure?
This is my idea:
The map function yields {Key, [{VectorClock,1,Hash}]} for each replica, but
needs to run on *all* replicas of objects in a given Bucket. Hash is the real
value I'm interested in i.e., the content-hash for the object; but it could be
some other "map" function output.
Then, the reduce phase needs to "merge" a list of {VectorClock,N,Hash} tuples,
by considering the VectorClocks to determine if results are in "conflict", or
if one is before/after the other. N is reduced to the sum of all elements with
equal Hash value.
For each output of the reduce phase I'll then have, for each key, a list of
{VC,N,Hash}. If one of those N values are >= quorum, then I have a consistent
output value (Hash).
Questions:
- How can I have a M/R job run on *all* vnodes? Not just for objects that are
owned by a primary?
- The M/R "input" is essentially listkeys(Bucket) ... can this be done using
"async keylisting", so that the operation does not hold up the vnode while
listing?
If someone can sketch a solution, I'd be happy to go hacking on it ...
Kresten
Mobile: + 45 2343 4626 | Skype: krestenkrabthorup | Twitter: @drkrab
Trifork A/S | Margrethepladsen 4 | DK- 8000 Aarhus C | Phone : +45 8732
8787 | www.trifork.com
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com