Help with a complex Map Reduce query

Julien Genestoux Sat, 01 Jun 2013 14:30:02 -0700

Hello,

We store the following in our Riak cluster:
- feeds as a list of 10 keys to entries. All the keys are like this:
feedKey-entryKey
- entries as a complex JSON object.


We try to avoid losing track of any entryKey by deleting it from the feed
object only when corresponding object has been deleted.
Yet, due to a bug in our implementation, we have 'lost' some entries. In
other words, some feedKey-entryKey elements are not in any feed object.

We're now trying to find the best way to "clean" that mess :)

Our initial solution was to list all the feed keys, and then, for each,
issue a mapReduce object to list all entries whose key start with feedKey.
Then, we can compare the expected list of entryKey (stored in the feedKey)
with the actual list of feedKey-* elements and delete the extra ones.
In practice, that would be about 500,000 map reduce jobs. We're thinking
that may not be the solution (and it can take litterally weeks to complete
as each mapReduce job takes about 10secs to complete)

We're now thinking there may be a better way? Maybe with a single mapReduce
job which would iterate over all the entry keys and then only keep track
of the feedKey that have more than 10 elements? This would probably cut
down very significantly the number of map reduce as we would run them only
on the few (maybe 1%?) feedKey for which there are 'lost' entries?

Maybe there would be a better way? Any idea?

Thanks

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Help with a complex Map Reduce query

Reply via email to