I've just inserted some data into a six node Riak 1.3.1 EE cluster. The
keys are all SHA256s. The bucket previously had somewhere in
the vicinity of 1 million objects.
A MR job using the $key 2i with a range of '0' to 'Z', which should cover
all possible SHA256s, and using
both riak_kv_mapreduce:: reduce_count_inputs and streaming the keys using
reduce_identity and counting client side, both returned a count around
750K, but that is now somewhat suspect.
The objects I inserted overlap somewhat with the previously existing
objects, but not completely. Overlapping objects were merged. I
inserted 2,521,799 objects.
When I execute the MR count job against it reports 1,604,783 objects, using
both techniques (reduce_count_inputs and reduce_identity plus client side
counting).
Given the discrepancy I queried the bucket for the 2,521,799 objects I
thought I inserted and I verified the system thinks they are there.
What gives? Why is MR returning incorrect result? Does the 2i query
somehow miss some possible keys?
This is what the job looks like in Ruby:
Riak::MapReduce.new(client).
index(bucket_name, "$key", '0'..'Z').
reduce(['riak_kv_mapreduce', 'reduce_count_inputs'], :keep => true, :arg
=> { "reduce_phase_batch_size" => 1000, "do_prereduce" => true } ).
timeout(86400000).
run
As a side question, does do_prereduce here have any effect? I am thinking
it does not. The docs indicate do_prereduce is a map phase argument, not a
reduce phase one. That begs the question of how to enable prereduce for a
MR job without a map phase, other than setting mapred_always_prereduce =
true in the config file.
Elias
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com