Hi everyone, I am trying to learn about Riak MapReduce and comparing it with Hadoop MapReduce, and there are some details that I am interested in but not covered in the online documents. So hopefully we can get some help here about the following questions? Thanks in advance!
1. For a given MapReduce request (or to say, job), how does Riak decide how many mappers to use for the job? For example, if I have 8 nodes and my data are distributed across all nodes with an "N" value of 2, will I have 4 mappers running on 4 nodes concurrently? Is it possible to have multiple mappers (e.g., 4 or even 6) for the same MR job running on each node (for better processing speed)? 2. If I run a MapReduce job over the results of a Riak Search query, how does Riak schedule the mappers based on the search results? 3. How does Riak handle intermediate data generated by mappers? Specifically: (1) In Hadoop MapReduce, the output of mappers are <key, value> pairs, and the output from all mappers are first grouped based on keys, and then handed over to the reducer. Does Riak do similar grouping of intermediate data? (2) How are mapper outputs transmitted to the reducer? Does Riak use local disks on the mapper nodes or reducer nodes to store the intermediate data temporarily? 4. According to the document http://docs.basho.com/riak/latest/dev/advanced/mapreduce/#How-Phases-Work , each MR job only schedules one reducer, which runs on the coordinate node. Is there any way to configure a MR job to use multiple reducers? Best regards, Xiaoming -- View this message in context: http://riak-users.197444.n3.nabble.com/Comparing-Riak-MapReduce-and-Hadoop-MapReduce-tp4028454.html Sent from the Riak Users mailing list archive at Nabble.com. _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
