Thanks for the insight into this. --- Jeremiah Peschka - Founder, Brent Ozar Unlimited MCITP: SQL Server 2008, MVP Cloudera Certified Developer for Apache Hadoop
On Thu, Feb 14, 2013 at 4:40 AM, Christian Dahlqvist <[email protected]>wrote: > Hi OJ, > > The do_prereduce parameter makes it possible to have the first iteration > of the reduce phase execute where the preceding map phase generated output. > This can, as in the example I provided, be used to reduce the amount of > data that needs to be sent across the network. This is described in greater > detail here: > http://docs.basho.com/riak/latest/references/appendices/MapReduce-Implementation/ > > As it is possible to set it to be enabled by default in the app.config, it > should be fine to always specify it for reduce phases preceded by a map > phase. > > Best regards, > > Christian > > > On 14 Feb 2013, at 12:21, OJ Reeves <[email protected]> wrote: > > Chris, > > I've never heard of do_prereduce before. What kind of effect does this > have? That is, if someone were to use it all the time, regardless of the > amount of data being returned, would this be a bad thing? > > Thanks. > OJ > > On Thu, Feb 14, 2013 at 6:19 PM, Christian Dahlqvist > <[email protected]>wrote: > >> Hi, >> >> For buckets with a significant number of records, it makes a lot of sense >> to run the example I provided with 'do_prereduce' enabled as it will result >> in considerably less data being sent between the nodes. This can be enabled >> as follows: >> >> curl -XPOST http://localhost:8098/mapred >> -H 'Content-Type: application/json' >> -d '{"inputs":{ >> "bucket":"goog", >> "index":"$bucket", >> "key":"goog" >> }, >> "query":[{"reduce":{"language":"erlang", >> "module":"riak_kv_mapreduce", >> "function":"reduce_count_inputs", >> "arg":{"do_prereduce":true}}}]}' >> >> Best regards, >> >> Christian >> >> >> On 14 Feb 2013, at 08:01, Christian Dahlqvist <[email protected]> >> wrote: >> >> Hi Jeremiah, >> >> It does indeed not seem to be documented on the main docs site, and I >> will try to correct this. The only place I have found it described is on >> the wiki for the Ruby client ( >> https://github.com/basho/riak-ruby-client/wiki/Secondary-Indexes). >> >> Below is also an example of a simple mapreduce job that shows how to >> count the number of records in the 'goog' bucket based on the $bucket >> secondary index: >> >> curl -XPOST http://localhost:8098/mapred >> -H 'Content-Type: application/json' >> -d '{"inputs":{ >> "bucket":"goof", >> "index":"$bucket", >> "key":"goof" >> }, >> "query":[{"reduce":{"language":"erlang", >> "module":"riak_kv_mapreduce", >> "function":"reduce_count_inputs"}}]}' >> >> I hope this helps. >> >> Best regards, >> >> Christian >> >> >> On 13 Feb 2013, at 18:12, Jeremiah Peschka <[email protected]> >> wrote: >> >> Is this documented anywhere on the docs.basho.com site? >> >> Searching for $bucket produces search results just for "bucket" and >> Google says "No results found for *site:docs.basho.com $bucket*." >> >> --- >> Jeremiah Peschka - Founder, Brent Ozar Unlimited >> MCITP: SQL Server 2008, MVP >> Cloudera Certified Developer for Apache Hadoop >> >> >> On Wed, Feb 13, 2013 at 10:08 AM, Christian Dahlqvist < >> [email protected]> wrote: >> >>> Hi, >>> >>> In addition to the $key index, there is also a $bucket index available >>> by default. This contains the name of the bucket, and can be used to get >>> all keys in a specific bucket. >>> >>> Best regards, >>> >>> Christian >>> >>> >> >> >> _______________________________________________ >> riak-users mailing list >> [email protected] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> > > > -- > > OJ Reeves > +61 431 952 586 > http://buffered.io/ > > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
