What's the underlying goal of getting this count of records in a bucket? Do
you want to just have a live count or will you be eventually performing
additional filters on the count?

One option might be to use counters [1] to hold these counts, instead of
attempting to compute them on the fly.

In direct answer to your question - there's no faster way to make this
happen apart from speeding up disks and may playing around with some of the
MapReduce arguments, like enabling pre-reduce. You're always going to have
to scan the cluster to find keys matching your criteria (at least with
LevelDB).

[1]: http://basho.com/counters-in-riak-1-4/



---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop


On Thu, Aug 1, 2013 at 12:01 PM, Christian Rosnes <
[email protected]> wrote:

>
>
>
> On Wed, Jul 31, 2013 at 9:54 AM, Christian Rosnes <
> [email protected]> wrote:
>
>
>> I have 4 node Riak 1.4 test cluster on Azure
>> (Large: 4core, 7GB RAM instances).
>>
>>
> Ran 7, slightly different, Erlang map-reduce jobs overnight to count the
> 118 million
> records in the 'entries' bucket. There were no other user requests running
> at the time of testing. Please take the test-results with a grain of salt,
> YMMV.
> Scripts used listed below.
>
> Christian
> @NorSoulx
>
> *Here are the results:*
>
> ----
> Running script *count.all.records.in.bucket.1.sh*
> Counting all records in bucket: entries (Thu Aug  1 09:07:53 UTC 2013)
> [118 553 863]
> real   * 201m46.355s*
> user    0m0.199s
> sys     0m0.419s
> Done: Thu Aug  1 12:29:39 UTC 2013
>
> ----
> Running script* count.all.records.in.bucket.2.sh*
> Counting all records in bucket: entries (Wed Jul 31 19:24:40 UTC 2013)
> [118 553 863]
> real    *148m33.854s* (ran this a second time and the result was then *
> 144m*)
> user    0m0.185s
> sys     0m0.423s
> Done: Wed Jul 31 21:53:13 UTC 2013
>
> ----
> Running script *count.all.records.in.bucket.3.sh*
> Counting all records in bucket: entries (Wed Jul 31 21:53:13 UTC 2013)
> [118 553 863]
> real    *129m51.310s*
> user    0m0.136s
> sys     0m0.327s
> Done: Thu Aug  1 00:03:05 UTC 2013
>
> ----
> Running script *count.all.records.in.bucket.4.sh*
> Countuing all records in bucket: entries (Thu Aug  1 00:03:05 UTC 2013)
> [118 553 863]
> real    *138m29.816s*
> user    0m0.105s
> sys     0m0.464s
> Done: Thu Aug  1 02:21:35 UTC 2013
>
> ----
> Running script *count.all.records.in.bucket.5.sh*
> Counting all records in bucket: entries (Thu Aug  1 02:21:35 UTC 2013)
> [118 553 863]
> real    *132m10.353s*
> user    0m0.129ss
> sys     0m0.337s
> Done: Thu Aug  1 04:33:45 UTC 2013
>
> ----
> Running script *count.all.records.in.bucket.6.sh*
> Counting all records in bucket: entries (Thu Aug  1 04:33:45 UTC 2013)
> [118 553 863]
> real    *137m16.386s*
> user    0m0.122s
> sys     0m0.363s
> Done: Thu Aug  1 06:51:01 UTC 2013
>
> ----
> Running script *count.all.records.in.bucket.7.sh*
> Counting all records in bucket: entries (Thu Aug  1 06:51:01 UTC 2013)
>
> [118 553 863]
> real    *136m51.149s*
> user    0m0.297s
> sys     0m0.225s
> Done: Thu Aug  1 09:07:53 UTC 2013
>
> =============================
>
> *Scripts:*
>
> count.all.records.in.bucket.1.sh
>
> --------------------------------
> time curl -XPOST http://localhost:8098/mapred -H 'Content-Type:
> application/json' -d '{
>         "inputs":"entries",
>         "query":[
>
> {"map":{"language":"erlang","module":"riak_mapreduce_utils",
> "function":"map_id","keep":false}},
>
>                 {"reduce" : {"language" : "erlang", "module" :
> "riak_kv_mapreduce", "function" : "reduce_count_inputs" }},
>                 ],
>         "timeout": 90000000}'
>
>
> count.all.records.in.bucket.2.sh
>
> --------------------------------
> time curl -XPOST http://localhost:8098/mapred \
>   -H 'Content-Type: application/json' \
>   -d '{"inputs":{
>            "bucket":"entries",
>            "index":"$bucket",
>            "key":"entries"
>        },
>        "query":[{"reduce":{"language":"erlang",
>                            "module":"riak_kv_mapreduce",
>                            "function":"reduce_count_inputs",
>                             "arg":{"reduce_phase_batch_size":1000}
>                           }
>                }],
>        "timeout": 90000000}'
>
>
> count.all.records.in.bucket.3.sh
>
> --------------------------------
> time curl -XPOST http://localhost:8098/mapred \
>   -H 'Content-Type: application/json' \
>   -d '{"inputs":"entries",
>
>       "query":[{"reduce":{"language":"erlang",
>                           "module":"riak_kv_mapreduce",
>                           "function":"reduce_count_inputs",
>                           "arg":{"do_prereduce":true}
>                           }
>               }],
>       "timeout": 90000000}'
>
>
> count.all.records.in.bucket.4.sh
>
> --------------------------------
> time curl -XPOST http://localhost:8098/mapred \
>   -H 'Content-Type: application/json' \
>   -d '{"inputs":"entries",
>
>     "query":[{"reduce":{"language":"erlang",
>                         "module":"riak_kv_mapreduce",
>                         "function":"reduce_count_inputs",
>
> "arg":{"reduce_phase_batch_size":100000,"do_prereduce":true}
>                         }
>             }],
>     "timeout": 90000000}'
>
>
> count.all.records.in.bucket.5.sh
>
> --------------------------------
> time curl -XPOST http://localhost:8098/mapred \
>   -H 'Content-Type: application/json' \
>   -d '{"inputs":{
>            "bucket":"entries",
>            "index":"$bucket",
>            "key":"entries"
>        },
>        "query":[{"reduce":{"language":"erlang",
>                            "module":"riak_kv_mapreduce",
>                            "function":"reduce_count_inputs",
>                            "arg":{"do_prereduce":true}
>                           }
>                }],
>        "timeout": 90000000}'
>
> count.all.records.in.bucket.6.sh
>
> --------------------------------
> time curl -XPOST http://localhost:8098/mapred \
>   -H 'Content-Type: application/json' \
>   -d '{"inputs":{
>            "bucket":"entries",
>            "index":"$bucket",
>            "key":"entries"
>        },
>        "query":[{"reduce":{"language":"erlang",
>                            "module":"riak_kv_mapreduce",
>                            "function":"reduce_count_inputs",
>                             "arg":{"do_prereduce":false}
>                           }
>                }],
>        "timeout": 90000000}'
>
>
> count.all.records.in.bucket.7.sh
>
> --------------------------------
> time curl -XPOST http://localhost:8098/mapred \
>   -H 'Content-Type: application/json' \
>   -d '{"inputs":{
>            "bucket":"entries",
>            "index":"$bucket",
>            "key":"entries"
>        },
>        "query":[{"reduce":{"language":"erlang",
>                            "module":"riak_kv_mapreduce",
>                            "function":"reduce_count_inputs",
>                            "arg":{"reduce_phase_batch_size":10000}
>                           }
>                }],
>        "timeout": 90000000}'
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to