All,
Just copying this from my stackoverflow post, as the riak tag doesn't get much
love over there :) It's fine to just outright say that Riak is never going to
work efficiently in this case because I'm inherently depending on MapReduce.
Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire
bucket and that there are other ways of achieving your goals. I'm not sure how,
though. I'm also not clear on why using an entire bucket is slow, if you only
have one bucket in the entire system, so either way, you need to go over all
the entries. Passing in a list of key-bucket pairs has the same effect. Maybe
the rule should be "don't use MapReduce with more than a handful of keys".
Which makes me wonder (apart from link traversal) what use it really has in the
real world.
I have a list of 500K+ documents that represent sales data. I need to view this
data in different ways: for example, how much revenue was made in each month
the business was operating? How much revenue did each product raise? How many
of each product were sold in a given month? I always thought MapReduce was
supposed to be good at solving these types of aggregate problems. That sounds
like a myth now though, unless we're just looking at Hadoop.
My documents are all in a bucket named 'sales' and they are records with the
following fields (as native Erlang records, not JSON):
{"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00",
"created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each
product in each month over the past 4 years (that's basically the entire
bucket, but that's just the requirement), how does one use Riak's MapReduce to
do that efficiently? Even just trying to use an identity map operation on the
data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language
is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
Even with secondary indexes, there's still a MapReduce involved in this query,
once you have the list of keys to process.
(Ordering not important right now).
Cheers,
Chris_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com