Using Riak to perform aggregate queries

Chris Corbyn Sun, 14 Apr 2013 16:49:05 -0700

All,

Just copying this from my stackoverflow post, as the riak tag doesn't get much 
love over there :) It's fine to just outright say that Riak is never going to 
work efficiently in this case because I'm inherently depending on MapReduce.


Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire 
bucket and that there are other ways of achieving your goals. I'm not sure how, 
though. I'm also not clear on why using an entire bucket is slow, if you only 
have one bucket in the entire system, so either way, you need to go over all 
the entries. Passing in a list of key-bucket pairs has the same effect. Maybe 
the rule should be "don't use MapReduce with more than a handful of keys". 
Which makes me wonder (apart from link traversal) what use it really has in the 
real world.

I have a list of 500K+ documents that represent sales data. I need to view this 
data in different ways: for example, how much revenue was made in each month 
the business was operating? How much revenue did each product raise? How many 
of each product were sold in a given month? I always thought MapReduce was 
supposed to be good at solving these types of aggregate problems. That sounds 
like a myth now though, unless we're just looking at Hadoop.

My documents are all in a bucket named 'sales' and they are records with the 
following fields (as native Erlang records, not JSON):

    {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", 
"created_at": 1365931758}.

Let's take the example where I need to report the total revenue for each 
product in each month over the past 4 years (that's basically the entire 
bucket, but that's just the requirement), how does one use Riak's MapReduce to 
do that efficiently? Even just trying to use an identity map operation on the 
data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.

I'm doing this in Erlang (using the protocol buffers client), but any language 
is fine for an explanation.

The equivalent SQL (MySQL) would be:

  SELECT SUM(price)                         AS revenue,
         FROM_UNIXTIME(created_at, '%Y-%m') AS month,
         product_key
    FROM sales
GROUP BY month, product_key
ORDER BY month ASC;

Even with secondary indexes, there's still a MapReduce involved in this query, 
once you have the list of keys to process.

(Ordering not important right now).

Cheers,

Chris

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Using Riak to perform aggregate queries

Reply via email to