Re: Is Riak suitable for a short-term scatter/gather sort of data store?

Keith Irwin Sat, 12 Nov 2011 15:43:44 -0800

On Nov 12, 2011, at 2:32 PM, Gordon Tillman wrote:

> Keith I have an idea that might work for you.  This is a bit vague but I 
> would be glad to put together a more concrete example if you like.


Okay, thanks! Not sure I understand everything, though.

> Use secondary indexes to tag each entry with the device id.

I get the tagging part, but I'm not sure what the bucket and key being tagged 
would look like. Are you taking a single bucket for all data?

put /buckets/mydata/keys/<device>-<timestamp>
x-riak-index-device_bin: FF06541287AB

Something like that?

> You can then find all of the entries  for a given device by using the the 
> secondary index to feed into a simple map phase operation that returns only 
> the entries that you want; i.e., those that are in a given time range.

This I don't know how to do based on my reading of the docs. Something like:

    get /buckets/mydata/index/device_bin/FF345678912

which would return a list of .... what, device-timestamp compound keys? And 
then would I feed a potentially huge list of "bucket/key" pairs into a gigantic 
javascript query for the map-reduce phase?

> In addition, to easily find all of the registered device ids easily you can 
> create one entry for each device.  The key can be most anything (even the 
> device id if you encode it properly -- hash it), and you could tag each of 
> those entries with a secondary index whose field is something like "type" or 
> whatever and whose value is "deviceid".  The value for each entry could be 
> just a simple text/plain value whose contents is just the device id of the 
> registered device.

Okay, I think I get this:

When a device comes in, just do something like:

put /buckets/devices/<device-id>
x-riak-index-type_bin: "device"

When I want a list of device IDs, I can:

get /buckets/devices/index/type_bin/device

and get them all, right? This is more efficient than the various list 
functions? That makes sense to me.

I guess I'll have to try a few examples and see what happens. What you're 
telling me is that what I want to do is possible, or is at least not pressing 
against Riak's particular trade-offs too much. Or at least I hope that's what 
you're telling me. ;)

Keith


> 
> --gordon
> 
> On Nov 12, 2011, at 16:19 , Keith Irwin wrote:
> 
>> Folks--
>> 
>> (Apologies up front for the length of this.)
>> 
>> I'm wondering if you can let me know if Riak is a good fit for a simple 
>> not-quite-key-value scenario described below. MongoDB or (say) Postgresql 
>> seem a more natural fit conceptually, but I really, really like Riak's 
>> distribution strategy.
>> 
>> ## context
>> 
>> The basic overview is this: 
>> 
>> 50K devices push data once a second to web services which need to store that 
>> data in short-term storage (Riak). Once an hour, a sweeper needs to take an 
>> hour's worth of data per device (if there is any) and ship it off to long 
>> term storage, then delete it from short-term storage. Ideally, there'd only 
>> ever be slightly more than 1 hour's worth of data still in short-term 
>> storage for any given device. The goal is to write down the data as simply 
>> and safely as possible, with little or no processing on that data.
>> 
>> Each second's worth of data is:
>> 
>> * A device identifier
>> * A timestamp (epoch seconds, integer) for the slice of time the data 
>> represents
>> * An opaque blob of binary data (2 to 4k)
>> 
>> Once an hour, I'd like to do something like:
>> 
>> * For each device:
>>      * Find (and concat) all the data between time1 and time2 (an hour).
>>      * Move that data to long-term storage (not Riak) as a single blob.
>>      * Delete that data from Riak.
>> 
>> For an SQL db, this is a really simple problem, conceptually. You can have a 
>> table with three columns: device-id, timestamp, blob. You can index the 
>> first two columns and roll up the data easily enough and then delete it via 
>> single SQL statements (or buffer as needed). The harder part is 
>> partitioning, replication, etc, etc.
>> 
>> For MongoDB, it's also fairly simple. Just use a document with the same 
>> device-id, timestamp and binary-array data (as JSON), make sure indexes are 
>> declared, and query/delete just as in SQL. MongoDB provides sharding, 
>> replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, 
>> still seems way more complicated than necessary.
>> 
>> These solutions also provide sorting (which, while nice, isn't a requirement 
>> for my case).
>> 
>> ## question
>> 
>> I've been reading the Riak docs, and I'm just not sure if this simple 
>> "queryable" case can really fit all that well. I'm not so concerned about 
>> having to send 50K "deletes" to delete data. I'm more concerned about being 
>> able to find it. Given what I've written above, I may be blocked 
>> conceptually by the above index/query mentality such that I'm just not 
>> seeing the Riak way of doing things.
>> 
>> Anyway, I can "tag" (via the secondary index feature) each blob of data with 
>> the device-id and the timestamp. I could then do a range query similar to:
>> 
>>   GET /buckets/devices/index/timestamp/start/end
>> 
>> However, this doesn't allow me to group based on device-id. I could create a 
>> separate bucket for every device, such that I could do:
>> 
>>   GET /buckets/device-id/index/timestamp/start/end
>> 
>> but if I do this, how can I get a list of the device-ids I need so that I 
>> can create that specific URL? The docs say listing buckets and keys is 
>> problematic.
>> 
>> Might be that Riak just isn't a good case for this sort of thing, especially 
>> given I want to use it for short-term transient data, and that's fine. But I 
>> wanted to ask you all just to make sure that I'm not missing something 
>> somewhere.
>> 
>> For instance, might link walking help? How about a map/reduce to find a 
>> unique list of device-ids within a given time-horizon, and a streaming map 
>> job to gather the data for export? Does that seem pretty reasonable?
>> 
>> Thanks!
>> 
>> Keith
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Is Riak suitable for a short-term scatter/gather sort of data store?

Reply via email to