On Nov 12, 2011, at 2:32 PM, Gordon Tillman wrote:
> Keith I have an idea that might work for you. This is a bit vague but I
> would be glad to put together a more concrete example if you like.
Okay, thanks! Not sure I understand everything, though.
> Use secondary indexes to tag each entry with the device id.
I get the tagging part, but I'm not sure what the bucket and key being tagged
would look like. Are you taking a single bucket for all data?
put /buckets/mydata/keys/<device>-<timestamp>
x-riak-index-device_bin: FF06541287AB
Something like that?
> You can then find all of the entries for a given device by using the the
> secondary index to feed into a simple map phase operation that returns only
> the entries that you want; i.e., those that are in a given time range.
This I don't know how to do based on my reading of the docs. Something like:
get /buckets/mydata/index/device_bin/FF345678912
which would return a list of .... what, device-timestamp compound keys? And
then would I feed a potentially huge list of "bucket/key" pairs into a gigantic
javascript query for the map-reduce phase?
> In addition, to easily find all of the registered device ids easily you can
> create one entry for each device. The key can be most anything (even the
> device id if you encode it properly -- hash it), and you could tag each of
> those entries with a secondary index whose field is something like "type" or
> whatever and whose value is "deviceid". The value for each entry could be
> just a simple text/plain value whose contents is just the device id of the
> registered device.
Okay, I think I get this:
When a device comes in, just do something like:
put /buckets/devices/<device-id>
x-riak-index-type_bin: "device"
When I want a list of device IDs, I can:
get /buckets/devices/index/type_bin/device
and get them all, right? This is more efficient than the various list
functions? That makes sense to me.
I guess I'll have to try a few examples and see what happens. What you're
telling me is that what I want to do is possible, or is at least not pressing
against Riak's particular trade-offs too much. Or at least I hope that's what
you're telling me. ;)
Keith
>
> --gordon
>
> On Nov 12, 2011, at 16:19 , Keith Irwin wrote:
>
>> Folks--
>>
>> (Apologies up front for the length of this.)
>>
>> I'm wondering if you can let me know if Riak is a good fit for a simple
>> not-quite-key-value scenario described below. MongoDB or (say) Postgresql
>> seem a more natural fit conceptually, but I really, really like Riak's
>> distribution strategy.
>>
>> ## context
>>
>> The basic overview is this:
>>
>> 50K devices push data once a second to web services which need to store that
>> data in short-term storage (Riak). Once an hour, a sweeper needs to take an
>> hour's worth of data per device (if there is any) and ship it off to long
>> term storage, then delete it from short-term storage. Ideally, there'd only
>> ever be slightly more than 1 hour's worth of data still in short-term
>> storage for any given device. The goal is to write down the data as simply
>> and safely as possible, with little or no processing on that data.
>>
>> Each second's worth of data is:
>>
>> * A device identifier
>> * A timestamp (epoch seconds, integer) for the slice of time the data
>> represents
>> * An opaque blob of binary data (2 to 4k)
>>
>> Once an hour, I'd like to do something like:
>>
>> * For each device:
>> * Find (and concat) all the data between time1 and time2 (an hour).
>> * Move that data to long-term storage (not Riak) as a single blob.
>> * Delete that data from Riak.
>>
>> For an SQL db, this is a really simple problem, conceptually. You can have a
>> table with three columns: device-id, timestamp, blob. You can index the
>> first two columns and roll up the data easily enough and then delete it via
>> single SQL statements (or buffer as needed). The harder part is
>> partitioning, replication, etc, etc.
>>
>> For MongoDB, it's also fairly simple. Just use a document with the same
>> device-id, timestamp and binary-array data (as JSON), make sure indexes are
>> declared, and query/delete just as in SQL. MongoDB provides sharding,
>> replica-sets, recovery, etc. Set up, while less complicated than an RDBMS,
>> still seems way more complicated than necessary.
>>
>> These solutions also provide sorting (which, while nice, isn't a requirement
>> for my case).
>>
>> ## question
>>
>> I've been reading the Riak docs, and I'm just not sure if this simple
>> "queryable" case can really fit all that well. I'm not so concerned about
>> having to send 50K "deletes" to delete data. I'm more concerned about being
>> able to find it. Given what I've written above, I may be blocked
>> conceptually by the above index/query mentality such that I'm just not
>> seeing the Riak way of doing things.
>>
>> Anyway, I can "tag" (via the secondary index feature) each blob of data with
>> the device-id and the timestamp. I could then do a range query similar to:
>>
>> GET /buckets/devices/index/timestamp/start/end
>>
>> However, this doesn't allow me to group based on device-id. I could create a
>> separate bucket for every device, such that I could do:
>>
>> GET /buckets/device-id/index/timestamp/start/end
>>
>> but if I do this, how can I get a list of the device-ids I need so that I
>> can create that specific URL? The docs say listing buckets and keys is
>> problematic.
>>
>> Might be that Riak just isn't a good case for this sort of thing, especially
>> given I want to use it for short-term transient data, and that's fine. But I
>> wanted to ask you all just to make sure that I'm not missing something
>> somewhere.
>>
>> For instance, might link walking help? How about a map/reduce to find a
>> unique list of device-ids within a given time-horizon, and a streaming map
>> job to gather the data for export? Does that seem pretty reasonable?
>>
>> Thanks!
>>
>> Keith
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com