Alexander, the whole batching strategy I described in my initial post is trying to help the problem better map to a kv store such as Riak. The plan is for each batch of MaxN events to be stored under a single key, hence avoiding the problem of storing too few tiny values. I'm still surprised of a 450 bytes per value overhead.
I do appreciate your reasoning around this, though. In the end Ciprian's earlier suggestion (Riak core + custom storage) would seem to win the day. Not sure if the tradeoff would pay off immediately, though, so I might end up having some strategy for the initial volume expectations with a migration plan to a more advanced solution sometimes down the road, I guess. Regards, Paul On Tue, Aug 9, 2011 at 10:43 AM, Alexander Sicular <[email protected]>wrote: > A couple of thoughts: > > -disk io > -total keys versus memory > -data on disk overhead > > As Jeremiah noted, disk io is crucial. Thankfully, Riak's distributed mesh > gives you access to a number of spindles limited only by your budget. I > think that is a critical bonus of a distributed system like Riak that is > often not fully appreciated. Here Riak is a win for you. > > Bitcask needs all keys to fit in memory. We are talking something like: > > (key length + overhead) * number of keys * replicas < cluster max available > ram. > > There is a tool on the wiki which should help figure this out. What that > basically means for you is that you will have to batch your data by some > sensor/time granularity metric. Let's say every minute. At 10hz that is a > 600x reduction in total keys. Of course, this doesn't come for free. Your > application middleware will have to accommodate. That means you could lose > up to whatever your time granularity batch is. Ie. You could lose a minute > of sensor data should your application fail. Here Riak is neutral to > negative. > > Riak data structure is not friendly towards small values. Sensor data > generally spit out integers or other small data tuples. If you search the > list archives you will find a magnificent data overhead writeup. IIRC, it > was something on the order of 450b. What that basically tells you is that > you can't use bitcask for small values if disk space is a concern, as I > imagine it to be in this case. Also, sensor data is generally write only, > ie. never deleted or modified, so compaction should not be a concern when > using bitcask. Here Riak is a strong negative. > > Data retrieval issues aside (which between Riak Search/secondary > indexes/third party indexes should not be a major concern), I am of the > opinion that Riak is not a good fit for high frequency sensor data > applications. > > Cheers, > Alexander > > Sent from my rotary phone. > On Aug 8, 2011 9:40 PM, "Paul O" <[email protected]> wrote: > > Quite a few interesting points, thanks! > > > > On Mon, Aug 8, 2011 at 5:53 PM, Jeremiah Peschka < > [email protected] > >> wrote: > > > >> Responses inline > >> > >> On Aug 8, 2011, at 1:25 PM, Paul O wrote: > >> > >> Will any existing data be imported? If this is totally greenfield, then > >> you're free to do whatever zany things you want! > > > > > > Almost totally greenfield, yes. Some data will need to be imported but > it's > > already in the format described. > > > > Ah, so you need IOPS throughput, not storage capacity. On the hardware > side > >> make sure your storage subsystem can keep up - don't cheap out on disks > just > >> because you have a lot of nodes. A single rotational HDD can only handle > >> about 180 IOPS on average. There's a lot you can do on the storage > backend > >> to make sure you're able to keep up there. > >> > > > > Indeed, storage capacity is also an issue but IOPS would be important, > too. > > I assume that sending batches to Riak (opaque blobs) would help a lot > with > > the quantity of writes, but it's still a very important point. > > > > You may want to look into ways to force Riak to clean up the bitcask > files. > >> I don't entirely remember how it's going to handle cleaning up deleted > >> records, but you might run into some tricky situations where compactions > >> aren't occurring. > >> > > > > Hm, any references regarding that? It would be a major snag in the whole > > schema Riak doesn't properly reclaim space for deleted records. > > > > Riak is pretty constant time for Bitcask. The tricky part with the amount > of > >> data you're describing is that Bitcask requires (I think) that all keys > fit > >> into memory. As your data volume increases, you'll need to do a > combination > >> of scaling up and scaling out. Scale up RAM in the nodes and then add > >> additional nodes to handle load. RAM will help with data volume, more > nodes > >> will help with write throughput. > >> > > > > Indeed, for high frequency sources that would create lots of bundles even > > the MaxN to 1 reduction for key names might still generate loads of keys. > > Any idea how much RAM Riak requires per record, or a reference that would > > point me to it? > > > > Since you're searching on time series, mostly, you could build time > indexes > >> in your RDBMS. The nice thing is that querying temporal data is well > >> documented in the relational world, especially in the data warehousing > >> world. In your case, I'd create a dates table and have a foreign key > >> relating to my RDBMS index table to make it easy to search for dates. > >> Querying your time table will be fast which reduces the need for scans > in > >> your index table. > >> > >> EXAMPLE: > >> > >> CREATE TABLE timeseries ( > >> time_key INT, > >> date TIMESTAMP, > >> datestring VARCHAR(30), > >> year SMALLINT, > >> month TINYINT, > >> day TINYINT, > >> day_of_week TINYINT > >> -- etc > >> ); > >> > >> CREATE TABLE riak_index ( > >> id INT NOT NULL, > >> time_key INT NOT NULL REFERENCES timeseries(time_key), > >> riak_key VARCHAR(100) NOT NULL > >> ); > >> > >> > >> SELECT ri.riak_key > >> FROM timeseries ts > >> JOIN riak_index ri ON ts.time_key = ri.time_key > >> WHERE ts.date BETWEEN '20090702' AND '20100702'; > >> > > > > My plan was to have the riak_index contain something like: (id, > start_time, > > end_time, source_id, record_count.) > > > > Without going too much into RDBMS fun, this pattern can get your RDBMS > >> running pretty quickly and then you can combine that with Riak's > performance > >> and have a really good idea of how quick any query will be. > > > > > > That's roughly the plan, thanks again for your contributions to the > > discussion! > > > > Paul >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
