Re: High volume data series storage and queries

Jeremiah Peschka Tue, 09 Aug 2011 07:25:02 -0700

---
Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
 Microsoft SQL Server MVP


On Aug 8, 2011, at 6:40 PM, Paul O wrote:

> Indeed, storage capacity is also an issue but IOPS would be important, too. I 
> assume that sending batches to Riak (opaque blobs) would help a lot with the 
> quantity of writes, but it's still a very important point.
> 
> You may want to look into ways to force Riak to clean up the bitcask files. I 
> don't entirely remember how it's going to handle cleaning up deleted records, 
> but you might run into some tricky situations where compactions aren't 
> occurring.
> 
> Hm, any references regarding that? It would be a major snag in the whole 
> schema Riak doesn't properly reclaim space for deleted records.

You might have to tweak the merge settings 
(http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings)
 depending on how and when data is deleted. You could bypass these 
configuration settings by manually running bitcask:merge.

More info here: 
https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted
and here: 
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-July/005055.html

> 
> Riak is pretty constant time for Bitcask. The tricky part with the amount of 
> data you're describing is that Bitcask requires (I think) that all keys fit 
> into memory. As your data volume increases, you'll need to do a combination 
> of scaling up and scaling out. Scale up RAM in the nodes and then add 
> additional nodes to handle load. RAM will help with data volume, more nodes 
> will help with write throughput.
> 
> Indeed, for high frequency sources that would create lots of bundles even the 
> MaxN to 1 reduction for key names might still generate loads of keys. Any 
> idea how much RAM Riak requires per record, or a reference that would point 
> me to it?

There's a capacity planning page: 
http://wiki.basho.com/Bitcask-Capacity-Planning.html
And some additional information about RAM and disk requirements here: 
http://wiki.basho.com/Cluster-Capacity-Planning.html

> 
> Since you're searching on time series, mostly, you could build time indexes 
> in your RDBMS. The nice thing is that querying temporal data is well 
> documented in the relational world, especially in the data warehousing world. 
> In your case, I'd create a dates table and have a foreign key relating to my 
> RDBMS index table to make it easy to search for dates. Querying your time 
> table will be fast which reduces the need for scans in your index table.
> 
> EXAMPLE:
> 
> CREATE TABLE timeseries (
>  time_key INT,
>  date TIMESTAMP,
>  datestring VARCHAR(30),
>  year SMALLINT,
>  month TINYINT,
>  day TINYINT,
>  day_of_week TINYINT
>  -- etc
> );
> 
> CREATE TABLE riak_index (
>  id INT NOT NULL,
>  time_key INT NOT NULL REFERENCES timeseries(time_key),
>  riak_key VARCHAR(100) NOT NULL
> );
> 
> 
> SELECT ri.riak_key
> FROM timeseries ts
> JOIN riak_index ri ON ts.time_key = ri.time_key
> WHERE ts.date BETWEEN '20090702' AND '20100702';
> 
> My plan was to have the riak_index contain something like: (id, start_time, 
> end_time, source_id, record_count.)
> 
> Without going too much into RDBMS fun, this pattern can get your RDBMS 
> running pretty quickly and then you can combine that with Riak's performance 
> and have a really good idea of how quick any query will be.
> 
> That's roughly the plan, thanks again for your contributions to the 
> discussion!
> 
> Paul 
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: High volume data series storage and queries

Reply via email to