--- Jeremiah Peschka - Founder, Brent Ozar PLF, LLC Microsoft SQL Server MVP
On Aug 8, 2011, at 6:40 PM, Paul O wrote: > Indeed, storage capacity is also an issue but IOPS would be important, too. I > assume that sending batches to Riak (opaque blobs) would help a lot with the > quantity of writes, but it's still a very important point. > > You may want to look into ways to force Riak to clean up the bitcask files. I > don't entirely remember how it's going to handle cleaning up deleted records, > but you might run into some tricky situations where compactions aren't > occurring. > > Hm, any references regarding that? It would be a major snag in the whole > schema Riak doesn't properly reclaim space for deleted records. You might have to tweak the merge settings (http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings) depending on how and when data is deleted. You could bypass these configuration settings by manually running bitcask:merge. More info here: https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted and here: http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-July/005055.html > > Riak is pretty constant time for Bitcask. The tricky part with the amount of > data you're describing is that Bitcask requires (I think) that all keys fit > into memory. As your data volume increases, you'll need to do a combination > of scaling up and scaling out. Scale up RAM in the nodes and then add > additional nodes to handle load. RAM will help with data volume, more nodes > will help with write throughput. > > Indeed, for high frequency sources that would create lots of bundles even the > MaxN to 1 reduction for key names might still generate loads of keys. Any > idea how much RAM Riak requires per record, or a reference that would point > me to it? There's a capacity planning page: http://wiki.basho.com/Bitcask-Capacity-Planning.html And some additional information about RAM and disk requirements here: http://wiki.basho.com/Cluster-Capacity-Planning.html > > Since you're searching on time series, mostly, you could build time indexes > in your RDBMS. The nice thing is that querying temporal data is well > documented in the relational world, especially in the data warehousing world. > In your case, I'd create a dates table and have a foreign key relating to my > RDBMS index table to make it easy to search for dates. Querying your time > table will be fast which reduces the need for scans in your index table. > > EXAMPLE: > > CREATE TABLE timeseries ( > time_key INT, > date TIMESTAMP, > datestring VARCHAR(30), > year SMALLINT, > month TINYINT, > day TINYINT, > day_of_week TINYINT > -- etc > ); > > CREATE TABLE riak_index ( > id INT NOT NULL, > time_key INT NOT NULL REFERENCES timeseries(time_key), > riak_key VARCHAR(100) NOT NULL > ); > > > SELECT ri.riak_key > FROM timeseries ts > JOIN riak_index ri ON ts.time_key = ri.time_key > WHERE ts.date BETWEEN '20090702' AND '20100702'; > > My plan was to have the riak_index contain something like: (id, start_time, > end_time, source_id, record_count.) > > Without going too much into RDBMS fun, this pattern can get your RDBMS > running pretty quickly and then you can combine that with Riak's performance > and have a really good idea of how quick any query will be. > > That's roughly the plan, thanks again for your contributions to the > discussion! > > Paul > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
