I want to store lots of time series data in a database. The data can be 
organized into measurements which have values at specific times. There will be 
a *lot* of data but it doesn't need to be accessed very often.
The time value will most likely be something like the seconds since the unix 
epoch. However, more precision would be useful. The time will be of a fixed 
size.
I want to store the data in such a way that losing a node with r=1 will only 
lose data at specific intervals. I'm thinking to do this by choosing the vnode 
to store the data on by time % num_vnodes and building on riak-core.

I have a idea for the actual storage format. Each measurement would be stored 
separately. The data would be grouped into files of a specific size.

The data would be grouped into blocks so that each block is a good size for 
compression. Each block would be compressed with 

The directory would look like:
1367419556.keys -> Keys file for time 1367419556 until the next file 
(1367464288).
1367419556.data -> Data file for time 1367419556 until the next file 
(1367464288).
1367419556.updates -> Updates/deletes/random inserts for the data. Searched 
first.
1367464288.keys
1367464288.data


The data file would be
| time | value_size | value | 
repeated for each time/value and grouped into compressed blocks. 

The keys file would contain 
| time of first item in block | offset |
repeated for each block. 

To get a value at a time first the file it is contained in would be found. 
Next, the update file will be searched if it exists. Then the keys file would 
be read. For each time, if it is greater than the time, the previous block will 
be opened and read until the time is found. Range support will be necessary to 
achieve reasonable speed.
To insert a key at the end, it would be buffered until there is block size 
buffered items. Then the data will be written in a block to the current keys 
and data files.
To update, delete, or insert a key not at the end it will be written to a 
update file for the file it is stored in or it should be placed. The update 
file can be processed periodically and be integrated into the keys and data 
files.
Downsampling of the data and keys file (aka increasing the block size) could 
also be sone at the same time the updates are integrated.

This may be silly but it seems to me like it would be efficient for storing 
lots of time series data. Do you think this is realistic for storing time 
series data?

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to