I want to store lots of time series data in a database. The data can be organized into measurements which have values at specific times. There will be a *lot* of data but it doesn't need to be accessed very often. The time value will most likely be something like the seconds since the unix epoch. However, more precision would be useful. The time will be of a fixed size. I want to store the data in such a way that losing a node with r=1 will only lose data at specific intervals. I'm thinking to do this by choosing the vnode to store the data on by time % num_vnodes and building on riak-core.
I have a idea for the actual storage format. Each measurement would be stored separately. The data would be grouped into files of a specific size. The data would be grouped into blocks so that each block is a good size for compression. Each block would be compressed with The directory would look like: 1367419556.keys -> Keys file for time 1367419556 until the next file (1367464288). 1367419556.data -> Data file for time 1367419556 until the next file (1367464288). 1367419556.updates -> Updates/deletes/random inserts for the data. Searched first. 1367464288.keys 1367464288.data The data file would be | time | value_size | value | repeated for each time/value and grouped into compressed blocks. The keys file would contain | time of first item in block | offset | repeated for each block. To get a value at a time first the file it is contained in would be found. Next, the update file will be searched if it exists. Then the keys file would be read. For each time, if it is greater than the time, the previous block will be opened and read until the time is found. Range support will be necessary to achieve reasonable speed. To insert a key at the end, it would be buffered until there is block size buffered items. Then the data will be written in a block to the current keys and data files. To update, delete, or insert a key not at the end it will be written to a update file for the file it is stored in or it should be placed. The update file can be processed periodically and be integrated into the keys and data files. Downsampling of the data and keys file (aka increasing the block size) could also be sone at the same time the updates are integrated. This may be silly but it seems to me like it would be efficient for storing lots of time series data. Do you think this is realistic for storing time series data? _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
