Re: High volume data series storage and queries

Paul O Mon, 08 Aug 2011 18:27:01 -0700

Pablo, are you planning to user lexical range queries for your time series
from Riak search? Not sure I understand the cost of lexical range queries
but this could simplify the system, indeed.


Regards,

Paul

On Mon, Aug 8, 2011 at 5:17 PM, Pablo Chacin <[email protected]> wrote:

> I'm facing a similar (yet not such extreme) use case. I'm also considering
> a similar strategy, but I was thinking about using riak search instead of a
> rdbs for the secondary indexes.
>
> On Mon, Aug 8, 2011 at 10:25 PM, Paul O <[email protected]> wrote:
>
>> Hi Jeremiah,
>>
>> This is for a yet-to-exist system, so the existing data characteristics
>> are not that important.
>>
>> The volume of data would be something like : average 10 events per second
>> per source meaning about 320 million events per source, for tens of
>> thousands of sources, potentially hundreds of thousands.
>>
>> Data retention policy would be in the range of years, probably 5 years.
>>
>> Most of the above-mentioned are averages, some sources might be sampled
>> even hundreds of times per second. There is also a layer of creating
>> aggregates for "regressive granularity" (a la RRD) but it's a bit less of a
>> concern (i.e. the same strategy I'm describing could be used for storing the
>> aggregates.)
>>
>> The strategy I've described tries to make the most common query (time
>> range per source with a max number of elements) predictable and as
>> performant as possible. I.e. for any range I know at most three batches need
>> to be read from Riak (or equivalent) so I can say that, if reading a batch
>> takes 20 ms and the initial query takes 10 ms I can predictably respond to
>> most such requests under 100 ms.
>>
>> So as long as I can benchmark individual aspects of the strategy I hope to
>> a predictable query cost and an idea of how to grow the system.
>>
>> As for the read to write ration I don't have an exact estimate (the system
>> will be generic and consumption applications will be built on top of it) but
>> the system is expected to be a lot more write intensive than read intensive.
>> Most data might go completely unused, some data might be rather "hot" so
>> additional caching might be implemented later but I'm trying to design the
>> underlying system so at least some performance axioms are computable.
>>
>> Does this clarify or confuses further?
>>
>> Regards,
>>
>> Paul
>>
>> On Mon, Aug 8, 2011 at 3:32 PM, Jeremiah Peschka <
>> [email protected]> wrote:
>>
>>> It sounds like a potentially interesting use case.
>>>
>>> The questions that immediately enter my head are:
>>> * How much data do you currently have?
>>> * How much data do you plan to have?
>>> * Do you have a data retention policy? If so, what is it? How do you plan
>>> to implement it?
>>> * What's the anticipated rate of growth per day? Week? Year?
>>> * What type of queries will you have? Is it a fixed set of queries? Is it
>>> a decision support system?
>>> * What does your read to write ratio look like?
>>>
>>> Your plan to support Riak with a hybrid system isn't that out of whack;
>>> it's very doable.
>>>
>>> You can certainly do the type of querying you've described through
>>> careful choice of key names, sorting in memory, and only using the first N
>>> data points in a given Map Reduce query result. The main reason to not
>>> perform range queries in Riak is that they'll result in full key space scans
>>> across the Riak cluster. If you're using bitcask as your backend then it's
>>> an in memory scan, otherwise you're doing a much more costly scan from disk.
>>> And, since key names are hashed as they are partitioned across the cluster,
>>> you're not going to get the benefit of sequential disk scan performance like
>>> you might get with a traditional database.
>>>
>>> The only thing that worries me is the phrase "should grow more than what
>>> a 'vanilla' RDBMS would support". Are you thinking 1TB? 10TB? 50TB? 500TB?
>>> I'm trying to get a handle on what size and performance characteristics
>>> you're looking for before diving into how to look at your system vs. saying
>>> "Hell if I know, does someone else on the list have a good idea?"
>>>
>>> ---
>>> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
>>> Microsoft SQL Server MVP
>>>
>>> On Aug 8, 2011, at 11:21 AM, Paul O wrote:
>>>
>>> > Hello Riak enthusiasts,
>>> >
>>> > I am trying to design a solution for storing time series data coming
>>> from a very large number of potential high-frequency sources.
>>> >
>>> > I thought Riak could be of help, though based on what I read about it I
>>> can't use it without some other layer on top of it.
>>> >
>>> > The problem is I need to be able to do range queries over this data, by
>>> the source. Hence, I want to be able to say "give me the N first data points
>>> for source S between time T1 and time T2."
>>> >
>>> > I need to store this data for a rather long time, and the expected
>>> volume should grow more than what a "vanilla" RDBMS would support.
>>> >
>>> > Another thing to note is that I can restrict the number of data points
>>> to be returned by a query, so no query would return more than MaxN data
>>> points.
>>> >
>>> > I thought about doing this the following way:
>>> >
>>> > 1. bundle date time series in batches of MaxN, to ensure that any query
>>> would require reading at most two batches. The batches would be store inside
>>> Riak.
>>> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL
>>> (or PostgreSQL) DB.
>>> >
>>> > My thinking is such a strategy would allow me to persist data in Riak
>>> and linearly grow with the data, and the index would be kept in a RDBM for
>>> fast range queries.
>>> >
>>> > Does it sound sensible to use Riak this way? Does this make you
>>> laugh/cry/shake your head in disbelief? Am I overlooking something from Riak
>>> which would make all this much better?
>>> >
>>> > Thanks and best regards,
>>> >
>>> > Paul
>>> > _______________________________________________
>>> > riak-users mailing list
>>> > [email protected]
>>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: High volume data series storage and queries

Reply via email to