I'm facing a similar (yet not such extreme) use case. I'm also considering a similar strategy, but I was thinking about using riak search instead of a rdbs for the secondary indexes.
On Mon, Aug 8, 2011 at 10:25 PM, Paul O <[email protected]> wrote: > Hi Jeremiah, > > This is for a yet-to-exist system, so the existing data characteristics are > not that important. > > The volume of data would be something like : average 10 events per second > per source meaning about 320 million events per source, for tens of > thousands of sources, potentially hundreds of thousands. > > Data retention policy would be in the range of years, probably 5 years. > > Most of the above-mentioned are averages, some sources might be sampled > even hundreds of times per second. There is also a layer of creating > aggregates for "regressive granularity" (a la RRD) but it's a bit less of a > concern (i.e. the same strategy I'm describing could be used for storing the > aggregates.) > > The strategy I've described tries to make the most common query (time range > per source with a max number of elements) predictable and as performant as > possible. I.e. for any range I know at most three batches need to be read > from Riak (or equivalent) so I can say that, if reading a batch takes 20 ms > and the initial query takes 10 ms I can predictably respond to most such > requests under 100 ms. > > So as long as I can benchmark individual aspects of the strategy I hope to > a predictable query cost and an idea of how to grow the system. > > As for the read to write ration I don't have an exact estimate (the system > will be generic and consumption applications will be built on top of it) but > the system is expected to be a lot more write intensive than read intensive. > Most data might go completely unused, some data might be rather "hot" so > additional caching might be implemented later but I'm trying to design the > underlying system so at least some performance axioms are computable. > > Does this clarify or confuses further? > > Regards, > > Paul > > On Mon, Aug 8, 2011 at 3:32 PM, Jeremiah Peschka < > [email protected]> wrote: > >> It sounds like a potentially interesting use case. >> >> The questions that immediately enter my head are: >> * How much data do you currently have? >> * How much data do you plan to have? >> * Do you have a data retention policy? If so, what is it? How do you plan >> to implement it? >> * What's the anticipated rate of growth per day? Week? Year? >> * What type of queries will you have? Is it a fixed set of queries? Is it >> a decision support system? >> * What does your read to write ratio look like? >> >> Your plan to support Riak with a hybrid system isn't that out of whack; >> it's very doable. >> >> You can certainly do the type of querying you've described through careful >> choice of key names, sorting in memory, and only using the first N data >> points in a given Map Reduce query result. The main reason to not perform >> range queries in Riak is that they'll result in full key space scans across >> the Riak cluster. If you're using bitcask as your backend then it's an in >> memory scan, otherwise you're doing a much more costly scan from disk. And, >> since key names are hashed as they are partitioned across the cluster, >> you're not going to get the benefit of sequential disk scan performance like >> you might get with a traditional database. >> >> The only thing that worries me is the phrase "should grow more than what a >> 'vanilla' RDBMS would support". Are you thinking 1TB? 10TB? 50TB? 500TB? I'm >> trying to get a handle on what size and performance characteristics you're >> looking for before diving into how to look at your system vs. saying "Hell >> if I know, does someone else on the list have a good idea?" >> >> --- >> Jeremiah Peschka - Founder, Brent Ozar PLF, LLC >> Microsoft SQL Server MVP >> >> On Aug 8, 2011, at 11:21 AM, Paul O wrote: >> >> > Hello Riak enthusiasts, >> > >> > I am trying to design a solution for storing time series data coming >> from a very large number of potential high-frequency sources. >> > >> > I thought Riak could be of help, though based on what I read about it I >> can't use it without some other layer on top of it. >> > >> > The problem is I need to be able to do range queries over this data, by >> the source. Hence, I want to be able to say "give me the N first data points >> for source S between time T1 and time T2." >> > >> > I need to store this data for a rather long time, and the expected >> volume should grow more than what a "vanilla" RDBMS would support. >> > >> > Another thing to note is that I can restrict the number of data points >> to be returned by a query, so no query would return more than MaxN data >> points. >> > >> > I thought about doing this the following way: >> > >> > 1. bundle date time series in batches of MaxN, to ensure that any query >> would require reading at most two batches. The batches would be store inside >> Riak. >> > 2. Store the start-time, end-time, size and Riak batch ID in a MySQL (or >> PostgreSQL) DB. >> > >> > My thinking is such a strategy would allow me to persist data in Riak >> and linearly grow with the data, and the index would be kept in a RDBM for >> fast range queries. >> > >> > Does it sound sensible to use Riak this way? Does this make you >> laugh/cry/shake your head in disbelief? Am I overlooking something from Riak >> which would make all this much better? >> > >> > Thanks and best regards, >> > >> > Paul >> > _______________________________________________ >> > riak-users mailing list >> > [email protected] >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
