On 15 Jan 2011, at 1:41, bubba postgres wrote:
> I've been googling, but haven't found a good answer to what I should do if I
> want to store time series in Postgres.
> My current solution is store serialized (compressed) blobs of data.
> (So for example store 1 day worth of 1 minute samples (~1440 samples) stored
> as one row in a bytea. (Plus meta data)
> It would be nice if I could use 1 sample per column,(because updating
> individual columns/samples is clear to me) but postgres doesn't compress the
> row (which is bad because of high amount of repetitive data.. Easily 10X
> bigger.
Not an easy problem to solve without knowing more about your data, like what
the nature of this repetition is, how repetitive it is, etc. It also much
depends on how you intend to use this data later on. Are there gaps in your
data ranges? How accurate does your data need to be stored? Etc.
For a decision you're probably the only person who'll be close enough to the
data and how it will be used.
That said, assuming your data has significant _sequential_ repetition (That's
just how I would phrase it; meaning that a range of values in sequence are the
same), it may be an idea to just store differences larger than 0 and the length
of a sample of equal values. If such ranges are small, this would just add
overhead though.
You could extend that approach by fitting curves to ranges of values with a
similar tendency and store those curves instead. Retrieving single values from
those is of course a little harder, plus you probably will get some inaccuracy
when you can't fit curves exactly.
Hard to say how effective this would be though.
> I've been considering a Double[] array, which would get compressed, but
> before I start down that path (I suppose I need to make some storedprocs to
> update individual samples), has anyone built anything like this? Any open
> source projects I should look at?
This is the safest approach I think. You don't have to rely on any assumptions
on how your data behaves over time. It's also quite predictable in terms of
database and storage requirements, you won't run into surprises here. Just the
usual time-related stuff (DST-changes and such).
One change I'd probably make is to store them like this:
CREATE TABLE sample (
start timestamp,
length int,
values double[]
);
That way you don't need to reserve space for longer gaps. For example, if a day
ends "early" because data stopped coming in, you can just store a shorter day
(length < 1440 minutes), or if you missed data at the start of the day you can
make it start later (once the first sample arrives).
Querying them isn't too tough either, although determining the array index you
need based on a timestamp is somewhat tricky I just found out.
Finding the record is pretty easy though, that's just:
WHERE now() BETWEEN start AND start + length * interval '1 minute';
I'd advise putting an index on the latter formula ;)
There is one possible problem I'm seeing with this approach though: Updating
values requires the whole row to be rewritten (MVCC).
If your arrays are wide enough, then updating those rows will take longer and
longer because the amount of data that needs to be written each time is
increasing.
Seeing that your data-samples apparently arrive at 1-minute intervals you'll
probably be safe, but if you're going to add significant processing around
updating a row, then you could run into a race where the previous version of
the record is still being processed.
There are ways around that, but they only complicate matters more and probably
hurt performance, so that's probably best left for if it's needed at all.
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.
!DSPAM:737,4d318b0811872055413410!
--
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general