Ramon Bastiaans wrote:

I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more).

I was wondering if anyone has had any experience with these kind of big numbers of data in a postgres sql database and how this affects database design and optimization.

What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database.

The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible.

I would really like to hear people's thoughts/suggestions or "go see a shrink, you must be mad" statements ;)

It just dawned on me that we're doing something that, while not the same, might be relevant. One of our tables has ~85M rows in it according to the output from an "explain select * from table". I don't plan on trying a select count(*) any time soon :) We add and remove about 25M rows a day to/from this table which would be about 750M rows/month total. Given our current usage of the database, it could handle a larger row/day rate without too much trouble. (The problem isn't adding rows but deleting rows.)


   Column    |   Type   | Modifiers
--------------+----------+-----------
timeseriesid | bigint   |
bindata      | bytea    |
binsize      | integer  |
rateid       | smallint |
ownerid      | smallint |
Indexes:
   "idx_timeseries" btree (timeseriesid)

In this case, each bytea entry is typically about 2KB of data, so the total table size is about 150GB, plus some index overhead.

A second table has ~100M rows according to explain select *. Again it has about 30M rows added and removed / day.

  Column   |         Type          | Modifiers
------------+-----------------------+-----------
uniqid     | bigint                |
type       | character varying(50) |
memberid   | bigint                |
tag        | character varying(50) |
membertype | character varying(50) |
ownerid    | smallint              |
Indexes:
   "composite_memberid" btree (memberid)
   "composite_uniqid" btree (uniqid)

There are some additional tables that have a few million rows / day of activity, so call it 60M rows/day added and removed. We run a vacuum every day.

The box is an dual Opteron 248 from Sun. Linux 2.6, 8GB of memory. We use reiserfs. We started with XFS but had several instances of file system corruption. Obviously, no RAID 5. The xlog is on a 2 drive mirror and the rest is on separate mirrored volume. The drives are fiber channel but that was a mistake as the driver from IBM wasn't very good.

So, while we don't have a billion rows we do have ~200M total rows in all the tables and we're certainly running the daily row count that you'd need to obtain. But scaling this sort of thing up can be tricky and your milage may vary.

In a prior career I ran a "data intensive computing center" and helped do some design work for a high energy physics experiment: petabytes of data, big tape robots, etc., the usual Big Science toys. You might take a look at ROOT and some of the activity from those folks if you don't need transactions and all the features of a general database like postgresql.

-- Alan

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match

Reply via email to