Ramon Bastiaans wrote:
I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more).
If you have that 1 billion perfectly distributed over all hours of the day, then you need 1e9/30/24/3600 = 385 transactions per second.
Which I'm pretty sure is possible with postgres, you just need pretty beefy hardware. And like Jeff said, lots of disks for lots of IO. Like a quad opteron, with 16GB of ram, and around 14-20 very fast disks. raid10 not raid5, etc. To improve query performance, you can do some load balancing by having replication machines by using Slony.
Or if you can do batch processing, you could split up the work into a few update machines, which then do bulk updates on the master database. This lets you get more machines into the job, since you can't share a database across multiple machines.
I was wondering if anyone has had any experience with these kind of big numbers of data in a postgres sql database and how this affects database design and optimization.
Well, one of the biggest things is if you can get bulk updates, or if clients can handle data being slightly out of date, so you can use cacheing. Can you segregate your data into separate tables as much as possible? Are your clients okay if aggregate information takes a little while to update?
One trick is to use semi-lazy materialized views to get your updates to be fast.
What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database.
I think you if you can design the db properly, it is doable. But if you have a clients saying "I need up to the second information on 1 billion rows", you're never going to get it.
The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible.
Again, it depends on the queries being done. There are some nice tricks you can use, like doing a month-by-month partitioning (if you are getting 1G inserts, you might want week-by-week partitioning), and then with a date column index, and a union all view you should be able to get pretty good insert speed, and still keep fast *recent* queries. Going through 1billion rows is always going to be expensive.
I would really like to hear people's thoughts/suggestions or "go see a shrink, you must be mad" statements ;)
I think it would be possible, but there are a lot of design issues with a system like this. You can't go into it thinking that you can design a multi billion row database the same way you would design a million row db.
Description: OpenPGP digital signature