On Sun, Sep 11, 2011 at 6:01 PM, Stephen Frost <[email protected]> wrote:
> * Igor Chudov ([email protected]) wrote: > > Right now I have a personal (one user) project to create a 5-10 > > Terabyte data warehouse. The largest table will consume the most space > > and will take, perhaps, 200,000,000 rows. > > I run data-warehouse databases on that order (current largest single > instance is ~4TB running under 9.0.4). If the largest table is only > 200M rows, PG should handle that quite well. Our data is partitioned by > month and each month is about 200M records and simple queries can run in > 15-20 minutes (with a single thread), with complex windowing queries > (split up and run in parallel) finishing in a couple of hours. > > Which brings up a question. Can I partition data by month (or quarter), without that month being part of PRIMARY KEY? If this question sounds weird, I am asking because MySQL enforces this, which does not fit my data. If I can keep my primary key to be the ID that I want (which comes with data), but still partition it by month, I will be EXTREMELY happy. > However, while an hour is fine, two weeks per query is NOT fine. > > What's really, really, really useful are two things: EXPLAIN, and this > mailing list. :) Seriously, run EXPLAIN on your queries before you run > them and see if how the query is going to be executed makes sense. > Here's a real easy hint: if it says "External Sort" and has big numbers, > come talk to us here- that's about one of the worst things you can > possibly do. Of course, PG's going to avoid doing that, but you may > have written a query (unintentionally) which forces PG to do a sort, or > something else. > > Very good, thanks > > I have a server with about 18 TB of storage and 48 GB of RAM, and 12 > > CPU cores. > > If you partition up your data and don't mind things running in different > transactions, you can definitely get a speed boost with PG by running > things in parallel. PG will handle that very well, in fact, if two > queries are running against the same table, PG will actually combine > them and only actually read the data from disk once. > > > I cannot shell out $47,000 per CPU for Oracle for this project. > > The above data warehouse was migrated from an Oracle-based system. :) > > I am wondering, why? > > To be more specific, the batch queries that I would do, I hope, > > would either use small JOINS of a small dataset to a large dataset, or > > just SELECTS from one big table. > > Make sure that you set your 'work_mem' correctly- PG will use that to > figure out if it can hash the small table (you want that to happen, > trust me..). If you do end up having sorts, it'll also use the work_mem > value to figure out how much memory to use for sorting. > > I could, say, set work_mem to 30 GB? (64 bit linux) > > So... Can Postgres support a 5-10 TB database with the use pattern > > stated above? > > Yes, certainly. > > that's great to know. i > Thanks, > > Stephen > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > > iEYEARECAAYFAk5tPc8ACgkQrzgMPqB3kigtSgCffwEmi3AD6Ryff7qZyQYieyKQ > jhoAoJDFC1snQmwCIBUjwlC6WVRyAOkn > =LPtP > -----END PGP SIGNATURE----- > >
