Re: [PERFORM] Postgres for a "data warehouse", 5-10 TB

Igor Chudov Sun, 11 Sep 2011 16:17:59 -0700

On Sun, Sep 11, 2011 at 6:01 PM, Stephen Frost <[email protected]> wrote:


> * Igor Chudov ([email protected]) wrote:
> > Right now I have a personal (one user) project to create a 5-10
> > Terabyte data warehouse. The largest table will consume the most space
> > and will take, perhaps, 200,000,000 rows.
>
> I run data-warehouse databases on that order (current largest single
> instance is ~4TB running under 9.0.4).  If the largest table is only
> 200M rows, PG should handle that quite well.  Our data is partitioned by
> month and each month is about 200M records and simple queries can run in
> 15-20 minutes (with a single thread), with complex windowing queries
> (split up and run in parallel) finishing in a couple of hours.
>
>

Which brings up a question.

Can I partition data by month (or quarter), without that month being part of
PRIMARY KEY?

If this question sounds weird, I am asking because MySQL enforces this,
which does not fit my data.

If I can keep my primary key to be the ID that I want (which comes with
data), but still partition it by month, I will be EXTREMELY happy.

> However, while an hour is fine, two weeks per query is NOT fine.
>
> What's really, really, really useful are two things: EXPLAIN, and this
> mailing list. :)  Seriously, run EXPLAIN on your queries before you run
> them and see if how the query is going to be executed makes sense.
> Here's a real easy hint: if it says "External Sort" and has big numbers,
> come talk to us here- that's about one of the worst things you can
> possibly do.  Of course, PG's going to avoid doing that, but you may
> have written a query (unintentionally) which forces PG to do a sort, or
> something else.
>
>
Very good, thanks


> > I have a server with about 18 TB of storage and 48 GB of RAM, and 12
> > CPU cores.
>
> If you partition up your data and don't mind things running in different
> transactions, you can definitely get a speed boost with PG by running
> things in parallel.  PG will handle that very well, in fact, if two
> queries are running against the same table, PG will actually combine
> them and only actually read the data from disk once.
>
> > I cannot shell out $47,000 per CPU for Oracle for this project.
>
> The above data warehouse was migrated from an Oracle-based system. :)
>
>
I am wondering, why?


> > To be more specific, the batch queries that I would do, I hope,
> > would either use small JOINS of a small dataset to a large dataset, or
> > just SELECTS from one big table.
>
> Make sure that you set your 'work_mem' correctly- PG will use that to
> figure out if it can hash the small table (you want that to happen,
> trust me..).  If you do end up having sorts, it'll also use the work_mem
> value to figure out how much memory to use for sorting.
>
>
I could, say, set work_mem to 30 GB? (64 bit linux)


>  > So... Can Postgres support a 5-10 TB database with the use pattern
> > stated above?
>
> Yes, certainly.
>
>
that's great to know.

i


>        Thanks,
>
>                Stephen
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iEYEARECAAYFAk5tPc8ACgkQrzgMPqB3kigtSgCffwEmi3AD6Ryff7qZyQYieyKQ
> jhoAoJDFC1snQmwCIBUjwlC6WVRyAOkn
> =LPtP
> -----END PGP SIGNATURE-----
>
>

Re: [PERFORM] Postgres for a "data warehouse", 5-10 TB

Reply via email to