Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS

Craig A. James Sat, 06 Jan 2007 23:22:52 -0800

Guy,

The application is fairly straightforward, but as you say, what isworking okay with BigDBMS isn't working as well under PG. I'm going totry other configuration suggestions made by others before I attemptlogic changes. The core logic is unchangeable; millions of rows of datain a single table will be updated throughout the day. If PG can'thandle high volume updates well, this may be brick wall.

Here are a couple things I learned.

ANALYZE is VERY important, surprisingly so even for small tables. I had a case last week
where a temporary "scratch" table with just 100 rows was joined to two more
tables of 6 and 12 million rows. You might think that a 100-row table wouldn't need to
be analyzed, but it does: Without the ANALYZE, Postgres generated a horrible plan that
took many minutes to run; with the ANALYZE, it took milliseconds. Any time a table's
contents change dramatically, ANALYZE it, ESPECIALLY if it's a small table. After all,
changing 20 rows in a 100-row table has a much larger affect on its statistics than
changing 20 rows in a million-row table.

Postgres functions like count() and max() are "plug ins" which has huge architectural advantages.
But in pre-8.1 releases, there was a big speed penalty for this: functions like count() were very, very slow,
requiring a full table scan. I think this is vastly improved from 8.0x to 8.1 and forward; others might be
able to comment whether count() is now as fast in Postgres as Oracle. The "idiom" to replace
count() was "select col from tbl order by col desc limit 1". It worked miracles for my app.

Postgres has explicit garbage collection via VACUUM, and you have to design your
application with this in mind. In Postgres, update is delete+insert, meaning updates
create garbage. If you have very "wide" tables, but only a subset of the
columns are updated frequently, put these columns in a separate table with an index to
join the two tables. For example, my original design was something like this:

integer primary key
very large text column
... a bunch of integer columns, float columns, and small text columns

The properties were updated by the application, but the large text column never
changed. This led to huge garbage-collection problems as the large text field
was repeatedly deleted and reinserted by the updates. By separating these into
two tables, one with the large text column, and the other table with the
dynamic, but smaller, columns, garbage is massively reduced, and performance
increased, both immediately (smaller data set to update) and long term (smaller
vacuums). You can use views to recreate your original combined columns, so the
changes to your app are limited to where updates occur.

If you have a column that is *frequently* updated (say, for example, a user's
last-access timestamp each time s/he hits your web server) then you definitely
want this in its own table, not mixed in with the user's name, address, etc.

Partitioning in Postgres is more powerful than in Oracle. Use it if you can.

Partial indexes are VERY nice in Postgres, if your data is poorly distributed
(for example, a mostly-NULL column with a small percentage of very important
values).

I'm sure there are more things that others can contribute.

Craig

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS

Reply via email to