Paul Lathrop wrote:
We run 4 ~25-30Gb databases which cache information from eBay. These
databases have had performance issues since before I joined the company.
The databases have gone through a number of iterations. Initially, they
were deployed as one huge database - performance was apparently
unacceptable. They were split, and tried on a variety of hardware
platforms. When I joined the company last year, the databases were
deployed on 12-disk RAID5 arrays on dual-proc AMD machines with 4Gb of
RAM, running Debian Woody and Postgres 7.2.

Well, first of all you need to upgrade. 7.2 is old and not supported anymore.

These systems seemed to
suffer a gradually decreasing performance accompanied by a gradually
growing disk space usage. The DBA had come to the conclusion that the
VACUUM command did/does not work on these systems, because even after a
VACUUM FULL, the size of the database was continually increasing. So, as
things stand with the PG7.2 machines, vacuuming is run nightly, and
whenever the database size reaches 40Gb on disk (the point at which
performance has degraded below tolerance), the DBA exports the data,
deletes the database, and then imports the data, shrinking it to the
actual size of the dataset.

Vacuum didn't reclaim empty index pages until 7.4, so you might be suffering from index bloat. A nightly reindex would help with that.

This process is time-consuming, costly, and the servers that we are
deployed on do not meet our stability requirements. So, after much
pushing, I was able to deploy a 12-disk RAID5 dual-proc AMD64 machine
with 16Gb of RAM running FreeBSD and Postgres 8.1.

You should give 8.2 (now in beta stage) a try as well. There's some significant performance enhancements, for example vacuums should run faster.

1) How does one define 'performance' anyway? Is it average time to
complete a query? If so, what kind of query? Is it some other metric?

Performance is really an umbrella term that can mean a lot of things. You'll have to come up with a metric that's most meaningful to you and that you can easily measure. Some typical metrics are:

* Average response time to a query/queries
* Max or 90% percentile response time to a query
* throughput, transactions per second

You'll have to start measuring performance somehow. You might find out that actually your performance is bad only during some hour of day for example. Take a look at the log_min_duration_statement parameter in more recent versions for starter.

2) I've combed the archives and seen evidence that people out there are
running much much larger databases on comparable hardware with decent
performance. Is this true, or is my dataset at about the limit of my
hardware?

It depends on your load, really. A dataset of ~ 40 GB is certainly not that big compared to what some people have.

3) Though this may seem irrelevant, since we are moving away from the
platform, it would still be good to know - was VACUUM actually
completely useless on PG7.2 or is there some other culprit on these
legacy machines?

It's certainly better nowadays..

4) Much of my reading of the PG docs and list archives seems to suggest
that much of performance tuning is done at the query level - you have to
know how to ask for information in an efficient way. To that end, I took
a look at some of the queries we get on a typical day. On average, 24
times per minute, our application causes a unique key violation. This
struck me as strange, but the VP of Engineering says this is a
performance ENHANCEMENT - the code doesn't bother checking for the
unique key because it depends on the database to enforce that. My
interpretation of reading the archives & docs seems to indicate that
this could be causing the constantly increasing database size... so now
that I've rambled about it, does an INSERT transaction that is rolled
back due to a unique key violation leave dead rows in the table? If so, why?

The way unique checking works in PostgreSQL is:

1. The row is inserted into heap.
2. The corresponding index tuple is inserted to index. While doing that, we check that there's no duplicate key there already.

So yes, a unique key violation will leave the dead tuple in the heap, and it will be removed by vacuum later on.

I think it's a valid and sane approach to leave the uniqueness check to the database. Unless a very large proportion of your transactions abort due to unique key violations, the dead rows won't be a problem. The space will be reclaimed by vacuum.

In general, it's normal that there's some dead rows in the database. As long as you vacuum regularly, the database size should eventually reach a steady-state where it doesn't grow anymore, unless the real live dataset size increases.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

Reply via email to