Slight update:

Thanks for the replies; this is starting to make a little more sense...

I've managed to track down the root of the problem to a single query on
a single table.  I have a query that looks like this:
       select count(*) from members where group_id = ? and member_id >

The members table contains about 500k rows.  It has an index on
(group_id, member_id) and on (member_id, group_id).

It seems like the statistics are wildly different depending on whether
the last operation on the table was a 'vacuum analyze' or an 'analyze'.
Vacuum or vacuum-analyze puts the correct number (~500k) in
pg_class.reltuples, but analyze puts 7000 in pg_class.reltuples.  The
reltuples numbers associated with this table's indexes are unchanged.

After a vacuum-analyze, the query correctly uses the index on (group_id,
member_id), and runs very fast (sub-millisecond reported by explain
analyze).  After an analyze, the query uses the (member_id, group_id)
index, and the query takes much longer (150ms reported by explain
analyze).  (Yes, I said the 2 databases were using the same query plan;
it turns out they're only sometimes using the same query plan. :( )

A similar problem happens to some of my other tables (according to
pg_class.reltuples), although I haven't seen query performance change as

Any idea what could cause this bad analyze behavior?  Any guesses why
this has happened in one of my data centers but not both?  (Coincidence
isn't a big stretch here.)  What can I do to stop or change this
behavior?  Apologies if this is a known problem...


-----Original Message-----
[mailto:[EMAIL PROTECTED] On Behalf Of Michael
Sent: Friday, June 04, 2004 10:43 AM
Subject: [PERFORM] postgres performance: comparing 2 data centers

I have two instances of a production application that uses Postgres 7.2,
deployed in two different data centers for about the last 6 months.  The
sizes, schemas, configurations, hardware, and access patterns of the two
databases are nearly identical, but one consistently takes at least 5x
longer than the other for some common operations.  During this time, CPU
usage and IO on the slow database are both high (sustained); I'm not
sure about the fast database.  These common operations are chatty - at
least tens of thousands of queries over a 5 to 60 minute stretch - but
the queries themselves are fairly simple.  The query plans are identical
across both databases, and the data distribution is comparable.  The
tables involved in these common operations change frequently, and are
indexed according to these queries.  The queries use the indexes as
expected.  The tables involved have 50k-500k rows.

We 'vacuum analyze' nightly, and we recently rebuilt the indexes on the
slow database (using reindex table).  This cut the number of index pages
dramatically: from ~1800 to ~50, but didn't noticeably change the time
or CPU utilization for the common operations described above.

When running pgbench, both databases have very similar results (200-260
over multiple runs with 5 concurrent threads).

I know of a few things I can do to make this operation marginally
simpler, but I'm most interested in the difference between the two

I haven't come up with a theory that explains each of these things. What
are some things I can look into to track this down further?


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
      joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Reply via email to