Csaba Nagy wrote:
Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you just want to remove old tuples.

... and then restart the app so all my pooled connections drop their
cached plans ;-)

Yeah, though Tom's working on plan invalidation for 8.3, so that wouldn't be an issue.

Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.

Hmm. How about:

1. LOCK TABLE queue_table
2. SELECT * INTO temp_table FROM queue_table
3. TRUNCATE queue_table
4. INSERT INTO queue_table SELECT * FROM temp_table

That way you're copying the rows twice, but if there isn't many live tuples it shouldn't matter too much.

As a long term solution, it would be nice if we had more fine-grained bookkeeping of snapshots that are in use in the system. In your case, there's a lot of tuples that are not visible to pg_dump because xmin is too new, and also not visible to any other transaction because xmax is too old. If we had a way to recognize situations like that, and vacuum those tuples, much of the problem with long-running transactions would go away.

In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.

We wouldn't clean up tuples that are visible to a transaction, but if you have one long-running transaction like pg_dump in a database with otherwise short transaction, you'll have a lot of tuples that are not vacuumable because of the long-running process, but are not in fact visible to any transaction. That's transactions that were inserted too late to be seen by the old transaction, and deleted too long time ago to be seen by any other transaction. Let me illustrate this with a timeline:

     xmin1    xmax1
     |        |
-----+--X-X+X-+ooooooooooooooXoooooXoXoXXo+------>now
           |                              |
           xmin2                          xmax2

xmin1 and xmax1 are the xmin and xmax of an old, long-running serializable transaction, like pg_dump. The Xs between them are xids of transactions that the old transaction sees as in-progress, IOW the SnapshotData.xip-array.

xmin2 and xmax2 are the xmin and xmax of a newer transaction. Because of the old-running transaction, xmin2 is far behind xmax2, but there's a wide gap between that and the next transaction that the newer transaction sees as in-progress.

The current rule to determine if a tuple is dead or not is to check that tuple's xmax < oldestxmin. Oldestxmin is in this case xmin1. But in addition to that, any tuple with an xmin > xmax1 and xmax that's not in the xip-array of any snapshot in use (marked with o above), isn't visible to any current or future transaction and can therefore be safely vacuumed.

The implementation problem is that we don't have a global view of all snapshots in the system. If we solve that, we can be more aggressive with vacuuming in presence of long-running transactions. It's not an easy problem, we don't want to add a lot of accounting overhead, but maybe we could have some kind of an approximation of the global state with little overhead, that would give most of the benefit.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to