Re: [HACKERS] CLUSTER and MVCC

Heikki Linnakangas Fri, 09 Mar 2007 07:51:50 -0800

Csaba Nagy wrote:

Hmm. You could use something along these lines instead:
0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table
After all, it's not that you care about the clustering of the table, youjust want to remove old tuples.


... and then restart the app so all my pooled connections drop their
cached plans ;-)

Yeah, though Tom's working on plan invalidation for 8.3, so thatwouldn't be an issue.

Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.


Hmm. How about:

1. LOCK TABLE queue_table
2. SELECT * INTO temp_table FROM queue_table
3. TRUNCATE queue_table
4. INSERT INTO queue_table SELECT * FROM temp_table

That way you're copying the rows twice, but if there isn't many livetuples it shouldn't matter too much.

As a long term solution, it would be nice if we had more fine-grainedbookkeeping of snapshots that are in use in the system. In your case,there's a lot of tuples that are not visible to pg_dump because xmin istoo new, and also not visible to any other transaction because xmax istoo old. If we had a way to recognize situations like that, and vacuumthose tuples, much of the problem with long-running transactions wouldgo away.
In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.

We wouldn't clean up tuples that are visible to a transaction, but ifyou have one long-running transaction like pg_dump in a database withotherwise short transaction, you'll have a lot of tuples that are notvacuumable because of the long-running process, but are not in factvisible to any transaction. That's transactions that were inserted toolate to be seen by the old transaction, and deleted too long time ago tobe seen by any other transaction. Let me illustrate this with a timeline:


     xmin1    xmax1
     |        |
-----+--X-X+X-+ooooooooooooooXoooooXoXoXXo+------>now
           |                              |
           xmin2                          xmax2

xmin1 and xmax1 are the xmin and xmax of an old, long-runningserializable transaction, like pg_dump. The Xs between them are xids oftransactions that the old transaction sees as in-progress, IOW theSnapshotData.xip-array.

xmin2 and xmax2 are the xmin and xmax of a newer transaction. Because ofthe old-running transaction, xmin2 is far behind xmax2, but there's awide gap between that and the next transaction that the newertransaction sees as in-progress.

The current rule to determine if a tuple is dead or not is to check thattuple's xmax < oldestxmin. Oldestxmin is in this case xmin1. But inaddition to that, any tuple with an xmin > xmax1 and xmax that's not inthe xip-array of any snapshot in use (marked with o above), isn'tvisible to any current or future transaction and can therefore be safelyvacuumed.

The implementation problem is that we don't have a global view of allsnapshots in the system. If we solve that, we can be more aggressivewith vacuuming in presence of long-running transactions. It's not aneasy problem, we don't want to add a lot of accounting overhead, butmaybe we could have some kind of an approximation of the global statewith little overhead, that would give most of the benefit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] CLUSTER and MVCC

Reply via email to