Added to TODO: Improve CLUSTER performance by sorting to reduce random I/O * http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php
--------------------------------------------------------------------------- Gregory Stark wrote: > > One thing that's been annoying me for a while is that our CLUSTER > implementation is really very slow. When I say very slow I mean it's really > very very very slow. > > The problem is that it does a full index scan and looks up each tuple in the > order of the index. That means it a) is doing a lot of random i/o and b) has > to access the same pages over and over again. > > Our query planner knows better and normally does a sequential heap scan and > sort for queries of this type. But CLUSTER has to use the index. > > Just the other day I wanted to randomize the order of a table so I added a > random column, indexed it and started cluster running. After 6 hours I looked > at how far it had gotten and found it was only about 20% done rewriting the > heap (not including rebuilding the indexes). I cancelled it and built a new > table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes! > > There are a couple problems with this: > > a) We need some way to decide *when* to do a sort and when to do an index > scan. The planner has all this machinery but we don't really have all the > pieces handy to use it in a utility statement. This is especially important > for the case where we're doing a cluster operation on a table that's already > clustered. In that case an index scan could conceivably actually win (though I > kind of doubt it). I don't really have a solution for this. > > b) tuplesort no longer has the pieces needed to sort whole tuples including > visibility info. And actually even the old pieces that were removed had not > quite the right interface and behaviour. We need to preserve t_self for the > heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to > generate the scan keys that match the index. The cleanest approach I think is > to just add back in the old raw tuple methods to tuplesort and tweak them to > preserve t_self and take Scankeys instead of attnums etc. > > c) expression indexes are a problem. We would have to start with constructing > new tuples to hold the expression values but retain the entire original heap > tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store > both in the sorttuple? For now I figure we can just punt on expression indexes > and always do an index sacn. > > I tested out the idea and it works reasonably well. The code might need some > cleanup including, I think refactoring cluster_rel() and possibly tweaking the > interface to tuplesort. But I'm fairly happy with it. > > The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop > using maintenance_work_mem of 768MB and work_mem of 256MB: > > postgres=# \d x > > Table "public.x" > Column | Type | Modifiers > --------+------------------+----------- > i | integer | > r | double precision | > repeat | text | > Indexes: > "xi" btree (i) > "xi2" btree ((i + 0)) > "xir" btree (r) CLUSTER > > postgresql=# CLUSTER xi ON x; > CLUSTER > Time: 736029.322 ms > > postgresql=# CLUSTER xir ON x; > CLUSTER > Time: 723610.507 ms > > postgresql=# CLUSTER xi2 ON x; > CLUSTER > Time: 12842793.346 ms > > > That's 3.5 hours for traditional cluster and 12 minutes for cluster using a > sort... > [ Attachment, skipping... ] > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > Ask me about EnterpriseDB's RemoteDBA services! > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers