Re: [PERFORM] Feature suggestion : FAST CLUSTER

Jim Nasby Tue, 29 May 2007 01:00:00 -0700

On May 27, 2007, at 12:34 PM, PFC wrote:

On Sun, 27 May 2007 17:53:38 +0200, Jim C. Nasby<[EMAIL PROTECTED]> wrote:
On Tue, May 22, 2007 at 09:29:00AM +0200, PFC wrote:
This does not run a complete sort on the table. It would beabout asfast as your seq scan disk throughput. Obviously, the endresult is not asgood as a real CLUSTER since the table will be made up ofseveral orderedchunks and a range lookup. Therefore, a range lookup on theclusteredcolumns would need at most N seeks, versus 1 for a reallyclustered table.But it only scans the table once and writes it once, evencounting index
rebuild.
Do you have any data that indicates such an arrangement would be
substantially better than less-clustered data?
While the little benchmark that will answer your question isrunning, I'll add a few comments :
I have been creating a new benchmark for PostgreSQL and MySQL,that I will call the Forum Benchmark. It mimics the activity of aforum.So far, I have got interesting results about Postgres and InnoDBand will publish an extensive report with lots of nasty stuff init, in, say, 2 weeks, since I'm doing this in spare time.
Anyway, forums like clustered tables, specifically clusteriingposts on (topic_id, post_id), in order to be able to display a pagewith one disk seek, instead of one seek per post.PostgreSQL humiliates InnoDB on CPU-bound workloads (about 2xfaster since I run it on dual core ; InnoDB uses only one core).However, InnoDB can automatically cluster tables withoutmaintenance. This means InnoDB will, even though it sucks and isawfully bloated, run a lot faster than postgres if things become IO-bound, ie. if the dataset is larger than RAM.Postgres needs to cluster the posts table in order to keep going.CLUSTER is very slow. I tried inserting into a new posts table,ordering by (post_id, topic_id), then renaming the new table inplace of the old. It is faster, but still slow when handling lotsof data.I am trying other approaches, some quite hack-ish, and will reportmy findings.


I assume you meant topic_id, post_id. :)

The problem with your proposal is that it does nothing to ensure thatposts for a topic stay together as soon as the table is large enoughthat you can't sort it in a single pass. If you've got a long-runningthread, it's still going to get spread out throughout the table.

What you really want is CLUSTER CONCURRENTLY, which I believe is onthe TODO list. BUT... there's another caveat here: for any post wherethe row ends up being larger than 2k, the text is going to getTOASTed anyway, which means it's going to be in a separate table, ina different ordering. I don't know of a good way to address that; youcan cluster the toast table, but you'll be clustering on an OID,which isn't going to help you.

--
Jim Nasby                                            [EMAIL PROTECTED]
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [PERFORM] Feature suggestion : FAST CLUSTER

Reply via email to