On Sun, 27 May 2007 19:34:30 +0200, PFC <[EMAIL PROTECTED]> wrote:

On Sun, 27 May 2007 17:53:38 +0200, Jim C. Nasby <[EMAIL PROTECTED]> wrote:

On Tue, May 22, 2007 at 09:29:00AM +0200, PFC wrote:
        This does not run a complete sort on the table. It would be about as
fast as your seq scan disk throughput. Obviously, the end result is not as good as a real CLUSTER since the table will be made up of several ordered
chunks and a range lookup. Therefore, a range lookup on the clustered
columns would need at most N seeks, versus 1 for a really clustered table. But it only scans the table once and writes it once, even counting index
rebuild.

Do you have any data that indicates such an arrangement would be
substantially better than less-clustered data?

While the little benchmark that will answer your question is running, I'll add a few comments :

        Alright, so far :

This is a simulated forum workload, so it's mostly post insertions, some edits, and some topic deletes. It will give results applicable to forums, obviously, but also anything that wotks on the same schema :
        - topics + posts
        - blog articles + coomments
        - e-commerce site where users can enter their reviews
So, the new trend being to let the users to participate, this kind of workload will become more and more relevant for websites.

So, how to cluster the posts table on (topic_id, post_id) to get all the posts on the same webpake in 1 seek ?

        I am benchmarking the following :
        - CLUSTER obviously
- Creating a new table and INSERT .. SELECT ORDER BY topic_id, post_id, then reindexing etc
        - not doing anything (just vacuuming all tables)
        - not even vacuuming the posts table.

        I al also trying the following more exotic approaches :

        * chunked sort :

Well, sorting 1GB of data when your work_mem is only 512 MB needs several passes, hence a lot of disk IO. The more data, the more IO.
        So, instead of doing this, I will :
        - grab about 250 MB of posts from the table
        - sort them by (topic_id, post_id)
        - insert them in a new table
        - repeat
        - then reindex, etc and replace old table with new.
(reindex is very fast, since the table is nicely defragmented now, I get full disk speed. However I would like being able to create 2 indexes with ONE table scan !)
        I'm trying 2 different ways to do that, with plpgsql and cursors.
It is much faster than sorting the whole data set, because the sorts are only done in memory (hence the "chunks") So far, it seems a database clustered this way is about as fast as using CLUSTER, but the clustering operation is faster.
        More results in about 3 days when the benchmarks finish.

        * other dumb stuff

I'll try DELETing the last 250MB of records, stuff them in a temp table, vacuum, and re-insert them in order.


        

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to