Re: [PERFORM] Feature suggestion : FAST CLUSTER

PFC Mon, 28 May 2007 23:23:18 -0700

On Sun, 27 May 2007 19:34:30 +0200, PFC <[EMAIL PROTECTED]> wrote:

On Sun, 27 May 2007 17:53:38 +0200, Jim C. Nasby <[EMAIL PROTECTED]>wrote:
On Tue, May 22, 2007 at 09:29:00AM +0200, PFC wrote:
        This does not run a complete sort on the table. It would be about as
fast as your seq scan disk throughput. Obviously, the end result isnot asgood as a real CLUSTER since the table will be made up of severalordered
chunks and a range lookup. Therefore, a range lookup on the clustered
columns would need at most N seeks, versus 1 for a really clusteredtable.But it only scans the table once and writes it once, even countingindex
rebuild.
Do you have any data that indicates such an arrangement would be
substantially better than less-clustered data?
While the little benchmark that will answer your question is running,I'll add a few comments :


        Alright, so far :

This is a simulated forum workload, so it's mostly post insertions, someedits, and some topic deletes.It will give results applicable to forums, obviously, but also anythingthat wotks on the same schema :

        - topics + posts
        - blog articles + coomments
        - e-commerce site where users can enter their reviews

So, the new trend being to let the users to participate, this kind ofworkload will become more and more relevant for websites.

So, how to cluster the posts table on (topic_id, post_id) to get all theposts on the same webpake in 1 seek ?


        I am benchmarking the following :
        - CLUSTER obviously

- Creating a new table and INSERT .. SELECT ORDER BY topic_id, post_id,then reindexing etc

        - not doing anything (just vacuuming all tables)
        - not even vacuuming the posts table.

        I al also trying the following more exotic approaches :

        * chunked sort :

Well, sorting 1GB of data when your work_mem is only 512 MB needs severalpasses, hence a lot of disk IO. The more data, the more IO.

        So, instead of doing this, I will :
        - grab about 250 MB of posts from the table
        - sort them by (topic_id, post_id)
        - insert them in a new table
        - repeat
        - then reindex, etc and replace old table with new.

(reindex is very fast, since the table is nicely defragmented now, I getfull disk speed. However I would like being able to create 2 indexes withONE table scan !)

        I'm trying 2 different ways to do that, with plpgsql and cursors.

It is much faster than sorting the whole data set, because the sorts areonly done in memory (hence the "chunks")So far, it seems a database clustered this way is about as fast as usingCLUSTER, but the clustering operation is faster.

        More results in about 3 days when the benchmarks finish.

        * other dumb stuff

I'll try DELETing the last 250MB of records, stuff them in a temp table,vacuum, and re-insert them in order.



        

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Feature suggestion : FAST CLUSTER

Reply via email to