Re: [PATCHES] Maintaining cluster order on insert

Tom Lane Mon, 09 Jul 2007 16:31:14 -0700

[ back to the cluster-order patch ]

Awhile back, Heikki Linnakangas <[EMAIL PROTECTED]> wrote:
> The performance characteristics of this patch hasn't been thoroughly 
> discussed yet. The reason why you want to cluster your tables is to 
> speed up SELECTs that return a bunch of tuples with similar values, for 
> example range queries. The reason for keeping them clustered on inserts 
> is to reduce the need to run CLUSTER as often.


> It doesn't come without a cost, however. In the worst case, there never 
> is room for new inserts on pages, and each insert needs to do one extra 
> I/O to fetch the optimal heap page where the insert should go, see that 
> there's no room, and then insert somewhere else. Using a non-zero 
> fillfactor helps, but even when there is room on the page, it's often 
> cheaper to just append to the end of the table and running CLUSTER at 
> night for example, than do random access to insert to the "right" pages 
> in the heap.

I looked through the thread and could not find any clear evidence that
anyone had done any performance testing of this patch at all, so I
hacked together a crude test that alternates between inserting/deleting
a random subset of rows and SELECTing a range of rows.  The results
do not look very good: while there is detectable improvement in the
SELECT performance, it's not much, and it comes at a *very* sizable
penalty in INSERT performance.

Attached is a test program, which I ran against today's CVS HEAD with
and without the v8 version of the patch.  The test program sets up
a clustered million-row table with a row width chosen to fit about
20 rows per page.  It then iterates a loop of:

        * delete a random 2% of the table
        * vacuum to recover space
        * insert a random 2% of the table
        * select (about) 1000 consecutively-numbered rows
        * select all the rows (this is just a cross check that the
          number of rows isn't changing too much)

What you would hope to see as the benefit of the patch is that the time
for the range SELECT degrades more slowly as more of the table is
replaced.  Ignoring the first SELECT as being a startup transient, it
looks like HEAD degrades from about 3 msec to 6 msec over 10 iterations
(20% replacement of the table), whereas with the patch it's about 3 msec
to about 4 and a half.  However, the INSERT steps went from around 20
sec each to about twice that.

I used just the following nondefault parameter settings:

shared_buffers = 10000                  # min 128kB or max_connections*16kB
maintenance_work_mem = 160MB            # min 1MB
max_fsm_pages = 2048000                 # min max_fsm_relations*16, 6 bytes each
wal_buffers = 640kB                     # min 32kB
checkpoint_segments = 30                # in logfile segments, min 1, 16MB each
enable_bitmapscan = off
enable_seqscan = off
autovacuum = off                        # enable autovacuum subprocess?

which for the most part are just 10x the factory defaults.  The hardware
is just a Dell x86_64 workstation with crappy IDE disk, so maybe things
would look better elsewhere, but it's all I have to work with.

Considering that the patch is really severely ugly from a modularity and
layering standpoint, I'm now inclined to reject it.  AFAICT this test
case is showing the patch at its best possible advantage; under
real-world conditions the benefit would be less.  It doesn't look to me
like the gain is worth the loss of system understandability and
maintainability that the patch would impose.

                        regards, tom lane

/*
 * derived from testlibpq3.c
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include "libpq-fe.h"


#define TABLESIZE 1000000               /* rows in table */
#define ITERS 10


static void
exit_nicely(PGconn *conn)
{
        PQfinish(conn);
        exit(1);
}

static char *
elapsed_time(struct timeval start_t, struct timeval elapse_t)
{
        static char buf[64];

        if (elapse_t.tv_usec < start_t.tv_usec)
        {
                elapse_t.tv_sec--;
                elapse_t.tv_usec += 1000000;
        }

        sprintf(buf, "%3ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec),
                   (long) (elapse_t.tv_usec - start_t.tv_usec));
        return buf;
}

static void
do_cmd(PGconn *conn, const char *cmd)
{
        PGresult   *res;
        struct timeval start_t;
        struct timeval elapse_t;

        gettimeofday(&start_t, NULL);
        res = PQexec(conn, cmd);
        gettimeofday(&elapse_t, NULL);
        if (PQresultStatus(res) != PGRES_COMMAND_OK)
        {
                fprintf(stderr, "%s\nCommand failed: %s", cmd, 
PQerrorMessage(conn));
                PQclear(res);
                exit_nicely(conn);
        }
        printf("executed %-.6s in %s sec\n", cmd,
                   elapsed_time(start_t, elapse_t));
        fflush(stdout);
        PQclear(res);
}

static void
do_select(PGconn *conn, const char *cmd)
{
        PGresult   *res;
        struct timeval start_t;
        struct timeval elapse_t;

        gettimeofday(&start_t, NULL);
        res = PQexec(conn, cmd);
        gettimeofday(&elapse_t, NULL);
        if (PQresultStatus(res) != PGRES_TUPLES_OK)
        {
                fprintf(stderr, "%s\nCommand failed: %s", cmd, 
PQerrorMessage(conn));
                PQclear(res);
                exit_nicely(conn);
        }
        printf("retrieved %4s tuples in %s sec\n", PQgetvalue(res, 0, 0),
                   elapsed_time(start_t, elapse_t));
        fflush(stdout);
        PQclear(res);
}

int
main(int argc, char **argv)
{
        const char *conninfo;
        PGconn     *conn;
        PGresult   *res;
        char            cmd[1024];
        int                     i;

        /*
         * If the user supplies a parameter on the command line, use it as the
         * conninfo string; otherwise default to setting dbname=postgres and 
using
         * environment variables or defaults for all other connection 
parameters.
         */
        if (argc > 1)
                conninfo = argv[1];
        else
                conninfo = "dbname = postgres";

        /* Make a connection to the database */
        conn = PQconnectdb(conninfo);

        /* Check to see that the backend connection was successfully made */
        if (PQstatus(conn) != CONNECTION_OK)
        {
                fprintf(stderr, "Connection to database failed: %s",
                                PQerrorMessage(conn));
                exit_nicely(conn);
        }

        /* Ignore any error while dropping old table */
        res = PQexec(conn, "DROP TABLE testtab");
        PQclear(res);

        /* Set up test table */
        do_cmd(conn, "CREATE TABLE testtab(k int, d text)");
        sprintf(cmd, "INSERT INTO testtab "
                        "SELECT x, repeat('x',350) FROM generate_series(1,%d) 
x",
                        TABLESIZE);
        do_cmd(conn, cmd);
        do_cmd(conn, "CREATE INDEX testtabi ON testtab(k)");
        do_cmd(conn, "CLUSTER testtab USING testtabi");
        do_cmd(conn, "VACUUM ANALYZE testtab");

        for (i = 0; i < ITERS; i++)
        {
                int                     st;

                sprintf(cmd, "DELETE FROM testtab WHERE random() < 1.0/50");
                do_cmd(conn, cmd);
                do_cmd(conn, "VACUUM testtab");
                sprintf(cmd, "INSERT INTO testtab "
                                "SELECT (random() * %d)::int, repeat('x',350) "
                                "FROM generate_series(1,%d)",
                                TABLESIZE, TABLESIZE/50);
                do_cmd(conn, cmd);
                st = ((double) random() * (TABLESIZE - 1000)) / 0x7FFFFFFF;
                sprintf(cmd, "SELECT count(*) FROM testtab "
                                "WHERE k BETWEEN %d AND %d",
                                st, st + 1000);
                do_select(conn, cmd);
                do_select(conn, "SELECT count(*) FROM testtab");
        }

        /* close the connection to the database and cleanup */
        PQfinish(conn);

        return 0;
}

executed CREATE in   0.066563 sec
executed INSERT in  40.465653 sec
executed CREATE in   9.152698 sec
executed CLUSTE in  20.036375 sec
executed VACUUM in   1.440232 sec
executed DELETE in  14.770937 sec
executed VACUUM in  10.663301 sec
executed INSERT in   2.449248 sec
retrieved  995 tuples in   0.170685 sec
retrieved 999971 tuples in   0.648237 sec
executed DELETE in  33.760709 sec
executed VACUUM in  28.762174 sec
executed INSERT in  12.212027 sec
retrieved  999 tuples in   0.003139 sec
retrieved 999642 tuples in   0.646765 sec
executed DELETE in  30.541053 sec
executed VACUUM in  13.204475 sec
executed INSERT in  17.972502 sec
retrieved  985 tuples in   0.003621 sec
retrieved 999612 tuples in   0.623749 sec
executed DELETE in  14.911813 sec
executed VACUUM in  27.443921 sec
executed INSERT in  19.125950 sec
retrieved 1002 tuples in   0.004900 sec
retrieved 999784 tuples in   0.667716 sec
executed DELETE in  22.651369 sec
executed VACUUM in  10.743926 sec
executed INSERT in  21.631076 sec
retrieved  987 tuples in   0.004566 sec
retrieved 999711 tuples in   0.632185 sec
executed DELETE in  11.587629 sec
executed VACUUM in  15.278964 sec
executed INSERT in  23.725325 sec
retrieved 1011 tuples in   0.005960 sec
retrieved 999624 tuples in   0.680135 sec
executed DELETE in   3.905152 sec
executed VACUUM in  42.848288 sec
executed INSERT in  18.619609 sec
retrieved 1006 tuples in   0.007106 sec
retrieved 999479 tuples in   0.678316 sec
executed DELETE in  27.288273 sec
executed VACUUM in   9.329839 sec
executed INSERT in  23.361110 sec
retrieved  983 tuples in   0.005997 sec
retrieved 999354 tuples in   0.615004 sec
executed DELETE in  20.416503 sec
executed VACUUM in  17.537463 sec
executed INSERT in  19.781416 sec
retrieved 1021 tuples in   0.006894 sec
retrieved 999688 tuples in   0.670918 sec
executed DELETE in  14.063273 sec
executed VACUUM in  17.439971 sec
executed INSERT in  17.381126 sec
retrieved  987 tuples in   0.006664 sec
retrieved 999930 tuples in   0.632494 sec

executed CREATE in   0.086862 sec
executed INSERT in  50.746362 sec
executed CREATE in  12.115655 sec
executed CLUSTE in  33.656341 sec
executed VACUUM in   4.306563 sec
executed DELETE in  18.062664 sec
executed VACUUM in  28.487570 sec
executed INSERT in  25.638022 sec
retrieved  998 tuples in   1.498475 sec
retrieved 1000019 tuples in   0.624082 sec
executed DELETE in  30.211317 sec
executed VACUUM in  24.147135 sec
executed INSERT in  40.759404 sec
retrieved 1006 tuples in   0.002711 sec
retrieved 1000184 tuples in   0.984822 sec
executed DELETE in  31.616131 sec
executed VACUUM in  22.383567 sec
executed INSERT in  36.174291 sec
retrieved  990 tuples in   0.002840 sec
retrieved 1000275 tuples in   0.723260 sec
executed DELETE in  34.279871 sec
executed VACUUM in  34.855060 sec
executed INSERT in  36.652868 sec
retrieved  990 tuples in   0.003554 sec
retrieved 1000376 tuples in   0.715215 sec
executed DELETE in  37.396236 sec
executed VACUUM in  17.721296 sec
executed INSERT in  32.413756 sec
retrieved  974 tuples in   0.003498 sec
retrieved 1000383 tuples in   0.723111 sec
executed DELETE in  44.696337 sec
executed VACUUM in  13.850628 sec
executed INSERT in  47.649557 sec
retrieved 1002 tuples in   0.004165 sec
retrieved 1000554 tuples in   0.722704 sec
executed DELETE in  17.854653 sec
executed VACUUM in  52.446585 sec
executed INSERT in  41.293512 sec
retrieved 1038 tuples in   0.004169 sec
retrieved 1000422 tuples in   0.692400 sec
executed DELETE in  32.102991 sec
executed VACUUM in  22.018916 sec
executed INSERT in  44.296194 sec
retrieved  991 tuples in   0.004636 sec
retrieved 1000713 tuples in   0.739701 sec
executed DELETE in  31.678825 sec
executed VACUUM in  12.643271 sec
executed INSERT in  53.179401 sec
retrieved 1015 tuples in   0.004745 sec
retrieved 1000797 tuples in   0.708603 sec
executed DELETE in  35.682182 sec
executed VACUUM in  42.612243 sec
executed INSERT in  45.726663 sec
retrieved 1017 tuples in   0.004753 sec
retrieved 1000755 tuples in   0.696441 sec

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [PATCHES] Maintaining cluster order on insert

Reply via email to