Perhaps because P4 is already doing H/W prefetching?

http://www.tomshardware.com/2000/11/20/intel/page5.html

I ran the test program on an opteron 2.2G:

% ./a.out 10 16
Sum: -951304192: with prefetch on - duration: 81.166 ms
Sum: -951304192: with prefetch off - duration: 79.769 ms
Sum: -951304192: with prefetch on - duration: 81.173 ms
Sum: -951304192: with prefetch off - duration: 79.718 ms
Sum: -951304192: with prefetch on - duration: 83.293 ms
Sum: -951304192: with prefetch off - duration: 79.731 ms
Sum: -951304192: with prefetch on - duration: 81.227 ms
Sum: -951304192: with prefetch off - duration: 79.851 ms
Sum: -951304192: with prefetch on - duration: 81.003 ms
Sum: -951304192: with prefetch off - duration: 79.724 ms
Sum: -951304192: with prefetch on - duration: 81.084 ms
Sum: -951304192: with prefetch off - duration: 79.728 ms
Sum: -951304192: with prefetch on - duration: 81.009 ms
Sum: -951304192: with prefetch off - duration: 79.723 ms
Sum: -951304192: with prefetch on - duration: 81.074 ms
Sum: -951304192: with prefetch off - duration: 79.719 ms
Sum: -951304192: with prefetch on - duration: 81.188 ms
Sum: -951304192: with prefetch off - duration: 79.724 ms
Sum: -951304192: with prefetch on - duration: 81.075 ms
Sum: -951304192: with prefetch off - duration: 79.719 ms

Got slowdown.

I ran it on a PIII 650M:

% ./a.out 10 16
Sum: -951304192: with prefetch on - duration: 284.952 ms
Sum: -951304192: with prefetch off - duration: 291.439 ms
Sum: -951304192: with prefetch on - duration: 290.690 ms
Sum: -951304192: with prefetch off - duration: 299.692 ms
Sum: -951304192: with prefetch on - duration: 295.287 ms
Sum: -951304192: with prefetch off - duration: 290.992 ms
Sum: -951304192: with prefetch on - duration: 285.116 ms
Sum: -951304192: with prefetch off - duration: 294.127 ms
Sum: -951304192: with prefetch on - duration: 286.986 ms
Sum: -951304192: with prefetch off - duration: 291.001 ms
Sum: -951304192: with prefetch on - duration: 283.233 ms
Sum: -951304192: with prefetch off - duration: 401.910 ms
Sum: -951304192: with prefetch on - duration: 297.021 ms
Sum: -951304192: with prefetch off - duration: 307.814 ms
Sum: -951304192: with prefetch on - duration: 287.201 ms
Sum: -951304192: with prefetch off - duration: 303.870 ms
Sum: -951304192: with prefetch on - duration: 286.962 ms
Sum: -951304192: with prefetch off - duration: 352.779 ms
Sum: -951304192: with prefetch on - duration: 283.245 ms
Sum: -951304192: with prefetch off - duration: 294.422 ms

looks like some speedup to me.

On Thu, 08 Dec 2005 Qingqing Zhou wrote :
> 
> I found an interesting paper improving index speed by prefetching memory
> data to L1/L2 cache here (there is discussion about prefetching disk
> data to memory several days ago "ice-breaker thread"):
> http://www.cs.cmu.edu/~chensm/papers/index_pf_final.pdf
> 
> Also related technique used to speedup memcpy:
> http://people.redhat.com/arjanv/pIII.c
> 
> I wonder if we could use it to speed up in-memory scan opertion for heap
> or index. Tom's patch has made scan can handle a page (vs. row) every
> time, which is a basis for this optimization.
> 
> I write a program try to simulate it, but I am not good at micro
> optimization, and I just get a very weak but kind-of-stable improvement. I
> wonder if any people here are interested to take a look.
> 
> Regards,
> Qingqing
> 
> ----------------------------------
> 
> Test results
> --------------
> Cache line size: 64
> CPU: P4 2.4G
> $#./prefetch 10 16
> Sum: -951304192: with prefetch on - duration: 42.163 ms
> Sum: -951304192: with prefetch off - duration: 42.838 ms
> Sum: -951304192: with prefetch on - duration: 44.044 ms
> Sum: -951304192: with prefetch off - duration: 42.792 ms
> Sum: -951304192: with prefetch on - duration: 42.324 ms
> Sum: -951304192: with prefetch off - duration: 42.803 ms
> Sum: -951304192: with prefetch on - duration: 42.189 ms
> Sum: -951304192: with prefetch off - duration: 42.801 ms
> Sum: -951304192: with prefetch on - duration: 42.155 ms
> Sum: -951304192: with prefetch off - duration: 42.827 ms
> Sum: -951304192: with prefetch on - duration: 42.179 ms
> Sum: -951304192: with prefetch off - duration: 42.798 ms
> Sum: -951304192: with prefetch on - duration: 42.180 ms
> Sum: -951304192: with prefetch off - duration: 42.804 ms
> Sum: -951304192: with prefetch on - duration: 42.193 ms
> Sum: -951304192: with prefetch off - duration: 42.827 ms
> Sum: -951304192: with prefetch on - duration: 42.164 ms
> Sum: -951304192: with prefetch off - duration: 42.810 ms
> Sum: -951304192: with prefetch on - duration: 42.182 ms
> Sum: -951304192: with prefetch off - duration: 42.826 ms
> 
> Test program
> ----------------
> 
> /*
>  * prefetch.c
>  *            PostgreSQL warm-cache sequential scan simulator with prefetch
>  */
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <memory.h>
> #include <errno.h>
> #include <sys/time.h>
> 
> typedef char bool;
> #define true  ((bool) 1)
> #define false ((bool) 0)
> 
> #define BLCKSZ        8192
> #define CACHESZ       64
> #define NBLCKS        5000
> 
> int   sum;
> 
> int
> main(int argc, char *argv[])
> {
>       int     i, rounds;
>       char    *blocks;
>       int     cpu_cost;
> 
>       if (argc != 3)
>       {
>               fprintf(stderr, "usage: prefetch <rounds> <cpu_cost [1, 
> 16]>\n");
>               exit(-1);
>       }
> 
>       rounds = atoi(argv[1]);
>       cpu_cost  = atoi(argv[2]);
>       if (cpu_cost > 16)
>               exit(-1);
> 
>       for (i = 0; i < 2*rounds; i++)
>       {
>               int     j, k;
>               struct  timeval start_t, stop_t;
>               bool    enable = i%2?false:true;
>               char    *blck;
> 
>               blocks = (char *)malloc(BLCKSZ*NBLCKS);
>               memset(blocks, 'a', BLCKSZ*NBLCKS);
> 
>               sum = 0;
>               gettimeofday(&start_t, NULL);
> 
>               for (j = 0; j < NBLCKS; j++)
>               {
>                       blck = blocks + j*BLCKSZ;
>                       for (k=0; k < BLCKSZ; k+=CACHESZ)
>                       {
>                               int     *s = (int *)(blck + k);
>                               int     u = cpu_cost;
> 
>                               if (enable)
>                               {
>                                       /* prefetch ahead */
>                                       __asm__ __volatile__ (
>                                       "1: prefetchnta 128(%0)\n"
>                                               : : "r" (s) : "memory" );
>                               }
> 
>                               /* pretend to process current tuple */
>                               while (u--) sum += (*(s+u))*(*(s+u));
>                       }
>               }
>               gettimeofday(&stop_t, NULL);
> 
>               free(blocks);
> 
>               /* measure the time */
>               if (stop_t.tv_usec < start_t.tv_usec)
>               {
>                       stop_t.tv_sec--;
>                       stop_t.tv_usec += 1000000;
>               }
>               fprintf (stdout, "Sum: %d: with prefetch %s - duration: 
> %ld.%03ld ms\n",
>                               sum,
>                               enable?"on":"off",
>                               (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 
> +
>                                               (stop_t.tv_usec - 
> start_t.tv_usec) / 1000),
>                               (long) (stop_t.tv_usec - start_t.tv_usec) % 
> 1000);
> 
>       }
> 
>       exit(0);
> }
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to [EMAIL PROTECTED] so that your
>        message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Reply via email to