Thanks for attaching the C code for your test. I ran a few tests on a 3Ghz 
Intel Xeon Paxville (dual core) system. I hope the formatting of this table 
Method  Size  N=1024*1024       N=1
MEMCPY  63     6964927 us     582494 us
MEMCPY  32     7102497 us     582467 us
MEMCPY  16     7116358 us     582538 us
MEMCPY  8      6965239 us     582796 us
MEMCPY  4      6964722 us     583183 us
STRNCPY 63    10131174 us    8843010 us
STRNCPY 32    10648202 us    9563868 us
STRNCPY 16     9187398 us    7969947 us
STRNCPY 8      9275353 us    8042777 us
STRNCPY 4      9067570 us    8058532 us
STRLCPY 63    15045507 us   14379702 us
STRLCPY 32     8960303 us    8120471 us
STRLCPY 16     7393607 us    4915457 us
STRLCPY 8      7222983 us    3211931 us
STRLCPY 4      7181267 us    1725546 us
LENCPY  63     7608932 us    4416602 us
LENCPY  32     7252849 us    3807535 us
LENCPY  16    11680927 us   10331487 us
LENCPY  8     10409685 us    9660616 us
LENCPY  4     10824632 us    9525082 us

The first column is the copy method, the second column is the source string 
size (size of -DSTRING), the 3rd and 4th columns are the different -DN settings.
The memcpy () call is the clear winner, at all source string sizes. The strncpy 
() call is better than strlcpy (), until the size of the string decreases. This 
is probably due to the zero padding effect of strncpy. The lencpy () call 
starts out strong, but degrades as the size of the string decreases. This was a 
little surprising and I don't have an explanation for this behavior at this 
The AMD optimization manuals have some interesting examples for optimizations 
for memcpy, along the lines of cache line copies and prefetching:
There also used to be an interesting article on the SGI web site called 
"Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540", 
but this seems to have been pulled. I did find a copy of the article here:
Obviously, different copy mechanisms suit different data sizes. So, I added a 
little debug to the strlcpy () function that was added to Postgres the other 
day. I ran a test against Postgres for ~15 minutes that used 2 client backends 
and the BG writer - 8330804 calls to strlcpy () were generated by the test.
Out of the 8330804 calls, 6226616 calls used a maximum copy size of 2213 bytes 
e.g. strlcpy (dest, src, 2213) and 2104074 calls used a maximum copy size of 64 
I know the 2213 size calls come from the set_ps_display () function. I don't 
know where the 64 size calls come from, yet.
In the 64 size case, with the exception of 35 calls, calls for size 64 are only 
copying 1 byte - I would assume this is a NULL.
In the 2213 size case, 1933027 calls copy 20 bytes; 2189415 calls copy 5 bytes; 
85550 calls copy 6 bytes and 2018482 calls copy 7 bytes.
Based on this data, it would seem that either memcpy () or strlcpy () calls 
would be better due to the source string size. 
Call originating from the set_ps_display () function might be able to use the 
memcpy () call as  the size of the source string should be known. The other 
calls probably need something like strlcpy () as the source string might not be 
known, although using memcpy () to copy in XX byte blocks might be interesting.


Sent: Fri 9/29/2006 2:59 PM
To: Tom Lane
Subject: Re: [HACKERS] Faster StrNCpy

On Fri, Sep 29, 2006 at 05:34:30PM -0400, Tom Lane wrote:
> > If anybody is curious, here are my numbers for an AMD X2 3800+:
> You did not show your C code, so no one else can reproduce the test on
> other hardware.  However, it looks like your compiler has unrolled the
> memcpy into straight-line 8-byte moves, which makes it pretty hard for
> anything operating byte-wise to compete, and is a bit dubious for the
> general case anyway (since it requires assuming that the size and
> alignment are known at compile time).

I did show the .s code. I call into x_memcpy(a, b), meaning that the
compiler can't assume anything. It may happen to be aligned.

Here are results over 64 Mbytes of memory, to ensure that every call is
a cache miss:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to 
be very slow."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
NONE:        767243 us
MEMCPY:     6044137 us
STRNCPY:   10741759 us
STRLCPY:   12061630 us
LENCPY:     9459099 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c 
strlcpy.c ; ./x
NONE:        712193 us
MEMCPY:     6072312 us
STRNCPY:    9982983 us
STRLCPY:    6605052 us
LENCPY:     7128258 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x 
NONE:        708164 us
MEMCPY:     6042817 us
STRNCPY:    8885791 us
STRLCPY:    5592477 us
LENCPY:     6135550 us

At least on my machine, memcpy() still comes out on top. Yes, assuming that
it is aligned correctly for the machine. Here is unaliagned (all arrays are
stored +1 offset in memory):

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to 
be very slow."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        790932 us
MEMCPY:     6591559 us
STRNCPY:   10622291 us
STRLCPY:   12070007 us
LENCPY:    10322541 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -DALIGN=1 -o 
x x.c y.c strlcpy.c ; ./x
NONE:        764577 us
MEMCPY:     6631731 us
STRNCPY:    9513540 us
STRLCPY:    6615345 us
LENCPY:     7263392 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c 
strlcpy.c ; ./x
NONE:        825689 us
MEMCPY:     6607777 us
STRNCPY:    8976487 us
STRLCPY:    5878088 us
LENCPY:     6180358 us

Alignment looks like it does impact the results for memcpy(). memcpy()
changes from around 6.0 seconds to 6.6 seconds. Overall, though, it is
still the winner in all cases accept for strlcpy(), which beats it on
very short strings ("").

Here is the cache hit case including your strlen+memcpy as 'LENCPY':

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to 
be very slow."' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        696157 us
MEMCPY:      825118 us
STRNCPY:    7983159 us
STRLCPY:   10787462 us
LENCPY:     6048339 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN=1 -o x x.c y.c strlcpy.c ; 
NONE:        700201 us
MEMCPY:      593701 us
STRNCPY:    7577380 us
STRLCPY:    3727801 us
LENCPY:     3169783 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        706283 us
MEMCPY:      792719 us
STRNCPY:    7870425 us
STRLCPY:     681334 us
LENCPY:     2062983 us

First call was every call being a cache hit. With this one, every one is
a cache miss, and the 64-byte blocks are spread equally over 64 Mbytes of
memory. I've attached the code for your consideration. x.c is the routines
I used to perform the tests. y.c is the main program. strlcpy.c is copied
from the online reference as is without change. The compilation steps
are described above. STRING is the string to try out. N is the number
of 64-byte blocks to allocate. ALIGN is the number of bytes to offset
the array by when storing / reading / writing. ALIGN should be >= 0.

At N=1, it's all in cache. At N=1024*1024 it is taking up 64 Mbytes of


.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to