Re: [HACKERS] Faster StrNCpy

mark Fri, 29 Sep 2006 14:23:52 -0700

If anybody is curious, here are my numbers for an AMD X2 3800+:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to 
be slow."' -o x x.c y.c strlcpy.c ; ./x
NONE:        620268 us
MEMCPY:      683135 us
STRNCPY:    7952930 us
STRLCPY:   10042364 us


$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -o x x.c y.c strlcpy.c ; ./x
NONE:        554694 us
MEMCPY:      691390 us
STRNCPY:    7759933 us
STRLCPY:    3710627 us

$ gcc -O3 -std=c99 -DSTRING='""' -o x x.c y.c strlcpy.c ; ./x
NONE:        631266 us
MEMCPY:      775340 us
STRNCPY:    7789267 us
STRLCPY:     550430 us

Each invocation represents 100 million calls to each of the functions.
Each function accepts a 'dst' and 'src' argument, and assumes that it
is copying 64 bytes from 'src' to 'dst'. The none function does
nothing. The memcpy calls memcpy(), the strncpy calls strncpy(), and
the strlcpy calls the strlcpy() that was posted from the BSD sources.
(GLIBC doesn't have strlcpy() on my machine).

This makes it clear what the overhead of the additional logic involves.
memcpy() is approximately equal to nothing at all. strncpy() is always
expensive. strlcpy() is often more expensive than memcpy(), except in
the empty string case.

These tests do not properly model the effects of real memory, however,
they do model the effects of cache memory. I would suggest that the
results are exaggerated, but not invalid.

For anybody doubting the none vs memcpy, I've included the generated
assembly code. I chalk it entirely up to fully utilizing the
parallelization capability of the CPU. Although 16 movq instructions
are executed, they can be executed fully in parallel.

It almost makes it clear to me that all of these instructions are
pretty fast. Are we sure this is a real bottleneck? Even the slowest
operation above, strlcpy() on a very long string, appears to execute
10 per microsecond? Perhaps my tests are too easy for my CPU and I
need to make it access many different 64-byte blocks? :-)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED]     
__________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

        .file   "x.c"
        .text
        .p2align 4,,15
.globl x_none
        .type   x_none, @function
x_none:
.LFB14:
        rep ; ret
.LFE14:
        .size   x_none, .-x_none
        .p2align 4,,15
.globl x_strlcpy
        .type   x_strlcpy, @function
x_strlcpy:
.LFB17:
        movl    $64, %edx
        jmp     strlcpy
.LFE17:
        .size   x_strlcpy, .-x_strlcpy
        .p2align 4,,15
.globl x_strncpy
        .type   x_strncpy, @function
x_strncpy:
.LFB16:
        movl    $64, %edx
        jmp     strncpy
.LFE16:
        .size   x_strncpy, .-x_strncpy
        .p2align 4,,15
.globl x_memcpy
        .type   x_memcpy, @function
x_memcpy:
.LFB15:
        movq    (%rsi), %rax
        movq    %rax, (%rdi)
        movq    8(%rsi), %rax
        movq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        movq    %rax, 24(%rdi)
        movq    32(%rsi), %rax
        movq    %rax, 32(%rdi)
        movq    40(%rsi), %rax
        movq    %rax, 40(%rdi)
        movq    48(%rsi), %rax
        movq    %rax, 48(%rdi)
        movq    56(%rsi), %rax
        movq    %rax, 56(%rdi)
        ret
.LFE15:
        .size   x_memcpy, .-x_memcpy
        .section        .eh_frame,"a",@progbits
.Lframe1:
        .long   .LECIE1-.LSCIE1
.LSCIE1:
        .long   0x0
        .byte   0x1
        .string "zR"
        .uleb128 0x1
        .sleb128 -8
        .byte   0x10
        .uleb128 0x1
        .byte   0x3
        .byte   0xc
        .uleb128 0x7
        .uleb128 0x8
        .byte   0x90
        .uleb128 0x1
        .align 8
.LECIE1:
.LSFDE1:
        .long   .LEFDE1-.LASFDE1
.LASFDE1:
        .long   .LASFDE1-.Lframe1
        .long   .LFB14
        .long   .LFE14-.LFB14
        .uleb128 0x0
        .align 8
.LEFDE1:
.LSFDE3:
        .long   .LEFDE3-.LASFDE3
.LASFDE3:
        .long   .LASFDE3-.Lframe1
        .long   .LFB17
        .long   .LFE17-.LFB17
        .uleb128 0x0
        .align 8
.LEFDE3:
.LSFDE5:
        .long   .LEFDE5-.LASFDE5
.LASFDE5:
        .long   .LASFDE5-.Lframe1
        .long   .LFB16
        .long   .LFE16-.LFB16
        .uleb128 0x0
        .align 8
.LEFDE5:
.LSFDE7:
        .long   .LEFDE7-.LASFDE7
.LASFDE7:
        .long   .LASFDE7-.Lframe1
        .long   .LFB15
        .long   .LFE15-.LFB15
        .uleb128 0x0
        .align 8
.LEFDE7:
        .ident  "GCC: (GNU) 4.1.1 20060525 (Red Hat 4.1.1-1)"
        .section        .note.GNU-stack,"",@progbits

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [HACKERS] Faster StrNCpy

Reply via email to