On 4/27/07, Davi Arnaut <[EMAIL PROTECTED]> wrote:
On 27/04/2007, at 00:14, Lucian Adrian Grijincu wrote:

> in apr-conv10-faster.patch you added:
>
> static const char digits[] = "0123456789";
> *--p = digits[magnitude % 10];
>
> Why is this faster than:
> *--p = (char) '0' + (magnitude % 10); ?

You have to take into account the entire loop. The fowling:

do {
        u_widest_int new_magnitude = magnitude / 10;
        *--p = (char) (magnitude - new_magnitude * 10 + '0');
        magnitude = new_magnitude;
} while (magnitude);

against:

do {
        *--p = digits[magnitude % 10];
} while ((magnitude /= 10) != 0);

digits is easily cacheable, fewer assignments.

I wasn't promoting the first version, I wanted to say the second one
could be made better this way:
do {
        *--p = (char) '0' + (magnitude % 10);
} while ((magnitude /= 10) != 0);

I don't quite like u_widest_int from the first version from a
portability standpoint, but I see that it's quite used in
conv_10_quad() and the rest :(
http://svn.apache.org/viewvc/apr/apr/trunk/strings/apr_snprintf.c?view=markup

--
Lucian Adrian Grijincu
>
> For your "faster" version, under the hood, the C compiler adds
> (magnitude % 10) to the address of digits and then copies the contents
> of the memory location represented by the sum's result into *--p.
>
> My version just adds (magnitude % 10) to '0' and stores the result
> in *--p.

Talk is cheap, let's benchmark! To see the generated assembly:

gcc -O2 -o bench bench.c -g
objdump -S bench > bench-asm

# Intel(R) Celeron(R) CPU 2.20GHz
[EMAIL PROTECTED] ~]$ gcc -o bench bench.c -O2 # uint32_t
[EMAIL PROTECTED] ~]$ ./bench $RANDOM
conv_1
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
conv_2
cycles:          236
cycles:          220
cycles:          224
cycles:          220
cycles:          224
cycles:          224
cycles:          224
cycles:          224
conv_1
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
cycles:          236
conv_2
cycles:          220
cycles:          224
cycles:          224
cycles:          224
cycles:          224
cycles:          224
cycles:          224
cycles:          224

[EMAIL PROTECTED] ~]$ gcc -o bench bench.c -O2 # uint64_t
[EMAIL PROTECTED] ~]$ ./bench $RANDOM |more
conv_1
cycles:          508
cycles:          532
cycles:          540
cycles:          468
cycles:          468
cycles:          468
cycles:          468
conv_2
cycles:         1188
cycles:          824
cycles:          896
cycles:          828
cycles:          824
cycles:          824
cycles:          820
conv_1
cycles:          524
cycles:          492
cycles:          468
cycles:          504
cycles:          468
cycles:          504
cycles:          468
conv_2
cycles:          768
cycles:          836
cycles:          836
cycles:          820
cycles:          820
cycles:          820
cycles:          820


> Am I missing something here?

Both code, after compiler optimizations, yield similar results but
hurts uint64_t (apr_uint64_t) case quite a bit. "Faster" was a
overstatement, I withdraw apr-conv10-faster.patch.

--
Davi Arnaut



Reply via email to