today I noticed this inefficient implementation for the "find first set bit" function:

This is interesting:

I've tried this tiny benchmark with a[] being filled with the powers of two:

        start = gethrtime();
        for (j = 0; j < iters; j++)
                for (i = 0; i <= 33; i++) {
                        x[i] = ffs(a[i]);
                }
        end = gethrtime();

        printf("Avg ffs time = %lld nsec\n", (end - start) / iters);

        start = gethrtime();
        for (j = 0; j < iters; j++)
                for (i = 0; i <= 33; i++) {
                        y[i] = lowbit(a[i]);
                }
        end = gethrtime();

        printf("Avg lowbit time = %lld nsec\n", (end - start) / iters);

If I just compile with gcc, I get

Avg ffs time = 794 nsec
Avg lowbit time = 111 nsec

but it I compile with gcc -m64 -c -funroll-loops -O2 lowbit_ffs.c, I will not only get unrolled loops, but gcc will also replace the ffs calls with bsfl

    main+0xc4:              48 89 04 d5 00 21  movq   %rax,0x412100(,%rdx,8)    
<x>
                            41 00
    main+0xcc:              49 63 d4           movslq %r12d,%rdx
    main+0xcf:              0f bc 04 d5 60 1e  bsfl   0x411e60(,%rdx,8),%eax    
<a>
                            41 00

so it's not surprising that calling lowbit will be slower, even if it's assembly - just because of the function call (yes, there's -finline-functions, but that's not realistic for the kernel).

Avg ffs time = 62 nsec
Avg lowbit time = 82 nsec

The interesting part is really that I see those calls to (inefficient) ffs with dtrace. To get an idea of the amount of function calls, I provided a comparison with swtch() calls:

r...@haggis:~# dtrace -n 'fbt::ffs:entry { @ffs=count(); } fbt::swtch:entry { @swtch=count(); }'
dtrace: description 'fbt::ffs:entry ' matched 2 probes
^C

            14430
            32989

Note: I suspect this is Xen specific...

r...@haggis:~# uname -a
SunOS haggis 5.11 snv_111 i86pc i386 i86xpv


I am really just playing with this stuff, and I really don't want to judge on how much of a difference this would make for the system as a whole, but it seems like a sensible thing to

 - either replace ffs() with an optimized version or
 - let the compiler optimize ffs()

Would anyone be interested in spending the 30 minutes it takes to integrate this? ;-) [I know it probably needs more effort, yes]

Thanks, Nils
_______________________________________________
opensolaris-code mailing list
opensolaris-code@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Reply via email to