bitops: implement __test_bit

Michael S. Tsirkin Tue, 01 Sep 2015 08:04:11 -0700

On Tue, Sep 01, 2015 at 01:39:42PM +0200, Ingo Molnar wrote:
> 
> * Michael S. Tsirkin <[email protected]> wrote:
> 
> > On Tue, Sep 01, 2015 at 11:24:22AM +0200, Ingo Molnar wrote:
> > > 
> > > * Michael S. Tsirkin <[email protected]> wrote:
> > > 
> > > > I applied this patch on top of mine:
> > > 
> > > Yeah, looks similar to the one I sent.
> > > 
> > > > -static inline int __variable_test_bit(long nr, const unsigned long 
> > > > *addr)
> > > > -{
> > > > -       int oldbit;
> > > > -
> > > > -       asm volatile("bt %2,%1\n\t"
> > > > -                    "sbb %0,%0"
> > > > -                    : "=r" (oldbit)
> > > > -                    : "m" (*addr), "Ir" (nr));
> > > > -
> > > > -       return oldbit;
> > > > -}
> > > 
> > > > And the code size went up:
> > > > 
> > > >    134836    2997    8372  146205   23b1d arch/x86/kvm/kvm-intel.ko  ->
> > > >    134846    2997    8372  146215   23b27 arch/x86/kvm/kvm-intel.ko     
> > > > 
> > > >    342690   47640     441  390771   5f673 arch/x86/kvm/kvm.ko ->
> > > >    342738   47640     441  390819   5f6a3 arch/x86/kvm/kvm.ko   
> > > > 
> > > > I tried removing  __always_inline, this had no effect.
> > > 
> > > But code size isn't the only factor.
> > > 
> > > Uros Bizjak pointed out that the reason GCC does not use the "BT reg,mem" 
> > > instruction is that it's highly suboptimal even on recent 
> > > microarchitectures, 
> > > Sandy Bridge is listed as having a 10 cycles latency (!) for this 
> > > instruction:
> > > 
> > >    http://www.agner.org/optimize/instruction_tables.pdf
> > > 
> > > this instruction had bad latency going back to Pentium 4 CPUs.
> > > 
> > > ... so unless something changed in this area with Skylake I think using 
> > > the 
> > > __variable_test_bit() code of the kernel is a bad choice and looking at 
> > > kernel 
> > > size only is misleading.
> > > 
> > > It makes sense for atomics, but not for unlocked access.
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > 
> > Hmm - so do you take back the ack?
> 
> I have no strong feelings either way, it simply strikes me as misguided to 
> explicitly optimize for something that is listed as a high overhead 
> instruction.
> 
> Assuming it really is high overhead:
> 
> > I compared this:
> > int main(int argc, char **argv)
> > {
> > 
> >         long long int i;
> >         const unsigned long addr = 0;
> >         for (i = 0; i < 1000000000ull; ++i) {
> >                 asm volatile("");
> >                 if (__variable_test_bit(1, &addr))
> >                 asm volatile("");
> >         }
> >         return 0;
> > }
> > 
> > with the __constant_test_bit variant.
> > 
> > __constant_test_bit code does appear to be slower on an i7 processor.



Ouch. I meant the reverse:

 [mst@robin test]$ diff a.c b.c
 31c31
 <               if (__variable_test_bit(1, &addr))
 ---
 >               if (__constant_test_bit(1, &addr))

[mst@robin test]$ gcc -Wall -O2 a.c; time ./a.out

real    0m0.532s
user    0m0.531s
sys     0m0.000s
[mst@robin test]$ gcc -Wall -O2 b.c; time ./a.out

real    0m0.517s
user    0m0.517s
sys     0m0.000s


So __constant_test_bit is faster even though it's using more
instructions
$ gcc -Wall -O2 a.c; -objdump -ld ./a.out


08048414 <main>:
main():
 8048414:       8d 4c 24 04             lea    0x4(%esp),%ecx
 8048418:       83 e4 f8                and    $0xfffffff8,%esp
 804841b:       ff 71 fc                pushl  -0x4(%ecx)
 804841e:       55                      push   %ebp
 804841f:       89 e5                   mov    %esp,%ebp
 8048421:       51                      push   %ecx
 8048422:       83 ec 14                sub    $0x14,%esp
 8048425:       c7 45 ec 00 00 00 00    movl   $0x0,-0x14(%ebp)
 804842c:       c7 45 f0 00 00 00 00    movl   $0x0,-0x10(%ebp)
 8048433:       c7 45 f4 00 00 00 00    movl   $0x0,-0xc(%ebp)
 804843a:       eb 1a                   jmp    8048456 <main+0x42>
 804843c:       8d 45 ec                lea    -0x14(%ebp),%eax
 804843f:       50                      push   %eax
 8048440:       6a 01                   push   $0x1
 8048442:       e8 b4 ff ff ff          call   80483fb <__variable_test_bit>
 8048447:       83 c4 08                add    $0x8,%esp
 804844a:       85 c0                   test   %eax,%eax
 804844c:       74 00                   je     804844e <main+0x3a>
 804844e:       83 45 f0 01             addl   $0x1,-0x10(%ebp)
 8048452:       83 55 f4 00             adcl   $0x0,-0xc(%ebp)
 8048456:       8b 45 f0                mov    -0x10(%ebp),%eax
 8048459:       8b 55 f4                mov    -0xc(%ebp),%edx
 804845c:       83 fa 00                cmp    $0x0,%edx
 804845f:       72 db                   jb     804843c <main+0x28>
 8048461:       83 fa 00                cmp    $0x0,%edx
 8048464:       77 07                   ja     804846d <main+0x59>
 8048466:       3d ff c9 9a 3b          cmp    $0x3b9ac9ff,%eax
 804846b:       76 cf                   jbe    804843c <main+0x28>
 804846d:       b8 00 00 00 00          mov    $0x0,%eax
 8048472:       8b 4d fc                mov    -0x4(%ebp),%ecx
 8048475:       c9                      leave  
 8048476:       8d 61 fc                lea    -0x4(%ecx),%esp
 8048479:       c3                      ret    
 804847a:       66 90                   xchg   %ax,%ax
 804847c:       66 90                   xchg   %ax,%ax
 804847e:       66 90                   xchg   %ax,%ax

$ gcc -Wall -O2 b.c; -objdump -ld ./a.out

080483fb <main>:
main():
 80483fb:       8d 4c 24 04             lea    0x4(%esp),%ecx
 80483ff:       83 e4 f8                and    $0xfffffff8,%esp
 8048402:       ff 71 fc                pushl  -0x4(%ecx)
 8048405:       55                      push   %ebp
 8048406:       89 e5                   mov    %esp,%ebp
 8048408:       51                      push   %ecx
 8048409:       83 ec 24                sub    $0x24,%esp
 804840c:       c7 45 e4 00 00 00 00    movl   $0x0,-0x1c(%ebp)
 8048413:       c7 45 f0 00 00 00 00    movl   $0x0,-0x10(%ebp)
 804841a:       c7 45 f4 00 00 00 00    movl   $0x0,-0xc(%ebp)
 8048421:       eb 44                   jmp    8048467 <main+0x6c>
 8048423:       c7 45 ec 01 00 00 00    movl   $0x1,-0x14(%ebp)
 804842a:       8d 45 e4                lea    -0x1c(%ebp),%eax
 804842d:       89 45 e8                mov    %eax,-0x18(%ebp)
 8048430:       8b 45 ec                mov    -0x14(%ebp),%eax
 8048433:       c1 f8 05                sar    $0x5,%eax
 8048436:       8d 14 85 00 00 00 00    lea    0x0(,%eax,4),%edx
 804843d:       8b 45 e8                mov    -0x18(%ebp),%eax
 8048440:       01 d0                   add    %edx,%eax
 8048442:       8b 10                   mov    (%eax),%edx
 8048444:       8b 45 ec                mov    -0x14(%ebp),%eax
 8048447:       83 e0 1f                and    $0x1f,%eax
 804844a:       89 c1                   mov    %eax,%ecx
 804844c:       d3 ea                   shr    %cl,%edx
 804844e:       89 d0                   mov    %edx,%eax
 8048450:       83 e0 01                and    $0x1,%eax
 8048453:       85 c0                   test   %eax,%eax
 8048455:       0f 95 c0                setne  %al
 8048458:       0f b6 c0                movzbl %al,%eax
 804845b:       85 c0                   test   %eax,%eax
 804845d:       74 00                   je     804845f <main+0x64>
 804845f:       83 45 f0 01             addl   $0x1,-0x10(%ebp)
 8048463:       83 55 f4 00             adcl   $0x0,-0xc(%ebp)
 8048467:       8b 45 f0                mov    -0x10(%ebp),%eax
 804846a:       8b 55 f4                mov    -0xc(%ebp),%edx
 804846d:       83 fa 00                cmp    $0x0,%edx
 8048470:       72 b1                   jb     8048423 <main+0x28>
 8048472:       83 fa 00                cmp    $0x0,%edx
 8048475:       77 07                   ja     804847e <main+0x83>
 8048477:       3d ff c9 9a 3b          cmp    $0x3b9ac9ff,%eax
 804847c:       76 a5                   jbe    8048423 <main+0x28>
 804847e:       b8 00 00 00 00          mov    $0x0,%eax
 8048483:       83 c4 24                add    $0x24,%esp
 8048486:       59                      pop    %ecx
 8048487:       5d                      pop    %ebp
 8048488:       8d 61 fc                lea    -0x4(%ecx),%esp
 804848b:       c3                      ret    
 804848c:       66 90                   xchg   %ax,%ax
 804848e:       66 90                   xchg   %ax,%ax


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] x86/bitops: implement __test_bit

Reply via email to