Re: [__mulvti3] register allocator plays shell game

Stefan Kanthak Tue, 27 Oct 2020 03:04:02 -0700

Richard Biener <richard.guent...@gmail.com> wrote:

> On Tue, Oct 27, 2020 at 12:01 AM Stefan Kanthak <stefan.kant...@nexgo.de> 
> wrote:
>>
>> Richard Biener <richard.guent...@gmail.com> wrote:
>>
>>> On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak <stefan.kant...@nexgo.de> 
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> for the AMD64 alias x86_64 platform and the __int128_t [DW]type,
>>>> the first few lines of the __mulvDI3() function from libgcc2.c
>>>>
>>>> | DWtype
>>>> | __mulvDI3 (DWtype u, DWtype v)
>>>> | {
>>>> |   /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
>>>> |      but the checked multiplication needs only two.  */
>>>> |   const DWunion uu = {.ll = u};
>>>> |   const DWunion vv = {.ll = v};
>>>> |
>>>> |   if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
>>>> |     {
>>>> |       /* u fits in a single Wtype.  */
>>>> |       if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 
>>>> 1))
>>>> |  {
>>>> |    /* v fits in a single Wtype as well.  */
>>>> |    /* A single multiplication.  No overflow risk.  */
>>>> |    return (DWtype) uu.s.low * (DWtype) vv.s.low;
>>>> |  }
>>>>
>>>> are compiled to this braindead code (obtained from libgcc.a of
>>>> GCC 10.2.0 installed on Debian):
>>>>
>>>> 0000000000000000 <__mulvti3>:
>>>>    0: 41 55                 push   %r13
>>>>    2: 49 89 cb              mov    %rcx,%r11
>>>>    5: 48 89 d0              mov    %rdx,%rax
>>>>    8: 49 89 d2              mov    %rdx,%r10
>>>>    b: 41 54                 push   %r12
>>>>    d: 49 89 fc              mov    %rdi,%r12
>>>>   10: 48 89 d1              mov    %rdx,%rcx
>>>>   13: 49 89 f0              mov    %rsi,%r8
>>>>   16: 4c 89 e2              mov    %r12,%rdx
>>>>   19: 49 89 f5              mov    %rsi,%r13
>>>>   1c: 53                    push   %rbx
>>>>   1d: 48 89 fe              mov    %rdi,%rsi
>>>>   20: 48 c1 fa 3f           sar    $0x3f,%rdx
>>>>   24: 48 c1 f8 3f           sar    $0x3f,%rax
>>>>   28: 4c 89 df              mov    %r11,%rdi
>>>>   2b: 4c 39 c2              cmp    %r8,%rdx
>>>>   2e: 75 18                 jne    48 <__mulvti3+0x48>
>>>>   30: 4c 39 d8              cmp    %r11,%rax
>>>>   33: 75 6b                 jne    a0 <__mulvti3+0xa0>
>>>>   35: 4c 89 e0              mov    %r12,%rax
>>>>   38: 49 f7 ea              imul   %r10
>>>>   3b: 5b                    pop    %rbx
>>>>   3c: 41 5c                 pop    %r12
>>>>   3e: 41 5d                 pop    %r13
>>>>   40: c3                    retq
>>>> ...
>>>>
>>>> There are EIGHT superfluous MOV instructions here, clobbering the
>>>> non-volatile registers RBX, R12 and R13, plus THREE superfluous
>>>> PUSH/POP pairs.
>>>>
>>>> What stops GCC from generating the following straightforward code
>>>> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)?
>>>>
>>>> .intel_syntax noprefix
>>>> __mulvti3:
>>>>     mov   r8, rdi
>>>>     mov   r9, rdx
>>>>     sra   r8, 63
>>>>     sra   r9, 63
>>>>     cmp   r8, rsi
>>>>     jne   __mulvti3+0x48+65-31
>>>>     cmp   r9, rcx
>>>>     jne   __mulvti3+0xa0+65-31
>>>>     mov   rax, rdi
>>>>     imul  rdx
>>>>     ret
>>>> ...
>>>>
>>>>
>>>> not amused
>>>
>>> can you open a bugreport please?
>>
>>
>> I'd like to discuss alternative implementations of __mulvti3() first:
>> the very first lines of libgcc2.c reads
>>
>> | /* More subroutines needed by GCC output code on some machines.  */
>> | /* Compile this one with gcc.  */
>>
>>
>> Why don't you take advantage of that and implement __mulvDI3() as
>> follows?
>>
>> DWtype
>> __mulvDI3 (DWtype multiplicand, DWtype multiplier)
>> {
>>     DWtype product;
>>
>>     if (__builtin_mul_overflow(multiplicand, multiplier, &product))
>>         abort();
>>
>>     return product;
>> }
> 
> Sure, that's possible - but the main concern is that such fancy builtins
> can end up calling libgcc.


1. these "fancy" builtins are provided by GCC itself.
2. "fancy" builtins like __builtin_clzll() or __builtin_expect() are already
   used in the sources for libgcc.

> Which isn't really relevant for the trap-on-overflow variants though.

Correct.
JFTR: if __builtin_*_overflow() were calling libgcc I would not propose
      to implement libgcc's __OPvMODE() functions using these builtins!

> I guess such change is desirable (also for the other arithmetic ops)

That's why I started this thread in the first place!
See <https://skanthak.homepage.t-online.de/gcc.html> for quite some
more bugs and deficiencies, plus properly optimised assembly code.

> and code generation improvements should be done for the
> __builtin_mul_overflow expansion, outside of libgcc.

Correct.
My second and third example show that the current implementation of
__builtin_mul_overflow() (at least for AMD64) is but RATHER b(raindea)d.
My initial example shows that the register allocator and the optimiser
need quite SOME optimisation, independent of any code generation
improvements for __builtin_*_overflow().
And my web page shows that the code for MANY of the libgcc functions is
FAR from optimised...

> Note __OPvMODE are considered legacy and -ftrapv dysfunctional
> in general ...

That's UNDOCUMENTED and therefore irrelevant for your users!
<https://gcc.gnu.org/onlinedocs/gccint/Integer-library-routines.html>
does NOT contain the word "legacy" at all.
It also misses the synopsis' for __absvti2(), __negvti2(), __addvti3(),
__subvti3() and __mulvti3()

The same holds for the word "dysfunctional", which is not contained in
any of the onlnedocs!

[ source and assembly of __mulvti3() ]

Stefan

Re: [__mulvti3] register allocator plays shell game

Reply via email to