Richard Biener <richard.guent...@gmail.com> wrote: > On Tue, Oct 27, 2020 at 12:01 AM Stefan Kanthak <stefan.kant...@nexgo.de> > wrote: >> >> Richard Biener <richard.guent...@gmail.com> wrote: >> >>> On Sun, Oct 25, 2020 at 8:37 PM Stefan Kanthak <stefan.kant...@nexgo.de> >>> wrote: >>>> >>>> Hi, >>>> >>>> for the AMD64 alias x86_64 platform and the __int128_t [DW]type, >>>> the first few lines of the __mulvDI3() function from libgcc2.c >>>> >>>> | DWtype >>>> | __mulvDI3 (DWtype u, DWtype v) >>>> | { >>>> | /* The unchecked multiplication needs 3 Wtype x Wtype multiplications, >>>> | but the checked multiplication needs only two. */ >>>> | const DWunion uu = {.ll = u}; >>>> | const DWunion vv = {.ll = v}; >>>> | >>>> | if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1)) >>>> | { >>>> | /* u fits in a single Wtype. */ >>>> | if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), >>>> 1)) >>>> | { >>>> | /* v fits in a single Wtype as well. */ >>>> | /* A single multiplication. No overflow risk. */ >>>> | return (DWtype) uu.s.low * (DWtype) vv.s.low; >>>> | } >>>> >>>> are compiled to this braindead code (obtained from libgcc.a of >>>> GCC 10.2.0 installed on Debian): >>>> >>>> 0000000000000000 <__mulvti3>: >>>> 0: 41 55 push %r13 >>>> 2: 49 89 cb mov %rcx,%r11 >>>> 5: 48 89 d0 mov %rdx,%rax >>>> 8: 49 89 d2 mov %rdx,%r10 >>>> b: 41 54 push %r12 >>>> d: 49 89 fc mov %rdi,%r12 >>>> 10: 48 89 d1 mov %rdx,%rcx >>>> 13: 49 89 f0 mov %rsi,%r8 >>>> 16: 4c 89 e2 mov %r12,%rdx >>>> 19: 49 89 f5 mov %rsi,%r13 >>>> 1c: 53 push %rbx >>>> 1d: 48 89 fe mov %rdi,%rsi >>>> 20: 48 c1 fa 3f sar $0x3f,%rdx >>>> 24: 48 c1 f8 3f sar $0x3f,%rax >>>> 28: 4c 89 df mov %r11,%rdi >>>> 2b: 4c 39 c2 cmp %r8,%rdx >>>> 2e: 75 18 jne 48 <__mulvti3+0x48> >>>> 30: 4c 39 d8 cmp %r11,%rax >>>> 33: 75 6b jne a0 <__mulvti3+0xa0> >>>> 35: 4c 89 e0 mov %r12,%rax >>>> 38: 49 f7 ea imul %r10 >>>> 3b: 5b pop %rbx >>>> 3c: 41 5c pop %r12 >>>> 3e: 41 5d pop %r13 >>>> 40: c3 retq >>>> ... >>>> >>>> There are EIGHT superfluous MOV instructions here, clobbering the >>>> non-volatile registers RBX, R12 and R13, plus THREE superfluous >>>> PUSH/POP pairs. >>>> >>>> What stops GCC from generating the following straightforward code >>>> (11 instructions in 31 bytes instead of 25 instructions in 65 bytes)? >>>> >>>> .intel_syntax noprefix >>>> __mulvti3: >>>> mov r8, rdi >>>> mov r9, rdx >>>> sra r8, 63 >>>> sra r9, 63 >>>> cmp r8, rsi >>>> jne __mulvti3+0x48+65-31 >>>> cmp r9, rcx >>>> jne __mulvti3+0xa0+65-31 >>>> mov rax, rdi >>>> imul rdx >>>> ret >>>> ... >>>> >>>> >>>> not amused >>> >>> can you open a bugreport please? >> >> >> I'd like to discuss alternative implementations of __mulvti3() first: >> the very first lines of libgcc2.c reads >> >> | /* More subroutines needed by GCC output code on some machines. */ >> | /* Compile this one with gcc. */ >> >> >> Why don't you take advantage of that and implement __mulvDI3() as >> follows? >> >> DWtype >> __mulvDI3 (DWtype multiplicand, DWtype multiplier) >> { >> DWtype product; >> >> if (__builtin_mul_overflow(multiplicand, multiplier, &product)) >> abort(); >> >> return product; >> } > > Sure, that's possible - but the main concern is that such fancy builtins > can end up calling libgcc.
1. these "fancy" builtins are provided by GCC itself. 2. "fancy" builtins like __builtin_clzll() or __builtin_expect() are already used in the sources for libgcc. > Which isn't really relevant for the trap-on-overflow variants though. Correct. JFTR: if __builtin_*_overflow() were calling libgcc I would not propose to implement libgcc's __OPvMODE() functions using these builtins! > I guess such change is desirable (also for the other arithmetic ops) That's why I started this thread in the first place! See <https://skanthak.homepage.t-online.de/gcc.html> for quite some more bugs and deficiencies, plus properly optimised assembly code. > and code generation improvements should be done for the > __builtin_mul_overflow expansion, outside of libgcc. Correct. My second and third example show that the current implementation of __builtin_mul_overflow() (at least for AMD64) is but RATHER b(raindea)d. My initial example shows that the register allocator and the optimiser need quite SOME optimisation, independent of any code generation improvements for __builtin_*_overflow(). And my web page shows that the code for MANY of the libgcc functions is FAR from optimised... > Note __OPvMODE are considered legacy and -ftrapv dysfunctional > in general ... That's UNDOCUMENTED and therefore irrelevant for your users! <https://gcc.gnu.org/onlinedocs/gccint/Integer-library-routines.html> does NOT contain the word "legacy" at all. It also misses the synopsis' for __absvti2(), __negvti2(), __addvti3(), __subvti3() and __mulvti3() The same holds for the word "dysfunctional", which is not contained in any of the onlnedocs! [ source and assembly of __mulvti3() ] Stefan