On Fri, Sep 1, 2023 at 3:35 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > Hi Manolis, > Many thanks. If you haven't already, could you create/file a > bug report at https://gcc.gnu.org/bugzilla/ which ensures this > doesn't get lost/forgotten. It provides a PR number for tracking > discussions, and patches/fixes with PR numbers are (often) > prioritized during the review and approval process. > Sure, opened as PR111267.
> I'll investigate what's going on. Either my "improvements" > need to be disabled for V2SF arguments, or the middle/back > end needs to figure out how to efficiently shuffle these values, > without reload moving them via integer registers, at least as > efficiently as we were before. As you/clang show, we could > do better. > Sounds good, thanks a lot! Manolis > Thanks again, and sorry for any inconvenience. > Best regards, > Roger > -- > > > -----Original Message----- > > From: Manolis Tsamis <manolis.tsa...@vrull.eu> > > Sent: 01 September 2023 11:45 > > To: Uros Bizjak <ubiz...@gmail.com> > > Cc: Roger Sayle <ro...@nextmovesoftware.com>; gcc-patches@gcc.gnu.org > > Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in > > ix86_expand_move). > > > > Hi Roger, > > > > I've (accidentally) found a codegen regression that I bisected down to this > > patch. > > For these two functions: > > > > typedef struct { > > float minx, miny; > > float maxx, maxy; > > } AABB; > > > > int TestOverlap(AABB a, AABB b) { > > return a.minx <= b.maxx > > && a.miny <= b.maxy > > && a.maxx >= b.minx > > && a.maxx >= b.minx; > > } > > > > int TestOverlap2(AABB a, AABB b) { > > return a.miny <= b.maxy > > && a.maxx >= b.minx; > > } > > > > GCC used to produce this code: > > > > TestOverlap: > > comiss xmm3, xmm0 > > movq rdx, xmm0 > > movq rsi, xmm1 > > movq rax, xmm3 > > jb .L10 > > shr rdx, 32 > > shr rax, 32 > > movd xmm0, eax > > movd xmm4, edx > > comiss xmm0, xmm4 > > jb .L10 > > movd xmm1, esi > > xor eax, eax > > comiss xmm1, xmm2 > > setnb al > > ret > > .L10: > > xor eax, eax > > ret > > TestOverlap2: > > shufps xmm0, xmm0, 85 > > shufps xmm3, xmm3, 85 > > comiss xmm3, xmm0 > > jb .L17 > > xor eax, eax > > comiss xmm1, xmm2 > > setnb al > > ret > > .L17: > > xor eax, eax > > ret > > > > After this patch codegen gets much worse: > > > > TestOverlap: > > movq rax, xmm1 > > movq rdx, xmm2 > > movq rsi, xmm0 > > mov rdi, rax > > movq rax, xmm3 > > mov rcx, rsi > > xchg rdx, rax > > movd xmm1, edx > > mov rsi, rax > > mov rax, rdx > > comiss xmm1, xmm0 > > jb .L10 > > shr rcx, 32 > > shr rax, 32 > > movd xmm0, eax > > movd xmm4, ecx > > comiss xmm0, xmm4 > > jb .L10 > > movd xmm0, esi > > movd xmm1, edi > > xor eax, eax > > comiss xmm1, xmm0 > > setnb al > > ret > > .L10: > > xor eax, eax > > ret > > TestOverlap2: > > movq rdx, xmm2 > > movq rax, xmm3 > > movq rsi, xmm0 > > xchg rdx, rax > > mov rcx, rsi > > mov rsi, rax > > mov rax, rdx > > shr rcx, 32 > > shr rax, 32 > > movd xmm4, ecx > > movd xmm0, eax > > comiss xmm0, xmm4 > > jb .L17 > > movd xmm0, esi > > xor eax, eax > > comiss xmm1, xmm0 > > setnb al > > ret > > .L17: > > xor eax, eax > > ret > > > > I saw that you've been improving i386 argument passing, so maybe this is > > just a > > missed case of these additions? > > > > (Can also be seen here https://godbolt.org/z/E4xrEn6KW) > > > > PS: I found the code that clang generates, with cmpleps + pextrw to avoid > > the fp- > > >int->fp + shr interesting. I wonder if something like this could be added > > >to GCC as > > well. > > > > Thanks! > > Manolis > > > > On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches <gcc- > > patc...@gcc.gnu.org> wrote: > > > > > > On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <ro...@nextmovesoftware.com> > > wrote: > > > > > > > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle > > > > > <ro...@nextmovesoftware.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > Passing 128-bit integer (TImode) parameters on x86_64 can > > > > > > sometimes result in surprising code. Consider the example below > > > > > > (from PR > > 43644): > > > > > > > > > > > > __uint128 foo(__uint128 x, unsigned long long y) { > > > > > > return x+y; > > > > > > } > > > > > > > > > > > > which currently results in 6 consecutive movq instructions: > > > > > > > > > > > > foo: movq %rsi, %rax > > > > > > movq %rdi, %rsi > > > > > > movq %rdx, %rcx > > > > > > movq %rax, %rdi > > > > > > movq %rsi, %rax > > > > > > movq %rdi, %rdx > > > > > > addq %rcx, %rax > > > > > > adcq $0, %rdx > > > > > > ret > > > > > > > > > > > > The underlying issue is that during RTL expansion, we generate > > > > > > the following initial RTL for the x argument: > > > > > > > > > > > > (insn 4 3 5 2 (set (reg:TI 85) > > > > > > (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1 > > > > > > (nil)) > > > > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8) > > > > > > (reg:DI 87)) "pr43644-2.c":5:1 -1 > > > > > > (nil)) > > > > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ]) > > > > > > (reg:TI 85)) "pr43644-2.c":5:1 -1 > > > > > > (nil)) > > > > > > > > > > > > which by combine/reload becomes > > > > > > > > > > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ]) > > > > > > (const_int 0 [0])) "pr43644-2.c":5:1 -1 > > > > > > (nil)) > > > > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0) > > > > > > (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > > > > (expr_list:REG_DEAD (reg:DI 93) > > > > > > (nil))) > > > > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8) > > > > > > (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > > > > (expr_list:REG_DEAD (reg:DI 94) > > > > > > (nil))) > > > > > > > > > > > > where the heavy use of SUBREG SET_DESTs creates challenges for > > > > > > both combine and register allocation. > > > > > > > > > > > > The improvement proposed here is to avoid these problematic > > > > > > SUBREGs by adding (two) special cases to ix86_expand_move. For > > > > > > insn 4, which sets a TImode destination from a paradoxical > > > > > > SUBREG, to assign the lowpart, we can use an explicit zero > > > > > > extension (zero_extendditi2 was added in July 2022), and for > > > > > > insn 5, which sets the highpart of a TImode register we can use > > > > > > the *insvti_highpart_1 instruction (that was added in May 2023, > > > > > > after > > being approved for stage1 in January). > > > > > > This allows combine to work its magic, merging these insns into > > > > > > a > > > > > > *concatditi3 and from there into other optimized forms. > > > > > > > > > > How about we introduce *insvti_lowpart_1, similar to > > > > > *insvti_highpart_1, in the hope that combine is smart enough to > > > > > also combine these two instructions? IMO, faking insert to lowpart > > > > > of the register with zero_extend is a bit overkill, and could > > > > > hinder some other optimization opportunities (as perhaps hinted by > > > > > failing > > testcases). > > > > > > > > The use of ZERO_EXTEND serves two purposes, both the setting of the > > > > lowpart and of informing the RTL passes that the highpart is dead. > > > > Notice in the original RTL stream, i.e. current GCC, insn 25 is > > > > inserted by the .286r.init-regs pass, clearing the entirety of the > > > > TImode register (like a clobber), and preventing TI:84 from occupying > > > > the > > same registers as DI:93 and DI:94. > > > > > > > > If the middle-end had asked the backend to generate a SET to > > > > STRICT_LOWPART then our hands would be tied, but a paradoxical > > > > SUBREG allows us the freedom to set the highpart bits to a defined > > > > value (we could have used sign extension if that was cheap), which > > > > then simplifies data-flow and liveness analysis. Allowing the > > > > highpart to contain undefined or untouched data is exactly the sort of > > > > security > > side-channel leakage that the clear regs pass attempts to address. > > > > > > > > I can investigate an *insvti_lowpart_1, but I don't think it will > > > > help with this issue, i.e. it won't prevent init-regs from > > > > clobbering/clearing > > TImode parameters. > > > > > > Thanks for the explanation, the patch is OK then. > > > > > > Thanks, > > > Uros. > > > > > > > > > > > > > So for the test case above, we now generate only a single movq: > > > > > > > > > > > > foo: movq %rdx, %rax > > > > > > xorl %edx, %edx > > > > > > addq %rdi, %rax > > > > > > adcq %rsi, %rdx > > > > > > ret > > > > > > > > > > > > But there is a little bad news. This patch causes two (minor) > > > > > > missed optimization regressions on x86_64; > > > > > > gcc.target/i386/pr82580.c and gcc.target/i386/pr91681-1.c. As > > > > > > shown in the test case above, we're no longer generating adcq > > > > > > $0, but instead using xorl. For the other FAIL, register > > > > > > allocation now has more freedom and is (arbitrarily) choosing a > > > > > > register assignment that doesn't match what the test is > > > > > > expecting. These issues are easier to explain and fix once this > > > > > > patch is in > > the tree. > > > > > > > > > > > > The good news is that this approach fixes a number of long > > > > > > standing issues, that need to checked in bugzilla, including PR > > > > > > target/110533 which was just opened/reported earlier this week. > > > > > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > > > > bootstrap and make -k check, both with and without > > > > > > --target_board=unix{-m32} with only the two new FAILs described > > > > > > above. > > Ok for mainline? > > > > > > > > > > > > 2023-07-06 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > > > > > > > gcc/ChangeLog > > > > > > PR target/43644 > > > > > > PR target/110533 > > > > > > * config/i386/i386-expand.cc (ix86_expand_move): Convert > > > > > > SETs of > > > > > > TImode destinations from paradoxical SUBREGs (setting the > > > > > > lowpart) > > > > > > into explicit zero extensions. Use *insvti_highpart_1 > > > > > > instruction > > > > > > to set the highpart of a TImode destination. > > > > > > > > > > > > gcc/testsuite/ChangeLog > > > > > > PR target/43644 > > > > > > PR target/110533 > > > > > > * gcc.target/i386/pr110533.c: New test case. > > > > > > * gcc.target/i386/pr43644-2.c: Likewise. > > > > > > > > > > > > Thanks in advance, > > > > > > Roger > > > > > > -- > > > > > > > > > > >