Hi Manolis, Many thanks. If you haven't already, could you create/file a bug report at https://gcc.gnu.org/bugzilla/ which ensures this doesn't get lost/forgotten. It provides a PR number for tracking discussions, and patches/fixes with PR numbers are (often) prioritized during the review and approval process.
I'll investigate what's going on. Either my "improvements" need to be disabled for V2SF arguments, or the middle/back end needs to figure out how to efficiently shuffle these values, without reload moving them via integer registers, at least as efficiently as we were before. As you/clang show, we could do better. Thanks again, and sorry for any inconvenience. Best regards, Roger -- > -----Original Message----- > From: Manolis Tsamis <manolis.tsa...@vrull.eu> > Sent: 01 September 2023 11:45 > To: Uros Bizjak <ubiz...@gmail.com> > Cc: Roger Sayle <ro...@nextmovesoftware.com>; gcc-patches@gcc.gnu.org > Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in > ix86_expand_move). > > Hi Roger, > > I've (accidentally) found a codegen regression that I bisected down to this > patch. > For these two functions: > > typedef struct { > float minx, miny; > float maxx, maxy; > } AABB; > > int TestOverlap(AABB a, AABB b) { > return a.minx <= b.maxx > && a.miny <= b.maxy > && a.maxx >= b.minx > && a.maxx >= b.minx; > } > > int TestOverlap2(AABB a, AABB b) { > return a.miny <= b.maxy > && a.maxx >= b.minx; > } > > GCC used to produce this code: > > TestOverlap: > comiss xmm3, xmm0 > movq rdx, xmm0 > movq rsi, xmm1 > movq rax, xmm3 > jb .L10 > shr rdx, 32 > shr rax, 32 > movd xmm0, eax > movd xmm4, edx > comiss xmm0, xmm4 > jb .L10 > movd xmm1, esi > xor eax, eax > comiss xmm1, xmm2 > setnb al > ret > .L10: > xor eax, eax > ret > TestOverlap2: > shufps xmm0, xmm0, 85 > shufps xmm3, xmm3, 85 > comiss xmm3, xmm0 > jb .L17 > xor eax, eax > comiss xmm1, xmm2 > setnb al > ret > .L17: > xor eax, eax > ret > > After this patch codegen gets much worse: > > TestOverlap: > movq rax, xmm1 > movq rdx, xmm2 > movq rsi, xmm0 > mov rdi, rax > movq rax, xmm3 > mov rcx, rsi > xchg rdx, rax > movd xmm1, edx > mov rsi, rax > mov rax, rdx > comiss xmm1, xmm0 > jb .L10 > shr rcx, 32 > shr rax, 32 > movd xmm0, eax > movd xmm4, ecx > comiss xmm0, xmm4 > jb .L10 > movd xmm0, esi > movd xmm1, edi > xor eax, eax > comiss xmm1, xmm0 > setnb al > ret > .L10: > xor eax, eax > ret > TestOverlap2: > movq rdx, xmm2 > movq rax, xmm3 > movq rsi, xmm0 > xchg rdx, rax > mov rcx, rsi > mov rsi, rax > mov rax, rdx > shr rcx, 32 > shr rax, 32 > movd xmm4, ecx > movd xmm0, eax > comiss xmm0, xmm4 > jb .L17 > movd xmm0, esi > xor eax, eax > comiss xmm1, xmm0 > setnb al > ret > .L17: > xor eax, eax > ret > > I saw that you've been improving i386 argument passing, so maybe this is just > a > missed case of these additions? > > (Can also be seen here https://godbolt.org/z/E4xrEn6KW) > > PS: I found the code that clang generates, with cmpleps + pextrw to avoid the > fp- > >int->fp + shr interesting. I wonder if something like this could be added to > >GCC as > well. > > Thanks! > Manolis > > On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches <gcc- > patc...@gcc.gnu.org> wrote: > > > > On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <ro...@nextmovesoftware.com> > wrote: > > > > > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle > > > > <ro...@nextmovesoftware.com> > > > > wrote: > > > > > > > > > > > > > > > Passing 128-bit integer (TImode) parameters on x86_64 can > > > > > sometimes result in surprising code. Consider the example below > > > > > (from PR > 43644): > > > > > > > > > > __uint128 foo(__uint128 x, unsigned long long y) { > > > > > return x+y; > > > > > } > > > > > > > > > > which currently results in 6 consecutive movq instructions: > > > > > > > > > > foo: movq %rsi, %rax > > > > > movq %rdi, %rsi > > > > > movq %rdx, %rcx > > > > > movq %rax, %rdi > > > > > movq %rsi, %rax > > > > > movq %rdi, %rdx > > > > > addq %rcx, %rax > > > > > adcq $0, %rdx > > > > > ret > > > > > > > > > > The underlying issue is that during RTL expansion, we generate > > > > > the following initial RTL for the x argument: > > > > > > > > > > (insn 4 3 5 2 (set (reg:TI 85) > > > > > (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1 > > > > > (nil)) > > > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8) > > > > > (reg:DI 87)) "pr43644-2.c":5:1 -1 > > > > > (nil)) > > > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ]) > > > > > (reg:TI 85)) "pr43644-2.c":5:1 -1 > > > > > (nil)) > > > > > > > > > > which by combine/reload becomes > > > > > > > > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ]) > > > > > (const_int 0 [0])) "pr43644-2.c":5:1 -1 > > > > > (nil)) > > > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0) > > > > > (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > > > (expr_list:REG_DEAD (reg:DI 93) > > > > > (nil))) > > > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8) > > > > > (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > > > (expr_list:REG_DEAD (reg:DI 94) > > > > > (nil))) > > > > > > > > > > where the heavy use of SUBREG SET_DESTs creates challenges for > > > > > both combine and register allocation. > > > > > > > > > > The improvement proposed here is to avoid these problematic > > > > > SUBREGs by adding (two) special cases to ix86_expand_move. For > > > > > insn 4, which sets a TImode destination from a paradoxical > > > > > SUBREG, to assign the lowpart, we can use an explicit zero > > > > > extension (zero_extendditi2 was added in July 2022), and for > > > > > insn 5, which sets the highpart of a TImode register we can use > > > > > the *insvti_highpart_1 instruction (that was added in May 2023, after > being approved for stage1 in January). > > > > > This allows combine to work its magic, merging these insns into > > > > > a > > > > > *concatditi3 and from there into other optimized forms. > > > > > > > > How about we introduce *insvti_lowpart_1, similar to > > > > *insvti_highpart_1, in the hope that combine is smart enough to > > > > also combine these two instructions? IMO, faking insert to lowpart > > > > of the register with zero_extend is a bit overkill, and could > > > > hinder some other optimization opportunities (as perhaps hinted by > > > > failing > testcases). > > > > > > The use of ZERO_EXTEND serves two purposes, both the setting of the > > > lowpart and of informing the RTL passes that the highpart is dead. > > > Notice in the original RTL stream, i.e. current GCC, insn 25 is > > > inserted by the .286r.init-regs pass, clearing the entirety of the > > > TImode register (like a clobber), and preventing TI:84 from occupying the > same registers as DI:93 and DI:94. > > > > > > If the middle-end had asked the backend to generate a SET to > > > STRICT_LOWPART then our hands would be tied, but a paradoxical > > > SUBREG allows us the freedom to set the highpart bits to a defined > > > value (we could have used sign extension if that was cheap), which > > > then simplifies data-flow and liveness analysis. Allowing the > > > highpart to contain undefined or untouched data is exactly the sort of > > > security > side-channel leakage that the clear regs pass attempts to address. > > > > > > I can investigate an *insvti_lowpart_1, but I don't think it will > > > help with this issue, i.e. it won't prevent init-regs from > > > clobbering/clearing > TImode parameters. > > > > Thanks for the explanation, the patch is OK then. > > > > Thanks, > > Uros. > > > > > > > > > > So for the test case above, we now generate only a single movq: > > > > > > > > > > foo: movq %rdx, %rax > > > > > xorl %edx, %edx > > > > > addq %rdi, %rax > > > > > adcq %rsi, %rdx > > > > > ret > > > > > > > > > > But there is a little bad news. This patch causes two (minor) > > > > > missed optimization regressions on x86_64; > > > > > gcc.target/i386/pr82580.c and gcc.target/i386/pr91681-1.c. As > > > > > shown in the test case above, we're no longer generating adcq > > > > > $0, but instead using xorl. For the other FAIL, register > > > > > allocation now has more freedom and is (arbitrarily) choosing a > > > > > register assignment that doesn't match what the test is > > > > > expecting. These issues are easier to explain and fix once this > > > > > patch is in > the tree. > > > > > > > > > > The good news is that this approach fixes a number of long > > > > > standing issues, that need to checked in bugzilla, including PR > > > > > target/110533 which was just opened/reported earlier this week. > > > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > > > bootstrap and make -k check, both with and without > > > > > --target_board=unix{-m32} with only the two new FAILs described above. > Ok for mainline? > > > > > > > > > > 2023-07-06 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > > > > > gcc/ChangeLog > > > > > PR target/43644 > > > > > PR target/110533 > > > > > * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs > > > > > of > > > > > TImode destinations from paradoxical SUBREGs (setting the > > > > > lowpart) > > > > > into explicit zero extensions. Use *insvti_highpart_1 > > > > > instruction > > > > > to set the highpart of a TImode destination. > > > > > > > > > > gcc/testsuite/ChangeLog > > > > > PR target/43644 > > > > > PR target/110533 > > > > > * gcc.target/i386/pr110533.c: New test case. > > > > > * gcc.target/i386/pr43644-2.c: Likewise. > > > > > > > > > > Thanks in advance, > > > > > Roger > > > > > -- > > > > > > > >