On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <ro...@nextmovesoftware.com> > > wrote: > > > > > > > > > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes > > > result in surprising code. Consider the example below (from PR 43644): > > > > > > __uint128 foo(__uint128 x, unsigned long long y) { > > > return x+y; > > > } > > > > > > which currently results in 6 consecutive movq instructions: > > > > > > foo: movq %rsi, %rax > > > movq %rdi, %rsi > > > movq %rdx, %rcx > > > movq %rax, %rdi > > > movq %rsi, %rax > > > movq %rdi, %rdx > > > addq %rcx, %rax > > > adcq $0, %rdx > > > ret > > > > > > The underlying issue is that during RTL expansion, we generate the > > > following initial RTL for the x argument: > > > > > > (insn 4 3 5 2 (set (reg:TI 85) > > > (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1 > > > (nil)) > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8) > > > (reg:DI 87)) "pr43644-2.c":5:1 -1 > > > (nil)) > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ]) > > > (reg:TI 85)) "pr43644-2.c":5:1 -1 > > > (nil)) > > > > > > which by combine/reload becomes > > > > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ]) > > > (const_int 0 [0])) "pr43644-2.c":5:1 -1 > > > (nil)) > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0) > > > (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > (expr_list:REG_DEAD (reg:DI 93) > > > (nil))) > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8) > > > (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal} > > > (expr_list:REG_DEAD (reg:DI 94) > > > (nil))) > > > > > > where the heavy use of SUBREG SET_DESTs creates challenges for both > > > combine and register allocation. > > > > > > The improvement proposed here is to avoid these problematic SUBREGs by > > > adding (two) special cases to ix86_expand_move. For insn 4, which > > > sets a TImode destination from a paradoxical SUBREG, to assign the > > > lowpart, we can use an explicit zero extension (zero_extendditi2 was > > > added in July 2022), and for insn 5, which sets the highpart of a > > > TImode register we can use the *insvti_highpart_1 instruction (that > > > was added in May 2023, after being approved for stage1 in January). > > > This allows combine to work its magic, merging these insns into a > > > *concatditi3 and from there into other optimized forms. > > > > How about we introduce *insvti_lowpart_1, similar to *insvti_highpart_1, in > > the > > hope that combine is smart enough to also combine these two instructions? > > IMO, > > faking insert to lowpart of the register with zero_extend is a bit > > overkill, and could > > hinder some other optimization opportunities (as perhaps hinted by failing > > testcases). > > The use of ZERO_EXTEND serves two purposes, both the setting of the lowpart > and of informing the RTL passes that the highpart is dead. Notice in the > original > RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-regs > pass, clearing > the entirety of the TImode register (like a clobber), and preventing TI:84 > from > occupying the same registers as DI:93 and DI:94. > > If the middle-end had asked the backend to generate a SET to STRICT_LOWPART > then our hands would be tied, but a paradoxical SUBREG allows us the freedom > to set the highpart bits to a defined value (we could have used sign > extension if > that was cheap), which then simplifies data-flow and liveness analysis. > Allowing the > highpart to contain undefined or untouched data is exactly the sort of > security > side-channel leakage that the clear regs pass attempts to address. > > I can investigate an *insvti_lowpart_1, but I don't think it will help with > this > issue, i.e. it won't prevent init-regs from clobbering/clearing TImode > parameters.
Thanks for the explanation, the patch is OK then. Thanks, Uros. > > > > So for the test case above, we now generate only a single movq: > > > > > > foo: movq %rdx, %rax > > > xorl %edx, %edx > > > addq %rdi, %rax > > > adcq %rsi, %rdx > > > ret > > > > > > But there is a little bad news. This patch causes two (minor) missed > > > optimization regressions on x86_64; gcc.target/i386/pr82580.c and > > > gcc.target/i386/pr91681-1.c. As shown in the test case above, we're > > > no longer generating adcq $0, but instead using xorl. For the other > > > FAIL, register allocation now has more freedom and is (arbitrarily) > > > choosing a register assignment that doesn't match what the test is > > > expecting. These issues are easier to explain and fix once this patch > > > is in the tree. > > > > > > The good news is that this approach fixes a number of long standing > > > issues, that need to checked in bugzilla, including PR target/110533 > > > which was just opened/reported earlier this week. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > > and make -k check, both with and without --target_board=unix{-m32} > > > with only the two new FAILs described above. Ok for mainline? > > > > > > 2023-07-06 Roger Sayle <ro...@nextmovesoftware.com> > > > > > > gcc/ChangeLog > > > PR target/43644 > > > PR target/110533 > > > * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of > > > TImode destinations from paradoxical SUBREGs (setting the lowpart) > > > into explicit zero extensions. Use *insvti_highpart_1 instruction > > > to set the highpart of a TImode destination. > > > > > > gcc/testsuite/ChangeLog > > > PR target/43644 > > > PR target/110533 > > > * gcc.target/i386/pr110533.c: New test case. > > > * gcc.target/i386/pr43644-2.c: Likewise. > > > > > > Thanks in advance, > > > Roger > > > -- > > > >