On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
> > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <ro...@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> > > result in surprising code.  Consider the example below (from PR 43644):
> > >
> > > __uint128 foo(__uint128 x, unsigned long long y) {
> > >   return x+y;
> > > }
> > >
> > > which currently results in 6 consecutive movq instructions:
> > >
> > > foo:    movq    %rsi, %rax
> > >         movq    %rdi, %rsi
> > >         movq    %rdx, %rcx
> > >         movq    %rax, %rdi
> > >         movq    %rsi, %rax
> > >         movq    %rdi, %rdx
> > >         addq    %rcx, %rax
> > >         adcq    $0, %rdx
> > >         ret
> > >
> > > The underlying issue is that during RTL expansion, we generate the
> > > following initial RTL for the x argument:
> > >
> > > (insn 4 3 5 2 (set (reg:TI 85)
> > >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> > >      (nil))
> > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> > >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> > >      (nil))
> > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> > >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> > >      (nil))
> > >
> > > which by combine/reload becomes
> > >
> > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> > >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> > >      (nil))
> > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> > >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > >      (expr_list:REG_DEAD (reg:DI 93)
> > >         (nil)))
> > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> > >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > >      (expr_list:REG_DEAD (reg:DI 94)
> > >         (nil)))
> > >
> > > where the heavy use of SUBREG SET_DESTs creates challenges for both
> > > combine and register allocation.
> > >
> > > The improvement proposed here is to avoid these problematic SUBREGs by
> > > adding (two) special cases to ix86_expand_move.  For insn 4, which
> > > sets a TImode destination from a paradoxical SUBREG, to assign the
> > > lowpart, we can use an explicit zero extension (zero_extendditi2 was
> > > added in July 2022), and for insn 5, which sets the highpart of a
> > > TImode register we can use the *insvti_highpart_1 instruction (that
> > > was added in May 2023, after being approved for stage1 in January).
> > > This allows combine to work its magic, merging these insns into a
> > > *concatditi3 and from there into other optimized forms.
> >
> > How about we introduce *insvti_lowpart_1, similar to *insvti_highpart_1, in 
> > the
> > hope that combine is smart enough to also combine these two instructions? 
> > IMO,
> > faking insert to lowpart of the register with zero_extend is a bit 
> > overkill, and could
> > hinder some other optimization opportunities (as perhaps hinted by failing
> > testcases).
>
> The use of ZERO_EXTEND serves two purposes, both the setting of the lowpart
> and of informing the RTL passes that the highpart is dead.  Notice in the 
> original
> RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-regs 
> pass, clearing
> the entirety of the TImode register (like a clobber), and preventing TI:84 
> from
> occupying the same registers as DI:93 and DI:94.
>
> If the middle-end had asked the backend to generate a SET to STRICT_LOWPART
> then our hands would be tied, but a paradoxical SUBREG allows us the freedom
> to set the highpart bits to a defined value (we could have used sign 
> extension if
> that was cheap), which then simplifies data-flow and liveness analysis.  
> Allowing the
> highpart to contain undefined or untouched data is exactly the sort of 
> security
> side-channel leakage that the clear regs pass attempts to address.
>
> I can investigate an *insvti_lowpart_1, but I don't think it will help with 
> this
> issue, i.e. it won't prevent init-regs from clobbering/clearing TImode 
> parameters.

Thanks for the explanation, the patch is OK then.

Thanks,
Uros.

>
> > > So for the test case above, we now generate only a single movq:
> > >
> > > foo:    movq    %rdx, %rax
> > >         xorl    %edx, %edx
> > >         addq    %rdi, %rax
> > >         adcq    %rsi, %rdx
> > >         ret
> > >
> > > But there is a little bad news.  This patch causes two (minor) missed
> > > optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> > > gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> > > no longer generating adcq $0, but instead using xorl.  For the other
> > > FAIL, register allocation now has more freedom and is (arbitrarily)
> > > choosing a register assignment that doesn't match what the test is
> > > expecting.  These issues are easier to explain and fix once this patch
> > > is in the tree.
> > >
> > > The good news is that this approach fixes a number of long standing
> > > issues, that need to checked in bugzilla, including PR target/110533
> > > which was just opened/reported earlier this week.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with only the two new FAILs described above.  Ok for mainline?
> > >
> > > 2023-07-06  Roger Sayle  <ro...@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         PR target/43644
> > >         PR target/110533
> > >         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> > >         TImode destinations from paradoxical SUBREGs (setting the lowpart)
> > >         into explicit zero extensions.  Use *insvti_highpart_1 instruction
> > >         to set the highpart of a TImode destination.
> > >
> > > gcc/testsuite/ChangeLog
> > >         PR target/43644
> > >         PR target/110533
> > >         * gcc.target/i386/pr110533.c: New test case.
> > >         * gcc.target/i386/pr43644-2.c: Likewise.
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
>

Reply via email to