Re: [x86_64 PATCH] Improve __int128 argument passing (in ix86_expand_move).

Manolis Tsamis Fri, 01 Sep 2023 06:03:29 -0700

On Fri, Sep 1, 2023 at 3:35 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> Hi Manolis,
> Many thanks.  If you haven't already, could you create/file a
> bug report at https://gcc.gnu.org/bugzilla/ which ensures this
> doesn't get lost/forgotten.  It provides a PR number for tracking
> discussions, and patches/fixes with PR numbers are (often)
> prioritized during the review and approval process.
>
Sure, opened as PR111267.


> I'll investigate what's going on.  Either my "improvements"
> need to be disabled for V2SF arguments, or the middle/back
> end needs to figure out how to efficiently shuffle these values,
> without reload moving them via integer registers, at least as
> efficiently as we were before.  As you/clang show, we could
> do better.
>
Sounds good, thanks a lot!
Manolis

> Thanks again, and sorry for any inconvenience.
> Best regards,
> Roger
> --
>
> > -----Original Message-----
> > From: Manolis Tsamis <manolis.tsa...@vrull.eu>
> > Sent: 01 September 2023 11:45
> > To: Uros Bizjak <ubiz...@gmail.com>
> > Cc: Roger Sayle <ro...@nextmovesoftware.com>; gcc-patches@gcc.gnu.org
> > Subject: Re: [x86_64 PATCH] Improve __int128 argument passing (in
> > ix86_expand_move).
> >
> > Hi Roger,
> >
> > I've (accidentally) found a codegen regression that I bisected down to this 
> > patch.
> > For these two functions:
> >
> > typedef struct {
> >   float minx, miny;
> >   float maxx, maxy;
> > } AABB;
> >
> > int TestOverlap(AABB a, AABB b) {
> >   return a.minx <= b.maxx
> >       && a.miny <= b.maxy
> >       && a.maxx >= b.minx
> >       && a.maxx >= b.minx;
> > }
> >
> > int TestOverlap2(AABB a, AABB b) {
> >   return a.miny <= b.maxy
> >       && a.maxx >= b.minx;
> > }
> >
> > GCC used to produce this code:
> >
> > TestOverlap:
> >         comiss  xmm3, xmm0
> >         movq    rdx, xmm0
> >         movq    rsi, xmm1
> >         movq    rax, xmm3
> >         jb      .L10
> >         shr     rdx, 32
> >         shr     rax, 32
> >         movd    xmm0, eax
> >         movd    xmm4, edx
> >         comiss  xmm0, xmm4
> >         jb      .L10
> >         movd    xmm1, esi
> >         xor     eax, eax
> >         comiss  xmm1, xmm2
> >         setnb   al
> >         ret
> > .L10:
> >         xor     eax, eax
> >         ret
> > TestOverlap2:
> >         shufps  xmm0, xmm0, 85
> >         shufps  xmm3, xmm3, 85
> >         comiss  xmm3, xmm0
> >         jb      .L17
> >         xor     eax, eax
> >         comiss  xmm1, xmm2
> >         setnb   al
> >         ret
> > .L17:
> >         xor     eax, eax
> >         ret
> >
> > After this patch codegen gets much worse:
> >
> > TestOverlap:
> >         movq    rax, xmm1
> >         movq    rdx, xmm2
> >         movq    rsi, xmm0
> >         mov     rdi, rax
> >         movq    rax, xmm3
> >         mov     rcx, rsi
> >         xchg    rdx, rax
> >         movd    xmm1, edx
> >         mov     rsi, rax
> >         mov     rax, rdx
> >         comiss  xmm1, xmm0
> >         jb      .L10
> >         shr     rcx, 32
> >         shr     rax, 32
> >         movd    xmm0, eax
> >         movd    xmm4, ecx
> >         comiss  xmm0, xmm4
> >         jb      .L10
> >         movd    xmm0, esi
> >         movd    xmm1, edi
> >         xor     eax, eax
> >         comiss  xmm1, xmm0
> >         setnb   al
> >         ret
> > .L10:
> >         xor     eax, eax
> >         ret
> > TestOverlap2:
> >         movq    rdx, xmm2
> >         movq    rax, xmm3
> >         movq    rsi, xmm0
> >         xchg    rdx, rax
> >         mov     rcx, rsi
> >         mov     rsi, rax
> >         mov     rax, rdx
> >         shr     rcx, 32
> >         shr     rax, 32
> >         movd    xmm4, ecx
> >         movd    xmm0, eax
> >         comiss  xmm0, xmm4
> >         jb      .L17
> >         movd    xmm0, esi
> >         xor     eax, eax
> >         comiss  xmm1, xmm0
> >         setnb   al
> >         ret
> > .L17:
> >         xor     eax, eax
> >         ret
> >
> > I saw that you've been improving i386 argument passing, so maybe this is 
> > just a
> > missed case of these additions?
> >
> > (Can also be seen here https://godbolt.org/z/E4xrEn6KW)
> >
> > PS: I found the code that clang generates, with cmpleps + pextrw to avoid 
> > the fp-
> > >int->fp + shr interesting. I wonder if something like this could be added 
> > >to GCC as
> > well.
> >
> > Thanks!
> > Manolis
> >
> > On Thu, Jul 6, 2023 at 5:21 PM Uros Bizjak via Gcc-patches <gcc-
> > patc...@gcc.gnu.org> wrote:
> > >
> > > On Thu, Jul 6, 2023 at 3:48 PM Roger Sayle <ro...@nextmovesoftware.com>
> > wrote:
> > > >
> > > > > On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle
> > > > > <ro...@nextmovesoftware.com>
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Passing 128-bit integer (TImode) parameters on x86_64 can
> > > > > > sometimes result in surprising code.  Consider the example below 
> > > > > > (from PR
> > 43644):
> > > > > >
> > > > > > __uint128 foo(__uint128 x, unsigned long long y) {
> > > > > >   return x+y;
> > > > > > }
> > > > > >
> > > > > > which currently results in 6 consecutive movq instructions:
> > > > > >
> > > > > > foo:    movq    %rsi, %rax
> > > > > >         movq    %rdi, %rsi
> > > > > >         movq    %rdx, %rcx
> > > > > >         movq    %rax, %rdi
> > > > > >         movq    %rsi, %rax
> > > > > >         movq    %rdi, %rdx
> > > > > >         addq    %rcx, %rax
> > > > > >         adcq    $0, %rdx
> > > > > >         ret
> > > > > >
> > > > > > The underlying issue is that during RTL expansion, we generate
> > > > > > the following initial RTL for the x argument:
> > > > > >
> > > > > > (insn 4 3 5 2 (set (reg:TI 85)
> > > > > >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> > > > > >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> > > > > >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > >
> > > > > > which by combine/reload becomes
> > > > > >
> > > > > > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> > > > > >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> > > > > >      (nil))
> > > > > > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> > > > > >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > > > >      (expr_list:REG_DEAD (reg:DI 93)
> > > > > >         (nil)))
> > > > > > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> > > > > >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> > > > > >      (expr_list:REG_DEAD (reg:DI 94)
> > > > > >         (nil)))
> > > > > >
> > > > > > where the heavy use of SUBREG SET_DESTs creates challenges for
> > > > > > both combine and register allocation.
> > > > > >
> > > > > > The improvement proposed here is to avoid these problematic
> > > > > > SUBREGs by adding (two) special cases to ix86_expand_move.  For
> > > > > > insn 4, which sets a TImode destination from a paradoxical
> > > > > > SUBREG, to assign the lowpart, we can use an explicit zero
> > > > > > extension (zero_extendditi2 was added in July 2022), and for
> > > > > > insn 5, which sets the highpart of a TImode register we can use
> > > > > > the *insvti_highpart_1 instruction (that was added in May 2023, 
> > > > > > after
> > being approved for stage1 in January).
> > > > > > This allows combine to work its magic, merging these insns into
> > > > > > a
> > > > > > *concatditi3 and from there into other optimized forms.
> > > > >
> > > > > How about we introduce *insvti_lowpart_1, similar to
> > > > > *insvti_highpart_1, in the hope that combine is smart enough to
> > > > > also combine these two instructions? IMO, faking insert to lowpart
> > > > > of the register with zero_extend is a bit overkill, and could
> > > > > hinder some other optimization opportunities (as perhaps hinted by 
> > > > > failing
> > testcases).
> > > >
> > > > The use of ZERO_EXTEND serves two purposes, both the setting of the
> > > > lowpart and of informing the RTL passes that the highpart is dead.
> > > > Notice in the original RTL stream, i.e. current GCC, insn 25 is
> > > > inserted by the .286r.init-regs pass, clearing the entirety of the
> > > > TImode register (like a clobber), and preventing TI:84 from occupying 
> > > > the
> > same registers as DI:93 and DI:94.
> > > >
> > > > If the middle-end had asked the backend to generate a SET to
> > > > STRICT_LOWPART then our hands would be tied, but a paradoxical
> > > > SUBREG allows us the freedom to set the highpart bits to a defined
> > > > value (we could have used sign extension if that was cheap), which
> > > > then simplifies data-flow and liveness analysis.  Allowing the
> > > > highpart to contain undefined or untouched data is exactly the sort of 
> > > > security
> > side-channel leakage that the clear regs pass attempts to address.
> > > >
> > > > I can investigate an *insvti_lowpart_1, but I don't think it will
> > > > help with this issue, i.e. it won't prevent init-regs from 
> > > > clobbering/clearing
> > TImode parameters.
> > >
> > > Thanks for the explanation, the patch is OK then.
> > >
> > > Thanks,
> > > Uros.
> > >
> > > >
> > > > > > So for the test case above, we now generate only a single movq:
> > > > > >
> > > > > > foo:    movq    %rdx, %rax
> > > > > >         xorl    %edx, %edx
> > > > > >         addq    %rdi, %rax
> > > > > >         adcq    %rsi, %rdx
> > > > > >         ret
> > > > > >
> > > > > > But there is a little bad news.  This patch causes two (minor)
> > > > > > missed optimization regressions on x86_64;
> > > > > > gcc.target/i386/pr82580.c and gcc.target/i386/pr91681-1.c.  As
> > > > > > shown in the test case above, we're no longer generating adcq
> > > > > > $0, but instead using xorl.  For the other FAIL, register
> > > > > > allocation now has more freedom and is (arbitrarily) choosing a
> > > > > > register assignment that doesn't match what the test is
> > > > > > expecting.  These issues are easier to explain and fix once this 
> > > > > > patch is in
> > the tree.
> > > > > >
> > > > > > The good news is that this approach fixes a number of long
> > > > > > standing issues, that need to checked in bugzilla, including PR
> > > > > > target/110533 which was just opened/reported earlier this week.
> > > > > >
> > > > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > > > bootstrap and make -k check, both with and without
> > > > > > --target_board=unix{-m32} with only the two new FAILs described 
> > > > > > above.
> > Ok for mainline?
> > > > > >
> > > > > > 2023-07-06  Roger Sayle  <ro...@nextmovesoftware.com>
> > > > > >
> > > > > > gcc/ChangeLog
> > > > > >         PR target/43644
> > > > > >         PR target/110533
> > > > > >         * config/i386/i386-expand.cc (ix86_expand_move): Convert 
> > > > > > SETs of
> > > > > >         TImode destinations from paradoxical SUBREGs (setting the 
> > > > > > lowpart)
> > > > > >         into explicit zero extensions.  Use *insvti_highpart_1 
> > > > > > instruction
> > > > > >         to set the highpart of a TImode destination.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog
> > > > > >         PR target/43644
> > > > > >         PR target/110533
> > > > > >         * gcc.target/i386/pr110533.c: New test case.
> > > > > >         * gcc.target/i386/pr43644-2.c: Likewise.
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Roger
> > > > > > --
> > > > > >
> > > >
>

Re: [x86_64 PATCH] Improve __int128 argument passing (in ix86_expand_move).

Reply via email to