[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-12 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

--- Comment #10 from CVS Commits  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:b37351e3279d192d5d4682f002abe5b2e133bba6

commit r12-4359-gb37351e3279d192d5d4682f002abe5b2e133bba6
Author: Uros Bizjak 
Date:   Tue Oct 12 18:20:38 2021 +0200

i386: Improve workaround for PR82524 LRA limitation [PR85730]

As explained in PR82524, LRA is not able to reload strict_low_part inout
operand with matched input operand. The patch introduces a workaround,
where we allow LRA to generate an instruction with non-matched input
operand
which is split post reload to an instruction that inserts non-matched input
operand to an inout operand and the instruction that uses matched operand.

The generated code improves from:

movsbl  %dil, %edx
movl%edi, %eax
sall$3, %edx
movb%dl, %al

to:

movl%edi, %eax
movb%dil, %al
salb$3, %al

which is still not optimal, but the code is one instruction shorter and
does not use a temporary register.

2021-10-12  Uroš Bizjak  

gcc/
PR target/85730
PR target/82524
* config/i386/i386.md (*add_1_slp): Rewrite as
define_insn_and_split pattern.  Add alternative 1 and split it
post reload to insert operand 1 into the low part of operand 0.
(*sub_1_slp): Ditto.
(*and_1_slp): Ditto.
(*_1_slp): Ditto.
(*ashl3_1_slp): Ditto.
(*3_1_slp): Ditto.
(*3_1_slp): Ditto.
(*neg_1_slp): New insn_and_split pattern.
(*one_cmpl_1_slp): Ditto.

gcc/testsuite/
PR target/85730
PR target/82524
* gcc.target/i386/pr85730.c: New test.

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

--- Comment #9 from Uroš Bizjak  ---
An interesting observation with the following testcase:

--cut here--
typedef char V __attribute__((vector_size(4)));

struct S
{
  char val;
  char pad1;
  char pad2;
  char pad3;
};

V
foo (V v)
{
  v[0] <<= 3;

  return v;
}

struct S
bar (struct S a)
{
  a.val <<= 3;

  return a;
}
--cut here--

gcc -O2:

foo:
movsbl  %dil, %edx
movl%edi, %eax
sall$3, %edx
movb%dl, %al
ret

bar:
movl%edi, %eax
salb$3, %al
ret

So, the compiler is able to produce optimal code with equivalent struct
description, but something (in middle-end?) prevents the same optimization with
a vector (a.k.a array).

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-07 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

--- Comment #8 from Uroš Bizjak  ---
Created attachment 51564
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51564=edit
Prototype patch

Attached patch works around reload problem and creates:

movl%edi, %eax
movb%dil, %al
addb%al, %al
ret

for all cases. Please note that some follow-up pass is needed to remove
"movb %dil, %al", since preceding instruction already moves full value from
%edi to %eax, so partial move is not needed.

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-04 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

--- Comment #7 from Segher Boessenkool  ---
(In reply to Richard Biener from comment #5)
> Not sure whether targets should have a special-case pattern here or whether
> that's for combine to un-canonicalize it?

Is the shift defined anywhere as the canonical form?

We already get a shift at expand time, for e.g.

  long f(long a) { return a+a; }

I cannot find the code that does this easily, it is quite a maze :-)  There
is code for changing a multiplication by a power of two (which we have in
Gimple already) into a shift, but then there should be something changing
the addition with self into a multiplication, and I cannot find that either.

Combine should absolutely not un-canonicalise it.  It could try multiple
ways of writing this, but is it important enough to allow this, to justify
the (potential) combinatorial explosion this causes?

If we want combine to try many ways of writing something (not a bad idea an
sich btw, I support it), we need ways to battle such an explosion, and esp.
to make the amount of garbage RTL created manageable (it is not, currently).
Currently combine has to create GC'ed RTL to recog() it.  Maybe we could
have some "GC stuff created between points X and Y" interface?

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-04 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

--- Comment #6 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #5)
> The GIMPLE IL is now using BIT_INSERT_EXPRs consistently for all cases and
> combine does
> 
> Trying 8 -> 11:
> 8: {r90:SI=r89:SI<<0x1;clobber flags:CC;}
>   REG_DEAD r89:SI
>   REG_UNUSED flags:CC
>11: strict_low_part(r92:V4QI#0)=r90:SI#0
>   REG_DEAD r90:SI
> Failed to match this instruction:
> (set (strict_low_part (subreg:QI (reg:V4QI 92 [ v ]) 0))
> (ashift:QI (subreg:QI (reg:SI 89 [ v ]) 0)
> (const_int 1 [0x1])))
> 
> where it fails to try (add:QI (subreg...) (subreg...)) instead of the shift
> by 1.
> 
> Not sure whether targets should have a special-case pattern here or whether
> that's for combine to un-canonicalize it?

We do have this pattern in i386.md, but please see the FIXME:

(define_insn "*ashl3_1_slp"
  [(set (strict_low_part (match_operand:SWI12 0 "register_operand" "+"))
(ashift:SWI12 (match_operand:SWI12 1 "register_operand" "0")
  (match_operand:QI 2 "nonmemory_operand" "cI")))
   (clobber (reg:CC FLAGS_REG))]
  "(!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun))
   /* FIXME: without this LRA can't reload this pattern, see PR82524.  */
   && rtx_equal_p (operands[0], operands[1])"

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

Richard Biener  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #5 from Richard Biener  ---
The GIMPLE IL is now using BIT_INSERT_EXPRs consistently for all cases and
combine does

Trying 8 -> 11:
8: {r90:SI=r89:SI<<0x1;clobber flags:CC;}
  REG_DEAD r89:SI
  REG_UNUSED flags:CC
   11: strict_low_part(r92:V4QI#0)=r90:SI#0
  REG_DEAD r90:SI
Failed to match this instruction:
(set (strict_low_part (subreg:QI (reg:V4QI 92 [ v ]) 0))
(ashift:QI (subreg:QI (reg:SI 89 [ v ]) 0)
(const_int 1 [0x1])))

where it fails to try (add:QI (subreg...) (subreg...)) instead of the shift by
1.

Not sure whether targets should have a special-case pattern here or whether
that's for combine to un-canonicalize it?

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-04 Thread gabravier at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

--- Comment #4 from Gabriel Ravier  ---
That's a bit odd, really - I'm just using the latest released sub-versions of
each of these (except for GCC 6 since I only have access to it through Godbolt
which doesn't have GCC 6.5), i.e. GCC 6.4, 7.5, 8.5, 9.4, 10.3, 11.2 and trunk,
I wouldn't expect it to impact this stuff that much.

Though I do realize now I had messed up my comment slightly: when saying "GCC 7
also changed bar and baz's code generation" I meant "foo and baz's code
generation".

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-04 Thread zsojka at seznam dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

--- Comment #3 from Zdenek Sojka  ---
(In reply to Gabriel Ravier from comment #2)
> Seems like they've all got identical code generation over here since GCC 7,
> and the GCC 6 code generation is just very bad for bar (although GCC 7 also
> changed bar and baz's code generation, which previously was as bar's is in
> the description)

I've got a different observation (using the branch HEADs, at -O3):
gcc-6 - ICEs
gcc-7, gcc-8, gcc-9 - the same code generated as in comment #1
gcc-10, gcc-11, gcc-12 - all functions are the same, bar() now has the same
assembly as foo() and baz() - but is that the optimal form?

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2021-10-03 Thread gabravier at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

Gabriel Ravier  changed:

   What|Removed |Added

 CC||gabravier at gmail dot com

--- Comment #2 from Gabriel Ravier  ---
Seems like they've all got identical code generation over here since GCC 7, and
the GCC 6 code generation is just very bad for bar (although GCC 7 also changed
bar and baz's code generation, which previously was as bar's is in the
description)

[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector

2018-05-10 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730

Marc Glisse  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-05-10
  Component|tree-optimization   |target
 Ever confirmed|0   |1

--- Comment #1 from Marc Glisse  ---
At gimple, the difference is essentially
  BIT_FIELD_REF  = _6;
vs
  v_9 = BIT_INSERT_EXPR ;

Before combine, that translates to modifying a register directly

(insn 6 3 7 2 (set (reg:SI 93 [ v ])
(sign_extend:SI (subreg:QI (reg/v:SI 92 [ v ]) 0))) "v.c":6 155
{extendqisi2}
 (nil))
(insn 7 6 8 2 (parallel [ 
(set (reg:SI 94) 
(ashift:SI (reg:SI 93 [ v ])
(const_int 1 [0x1])))
(clobber (reg:CC 17 flags))
]) "v.c":6 550 {*ashlsi3_1}
 (expr_list:REG_DEAD (reg:SI 93 [ v ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(insn 8 7 13 2 (set (strict_low_part (subreg:QI (reg/v:SI 92 [ v ]) 0))
(subreg:QI (reg:SI 94) 0)) "v.c":6 101 {*movstrictqi_1}
 (expr_list:REG_DEAD (reg:SI 94)
(nil)))

or modifying a copy of it

(sign_extend:SI (subreg:QI (reg/v:SI 92 [ v ]) 0))) "v.c":12 155
{extendqisi2}
 (nil))
(insn 7 6 9 2 (parallel [
(set (reg:SI 94)
(ashift:SI (reg:SI 93 [ v ])
(const_int 1 [0x1])))
(clobber (reg:CC 17 flags))
]) "v.c":12 550 {*ashlsi3_1}
 (expr_list:REG_DEAD (reg:SI 93 [ v ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(insn 9 7 10 2 (set (reg:SI 96 [ v ])
(reg/v:SI 92 [ v ])) "v.c":12 86 {*movsi_internal}
 (expr_list:REG_DEAD (reg/v:SI 92 [ v ])
(nil)))
(insn 10 9 15 2 (set (strict_low_part (subreg:QI (reg:SI 96 [ v ]) 0))
(subreg:QI (reg:SI 94) 0)) "v.c":12 101 {*movstrictqi_1}
 (expr_list:REG_DEAD (reg:SI 94)
(nil)))

and combine only manages to match in the first case

(insn 8 7 13 2 (parallel [
(set (strict_low_part (subreg:QI (reg/v:SI 92 [ v ]) 0)) 
(ashift:QI (subreg:QI (reg/v:SI 92 [ v ]) 0) 
(const_int 1 [0x1])))
(clobber (reg:CC 17 flags))
]) "v.c":6 556 {*ashlqi3_1_slp}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))

Operations on partial registers are often not so fast, but in any case it seems
that we should generate the same code for both cases. Either target or
rtl-optimization.