[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 --- Comment #10 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:b37351e3279d192d5d4682f002abe5b2e133bba6 commit r12-4359-gb37351e3279d192d5d4682f002abe5b2e133bba6 Author: Uros Bizjak Date: Tue Oct 12 18:20:38 2021 +0200 i386: Improve workaround for PR82524 LRA limitation [PR85730] As explained in PR82524, LRA is not able to reload strict_low_part inout operand with matched input operand. The patch introduces a workaround, where we allow LRA to generate an instruction with non-matched input operand which is split post reload to an instruction that inserts non-matched input operand to an inout operand and the instruction that uses matched operand. The generated code improves from: movsbl %dil, %edx movl%edi, %eax sall$3, %edx movb%dl, %al to: movl%edi, %eax movb%dil, %al salb$3, %al which is still not optimal, but the code is one instruction shorter and does not use a temporary register. 2021-10-12 Uroš Bizjak gcc/ PR target/85730 PR target/82524 * config/i386/i386.md (*add_1_slp): Rewrite as define_insn_and_split pattern. Add alternative 1 and split it post reload to insert operand 1 into the low part of operand 0. (*sub_1_slp): Ditto. (*and_1_slp): Ditto. (*_1_slp): Ditto. (*ashl3_1_slp): Ditto. (*3_1_slp): Ditto. (*3_1_slp): Ditto. (*neg_1_slp): New insn_and_split pattern. (*one_cmpl_1_slp): Ditto. gcc/testsuite/ PR target/85730 PR target/82524 * gcc.target/i386/pr85730.c: New test.
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 --- Comment #9 from Uroš Bizjak --- An interesting observation with the following testcase: --cut here-- typedef char V __attribute__((vector_size(4))); struct S { char val; char pad1; char pad2; char pad3; }; V foo (V v) { v[0] <<= 3; return v; } struct S bar (struct S a) { a.val <<= 3; return a; } --cut here-- gcc -O2: foo: movsbl %dil, %edx movl%edi, %eax sall$3, %edx movb%dl, %al ret bar: movl%edi, %eax salb$3, %al ret So, the compiler is able to produce optimal code with equivalent struct description, but something (in middle-end?) prevents the same optimization with a vector (a.k.a array).
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 --- Comment #8 from Uroš Bizjak --- Created attachment 51564 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51564=edit Prototype patch Attached patch works around reload problem and creates: movl%edi, %eax movb%dil, %al addb%al, %al ret for all cases. Please note that some follow-up pass is needed to remove "movb %dil, %al", since preceding instruction already moves full value from %edi to %eax, so partial move is not needed.
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 --- Comment #7 from Segher Boessenkool --- (In reply to Richard Biener from comment #5) > Not sure whether targets should have a special-case pattern here or whether > that's for combine to un-canonicalize it? Is the shift defined anywhere as the canonical form? We already get a shift at expand time, for e.g. long f(long a) { return a+a; } I cannot find the code that does this easily, it is quite a maze :-) There is code for changing a multiplication by a power of two (which we have in Gimple already) into a shift, but then there should be something changing the addition with self into a multiplication, and I cannot find that either. Combine should absolutely not un-canonicalise it. It could try multiple ways of writing this, but is it important enough to allow this, to justify the (potential) combinatorial explosion this causes? If we want combine to try many ways of writing something (not a bad idea an sich btw, I support it), we need ways to battle such an explosion, and esp. to make the amount of garbage RTL created manageable (it is not, currently). Currently combine has to create GC'ed RTL to recog() it. Maybe we could have some "GC stuff created between points X and Y" interface?
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 --- Comment #6 from Uroš Bizjak --- (In reply to Richard Biener from comment #5) > The GIMPLE IL is now using BIT_INSERT_EXPRs consistently for all cases and > combine does > > Trying 8 -> 11: > 8: {r90:SI=r89:SI<<0x1;clobber flags:CC;} > REG_DEAD r89:SI > REG_UNUSED flags:CC >11: strict_low_part(r92:V4QI#0)=r90:SI#0 > REG_DEAD r90:SI > Failed to match this instruction: > (set (strict_low_part (subreg:QI (reg:V4QI 92 [ v ]) 0)) > (ashift:QI (subreg:QI (reg:SI 89 [ v ]) 0) > (const_int 1 [0x1]))) > > where it fails to try (add:QI (subreg...) (subreg...)) instead of the shift > by 1. > > Not sure whether targets should have a special-case pattern here or whether > that's for combine to un-canonicalize it? We do have this pattern in i386.md, but please see the FIXME: (define_insn "*ashl3_1_slp" [(set (strict_low_part (match_operand:SWI12 0 "register_operand" "+")) (ashift:SWI12 (match_operand:SWI12 1 "register_operand" "0") (match_operand:QI 2 "nonmemory_operand" "cI"))) (clobber (reg:CC FLAGS_REG))] "(!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun)) /* FIXME: without this LRA can't reload this pattern, see PR82524. */ && rtx_equal_p (operands[0], operands[1])"
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 Richard Biener changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #5 from Richard Biener --- The GIMPLE IL is now using BIT_INSERT_EXPRs consistently for all cases and combine does Trying 8 -> 11: 8: {r90:SI=r89:SI<<0x1;clobber flags:CC;} REG_DEAD r89:SI REG_UNUSED flags:CC 11: strict_low_part(r92:V4QI#0)=r90:SI#0 REG_DEAD r90:SI Failed to match this instruction: (set (strict_low_part (subreg:QI (reg:V4QI 92 [ v ]) 0)) (ashift:QI (subreg:QI (reg:SI 89 [ v ]) 0) (const_int 1 [0x1]))) where it fails to try (add:QI (subreg...) (subreg...)) instead of the shift by 1. Not sure whether targets should have a special-case pattern here or whether that's for combine to un-canonicalize it?
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 --- Comment #4 from Gabriel Ravier --- That's a bit odd, really - I'm just using the latest released sub-versions of each of these (except for GCC 6 since I only have access to it through Godbolt which doesn't have GCC 6.5), i.e. GCC 6.4, 7.5, 8.5, 9.4, 10.3, 11.2 and trunk, I wouldn't expect it to impact this stuff that much. Though I do realize now I had messed up my comment slightly: when saying "GCC 7 also changed bar and baz's code generation" I meant "foo and baz's code generation".
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 --- Comment #3 from Zdenek Sojka --- (In reply to Gabriel Ravier from comment #2) > Seems like they've all got identical code generation over here since GCC 7, > and the GCC 6 code generation is just very bad for bar (although GCC 7 also > changed bar and baz's code generation, which previously was as bar's is in > the description) I've got a different observation (using the branch HEADs, at -O3): gcc-6 - ICEs gcc-7, gcc-8, gcc-9 - the same code generated as in comment #1 gcc-10, gcc-11, gcc-12 - all functions are the same, bar() now has the same assembly as foo() and baz() - but is that the optimal form?
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 Gabriel Ravier changed: What|Removed |Added CC||gabravier at gmail dot com --- Comment #2 from Gabriel Ravier --- Seems like they've all got identical code generation over here since GCC 7, and the GCC 6 code generation is just very bad for bar (although GCC 7 also changed bar and baz's code generation, which previously was as bar's is in the description)
[Bug target/85730] complex code for modifying lowest byte in a 4-byte vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2018-05-10 Component|tree-optimization |target Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- At gimple, the difference is essentially BIT_FIELD_REF= _6; vs v_9 = BIT_INSERT_EXPR ; Before combine, that translates to modifying a register directly (insn 6 3 7 2 (set (reg:SI 93 [ v ]) (sign_extend:SI (subreg:QI (reg/v:SI 92 [ v ]) 0))) "v.c":6 155 {extendqisi2} (nil)) (insn 7 6 8 2 (parallel [ (set (reg:SI 94) (ashift:SI (reg:SI 93 [ v ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) "v.c":6 550 {*ashlsi3_1} (expr_list:REG_DEAD (reg:SI 93 [ v ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 8 7 13 2 (set (strict_low_part (subreg:QI (reg/v:SI 92 [ v ]) 0)) (subreg:QI (reg:SI 94) 0)) "v.c":6 101 {*movstrictqi_1} (expr_list:REG_DEAD (reg:SI 94) (nil))) or modifying a copy of it (sign_extend:SI (subreg:QI (reg/v:SI 92 [ v ]) 0))) "v.c":12 155 {extendqisi2} (nil)) (insn 7 6 9 2 (parallel [ (set (reg:SI 94) (ashift:SI (reg:SI 93 [ v ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) "v.c":12 550 {*ashlsi3_1} (expr_list:REG_DEAD (reg:SI 93 [ v ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 9 7 10 2 (set (reg:SI 96 [ v ]) (reg/v:SI 92 [ v ])) "v.c":12 86 {*movsi_internal} (expr_list:REG_DEAD (reg/v:SI 92 [ v ]) (nil))) (insn 10 9 15 2 (set (strict_low_part (subreg:QI (reg:SI 96 [ v ]) 0)) (subreg:QI (reg:SI 94) 0)) "v.c":12 101 {*movstrictqi_1} (expr_list:REG_DEAD (reg:SI 94) (nil))) and combine only manages to match in the first case (insn 8 7 13 2 (parallel [ (set (strict_low_part (subreg:QI (reg/v:SI 92 [ v ]) 0)) (ashift:QI (subreg:QI (reg/v:SI 92 [ v ]) 0) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) "v.c":6 556 {*ashlqi3_1_slp} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) Operations on partial registers are often not so fast, but in any case it seems that we should generate the same code for both cases. Either target or rtl-optimization.