[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 Richard Biener changed: What|Removed |Added Target Milestone|14.0|14.2 --- Comment #18 from Richard Biener --- GCC 14.1 is being released, retargeting bugs to GCC 14.2.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #17 from GCC Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:ff8d0ce17fb585a29a83349acbc67b2dd3556629 commit r14-6495-gff8d0ce17fb585a29a83349acbc67b2dd3556629 Author: Roger Sayle Date: Wed Dec 13 13:36:44 2023 + ARC: Add *extvsi_n_0 define_insn_and_split for PR 110717. This patch improves the code generated for bitfield sign extensions on ARC cpus without a barrel shifter. Compiling the following test case: int foo(int x) { return (x<<27)>>27; } with -O2 -mcpu=em, generates two loops: foo:mov lp_count,27 lp 2f add r0,r0,r0 nop 2: # end single insn loop mov lp_count,27 lp 2f asr r0,r0 nop 2: # end single insn loop j_s [blink] and the closely related test case: struct S { int a : 5; }; int bar (struct S *p) { return p->a; } generates the slightly better: bar:ldb_s r0,[r0] mov_s r2,0;3 add3r0,r2,r0 sexb_s r0,r0 asr_s r0,r0 asr_s r0,r0 j_s.d [blink] asr_s r0,r0 which uses 6 instructions to perform this particular sign extension. It turns out that sign extensions can always be implemented using at most three instructions on ARC (without a barrel shifter) using the idiom ((x)^msb)-msb [as described in section "2-5 Sign Extension" of Henry Warren's book "Hacker's Delight"]. Using this, the sign extensions above on ARC's EM both become: bmsk_s r0,r0,4 xor r0,r0,16 sub r0,r0,16 which takes about 3 cycles, compared to the ~112 cycles for the loops in foo. 2023-12-13 Roger Sayle Jeff Law gcc/ChangeLog * config/arc/arc.md (*extvsi_n_0): New define_insn_and_split to implement SImode sign extract using a AND, XOR and MINUS sequence. gcc/testsuite/ChangeLog * gcc.target/arc/extvsi-1.c: New test case. * gcc.target/arc/extvsi-2.c: Likewise.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #16 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:31cc9824d1cd5e0f7fa145d0831a923479333cd6 commit r14-5013-g31cc9824d1cd5e0f7fa145d0831a923479333cd6 Author: Roger Sayle Date: Mon Oct 30 16:17:42 2023 + ARC: Improved ARC rtx_costs/insn_cost for SHIFTs and ROTATEs. This patch overhauls the ARC backend's insn_cost target hook, and makes some related improvements to rtx_costs, BRANCH_COST, etc. The primary goal is to allow the backend to indicate that shifts and rotates are slow (discouraged) when the CPU doesn't have a barrel shifter. I should also acknowledge Richard Sandiford for inspiring the use of set_cost in this rewrite of arc_insn_cost; this implementation borrows heavily for the target hooks for AArch64 and ARM. The motivating example is derived from PR rtl-optimization/110717. struct S { int a : 5; }; unsigned int foo (struct S *p) { return p->a; } With a barrel shifter, GCC -O2 generates the reasonable: foo:ldb_s r0,[r0] asl_s r0,r0,27 j_s.d [blink] asr_s r0,r0,27 What's interesting is that during combine, the middle-end actually has two shifts by three bits, and a sign-extension from QI to SI. Trying 8, 9 -> 11: 8: r158:SI=r157:QI#0<<0x3 REG_DEAD r157:QI 9: r159:SI=sign_extend(r158:SI#0) REG_DEAD r158:SI 11: r155:SI=r159:SI>>0x3 REG_DEAD r159:SI Whilst it's reasonable to simplify this to two shifts by 27 bits when the CPU has a barrel shifter, it's actually a significant pessimization when these shifts are implemented by loops. This combination can be prevented if the backend provides accurate-ish estimates for insn_cost. Previously, without a barrel shifter, GCC -O2 -mcpu=em generates: foo:ldb_s r0,[r0] mov lp_count,27 lp 2f add r0,r0,r0 nop 2: # end single insn loop mov lp_count,27 lp 2f asr r0,r0 nop 2: # end single insn loop j_s [blink] which contains two loops and requires about ~113 cycles to execute. With this patch to rtx_cost/insn_cost, GCC -O2 -mcpu=em generates: foo:ldb_s r0,[r0] mov_s r2,0;3 add3r0,r2,r0 sexb_s r0,r0 asr_s r0,r0 asr_s r0,r0 j_s.d [blink] asr_s r0,r0 which requires only ~6 cycles, for the shorter shifts by 3 and sign extension. 2023-10-30 Roger Sayle gcc/ChangeLog * config/arc/arc.cc (arc_rtx_costs): Improve cost estimates. Provide reasonable values for SHIFTS and ROTATES by constant bit counts depending upon TARGET_BARREL_SHIFTER. (arc_insn_cost): Use insn attributes if the instruction is recognized. Avoid calling get_attr_length for type "multi", i.e. define_insn_and_split patterns without explicit type. Fall-back to set_rtx_cost for single_set and pattern_cost otherwise. * config/arc/arc.h (COSTS_N_BYTES): Define helper macro. (BRANCH_COST): Improve/correct definition. (LOGICAL_OP_NON_SHORT_CIRCUIT): Preserve previous behavior.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #15 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:c572f09a751cbd365e2285b30527de5ab9025972 commit r14-2998-gc572f09a751cbd365e2285b30527de5ab9025972 Author: Roger Sayle Date: Fri Aug 4 16:26:06 2023 +0100 Specify signed/unsigned/dontcare in calls to extract_bit_field_1. This patch is inspired by Jakub's work on PR rtl-optimization/110717. The bitfield example described in comment #2, looks like: struct S { __int128 a : 69; }; unsigned type bar (struct S *p) { return p->a; } which on x86_64 with -O2 currently generates: bar:movzbl 8(%rdi), %ecx movq(%rdi), %rax andl$31, %ecx movq%rcx, %rdx salq$59, %rdx sarq$59, %rdx ret The ANDL $31 is interesting... we first extract an unsigned 69-bit bitfield by masking/clearing the top bits of the most significant word, and then it gets sign-extended, by left shifting and arithmetic right shifting. Obviously, this bit-wise AND is redundant, for signed bit-fields, we don't require these bits to be cleared, if we're about to set them appropriately. This patch eliminates this redundancy in the middle-end, during RTL expansion, but extending the extract_bit_field APIs so that the integer UNSIGNEDP argument takes a special value; 0 indicates the field should be sign extended, 1 (any non-zero value) indicates the field should be zero extended, but -1 indicates a third option, that we don't care how or whether the field is extended. By passing and checking this sentinel value at the appropriate places we avoid the useless bit masking (on all targets). For the test case above, with this patch we now generate: bar:movzbl 8(%rdi), %ecx movq(%rdi), %rax movq%rcx, %rdx salq$59, %rdx sarq$59, %rdx ret 2023-08-04 Roger Sayle gcc/ChangeLog * expmed.cc (extract_bit_field_1): Document that an UNSIGNEDP value of -1 is equivalent to don't care. (extract_integral_bit_field): Indicate that we don't require the most significant word to be zero extended, if we're about to sign extend it. (extract_fixed_bit_field_1): Document that an UNSIGNEDP value of -1 is equivalent to don't care. Don't clear the most significant bits with AND mask when UNSIGNEDP is -1. gcc/testsuite/ChangeLog * gcc.target/i386/pr110717-2.c: New test case.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #14 from Andrew Pinski --- (In reply to Segher Boessenkool from comment #13) > So. Before expand we have > > _6 = (__int128) x_3(D); > x.0_1 = _6 << 59; > _2 = x.0_1 >> 59; Jakub is trying to emulate this using shifts but this is better using bitfields just to get the truncation: #ifdef __SIZEOF_INT128__ #define type __int128 #else #define type long long #endif struct f { #ifdef __SIZEOF_INT128__ __int128 t:(128-59); #else long long t:(64-27); #endif }; unsigned type foo (unsigned type x) { struct f t; t.t = x; return t.t; }
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #13 from Segher Boessenkool --- So. Before expand we have _6 = (__int128) x_3(D); x.0_1 = _6 << 59; _2 = x.0_1 >> 59; _4 = (__int128 unsigned) _2; return _4; That should have been optimised better :-( The RTL code it expands to sets the same pseudo multiple times. Bad bad bad. This hampers many optimisations. Like: (insn 6 3 7 2 (set (reg:DI 124) (lshiftrt:DI (reg:DI 129 [ x+8 ]) (const_int 5 [0x5]))) "110717.c":6:11 299 {lshrdi3} (nil)) (insn 7 6 8 2 (set (reg:DI 132) (ashift:DI (reg:DI 128 [ x ]) (const_int 59 [0x3b]))) "110717.c":6:11 289 {ashldi3} (nil)) (insn 8 7 9 2 (set (reg:DI 132) (ior:DI (reg:DI 124) (reg:DI 132))) "110717.c":6:11 233 {*booldi3} (nil)) (They are subregs right after expand, totally unreadable; this is after subreg1, slightly more readable, but essentially the same code still). The web pass eventually gets rid of the double set in this case. Because the shift-left-then-right survives all the way to combine, it (being the greedy bastard that it is) will use the combiner patterns rs6000 has for multi-precision shifts, before it would notice the two (multiprecision!) shifts together are largely a no-op, so you get stuck at a local optimum. Pat for the course for combine :-/
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #12 from Segher Boessenkool --- (In reply to Jakub Jelinek from comment #9) > Wonder how many important targets provide double-word shift patterns vs. > ones which expand it through generic code. Very long ago rs6000 had special code for this. That was sub-optimal in other ways, and the generic code generated almost ideal code (sometimes an extra data movement insn). > powerpc probably could be improved: > foo: > srwi 9,4,5 > mr 10,9 > rlwimi 4,9,5,0,31-5 > rlwimi 10,3,27,0,31-27 > srawi 3,10,27 > blr This is hugely worse than what we used to do, it seems? GCC 8 did srdi 9,4,5 rldimi 9,3,59,0 rldimi 4,9,5,0 sradi 3,9,59 blr GCC 9 started with the unnecessary move. But we should get only one insert insn in any case!
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #11 from Jakub Jelinek --- To handle this in generic code, I think expand_expr_real_2 woiuld need to notice this case of << operand of arithmetic >> by same amount and tell that to expand_variable_shift -> expand_shift_1 -> expand_binop somehow. Wasn't there a proposal for SEXT_EXPR at some point?
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #10 from Jakub Jelinek --- Though, grepping tmp-mddump.md files shows only x86 having ashlti3 and ashrti3 expanders.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 Jakub Jelinek changed: What|Removed |Added CC||krebbel at gcc dot gnu.org, ||law at gcc dot gnu.org, ||rsandifo at gcc dot gnu.org, ||segher at gcc dot gnu.org --- Comment #9 from Jakub Jelinek --- Wonder how many important targets provide double-word shift patterns vs. ones which expand it through generic code. aarch64 looks quite small: foo: extrx1, x1, x0, 5 asr x1, x1, 59 ret powerpc probably could be improved: foo: srwi 9,4,5 mr 10,9 rlwimi 4,9,5,0,31-5 rlwimi 10,3,27,0,31-27 srawi 3,10,27 blr ditto s390x: foo: lg %r1,0(%r3) lg %r3,8(%r3) srlg%r5,%r3,5 lghi%r0,31 sllg%r1,%r1,59 ogr %r1,%r5 ngr %r3,%r0 sllg%r5,%r5,5 srag%r1,%r1,59 ogr %r3,%r5 stg %r3,8(%r2) stg %r1,0(%r2) br %r14 ditto riscv64: foo: srlia5,a0,5 sllia1,a1,59 or a1,a5,a1 sllia5,a1,5 andia0,a0,31 or a0,a5,a0 sraia1,a1,59 ret
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 Uroš Bizjak changed: What|Removed |Added Assignee|ubizjak at gmail dot com |unassigned at gcc dot gnu.org Status|ASSIGNED|NEW --- Comment #8 from Uroš Bizjak --- (In reply to CVS Commits from comment #7) > The master branch has been updated by Uros Bizjak : The patch implements transform for x86 targets only. Due to eventual STV transformation, x86 targets handle double-word operations in its own way. I'll left the target-independent implementation to someone else.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #7 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:b50a851eef4b70aabf28fa875d9b2a302d17b66a commit r14-2684-gb50a851eef4b70aabf28fa875d9b2a302d17b66a Author: Uros Bizjak Date: Thu Jul 20 20:54:51 2023 +0200 i386: Double-word sign-extension missed-optimization [PR110717] When sign-extending the value in a double-word register pair using shift and ashiftrt sequence with the same count immediate value less than word width, there is no need to shift the lower word of the value. The sign-extension could be limited to the upper word, but we uselessly shift the lower word with it as well: movq%rdi, %rax movq%rsi, %rdx shldq $59, %rdi, %rdx salq$59, %rax shrdq $59, %rdx, %rax sarq$59, %rdx ret for -m64 and movl4(%esp), %eax movl8(%esp), %edx shldl $27, %eax, %edx sall$27, %eax shrdl $27, %edx, %eax sarl$27, %edx ret for -m32. The patch introduces a new post-reload splitter to provide the combined ASHIFTRT/SHIFT instruction pattern. The instruction is split to a sequence of SAL and SAR insns with the same count immediate operand: movq%rsi, %rdx movq%rdi, %rax salq$59, %rdx sarq$59, %rdx ret Some complication is required to properly handle STV transform, where we emit a sequence with DImode PSLLQ and PSRAQ insns for 32-bit AVX512VL targets when profitable. The patch also fixes a small oversight and enables STV transform of SImode ASHIFTRT to PSRAD also for SSE2 targets. PR target/110717 gcc/ChangeLog: * config/i386/i386-features.cc (general_scalar_chain::compute_convert_gain): Calculate gain for extend higpart case. (general_scalar_chain::convert_op): Handle ASHIFTRT/ASHIFT combined RTX. (general_scalar_to_vector_candidate_p): Enable ASHIFTRT for SImode for SSE2 targets. Handle ASHIFTRT/ASHIFT combined RTX. * config/i386/i386.md (*extend2_doubleword_highpart): New define_insn_and_split pattern. (*extendv2di2_highpart_stv): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110717.c: New test.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 Uroš Bizjak changed: What|Removed |Added Target Milestone|--- |14.0 CC|uros at gcc dot gnu.org| --- Comment #6 from Uroš Bizjak --- (In reply to Jakub Jelinek from comment #5) > Thanks. > Shouldn't > INTVAL (operands[2]) < * BITS_PER_UNIT > be > UINTVAL (operands[2]) < * BITS_PER_UNIT > just to make sure it doesn't trigger for negative? Ah, yes, I'll change it.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #5 from Jakub Jelinek --- Thanks. Shouldn't INTVAL (operands[2]) < * BITS_PER_UNIT be UINTVAL (operands[2]) < * BITS_PER_UNIT just to make sure it doesn't trigger for negative?
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 Uroš Bizjak changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |ubizjak at gmail dot com --- Comment #4 from Uroš Bizjak --- Created attachment 55578 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55578=edit Proposed patch Patch in testing.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2023-07-19 Ever confirmed|0 |1 --- Comment #3 from Richard Biener --- Confirmed.
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 --- Comment #2 from Jakub Jelinek --- Improved testcase which shows similar behavior also with bitfields: #ifdef __SIZEOF_INT128__ #define type __int128 #define N 59 #else #define type long long #define N 27 #endif struct S { type a : sizeof (type) * __CHAR_BIT__ - N; }; unsigned type foo (unsigned type x) { x <<= N; return ((type) x) >> N; } unsigned type bar (struct S *p) { return p->a; }
[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717 Jakub Jelinek changed: What|Removed |Added CC||uros at gcc dot gnu.org Target||x86_64-linux Keywords||missed-optimization --- Comment #1 from Jakub Jelinek --- Haven't tried other targets.