[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2024-05-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|14.0|14.2

--- Comment #18 from Richard Biener  ---
GCC 14.1 is being released, retargeting bugs to GCC 14.2.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-12-13 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #17 from GCC Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:ff8d0ce17fb585a29a83349acbc67b2dd3556629

commit r14-6495-gff8d0ce17fb585a29a83349acbc67b2dd3556629
Author: Roger Sayle 
Date:   Wed Dec 13 13:36:44 2023 +

ARC: Add *extvsi_n_0 define_insn_and_split for PR 110717.

This patch improves the code generated for bitfield sign extensions on
ARC cpus without a barrel shifter.

Compiling the following test case:

int foo(int x) { return (x<<27)>>27; }

with -O2 -mcpu=em, generates two loops:

foo:mov lp_count,27
lp  2f
add r0,r0,r0
nop
2:  # end single insn loop
mov lp_count,27
lp  2f
asr r0,r0
nop
2:  # end single insn loop
j_s [blink]

and the closely related test case:

struct S { int a : 5; };
int bar (struct S *p) { return p->a; }

generates the slightly better:

bar:ldb_s   r0,[r0]
mov_s   r2,0;3
add3r0,r2,r0
sexb_s  r0,r0
asr_s   r0,r0
asr_s   r0,r0
j_s.d   [blink]
asr_s   r0,r0

which uses 6 instructions to perform this particular sign extension.
It turns out that sign extensions can always be implemented using at
most three instructions on ARC (without a barrel shifter) using the
idiom ((x)^msb)-msb [as described in section "2-5 Sign Extension"
of Henry Warren's book "Hacker's Delight"].  Using this, the sign
extensions above on ARC's EM both become:

bmsk_s  r0,r0,4
xor r0,r0,16
sub r0,r0,16

which takes about 3 cycles, compared to the ~112 cycles for the loops
in foo.

2023-12-13  Roger Sayle  
Jeff Law  

gcc/ChangeLog
* config/arc/arc.md (*extvsi_n_0): New define_insn_and_split to
implement SImode sign extract using a AND, XOR and MINUS sequence.

gcc/testsuite/ChangeLog
* gcc.target/arc/extvsi-1.c: New test case.
* gcc.target/arc/extvsi-2.c: Likewise.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-10-30 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #16 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:31cc9824d1cd5e0f7fa145d0831a923479333cd6

commit r14-5013-g31cc9824d1cd5e0f7fa145d0831a923479333cd6
Author: Roger Sayle 
Date:   Mon Oct 30 16:17:42 2023 +

ARC: Improved ARC rtx_costs/insn_cost for SHIFTs and ROTATEs.

This patch overhauls the ARC backend's insn_cost target hook, and makes
some related improvements to rtx_costs, BRANCH_COST, etc.  The primary
goal is to allow the backend to indicate that shifts and rotates are
slow (discouraged) when the CPU doesn't have a barrel shifter. I should
also acknowledge Richard Sandiford for inspiring the use of set_cost
in this rewrite of arc_insn_cost; this implementation borrows heavily
for the target hooks for AArch64 and ARM.

The motivating example is derived from PR rtl-optimization/110717.

struct S { int a : 5; };
unsigned int foo (struct S *p) {
  return p->a;
}

With a barrel shifter, GCC -O2 generates the reasonable:

foo:ldb_s   r0,[r0]
asl_s   r0,r0,27
j_s.d   [blink]
asr_s   r0,r0,27

What's interesting is that during combine, the middle-end actually
has two shifts by three bits, and a sign-extension from QI to SI.

Trying 8, 9 -> 11:
8: r158:SI=r157:QI#0<<0x3
  REG_DEAD r157:QI
9: r159:SI=sign_extend(r158:SI#0)
  REG_DEAD r158:SI
   11: r155:SI=r159:SI>>0x3
  REG_DEAD r159:SI

Whilst it's reasonable to simplify this to two shifts by 27 bits when
the CPU has a barrel shifter, it's actually a significant pessimization
when these shifts are implemented by loops.  This combination can be
prevented if the backend provides accurate-ish estimates for insn_cost.

Previously, without a barrel shifter, GCC -O2 -mcpu=em generates:

foo:ldb_s   r0,[r0]
mov lp_count,27
lp  2f
add r0,r0,r0
nop
2:  # end single insn loop
mov lp_count,27
lp  2f
asr r0,r0
nop
2:  # end single insn loop
j_s [blink]

which contains two loops and requires about ~113 cycles to execute.
With this patch to rtx_cost/insn_cost, GCC -O2 -mcpu=em generates:

foo:ldb_s   r0,[r0]
mov_s   r2,0;3
add3r0,r2,r0
sexb_s  r0,r0
asr_s   r0,r0
asr_s   r0,r0
j_s.d   [blink]
asr_s   r0,r0

which requires only ~6 cycles, for the shorter shifts by 3 and sign
extension.

2023-10-30  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.cc (arc_rtx_costs): Improve cost estimates.
Provide reasonable values for SHIFTS and ROTATES by constant
bit counts depending upon TARGET_BARREL_SHIFTER.
(arc_insn_cost): Use insn attributes if the instruction is
recognized.  Avoid calling get_attr_length for type "multi",
i.e. define_insn_and_split patterns without explicit type.
Fall-back to set_rtx_cost for single_set and pattern_cost
otherwise.
* config/arc/arc.h (COSTS_N_BYTES): Define helper macro.
(BRANCH_COST): Improve/correct definition.
(LOGICAL_OP_NON_SHORT_CIRCUIT): Preserve previous behavior.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-08-04 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #15 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:c572f09a751cbd365e2285b30527de5ab9025972

commit r14-2998-gc572f09a751cbd365e2285b30527de5ab9025972
Author: Roger Sayle 
Date:   Fri Aug 4 16:26:06 2023 +0100

Specify signed/unsigned/dontcare in calls to extract_bit_field_1.

This patch is inspired by Jakub's work on PR rtl-optimization/110717.
The bitfield example described in comment #2, looks like:

struct S { __int128 a : 69; };
unsigned type bar (struct S *p) {
  return p->a;
}

which on x86_64 with -O2 currently generates:

bar:movzbl  8(%rdi), %ecx
movq(%rdi), %rax
andl$31, %ecx
movq%rcx, %rdx
salq$59, %rdx
sarq$59, %rdx
ret

The ANDL $31 is interesting... we first extract an unsigned 69-bit bitfield
by masking/clearing the top bits of the most significant word, and then
it gets sign-extended, by left shifting and arithmetic right shifting.
Obviously, this bit-wise AND is redundant, for signed bit-fields, we don't
require these bits to be cleared, if we're about to set them appropriately.

This patch eliminates this redundancy in the middle-end, during RTL
expansion, but extending the extract_bit_field APIs so that the integer
UNSIGNEDP argument takes a special value; 0 indicates the field should
be sign extended, 1 (any non-zero value) indicates the field should be
zero extended, but -1 indicates a third option, that we don't care how
or whether the field is extended.  By passing and checking this sentinel
value at the appropriate places we avoid the useless bit masking (on
all targets).

For the test case above, with this patch we now generate:

bar:movzbl  8(%rdi), %ecx
movq(%rdi), %rax
movq%rcx, %rdx
salq$59, %rdx
sarq$59, %rdx
ret

2023-08-04  Roger Sayle  

gcc/ChangeLog
* expmed.cc (extract_bit_field_1): Document that an UNSIGNEDP
value of -1 is equivalent to don't care.
(extract_integral_bit_field): Indicate that we don't require
the most significant word to be zero extended, if we're about
to sign extend it.
(extract_fixed_bit_field_1): Document that an UNSIGNEDP value
of -1 is equivalent to don't care.  Don't clear the most
significant bits with AND mask when UNSIGNEDP is -1.

gcc/testsuite/ChangeLog
* gcc.target/i386/pr110717-2.c: New test case.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #14 from Andrew Pinski  ---
(In reply to Segher Boessenkool from comment #13)
> So.  Before expand we have
> 
>   _6 = (__int128) x_3(D);
>   x.0_1 = _6 << 59;
>   _2 = x.0_1 >> 59;


Jakub is trying to emulate this using shifts but this is better using bitfields
just to get the truncation:


#ifdef __SIZEOF_INT128__
#define type __int128
#else
#define type long long
#endif

struct f
{
#ifdef __SIZEOF_INT128__
  __int128 t:(128-59);
#else
  long long t:(64-27);
#endif
};

unsigned type
foo (unsigned type x)
{
  struct f t;
  t.t = x;
  return t.t;
}

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #13 from Segher Boessenkool  ---
So.  Before expand we have

  _6 = (__int128) x_3(D);
  x.0_1 = _6 << 59;
  _2 = x.0_1 >> 59;
  _4 = (__int128 unsigned) _2;
  return _4;

That should have been optimised better :-(

The RTL code it expands to sets the same pseudo multiple times.  Bad bad bad.
This hampers many optimisations.  Like:
(insn 6 3 7 2 (set (reg:DI 124)
(lshiftrt:DI (reg:DI 129 [ x+8 ])
(const_int 5 [0x5]))) "110717.c":6:11 299 {lshrdi3}
 (nil))
(insn 7 6 8 2 (set (reg:DI 132)
(ashift:DI (reg:DI 128 [ x ])
(const_int 59 [0x3b]))) "110717.c":6:11 289 {ashldi3}
 (nil))
(insn 8 7 9 2 (set (reg:DI 132)
(ior:DI (reg:DI 124)
(reg:DI 132))) "110717.c":6:11 233 {*booldi3}
 (nil))
(They are subregs right after expand, totally unreadable; this is after
subreg1, slightly more readable, but essentially the same code still).

The web pass eventually gets rid of the double set in this case.

Because the shift-left-then-right survives all the way to combine, it (being
the greedy bastard that it is) will use the combiner patterns rs6000 has for
multi-precision shifts, before it would notice the two (multiprecision!)
shifts together are largely a no-op, so you get stuck at a local optimum.
Pat for the course for combine :-/

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #12 from Segher Boessenkool  ---
(In reply to Jakub Jelinek from comment #9)
> Wonder how many important targets provide double-word shift patterns vs.
> ones which expand it through generic code.

Very long ago rs6000 had special code for this.  That was sub-optimal in
other ways, and the generic code generated almost ideal code (sometimes an
extra data movement insn).

> powerpc probably could be improved:
> foo:
> srwi 9,4,5
> mr 10,9
> rlwimi 4,9,5,0,31-5
> rlwimi 10,3,27,0,31-27
> srawi 3,10,27
> blr

This is hugely worse than what we used to do, it seems?

GCC 8 did

srdi 9,4,5
rldimi 9,3,59,0
rldimi 4,9,5,0
sradi 3,9,59
blr

GCC 9 started with the unnecessary move.

But we should get only one insert insn in any case!

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-21 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #11 from Jakub Jelinek  ---
To handle this in generic code, I think expand_expr_real_2 woiuld need to
notice this case of << operand of arithmetic >> by same amount and tell that to
expand_variable_shift -> expand_shift_1 -> expand_binop somehow.
Wasn't there a proposal for SEXT_EXPR at some point?

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-20 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #10 from Jakub Jelinek  ---
Though, grepping tmp-mddump.md files shows only x86 having ashlti3 and ashrti3
expanders.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-20 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Jakub Jelinek  changed:

   What|Removed |Added

 CC||krebbel at gcc dot gnu.org,
   ||law at gcc dot gnu.org,
   ||rsandifo at gcc dot gnu.org,
   ||segher at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek  ---
Wonder how many important targets provide double-word shift patterns vs. ones
which expand it through generic code.
aarch64 looks quite small:
foo:
extrx1, x1, x0, 5
asr x1, x1, 59
ret
powerpc probably could be improved:
foo:
srwi 9,4,5
mr 10,9
rlwimi 4,9,5,0,31-5
rlwimi 10,3,27,0,31-27
srawi 3,10,27
blr
ditto s390x:
foo:
lg  %r1,0(%r3)
lg  %r3,8(%r3)
srlg%r5,%r3,5
lghi%r0,31
sllg%r1,%r1,59
ogr %r1,%r5
ngr %r3,%r0
sllg%r5,%r5,5
srag%r1,%r1,59
ogr %r3,%r5
stg %r3,8(%r2)
stg %r1,0(%r2)
br  %r14
ditto riscv64:
foo:
srlia5,a0,5
sllia1,a1,59
or  a1,a5,a1
sllia5,a1,5
andia0,a0,31
or  a0,a5,a0
sraia1,a1,59
ret

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-20 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|ubizjak at gmail dot com   |unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|NEW

--- Comment #8 from Uroš Bizjak  ---
(In reply to CVS Commits from comment #7)
> The master branch has been updated by Uros Bizjak :

The patch implements transform for x86 targets only. Due to eventual STV
transformation, x86 targets handle double-word operations in its own way.

I'll left the target-independent implementation to someone else.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-20 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #7 from CVS Commits  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:b50a851eef4b70aabf28fa875d9b2a302d17b66a

commit r14-2684-gb50a851eef4b70aabf28fa875d9b2a302d17b66a
Author: Uros Bizjak 
Date:   Thu Jul 20 20:54:51 2023 +0200

i386: Double-word sign-extension missed-optimization [PR110717]

When sign-extending the value in a double-word register pair using shift
and
ashiftrt sequence with the same count immediate value less than word width,
there is no need to shift the lower word of the value. The sign-extension
could be limited to the upper word, but we uselessly shift the lower word
with it as well:
movq%rdi, %rax
movq%rsi, %rdx
shldq   $59, %rdi, %rdx
salq$59, %rax
shrdq   $59, %rdx, %rax
sarq$59, %rdx
ret
for -m64 and
movl4(%esp), %eax
movl8(%esp), %edx
shldl   $27, %eax, %edx
sall$27, %eax
shrdl   $27, %edx, %eax
sarl$27, %edx
ret
for -m32.

The patch introduces a new post-reload splitter to provide the combined
ASHIFTRT/SHIFT instruction pattern.  The instruction is split to a sequence
of SAL and SAR insns with the same count immediate operand:
movq%rsi, %rdx
movq%rdi, %rax
salq$59, %rdx
sarq$59, %rdx
ret

Some complication is required to properly handle STV transform, where we
emit a sequence with DImode PSLLQ and PSRAQ insns for 32-bit AVX512VL
targets when profitable.

The patch also fixes a small oversight and enables STV transform of SImode
ASHIFTRT to PSRAD also for SSE2 targets.

PR target/110717

gcc/ChangeLog:

* config/i386/i386-features.cc
(general_scalar_chain::compute_convert_gain): Calculate gain
for extend higpart case.
(general_scalar_chain::convert_op): Handle
ASHIFTRT/ASHIFT combined RTX.
(general_scalar_to_vector_candidate_p): Enable ASHIFTRT for
SImode for SSE2 targets.  Handle ASHIFTRT/ASHIFT combined RTX.
* config/i386/i386.md (*extend2_doubleword_highpart):
New define_insn_and_split pattern.
(*extendv2di2_highpart_stv): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110717.c: New test.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-19 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 CC|uros at gcc dot gnu.org|

--- Comment #6 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #5)
> Thanks.
> Shouldn't
> INTVAL (operands[2]) <  * BITS_PER_UNIT
> be
> UINTVAL (operands[2]) <  * BITS_PER_UNIT
> just to make sure it doesn't trigger for negative?

Ah, yes, I'll change it.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-19 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #5 from Jakub Jelinek  ---
Thanks.
Shouldn't
INTVAL (operands[2]) <  * BITS_PER_UNIT
be
UINTVAL (operands[2]) <  * BITS_PER_UNIT
just to make sure it doesn't trigger for negative?

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-19 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #4 from Uroš Bizjak  ---
Created attachment 55578
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55578=edit
Proposed patch

Patch in testing.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-19 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-07-19
 Ever confirmed|0   |1

--- Comment #3 from Richard Biener  ---
Confirmed.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-18 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

--- Comment #2 from Jakub Jelinek  ---
Improved testcase which shows similar behavior also with bitfields:

#ifdef __SIZEOF_INT128__
#define type __int128
#define N 59
#else
#define type long long
#define N 27
#endif

struct S { type a : sizeof (type) * __CHAR_BIT__ - N; };

unsigned type
foo (unsigned type x)
{
  x <<= N;
  return ((type) x) >> N;
}

unsigned type
bar (struct S *p)
{
  return p->a;
}

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-18 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Jakub Jelinek  changed:

   What|Removed |Added

 CC||uros at gcc dot gnu.org
 Target||x86_64-linux
   Keywords||missed-optimization

--- Comment #1 from Jakub Jelinek  ---
Haven't tried other targets.