[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-20 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Roger Sayle  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-05-20

--- Comment #2 from Roger Sayle  ---
I can confirm that I can reproduce this and see the same thing.
Adding 
  vi tmp1 = Set_i32(INT32_MAX);
  d_i("tmp1",tmp1.raw);
at multiple places in bug.cc, reveals that sometimes the result is the correct
[0x7ff x 4], and at other places is the incorrect [0x8000 x 4], even
though this affected snippet doesn't involve binary operation simplification.

[Bug target/106060] Inefficient constant broadcast on x86_64

2024-05-12 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106060

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
  Known to work||15.0
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Roger Sayle  ---
This has now been fixed on mainline for GCC 15.  There are still improvements
that can be made to vector constant materialization/initialization on x86_64,
but the issues/ideas described in this bugzilla PR are all now implemented. 
Thanks.

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-10 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

--- Comment #2 from Roger Sayle  ---
Here's a reduced test case that should be unaffected by the pending changes to
how V8QI shifts are expanded.  Note that the final "t -= t4" is required to
convince the register allocator to "spill".

typedef signed char v16qi __attribute__ ((__vector_size__ (16)));
// sign-extend low 3 bits to a byte.
v16qi foo (v16qi x) {
v16qi t7 = (v16qi){7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
v16qi t4 = (v16qi){4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4};
v16qi t = x & t7;
t ^= t4;
t -= t4;
return t;
}

which produces:

foo:movl$67372036, %eax
vmovdqa %xmm0, %xmm2
vpbroadcastd%eax, %xmm1
movl$117901063, %eax
vpbroadcastd%eax, %xmm3
vmovdqa %xmm1, %xmm0
vmovdqa %xmm3, -24(%rsp)
vmovdqa -24(%rsp), %xmm4
vpternlogd  $120, %xmm2, %xmm4, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0
ret

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-10 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
   Last reconfirmed||2024-05-10
 CC||roger at nextmovesoftware dot 
com
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #1 from Roger Sayle  ---
I have a patch for x86 ternlog handling that changes the output for this
testcase (without the pending change to optimize V8QI shifts) to:
foo:movl$67372036, %eax
vpsraw  $5, %xmm0, %xmm0
vpbroadcastd%eax, %xmm1
vpternlogd  $108, .LC0(%rip), %xmm1, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0
ret
.align 16
.LC0:
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7

which at least doesn't construct the vector with a broadcast, and then "spill"
it to the stack before reading it back from memory.   I've no idea if this is
optimal, but it's certainly better than the current "spill".

I'm curious about what has changed to make this code (register allocation)
regress since GCC 13.  It was a patch of mine that changed broadcastb to
broadcastd, but that shouldn't have affected reload/register preferencing.

[Bug middle-end/78947] sub-optimal code for (bool)(int ? int : int)

2024-05-06 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78947

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Roger Sayle  ---
As Andrew mentioned in comment #2, this has been fixed/resolved since GCC v9.
Mainline g++ -O3 currently generates:
condSet(int, int, int):
testedi, edi
cmovne  edx, esi
testedx, edx
setne   al
ret

[I believe the status change/assignment in comment #3 was due to a typo in the
bugzilla PR number].

[Bug middle-end/85559] [meta-bug] Improve conditional move

2024-05-06 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85559
Bug 85559 depends on bug 78947, which changed state.

Bug 78947 Summary: sub-optimal code for (bool)(int ? int : int)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78947

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/113832] [14/15 Regression] 6% exec time regression of 464.h264ref on aarch64

2024-04-30 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113832

--- Comment #5 from Roger Sayle  ---
I'm trying to confirm that there are actually widening multiplications in
464.h264ref (on aarch64), but if anyone's already done an analysis of what
might be causing these performance swings, please do post (a pointer here).

[Bug tree-optimization/113673] [12/13/14/15 Regression] ICE: verify_flow_info failed: BB 5 cannot throw but has an EH edge with -Os -finstrument-functions -fnon-call-exceptions -ftrapv

2024-04-26 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113673

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
 Status|NEW |ASSIGNED

--- Comment #6 from Roger Sayle  ---
Created attachment 58051
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58051=edit
proposed patch

Bootstrapping and regression testing the attached patch.

[Bug target/43644] __uint128_t missed optimizations.

2024-04-26 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||roger at nextmovesoftware dot 
com
 Status|NEW |RESOLVED
   Target Milestone|--- |14.0

--- Comment #6 from Roger Sayle  ---
This is now fixed on mainline (for GCC 14 and GCC 15).

[Bug rtl-optimization/97756] [11/12/13 Regression] Inefficient handling of 128-bit arguments

2024-04-26 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

Roger Sayle  changed:

   What|Removed |Added

  Known to work||14.0
Summary|[11/12/13/14/15 Regression] |[11/12/13 Regression]
   |Inefficient handling of |Inefficient handling of
   |128-bit arguments   |128-bit arguments

--- Comment #17 from Roger Sayle  ---
I believe this issue is now fixed on mainline (i.e. for both GCC 14 and GCC
15).
Firstly, many thanks to Jakub for correcting the error in my patch. We now
generate optimal code sequences for the code in comments #3 and #5, and use
generate fewer instructions than described in the original description.

The final remaining issue is that with -O3 GCC still uses more instructions
than clang and icc (see Thomas' comments in comments #12 and #13).  The good
news is that this is intentional, compiling with -Os (to optimize for size)
generates the same number of instructions as clang and icc [in fact, using icc
-Os generates larger code!?].  So when optimizing for performance, GCC is
taking the opportunity to use more (cheap) instructions to execute faster (or
that's the theory).

[Bug middle-end/111701] [11/12/13/14 Regression] wrong code for __builtin_signbit(x*x)

2024-04-26 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111701

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #2 from Roger Sayle  ---
A patch to provide a possible solution/workaround has been proposed at
https://gcc.gnu.org/pipermail/gcc-patches/2024-April/650054.html
With that change, compiling the code in the original description with the
-fsignaling-nans command line option, avoids the abort.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #5 from Roger Sayle  ---
Another interesting (simpler) case of -ffast-math pessimization is:
void foo(_Complex double *c)
{
for (int i=0; i<16; i++)
  c[i] += __builtin_complex(1.0,0.0);
}

Again without -ffast-math we vectorize consecutive additions, but with
-ffast-math we (not so) cleverly avoid every second addition by producing
significantly larger code that shuffles the real/imaginary parts around.

This even suggests a missed-optimization for:
void bar(_Complex double *c, double x)
{
for (int i=0; i<16; i++)
  c[i] += x;
}

which may be more efficiently implemented (when safe) by:
void bar(_Complex double *c, double x)
{
for (int i=0; i<16; i++)
  c[i] += __builtin_complex(x,0.0);
}

i.e. insert/interleave a no-op zero addition, to simplify the vectorization.

The existence of a suitable identity operation (+0, *1.0, &~0, |0, ^0) can be
used to avoid shuffling/permuting values/lanes out of vectors, when its
possible for the vector operation to leave the other values unchanged.

[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal

2024-04-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #3 from Roger Sayle  ---
Richard has already changed this from "gfortran" to "tree-optimization", but
for the record, the C equivalent of this test case (with the same issue) is:

void scale_i(_Complex double *c, int n)
{
for (int i=0; i

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-07 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544

Roger Sayle  changed:

   What|Removed |Added

   Last reconfirmed||2024-04-07
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
 CC||roger at nextmovesoftware dot 
com

[Bug middle-end/114552] [13/14 Regression] wrong code at -O1 and above on x86_64-linux-gnu since r13-990

2024-04-02 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114552

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #6 from Roger Sayle  ---
Many thanks Jakub, and my apologies for the breakage/inconvenience.  It looks
like sizeof(k) is 10 bytes, and sizeof(k.b) is 6 bytes, and somehow this code
is getting the constructor for "k" and not for just "k.b".  This is, of course,
fine for memcpy as it can move the just the pieces it wants.  I completely
agree that the safe fix is to check that the sizes match; I don't think I ever
considered that they might not be identical when I wrote this code, or assumed
that partial would be non-zero for this case].

[Bug target/114284] [14 Regression] arm: Load of volatile short gets miscompiled (loaded twice) since r14-8319

2024-03-09 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114284

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #10 from Roger Sayle  ---
Thanks Jakub.  My apologies for the unintentional breakage.

[Bug target/114187] [14 regression] bizarre register dance on x86_64 for pass-by-value struct since r14-2526

2024-03-01 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114187

Roger Sayle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #4 from Roger Sayle  ---
Created attachment 57587
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57587=edit
proposed patch

Proposed fix attached.  Currently bootstrapping and regression testing.  The
problematic code (from March 2023) has an interesting history.

[Bug target/114187] [14 regression] bizarre register dance on x86_64 for pass-by-value struct since r14-2526

2024-03-01 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114187

Roger Sayle  changed:

   What|Removed |Added

   Last reconfirmed||2024-03-01
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #3 from Roger Sayle  ---
There's a missing simplification in combine:

Trying 6 -> 11:
6: r102:TI=zero_extend(r109:DF#0)<<0x40|zero_extend(r108:DF#0)
  REG_DEAD r108:DF
  REG_DEAD r109:DF
   11: r105:DF=r102:TI#0+r102:TI#8
  REG_DEAD r102:TI
Failed to match this instruction:
(set (reg:DF 105 [ _4 ])
(plus:DF (subreg:DF (ior:TI (ashift:TI (zero_extend:TI (subreg:DI (reg:DF
109) 0))
(const_int 64 [0x40]))
(zero_extend:TI (subreg:DI (reg:DF 108) 0))) 8)
(reg:DF 108)))

where the lowpart is getting simplified to reg:DF 108, but the highpart isn't
getting simplified to reg:DF 109.  i.e.

(subreg:DF (ior:TI (ashift:TI (zero_extend:TI (subreg:DI (reg:DF 109) 0))
  (const_int 64 [0x40]))
  (zero_extend:TI (subreg:DI (reg:DF 108) 0))) 8)
can be simplified to just (reg:DF 109).

I'm looking into why this isn't happening.

[Bug other/113336] [14 Regression] libatomic (testsuite) regressions on arm

2024-02-17 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336

Roger Sayle  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Roger Sayle  ---
This should now be fixed on mainline.

[Bug target/106060] Inefficient constant broadcast on x86_64

2024-02-16 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106060

Roger Sayle  changed:

   What|Removed |Added

   Target Milestone|--- |15.0

--- Comment #5 from Roger Sayle  ---
For the record (so it doesn't get lost) the final patch was posted at
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643973.html
and approved (for stage 1) at
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643996.html

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2024-02-16 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

Roger Sayle  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Roger Sayle  ---
This should now be fixed on mainline.  The testsuite regressions (on non-x86
targets) are cosmetic, i.e. neither wrong code nor worse performance/size, just
differences in expected code generation.

[Bug target/113690] [13 Regression] ICE: in as_a, at machmode.h:381 with -O2 -fno-dce -fno-forward-propagate -fno-split-wide-types -funroll-loops

2024-02-16 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113690

Roger Sayle  changed:

   What|Removed |Added

Summary|[13/14 Regression] ICE: in  |[13 Regression] ICE: in
   |as_a, at machmode.h:381 |as_a, at machmode.h:381
   |with -O2 -fno-dce   |with -O2 -fno-dce
   |-fno-forward-propagate  |-fno-forward-propagate
   |-fno-split-wide-types   |-fno-split-wide-types
   |-funroll-loops  |-funroll-loops

--- Comment #6 from Roger Sayle  ---
This has now been fixed on mainline.  Please let me know if this is worth
backporting to GCC 13.

[Bug tree-optimization/112508] [14 Regression] Size regression when using -Os starting with r14-4089-gd45ddc2c04e

2024-02-15 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112508

Roger Sayle  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
 CC||roger at nextmovesoftware dot 
com
   Last reconfirmed||2024-02-15

--- Comment #2 from Roger Sayle  ---
The issue appears to be with (poor costing in) loop invariant store motion. 
Adding the command line option "-fno-move-loop-stores" reduces the .s file from
149 lines to 54 lines, and the size of main (as reported by objdump -d) from
317 bytes to 73 bytes.   To confirm that this isn't specific to this (possibly
pathological/obscure) test case, I ran the CSiBE benchmark on x86_64, comparing
"-Os" to "-Os -fno-move-loop-stores", which shows a net saving of 1606 bytes
with -fno-move-loop-stores.  There are cases where -fno-move-loop-stores
reduces code size (on x86_64, and I've not investigated other targets), so I
guess it would be preferrable to use more accurate size costs instead of just
disabling this sub-pass. Note that the bigger hammer, -fno-tree-loop-im, also
avoids the code growth, but the more precise/specific -fno-move-loop-stores is
sufficient.

[Bug target/113764] [X86] __builtin_clz generates lzcnt when bsr is sufficient

2024-02-11 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

Roger Sayle  changed:

   What|Removed |Added

Summary|[X86] Generates lzcnt when  |[X86] __builtin_clz
   |bsr is sufficient   |generates lzcnt when bsr is
   ||sufficient

--- Comment #4 from Roger Sayle  ---
Yep, CLZ_DEFINED_VALUE_AT_ZERO really complicates things.  With a single
"global" macro it's currently impossible for a backend to support two different
CLZ instructions; one with defined behavior at zero, and the other with
undefined behavior at zero.

It might just be possible to do something encoding LZCNT patterns in RTL using:
(if_then_else:SI (ne:SI (reg:SI x) (const_int 0))
 (clz:SI (reg:SI x))
 (const_int VALUE))

Additionally on x86_64, the BSR instruction sets the zero flag if it's input is
zero, when the destination register becomes undefined, which can be useful with
CMOV, i.e. it's possible to get defined behavior without an additional test and
branch.  But for Pawel's original tescase, __builtin_clz is undefined at zero,
so this really is a missed optimization, with either -Os or a modern -march
such as cascadelake or znver4.

I agree with Jakub, this is a can of worms; potentially a lot of effort for a
marginal improvement.

[Bug target/113764] [X86] Generates lzcnt when bsr is sufficient

2024-02-09 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

--- Comment #2 from Roger Sayle  ---
Investigating further, the thinking behind GCC's current behaviour can be found
in Agner Fog's instruction tables; on many architectures BSR is much slower
than LZCNT.

Legacy AMD:  BSR=4 cycles,  LZCNT=2 cycles
AMD BOBCAT:  BSR=6 cycles,  LZCNT=5 cycles
AMD JAGUAR:  BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN[1-3]:BSR=4 cycles,  LZCNT=1 cycle
AMD ZEN4:BSR=1 cycle,   LZCNT=1 cycle
INTEL:   BSR=3 cycles,  LZCNT=3 cycles
KNIGHTS LANDING: BSR=11 cycles, LZCNT=3 cycles

Hence using bsr is only "better" in some (but not all) contexts, and a
reasonable default (for generic tuning) is to ignore BSR when LZCNT is
available, as it's only one extra cycle of latency to perform the XOR.

The correct solution is to add a tuning parameter to the x86 backend, to
control whether it's beneficial to use BSR when LZCNT is available, for example
when optimizing for size with -Os or -Oz.  This is more reasonable now that
current Intel and AMD architectures have the same latency for BSR and LZCNT,
than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).

[Bug tree-optimization/113673] [12/13/14 Regression] ICE: verify_flow_info failed: BB 5 cannot throw but has an EH edge with -Os -finstrument-functions -fnon-call-exceptions -ftrapv

2024-02-08 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113673

--- Comment #4 from Roger Sayle  ---
The identified patch implements += the same way as |=.  Presumably a version of
the test case replacing "m += *data++;" with "m |= *data++;" would be more
useful at identifying a patch that actually changed EH edges.

[Bug target/113832] [14 Regression] 6% exec time regression of 464.h264ref on aarch64

2024-02-08 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113832

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #2 from Roger Sayle  ---
Adding myself to Cc list (in case this is confirmed to be a widening multiply
issue).

[Bug target/113764] [X86] Generates lzcnt when bsr is sufficient

2024-02-07 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2024-02-08

--- Comment #1 from Roger Sayle  ---
Confirmed.  This issue has two parts.  The first is that the bsr_1 pattern (and
variants) is (are) conditional on !TARGET_LZCNT, so the bsrl instruction isn't
currently available with -mlzcnt.  The second is that the middle-end doesn't
have a preferred canonical RTL representation for this idiom, but all three of
the following equivalent functions should generate identical code:

unsigned bsr1(unsigned x) { return __builtin_clz(x) ^ 31; }
unsigned bsr2(unsigned x) { return 31 - __builtin_clz(x); }
unsigned bsr3(unsigned x) { return ~__builtin_clz(x) & 31; }

[Note that the tree-ssa optimizers do transform bsr3 into bsr1].
A suitable fix would be to add the equivalent clz(x)^31 variant pattern to
i386.md as a "synonymous" define_insn pattern.

[Bug tree-optimization/113759] [14 regression] ICE when building fdk-aac-2.0.3 since r14-8680

2024-02-06 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113759

--- Comment #9 from Roger Sayle  ---
Many thanks Jakub.  Sorry again for the inconvenience.

[Bug target/113720] [14 Regression] internal compiler error: in extract_insn, at recog.cc:2812 targeting alpha-linux-gnu

2024-02-02 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113720

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #3 from Roger Sayle  ---
Sorry for the inconvenience.  alpha.md's define_expand that creates RTL that
contains a MULT with operands of different modes looks highly suspicious. 
Uros' patch to use the (relatively recently added) UMUL_HIGHPART rtx_code is
certainly a step in the right direction.

[Bug target/113690] [13/14 Regression] ICE: in as_a, at machmode.h:381 with -O2 -fno-dce -fno-forward-propagate -fno-split-wide-types -funroll-loops

2024-02-01 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113690

Roger Sayle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #4 from Roger Sayle  ---
I'm bootstrapping and regression testing a fix.

[Bug target/113701] Issues with __int128 argument passing

2024-02-01 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113701

Roger Sayle  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=106518

--- Comment #5 from Roger Sayle  ---
I like Uros' patch in comment #2.  There have been so many incremental changes
and improvements to x86 TImode and register allocation, that this legacy
heuristic (workaround?) is not only no longer useful, but it actually hurts
register allocation.  *cmp_doubleword appears to be the only (remaining?)
place this idiom is used.

Additionally, I think I've mentioned in the past that it might also be useful
to have a xchg/swap sinking pass, perhaps as part of cprop_hardreg, so that for
example swap followed by swap is eliminated, that swap with one destination
REG_DEAD is transformed into mov, etc.  Swap/xchg is almost always just hard
register renaming, so these should often be eliminatable, but the abstraction
is useful to allow this to happen.

[Bug other/113336] [14 Regression] libatomic (testsuite) regressions on arm

2024-01-28 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336

Roger Sayle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
   Target Milestone|--- |14.0

--- Comment #7 from Roger Sayle  ---
A revised patch has been posted for review/approval to gcc-patches:
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/644147.html

[Bug target/113560] Strange code generated when optimizing a multiplication on x86_64

2024-01-28 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

Roger Sayle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #7 from Roger Sayle  ---
I'm bootstrapping and regression testing a patch.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-27 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #14 from Roger Sayle  ---
My apologies for not keeping folks updated on my thinking. Following Oleg's
feedback, I've decided to slim down my proposed fix to the bare minimum, and
postpone the other rtx_costs improvements until GCC 15 (or later), when I'll
have more time to use to CSiBE to demonstrate the benefits/tradeoffs for -Os
and -Oz.  For example, with fwprop about to transition to insn_cost, it would
be good for the SH backend to provide a sh_insn_cost target hook.

The current minimal patch to fix this specific regression is:

diff --git a/gcc/config/sh/sh.cc b/gcc/config/sh/sh.cc
index 2c411c3..fba6c0fd465 100644
--- a/gcc/config/sh/sh.cc
+++ b/gcc/config/sh/sh.cc
@@ -3313,7 +3313,8 @@ sh_rtx_costs (rtx x, machine_mode mode ATTRIBUTE_UNUSED,
i
nt outer_code,
{
  *total = sh_address_cost (XEXP (XEXP (x, 0), 0),
GET_MODE (XEXP (x, 0)),
-   MEM_ADDR_SPACE (XEXP (x, 0)), true);
+   MEM_ADDR_SPACE (XEXP (x, 0)), true)
+  + COSTS_N_INSNS (1);
  return true;
}
   return false;

The minor complication is that as explained above this results in:
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times addc 6
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times cmp/pz 25
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times shll 3
PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times subc 14

which were failures that were fixed (or silenced) by my solution to PR111267.
I will note that although the scan-assembler-times complain, that this
tweak to sh_rtx_costs reduces the total number of instructions in pr59533-1.c
which (normally) indicates that its an improvement.

*** old.s   Thu Jan 25 22:54:11 2024
--- new.s   Thu Jan 25 22:54:23 2024
***
*** 15,23 
.global test_01
.type   test_01, @function
  test_01:
-   mov.b   @r4,r0
-   extu.b  r0,r0
mov.b   @r4,r1
cmp/pz  r1
mov #0,r1
rts
--- 15,22 
.global test_01
.type   test_01, @function
  test_01:
mov.b   @r4,r1
+   extu.b  r1,r0
cmp/pz  r1
mov #0,r1
rts
...

Hence I'm looking into PR59533, which has separate tests for sh2a and !sh2a,
and my latest discoveries are the -m2a isn't supported if I build gcc using
--target=sh3-linux-gnu, and that --target=sh2a-linux-gnu doesn't automatically
default to --target=sh2aeb-linux-gnu and instead gives a fatal error about
"SH2A does not support little-endian" during the build.  All part (joy?) of the
learning curve.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-26 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

Roger Sayle  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=59533

--- Comment #12 from Roger Sayle  ---
It should be mentioned that the fwprop fix for PR11267 also resolved several
FAILs in gcc.target/sh/pr59533.c.  I mention this as tweaking the cost of
SIGN_EXTEND in sh_rtx_costs interacts with the (redundant) extensions mentioned
in the initial description of PR59533.

[Bug other/113336] libatomic (testsuite) regressions on arm

2024-01-25 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336

Roger Sayle  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|roger at nextmovesoftware dot com  |unassigned at gcc dot 
gnu.org
Summary|libatomic (testsuite)   |libatomic (testsuite)
   |regressions on  |regressions on arm
   |armv6-linux-gnueabihf   |

--- Comment #4 from Roger Sayle  ---
Hi Victor,
Yes, I agree your approach is better/less invasive than mine.  I simply copied
the existing idiom in Makefile.am, not noticing that this adds more
functionality to libatomic than is strictly required. Just adding the
missing/required tas_1_2_.lo is better (and hopefully more acceptable to the
maintainers/reviewers).

[Bug target/113560] Strange code generated when optimizing a multiplication on x86_64

2024-01-24 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

--- Comment #6 from Roger Sayle  ---
In the .optimized dump, we have:
  __int128 unsigned __res;
  __int128 unsigned _12;
  ...
  __res_11 = in_2(D) w* 184467440738;
  _12 = __res_11 & 18446744073709551615;
  __res_7 = _12 * 100;

So the first multiplication is a widening multiplication and expanded using
mulx, but the second multiplication is a full width TImode multiplication,
which is why it has the same RTL expansion as "x * 100".  This is looking like
a tree-level issue and (perhaps) not a target-specific problem.

In fact, it looks like this operation is actually a highpart_multiplication as
only the highpart of the result is required (which should still generate mulx,
but  has a different representation at the tree-level).

[Bug target/113560] Strange code generated when optimizing a multiplication on x86_64

2024-01-24 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #2 from Roger Sayle  ---
The costs look sane, and I'd expect the synth_mult generated sequence to be
faster, though it would be good to get some microbenchmarking.
A reduced test case is:
__int128 foo(__int128 x) { return x*100; }
The x86 backend thinks that a 128-bit (TImode) multiplication would take 14
cycles, so instead generates:
x2 = x+x2 cycles
x3 = x2+x   2 cycles
x24 = x<<3  2 cycles
x25 = x24+x 2 cycles
x100 = x<<2 2 cycles
which is a total of 10 cycles, and predicted to be faster than the generic
implementation (requiring 2 IMULQ, 1 MULQ and 2 ADDQ) for
__int128 bar(__int128 x, int y) { return x*y; }

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #10 from Roger Sayle  ---
Hi Oleg.  Great question.  The "speed" parameter passed to rtx_costs, and
address_cost indicates whether the middle-end is optimizing for peformance, and
interested in the nummber of cycles taken by each instruction, or optimizing
for size, and interested in the number of bytes used to encode the instruction.
 Previously, this speed parameter was ignored by the SH backend, so the costs
were the same independent of the objective function.

In my proposed patch, the address cost (1) when optimizing for size attempts to
return the additional size of an instruction based on the addressing mode.  For
register, and reg+reg addressing modes there is no size increase (overhead),
and for adressing modes with displacements, and displacements to address
pointers, there is a cost.  (2) when optimizing for speed, address cost remains
between 0 and 3, and is used to prioritize between (equivalent numbers of)
instructions.  Normally, rtx_costs are defined in terms of COST_N_INSNS, which
multiplies by 4.  Hence on many platforms a single instruction that references
memory may be encoded as COSTS_N_INSNS(1)+1 (or a more complex addressing mode
as COSTS_N_INSNS(1)+2) to show that this is disfavored to a single instruction
that doesn't reference memory, COSTS_N_INSNS(1)+0.

This is the fix for this particular regression; SIGN_EXTEND of a register now
costs COSTS_N_INSNS(1), and SIGN_EXTEND of a MEM now costs COSTS_N_INSNS(1)+1.

A useful way to debug rtx_costs is to use the -dP command line option, and then
look at the [c=X, l=Y] annotations in the assembly language file.  One way to
check/confirm that these are sensible is that ideally they should be correlated
when optimizing for size (with -Os or -Oz).

I've found an interesting table of SH cycle counts (for different CPUs) at
http://www.shared-ptr.com/sh_insns.html and these could be used to improve
sh_rtx_costs further.  For example, SH currently reports multiplications as
a single cycle operation, which doesn't match the hardware specs, and prevents
GCC from using synth_mult to produce faster (or shorter) sequences using shifts
and additions.  Likewise, sh_rtx_costs doesn't distinguish the machine mode,
so the costs of SImode multiplications are the same as DImode multiplications.

In comment #5 you mention GCC's defaults; it turns out that for rtx_costs the
default values that would be provided by the middle-end, may be more accurate
than the values (currently) specified by the backend.

I hope this answers your question.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

--- Comment #8 from Roger Sayle  ---
Created attachment 57190
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57190=edit
proposed patch

Proposed patch to provide a sane/saner set of rtx_costs for SH.  There's plenty
more that could be done, but these changes are (more than) sufficient to
resolve the code quality regression caused by improved fwprop.  If someone
could try this out on SH, and report back the results, that would be great.

[Bug rtl-optimization/113542] New: gcc.target/arm/bics_3.c regression after change for pr111267

2024-01-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113542

Bug ID: 113542
   Summary: gcc.target/arm/bics_3.c regression after change for
pr111267
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: roger at nextmovesoftware dot com
  Target Milestone: ---

This patch is a placeholder for tracking the reported failures of
FAIL: gcc.target/arm/bics_3.c scan-assembler-times bics\tr[0-9]+, r[0-9]+,
r[0-9]+ 2
FAIL: gcc.target/arm/bics_3.c scan-assembler-times bics\tr[0-9]+, r[0-9]+,
r[0-9]+, .sl #2 1
See https://linaro.atlassian.net/browse/GNU-1117

Alas, I've been unable to reproduce the failure on cross compilers to either
arm-linux-gnueabihf nor armv8l-unknown-linux-gnueabihf, so I suspect that
there's some configuration option or compile-time flag I'm missing that's
required to trigger these failures (which I'm hoping is "missed optimization"
rather than "wrong code").

Hopefully, filing this PR provides a mechanism to allow folks to help me
investigate this issue.  My apologies for the temporary inconvenience.
Setting the component to "rtl-optimization" until this is confirmed to be a
target (ARM backend) issue.

[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267

2024-01-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533

Roger Sayle  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-22
 Status|UNCONFIRMED |NEW
 CC||roger at nextmovesoftware dot 
com
 Ever confirmed|0   |1

--- Comment #6 from Roger Sayle  ---
To help diagnose the problem, I came up with this simple patch:
diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 7872609b336..dc563ac2ca1 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -492,6 +492,9 @@ try_fwprop_subst_pattern (obstack_watermark ,
insn_change _change,
   " (cost %d -> cost %d)\n", old_cost, new_cost);
ok = false;
  }
+   else if (dump_file)
+ fprintf (dump_file, "change is profitable"
+  " (cost %d -> cost %d)\n", old_cost, new_cost);
   }

   if (!ok)

which then helps reveal that on sh3-linux-gnu with -O1 we see:
propagating insn 6 into insn 12, replacing:
(set (reg:SI 174 [ _1 ])
(sign_extend:SI (reg:QI 169 [ *a_7(D) ])))
successfully matched this instruction to *extendqisi2_compact_snd:
(set (reg:SI 174 [ _1 ])
(sign_extend:SI (mem:QI (reg/v/f:SI 168 [ aD.1817 ]) [0 *a_7(D)+0 S1 A8])))
change is profitable (cost 4 -> cost 1)

which confirms Andrew's and Oleg's analyses above; the sh_rtx_costs function is
a little odd... Reading from memory is four times faster than using a pseudo!?
I'm investigating a "costs" patch for the SH backend.  My apologies for the
temporary inconvenience, and thanks to Jeff for catching/spotting this.

[Bug target/91681] Missed optimization for 128 bit arithmetic operations

2024-01-21 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91681

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
   Target Milestone|--- |14.0

--- Comment #7 from Roger Sayle  ---
This is now fixed on mainline.  GCC is now optimal, and generates one less
instruction than clang.

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2024-01-15 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

--- Comment #10 from Roger Sayle  ---
A revised and improved patch has been posted for review at
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643062.html

[Bug other/113336] libatomic (testsuite) regressions on armv6-linux-gnueabihf

2024-01-14 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336

Roger Sayle  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-14
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #1 from Roger Sayle  ---
As there's a patch for this regression (awaiting review), I should upgrade the
PR status to ASSIGNED.

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2024-01-14 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

Roger Sayle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #8 from Roger Sayle  ---
Now we're in stage4, I'll take this.  I'm bootstrapping and regression testing
a variant of my tweak to try_fwprop_subst_pattern.  The change in comment #7
can hang loop indefinitely if the transformation results in the same cost as
the original, so the logic on when to forward-propagate needed to be tweaked a
little.

[Bug target/106060] Inefficient constant broadcast on x86_64

2024-01-14 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106060

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
 Status|NEW |ASSIGNED

--- Comment #4 from Roger Sayle  ---
I have a patch for better materialization of vector constants (including
cmpeq+abs  and cmpeq+abs), but now that we've transitioned for stage 3 (bug
fixing) to stage 4 (regression fixing), this will have to wait for GCC 15's
stage 1.  I'm happy to post the patch here or to gcc-patches, if anyone would
like to pre-review it and/or benchmark the proposed changes.

[Bug target/112992] Inefficient vector initialization using vec_duplicate/broadcast

2024-01-14 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112992

Roger Sayle  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |14.0

--- Comment #10 from Roger Sayle  ---
This has now been fixed on mainline (we generate identical code for all four
functions in comment #1).

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2024-01-13 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

--- Comment #7 from Roger Sayle  ---
Very many thanks to Jeff Law for pointing me to fwprop.  The following simple
patch also fixes this regression.

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0c588f8..cbba44e 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -449,15 +449,6 @@ try_fwprop_subst_pattern (obstack_watermark ,
insn_
change _change,
   if (prop.num_replacements == 0)
 return false;

-  if (!prop.profitable_p ())
-{
-  if (dump_file && (dump_flags & TDF_DETAILS))
-   fprintf (dump_file, "cannot propagate from insn %d into"
-" insn %d: %s\n", def_insn->uid (), use_insn->uid (),
-"would increase complexity of pattern");
-  return false;
-}
-
   if (dump_file && (dump_flags & TDF_DETAILS))
 {
   fprintf (dump_file, "\npropagating insn %d into insn %d, replacing:\n",

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2024-01-12 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

--- Comment #6 from Roger Sayle  ---
Sorry for the delay in replying/answering Jakub's questions/comments.  Yes,
using a define_insn_and_split in the backend fixes/works around the issue (and
I agree your implementation/refinement in comment #5 is better than mine in
comment #2), but I've a feeling that this approach isn't the ideal solution. 
Nothing about this split, is specific to these x86 instructions or even to the
i386 backend.

A more generic fix might be teach combine.cc that it can split parallels of two
independent sets, with no inter dependencies, into two insns if the total cost
of the two instructions is less than the original two, i.e. a 2 insn -> 2 insn
combination.

But then even this doesn't feel like the perfect approach... the reason combine
doesn't already support 2->2 combinations is that they're not normally
required, these types of problems are usually handled by GCSE or CSE or PRE (or
?).

The pattern is insn1 defines REG1 to a complicated expression, that is live in
several locations, so this instruction can't be eliminated.  However, if the
definition of REG1 is provided to insn2 that sets REG2, this second instruction
can be significantly simplified.  This feels like a classic (non-)constant
propagation problem.  I'm thinking perhaps want_to_gcse_p (or somewhere
similar) could be tweaked.

For people just joining the discussion (hopefully Jeff or a Richard):

(set (REG:DI 1) (concat:DI (REG:SI 2) (REG:SI 3))
...
(set (REG:SI 4) (low_part (REG:DI 1))

can be simplified so that the second assignment becomes just:
(set (REG:SI 4) (REG:SI 2))
and similarly for high_part vs. low_part.  These don't even
need to be in the same basic block.

In actuality, "concat" is a large ugly expression, and high_part/low_part are
actually SUBREGs (or could be TRUNCATE or SHIFT+TRUNCATE), but the theory
should remain the same.

I'm trying to figure out which pass (or cselib?) is normally responsible for
handling this type of pseudo-reg propagation.

But the define_insn_and_split certainly papers over the deficiency in the
middle-end's RTL optimizers and fixes this (very) specific case/regression.

[Bug other/113336] New: libatomic (testsuite) regressions on armv6-linux-gnueabihf

2024-01-11 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336

Bug ID: 113336
   Summary: libatomic (testsuite) regressions on
armv6-linux-gnueabihf
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: roger at nextmovesoftware dot com
  Target Milestone: ---

As suggested by Richard Earnshaw, this opens a bugzilla PR for tracking this
issue.  All the tests in libatomic currently fail on a raspberry pi running
raspbian, but passed back in December 2020.
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642168.html

The regression (which isn't really a regression) was caused by:
2023-09-26  Hans-Peter Nilsson  

PR target/107567
PR target/109166
* builtins.cc (expand_builtin) :
Handle failure from expand_builtin_atomic_test_and_set.
* optabs.cc (expand_atomic_test_and_set): When all attempts fail to
generate atomic code through target support, return NULL
instead of emitting non-atomic code.  Also, for code handling
targetm.atomic_test_and_set_trueval != 1, gcc_assert result
from calling emit_store_flag_force instead of returning NULL.


Prior to this, when -fno-sync-libcalls was specified on the command line, the
__atomic_test_and_set built-in simply expanded to a non-atomic code sequence,
which then passed libatomic's configure tests for HAVE_ATOMIC_TAS.  Now that
this hole/bug/correctness issue has been fixed, and HAVE_ATOMIC_TAS is now
detected as false, the libatomics's tas_n.c can no longer implement tas_8_2_.o
without (a missing helper function) tas_1_2_.o.

Hence libatomic has (always?) been broken on armv6, but synchronization
primitives can now be supported with the above change. We've just not noticed
that necessary pieces of the runtime were missing, until the above correctness
fix resulted in a link error.

[Bug target/113231] x86_64 uses SSE instructions for `*mem <<= const` at -Os

2024-01-09 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113231

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |14.0
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Roger Sayle  ---
This should now be fixed on mainline.

[Bug target/113231] x86_64 uses SSE instructions for `*mem <<= const` at -Os

2024-01-04 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113231

Roger Sayle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #4 from Roger Sayle  ---
I'm testing a patch, for more accurate conversion gains/costs in the
scalar-to-vector pass.  Adding -mno-stv will work around the problem.

[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value

2023-12-24 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914

--- Comment #19 from Roger Sayle  ---
Created attachment 56930
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56930=edit
proposed patch

And now for a patch that does (or should) work.  This even contains an
optimization, we middle-end knows we don't need to sign or zero extend if a
insv doesn't modify the sign-bit.  Testing on MIPS would be much appreciated. 
TIA.

[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value

2023-12-24 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914

--- Comment #18 from Roger Sayle  ---
Please ignore comment #17, the above patch is completely bogus/broken.

[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value

2023-12-24 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914

--- Comment #17 from Roger Sayle  ---
I think this patch might resolve the problem (or move it somewhere else):

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 9fef2bf6585..218bca905f5 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -6274,10 +6274,7 @@ expand_assignment (tree to, tree from, bool nontemporal)
result = store_expr (from, to_rtx, 0, nontemporal, false);
  else
{
- rtx to_rtx1
-   = lowpart_subreg (subreg_unpromoted_mode (to_rtx),
- SUBREG_REG (to_rtx),
- subreg_promoted_mode (to_rtx));
+ rtx to_rtx1 = gen_reg_rtx (subreg_unpromoted_mode (to_rtx));
  result = store_field (to_rtx1, bitsize, bitpos,
bitregion_start, bitregion_end,
mode1, from, get_alias_set (to),

The motivation/solution comes from a comment in expmed.cc:
/* If the destination is a paradoxical subreg such that we need a
   truncate to the inner mode, perform the insertion on a temporary and
   truncate the result to the original destination.  Note that we can't
   just truncate the paradoxical subreg as (truncate:N (subreg:W (reg:N
   X) 0)) is (reg:N X).  */

The same caveat applies to extensions on MIPS, so we should use a new
pseudo temporary register rather than update the SUBREG in place.

If someone could confirm this fixes the issue on MIPS, I'll try to come up
with a milder form of this fix that checks TARGET_MODE_REP_EXTENDED that'll
limit the churn/impact on other targets.

[Bug rtl-optimization/112380] [14 regression] ICE when building Mesa (in combine, internal compiler error: in simplify_subreg) since r14-2526-g8911879415d6c2

2023-12-16 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112380

Roger Sayle  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #14 from Roger Sayle  ---
This should now be fixed on mainline, but similar issues may still be latent in
combine (see the longer alternative in comment #12).

[Bug target/112992] New: Inefficient vector initialization using vec_duplicate/broadcast

2023-12-12 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112992

Bug ID: 112992
   Summary: Inefficient vector initialization using
vec_duplicate/broadcast
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: roger at nextmovesoftware dot com
  Target Milestone: ---

The following four functions should in theory all produce the same code:

typedef unsigned long long v4di __attribute((vector_size(32)));
typedef unsigned int v8si __attribute((vector_size(32)));
typedef unsigned short v16hi __attribute((vector_size(32)));
typedef unsigned char v32qi __attribute((vector_size(32)));

#define MASK  0x01010101
#define MASKL 0x0101010101010101ULL
#define MASKS 0x0101

v4di fooq() {
  return (v4di){MASKL,MASKL,MASKL,MASKL};
}

v8si food() {
  return (v8si){MASK,MASK,MASK,MASK,MASK,MASK,MASK,MASK};
}

v16hi foow() {
  return (v16hi){MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,
 MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS};
}

v32qi foob() {
  return (v32qi){1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
}

On x86_64 with -mavx, we currently produce very different implementations:

fooq:
movabs  rax, 72340172838076673
pushrbp
mov rbp, rsp
and rsp, -32
mov QWORD PTR [rsp-8], rax
vbroadcastsdymm0, QWORD PTR [rsp-8]
leave
ret
food:
vbroadcastssymm0, DWORD PTR .LC2[rip]
ret
foow:
vmovdqa ymm0, YMMWORD PTR .LC3[rip]
ret
foob:
vmovdqa ymm0, YMMWORD PTR .LC4[rip]
ret

clang currently produces the vbroadcastss for all four.
I discovered that some of my "day job" code used the "fooq" idiom, requiring a
stack frame, and both reads and writes to memory [of a compile-time constant].

I suspect the fix is to add a define_insn_and_split or two to i386/sse.md, and
perhaps something can be done in expand, but I'm confused why LRA/reload spills
the DImode component of V4DI to the stack frame, but places the SImode
component of V8SI in the constant pool.

This is related (distantly) to PRs 100865 and 106060, but is potentially target
independent, and seems to be going wrong in LRA/reload's REG_EQUIV elimination.
Thoughts?  Apologies if this is a dup.  I'm happy to work up a patch if someone
could advise on where best this should be fixed.  Perhaps RTL's vec_duplicate
could be canonicalized to the most appropriate vector mode?

[Bug rtl-optimization/112380] [14 regression] ICE when building Mesa (in combine, internal compiler error: in simplify_subreg) since r14-2526-g8911879415d6c2

2023-11-12 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112380

--- Comment #12 from Roger Sayle  ---
Patch proposed (actually two alternatives proposed) at
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636203.html

[Bug target/110551] [11/12/13 Regression] an extra mov when doing 128bit multiply

2023-11-12 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551

Roger Sayle  changed:

   What|Removed |Added

   Target Milestone|11.5|14.0
 Resolution|--- |FIXED
Summary|[11/12/13/14 Regression] an |[11/12/13 Regression] an
   |extra mov when doing 128bit |extra mov when doing 128bit
   |multiply|multiply
 Status|NEW |RESOLVED

--- Comment #10 from Roger Sayle  ---
This should now be fixed on mainline.

[Bug rtl-optimization/91865] Combine misses opportunity to remove (sign_extend (zero_extend)) before searching for insn patterns

2023-11-12 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91865

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Roger Sayle  ---
This should now be fixed on mainline.

[Bug rtl-optimization/112380] [14 regression] ICE when building Mesa (in combine, internal compiler error: in simplify_subreg) since r14-2526-g8911879415d6c2

2023-11-05 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112380

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
 Status|NEW |ASSIGNED

--- Comment #10 from Roger Sayle  ---
combine.cc's expand_field_assignment needs to defend against gen_lowpart (which
is gen_lowpart_for_combine) returning a CLOBBER.  Otherwise, we end up calling
simplify_set on:

(set (reg:DI 134)
(and:DI (subreg:DI (ior:SI (ior:SI (and:SI (subreg:SI (reg/v:TI 114 [
sampler ]) 0)
(const_int -129280 [0xfffe0700]))
(and:SI (clobber:TI (const_int 0 [0]))
(const_int -129025 [0xfffe07ff])))
(and:SI (reg:SI 130)
(const_int 129024 [0x1f800]))) 0)
(const_int 4294967295 [0x])))

where if you look closely the "(clobber:TI (const_int 0))" causes no end of fun
in simplify_rtx; it's not surprising that an assert is eventually triggered in
simplify_subreg.

I'm testing a patch.

[Bug c++/50755] [avr] ICE: tree check: expected class 'constant', have 'unary' (convert_expr)

2023-11-03 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50755

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #4 from Roger Sayle  ---
This appears to be fixed on mainline.  G-J can you confirm this is resolved?

[Bug target/112298] Poor code for DImode operations on H8 port

2023-10-30 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112298

Roger Sayle  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 CC||roger at nextmovesoftware dot 
com
 Ever confirmed|0   |1
   Last reconfirmed||2023-10-30

[Bug target/112103] [14 regression] gcc.target/powerpc/rlwinm-0.c fails after r14-4941-gd1bb9569d70304

2023-10-26 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112103

Roger Sayle  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||roger at nextmovesoftware dot 
com
   Last reconfirmed||2023-10-26

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2023-10-20 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

--- Comment #3 from Roger Sayle  ---
This patch addresses the regression, but probably isn't the correct fix.

The issue is that the backend now has a way of representing the concatenation
of two registers (for example, TI is constructed for two DI mode registers):

(set (reg:TI 111 [ bD.2764 ])
(ior:TI (ashift:TI (zero_extend:TI (reg:DI 142))
(const_int 64 [0x40]))
(zero_extend:TI (reg:DI 141

But combine is unable to cleanly extract the (original) DI mode components back
out of this using SUBREGs.  Currently combine gets confused and attempts to
match things like:

Trying 10 -> 74:
   10: r111:TI=zero_extend(r142:DI)<<0x40|zero_extend(r141:DI)
  REG_DEAD r141:DI
  REG_DEAD r142:DI
   74: r137:DI=r111:TI#0
Failed to match this instruction:
(parallel [
(set (reg:DI 137 [ bD.2764 ])
(reg:DI 141))
(set (reg:TI 111 [ bD.2764 ])
(ior:TI (ashift:TI (zero_extend:TI (reg:DI 142))
(const_int 64 [0x40]))
(zero_extend:TI (reg:DI 141
])

which contains the simplification we want, "reg:DI 137 := reg:DI 141", but
along with stuff that combine should really take care off (strip/duplicate). 
I'll work on a more acceptable middle-end fix, but this patch demonstrates
progress, and can be used if a more general solution can't be found.

[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes

2023-10-20 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

--- Comment #2 from Roger Sayle  ---
Created attachment 56162
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56162=edit
proof-of-concept patch

[Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply

2023-10-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #4 from Roger Sayle  ---
Patch proposed:
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/67.html

[Bug bootstrap/111812] [14 regression] Can't build with gcc 4.8.5

2023-10-16 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111812

Roger Sayle  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2023-10-16
  Known to work||13.0
  Known to fail||14.0
   Host|powerpc64-linux-gnu |*-linux-gnu
 Status|UNCONFIRMED |NEW
 Target|powerpc64-linux-gnu |
 CC||roger at nextmovesoftware dot 
com
  Build|powerpc64-linux-gnu |

[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu

2023-10-11 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701

Roger Sayle  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Roger Sayle  ---
This should now be fixed on mainline.

[Bug middle-end/17886] variable rotate and unsigned long long rotate should be better optimized

2023-10-10 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=17886

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #27 from Roger Sayle  ---
I believe that this issue has been fixed (for a long time).  For Andi's
testcases in comment #3, -fdump-tree-optimized reveals all these cases are
perceived as rotations by the early middle-end. 

long long unsigned int f (long long unsigned int x, int y)
{
  return x_1(D) r<< y_2(D);
}

unsigned int f2 (unsigned int x, int y)
{
  return x_1(D) r<< y_2(D);
}

long long unsigned int f3 (long long unsigned int x)
{
  return x_1(D) r>> 55;
}

long unsigned int f4 (unsigned int x)
{
  return x_1(D) r>> 22;
}

[Bug tree-optimization/111519] [13/14 Regression] Wrong code at -O3 on x86_64-linux-gnu since r13-455-g1fe04c497d

2023-10-09 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111519

--- Comment #2 from Roger Sayle  ---
Complicated.  Things have gone wrong before the strlen pass which is given:

  _73 = e;
  _72 = *_73;
...
  *_73 = prephitmp_23;
  d = _72;

Here the assignment to *_73 overwrites the value of f (at *e) which then
invalidates the use of _72 resulting in the wrong value for d.  But figuring
out which pass is at fault (perhaps complete loop unrolling?) is tricky.

[Bug target/71749] Define _REENTRANT on ARC when -pthread is passed

2023-09-28 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71749

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Roger Sayle  ---
This has been fixed on mainline since March 2017 thanks to:
2017-03-28  Claudiu Zissulescu  
Thomas Petazzoni 

[Bug target/91251] Revision 272645 on top of 9.1.0 caused ICE: in extract_insn, at recog.c:2310

2023-09-23 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91251

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #1 from Roger Sayle  ---
I'm unable to reproduce this with mainline gcc, the reduced testcase appears to
compile fine (without an ICE).  Can you confirm that this issue is also fixed
for you?

[Bug target/91591] Arc: ICE in trunc_int_for_mode, at explow.c:60

2023-09-23 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91591

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |8.4
 CC||roger at nextmovesoftware dot 
com
 Status|UNCONFIRMED |RESOLVED

--- Comment #6 from Roger Sayle  ---
As reported by Giulio, this bug has now been fixed.

[Bug target/83409] arc: "internal compiler error: in extract_constrain_insn" with -O3

2023-09-23 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83409

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #3 from Roger Sayle  ---
I'm unable to reproduce this with mainline gcc configured with
--target=arc-elf.  scatterlist.i (and the reduced test case in comment #1),
compile fine with -O2, -O3 and -O3 -fno-strict-aliasing.  I've also tried
-mcpu=em.  Can you confirm that this has been fixed for you (so we can close
the PR)?

[Bug target/43892] PowerPC suboptimal "add with carry" optimization

2023-08-29 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43892

--- Comment #39 from Roger Sayle  ---
My apologies for dropping the ball on this patch (series)... My only access to
PowerPC hardware is/was via the GCC compile farm, which complicates things.

Shortly after David's approval, Segher enquired whether the patch could be
modified to also handle -mcpu=power10 (which represents carry differently):
https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586868.html

Trying to (also) address this then openned up a rabbit hole/can of worms
related to how middle-end (and rs6000.md) represents overflow, which included a
combine patch:
https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586572.html

Soon after GCC entered stage 4 (or stage 3), and the above patches (and an
unsubmitted one for power10) simply got lost in the backlog.  I believe this
patch is sound, but unfortunately I don't have the bandwidth/patience to
(re)check it against mainline on (multiple variants of) rs6000.

If one of the IBM folks could take it from here, that'd be much appreciated.

[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value

2023-08-03 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #15 from Roger Sayle  ---
Is MIPS64 actually a TRULY_NOOP_TRUNCATION_TARGET?  If SImode is implicitly
assumed to be (sign?) extended, then an arbitrary DImode value/register can't
be used as an SImode value without appropriately setting/clearing the upper
bits.
i.e. thus this integer truncation isn't a no-op.

I suspect that the underlying problem is that the backend is relying on
implicit invariants, not explicitly represented in the RTL, and then surprised
when valid RTL transformations don't preserve those invariants/assumptions.

I wonder why the zero_extract followed by sign_extend example mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626137.html isn't already
being considered as a try_combine candidate, allowing the backend to simply
recognize or split it.  I'll investigate.

[Bug target/106222] x86 Better code squence for __builtin_shuffle

2023-07-30 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106222

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
 CC||roger at nextmovesoftware dot 
com
   Target Milestone|--- |13.0

--- Comment #3 from Roger Sayle  ---
This was fixed by Hongtao's patch for GCC 13.

[Bug target/110843] ICE in convert_insn, at config/i386/i386-features.cc:1438 since r14-2405-g4814b63c3c2326

2023-07-28 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110843

Roger Sayle  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||roger at nextmovesoftware dot 
com
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-07-28

--- Comment #1 from Roger Sayle  ---
My STV patch should check TARGET_AVX512VL (which is required for V2DI in
VI48_AVX512VL) and not TARGET_AVX512F which is the condition in the define_insn
for vp.  Bootstrapping and regression testing a fix.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-27 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Roger Sayle  changed:

   What|Removed |Added

   Assignee|roger at nextmovesoftware dot com  |unassigned at gcc dot 
gnu.org

--- Comment #16 from Roger Sayle  ---
My patch (in comment #15) is obsoleted by Richard Biener's much better
solution(s):
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625416.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625417.html

[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu

2023-07-27 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701

--- Comment #7 from Roger Sayle  ---
Patch proposed here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625532.html

[Bug target/110792] [13/14 Regression] GCC 13 x86_32 miscompilation of Whirlpool hash function

2023-07-25 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110792

Roger Sayle  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #11 from Roger Sayle  ---
Mine.  Alas the obvious fix of adding an early clobber to the rotate doubleword
from memory alternative generates some truly terrible code (spills via memory
to SSE registers!?), but I've come up with a better solution, forcing reload to
read the operand from memory before the rotate, which appears to fix the
problem without the adverse performance impact.

[Bug target/110790] [14 Regression] gcc -m32 generates invalid bit test code on gmp-6.2.1

2023-07-25 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110790

--- Comment #5 from Roger Sayle  ---
I'll add this testcase to the testsuite, when I apply a corrected version of my
QImode offset patch to mainline.  On the bright side, we'll be generating more
efficient code for gmp's refmpn_tstbit by using the x86's bt instruction (it
just needs to use setc not setnc in this case).

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-25 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #15 from Roger Sayle  ---
Hi Richard,
There's another patch awaiting review at
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625282.html
and I've another follow-up after that currently regression testing...

[Bug target/110787] [14 regression] ICE building SYSTEM.def

2023-07-24 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110787

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #2 from Roger Sayle  ---
I'm bootstrapping with --enable-languages=all to investigate what's going on. 
I'll revert the patch once I (or anyone) can confirm that this restores
bootstrap, but I'd be happier understanding the actual mechanism (cause and
effect) of this ICE.  Sorry for the temporary inconvenience.

[Bug c++/72825] ICE on invalid C++ code on x86_64-linux-gnu (internal compiler error: tree check: expected array_type, have error_mark in array_ref_low_bound, at tree.c:13013)

2023-07-23 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72825

Roger Sayle  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
 CC||roger at nextmovesoftware dot 
com

--- Comment #4 from Roger Sayle  ---
This issue has been fixed on mainline (for GCC 14), by the patch for PR 110699.

[Bug c/109598] [12/13/14 Regression] ICE: tree check: expected array_type, have error_mark in array_ref_low_bound, at tree.cc

2023-07-23 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109598

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
 CC||roger at nextmovesoftware dot 
com

--- Comment #4 from Roger Sayle  ---
This issue has been fixed on mainline (for GCC 14), by the patch for PR 110699.

[Bug c/110699] [12/13/14 Regression] internal compiler error: tree check: expected array_type, have error_mark in array_ref_low_bound, at tree.cc:12754

2023-07-23 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110699

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #3 from Roger Sayle  ---
This issue has been fixed on mainline for GCC 14.

[Bug target/110588] btl (on x86_64) not always generated

2023-07-22 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110588

Roger Sayle  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Roger Sayle  ---
This is now fixed on mainline for GCC 14.

[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu

2023-07-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
 Status|NEW |ASSIGNED

--- Comment #6 from Roger Sayle  ---
I have a fix (to combine.cc's record_dead_and_set_regs_1).  Bootstrapping and
regression testing.

[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu

2023-07-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #5 from Roger Sayle  ---
nonzero_bits ((reg:DI 92),SImode) is returning 340, so combine (or more
specifically simplify_and_const_int_1) believes that the AND (ZERO_EXTEND)
isn't unnecessary.  So it's the same nonzero_bits information that allows us to
turn the  XOR into IOR (in insn 16) that's incorrectly telling us the AND 340
(or AND 343, or ZERO_EXTEND) is unnecessary (in insn 17).

[Bug c/89180] [meta-bug] bogus/missing -Wunused warnings

2023-07-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89180
Bug 89180 depends on bug 101090, which changed state.

Bug 101090 Summary: incorrect -Wunused-value warning on remquo with constant 
values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101090

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

[Bug c/106264] [10/11/12/13 Regression] spurious -Wunused-value on a folded frexp, modf, and remquo calls with unused result since r9-1295-g781ff3d80e88d7d0

2023-07-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106264

--- Comment #9 from Roger Sayle  ---
*** Bug 101090 has been marked as a duplicate of this bug. ***

[Bug c/101090] incorrect -Wunused-value warning on remquo with constant values

2023-07-18 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101090

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 CC||roger at nextmovesoftware dot 
com
 Status|NEW |RESOLVED

--- Comment #4 from Roger Sayle  ---
Many thanks to Vincent for spotting/confirming that his bug report is a
duplicate of PR 106264, which was fixed in GCC 13.

*** This bug has been marked as a duplicate of bug 106264 ***

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com

--- Comment #11 from Roger Sayle  ---
My (upcoming) patch for PR88873 dramatically reduces the compile-time (with
-O0) for this test case (by reducing the number of pseudos and reducing the
number of reloads).  But don't let that stop anyone from speeding up
lra_final_code_change.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-07-17 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=88873

--- Comment #9 from Roger Sayle  ---
I'll check whether turning off the insvti_{low,high}part transformations during
lra_in_progress helps compile-time.  I believe everytime reload encounters a
TI<->SSE SUBREG, the spill/reload generates two or three additional
instructions.  I'm thinking that perhaps this should ideally be an UNSPEC, that
we can split after reload. As shown in PR 88873, we'd like SSE->TI->SSE to
avoid going via memory [where currently this happens twice]. It looks like
"interval" in pr28071.c suffers from the same x86 ABI issues [i.e. is placed in
scalar TImode, where ideally we'd like V2DI].

[Bug target/110649] [14 Regression] 25% sphinx3 spec2006 regression on Ice Lake and zen between g:acaa441a98bebc52 (2023-07-06 11:36) and g:55900189ab517906 (2023-07-07 00:23)

2023-07-17 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #12 from Roger Sayle  ---
Hi Jan,
I believe you also need to remove the
   profile_count entry_count = profile_count::zero ();
from tree-ssa-loop-ivcanon.cc's try_peel_loop to avoid a
bootstrap issue with -Werror "variable entry_count set but unused".

[Bug target/110598] [14 Regression] wrong code on llvm-14.0.6 due to memcmp being miscompiled

2023-07-12 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110598

Roger Sayle  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
  Known to work||14.0

--- Comment #7 from Roger Sayle  ---
Many thanks to Sergei for confirming this issue is now resolved.  Sorry again
for the inconvenience.

  1   2   3   4   5   >