[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Roger Sayle changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2024-05-20 --- Comment #2 from Roger Sayle --- I can confirm that I can reproduce this and see the same thing. Adding vi tmp1 = Set_i32(INT32_MAX); d_i("tmp1",tmp1.raw); at multiple places in bug.cc, reveals that sometimes the result is the correct [0x7ff x 4], and at other places is the incorrect [0x8000 x 4], even though this affected snippet doesn't involve binary operation simplification.
[Bug target/106060] Inefficient constant broadcast on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106060 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED Known to work||15.0 Status|ASSIGNED|RESOLVED --- Comment #7 from Roger Sayle --- This has now been fixed on mainline for GCC 15. There are still improvements that can be made to vector constant materialization/initialization on x86_64, but the issues/ideas described in this bugzilla PR are all now implemented. Thanks.
[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021 --- Comment #2 from Roger Sayle --- Here's a reduced test case that should be unaffected by the pending changes to how V8QI shifts are expanded. Note that the final "t -= t4" is required to convince the register allocator to "spill". typedef signed char v16qi __attribute__ ((__vector_size__ (16))); // sign-extend low 3 bits to a byte. v16qi foo (v16qi x) { v16qi t7 = (v16qi){7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7}; v16qi t4 = (v16qi){4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4}; v16qi t = x & t7; t ^= t4; t -= t4; return t; } which produces: foo:movl$67372036, %eax vmovdqa %xmm0, %xmm2 vpbroadcastd%eax, %xmm1 movl$117901063, %eax vpbroadcastd%eax, %xmm3 vmovdqa %xmm1, %xmm0 vmovdqa %xmm3, -24(%rsp) vmovdqa -24(%rsp), %xmm4 vpternlogd $120, %xmm2, %xmm4, %xmm0 vpsubb %xmm1, %xmm0, %xmm0 ret
[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021 Roger Sayle changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com Last reconfirmed||2024-05-10 CC||roger at nextmovesoftware dot com Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #1 from Roger Sayle --- I have a patch for x86 ternlog handling that changes the output for this testcase (without the pending change to optimize V8QI shifts) to: foo:movl$67372036, %eax vpsraw $5, %xmm0, %xmm0 vpbroadcastd%eax, %xmm1 vpternlogd $108, .LC0(%rip), %xmm1, %xmm0 vpsubb %xmm1, %xmm0, %xmm0 ret .align 16 .LC0: .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 .byte 7 which at least doesn't construct the vector with a broadcast, and then "spill" it to the stack before reading it back from memory. I've no idea if this is optimal, but it's certainly better than the current "spill". I'm curious about what has changed to make this code (register allocation) regress since GCC 13. It was a patch of mine that changed broadcastb to broadcastd, but that shouldn't have affected reload/register preferencing.
[Bug middle-end/78947] sub-optimal code for (bool)(int ? int : int)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78947 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #4 from Roger Sayle --- As Andrew mentioned in comment #2, this has been fixed/resolved since GCC v9. Mainline g++ -O3 currently generates: condSet(int, int, int): testedi, edi cmovne edx, esi testedx, edx setne al ret [I believe the status change/assignment in comment #3 was due to a typo in the bugzilla PR number].
[Bug middle-end/85559] [meta-bug] Improve conditional move
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85559 Bug 85559 depends on bug 78947, which changed state. Bug 78947 Summary: sub-optimal code for (bool)(int ? int : int) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78947 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/113832] [14/15 Regression] 6% exec time regression of 464.h264ref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113832 --- Comment #5 from Roger Sayle --- I'm trying to confirm that there are actually widening multiplications in 464.h264ref (on aarch64), but if anyone's already done an analysis of what might be causing these performance swings, please do post (a pointer here).
[Bug tree-optimization/113673] [12/13/14/15 Regression] ICE: verify_flow_info failed: BB 5 cannot throw but has an EH edge with -Os -finstrument-functions -fnon-call-exceptions -ftrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113673 Roger Sayle changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com Status|NEW |ASSIGNED --- Comment #6 from Roger Sayle --- Created attachment 58051 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58051=edit proposed patch Bootstrapping and regression testing the attached patch.
[Bug target/43644] __uint128_t missed optimizations.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43644 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED CC||roger at nextmovesoftware dot com Status|NEW |RESOLVED Target Milestone|--- |14.0 --- Comment #6 from Roger Sayle --- This is now fixed on mainline (for GCC 14 and GCC 15).
[Bug rtl-optimization/97756] [11/12/13 Regression] Inefficient handling of 128-bit arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756 Roger Sayle changed: What|Removed |Added Known to work||14.0 Summary|[11/12/13/14/15 Regression] |[11/12/13 Regression] |Inefficient handling of |Inefficient handling of |128-bit arguments |128-bit arguments --- Comment #17 from Roger Sayle --- I believe this issue is now fixed on mainline (i.e. for both GCC 14 and GCC 15). Firstly, many thanks to Jakub for correcting the error in my patch. We now generate optimal code sequences for the code in comments #3 and #5, and use generate fewer instructions than described in the original description. The final remaining issue is that with -O3 GCC still uses more instructions than clang and icc (see Thomas' comments in comments #12 and #13). The good news is that this is intentional, compiling with -Os (to optimize for size) generates the same number of instructions as clang and icc [in fact, using icc -Os generates larger code!?]. So when optimizing for performance, GCC is taking the opportunity to use more (cheap) instructions to execute faster (or that's the theory).
[Bug middle-end/111701] [11/12/13/14 Regression] wrong code for __builtin_signbit(x*x)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111701 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #2 from Roger Sayle --- A patch to provide a possible solution/workaround has been proposed at https://gcc.gnu.org/pipermail/gcc-patches/2024-April/650054.html With that change, compiling the code in the original description with the -fsignaling-nans command line option, avoids the abort.
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 --- Comment #5 from Roger Sayle --- Another interesting (simpler) case of -ffast-math pessimization is: void foo(_Complex double *c) { for (int i=0; i<16; i++) c[i] += __builtin_complex(1.0,0.0); } Again without -ffast-math we vectorize consecutive additions, but with -ffast-math we (not so) cleverly avoid every second addition by producing significantly larger code that shuffles the real/imaginary parts around. This even suggests a missed-optimization for: void bar(_Complex double *c, double x) { for (int i=0; i<16; i++) c[i] += x; } which may be more efficiently implemented (when safe) by: void bar(_Complex double *c, double x) { for (int i=0; i<16; i++) c[i] += __builtin_complex(x,0.0); } i.e. insert/interleave a no-op zero addition, to simplify the vectorization. The existence of a suitable identity operation (+0, *1.0, &~0, |0, ^0) can be used to avoid shuffling/permuting values/lanes out of vectors, when its possible for the vector operation to leave the other values unchanged.
[Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #3 from Roger Sayle --- Richard has already changed this from "gfortran" to "tree-optimization", but for the record, the C equivalent of this test case (with the same issue) is: void scale_i(_Complex double *c, int n) { for (int i=0; i
[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544 Roger Sayle changed: What|Removed |Added Last reconfirmed||2024-04-07 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 CC||roger at nextmovesoftware dot com
[Bug middle-end/114552] [13/14 Regression] wrong code at -O1 and above on x86_64-linux-gnu since r13-990
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114552 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #6 from Roger Sayle --- Many thanks Jakub, and my apologies for the breakage/inconvenience. It looks like sizeof(k) is 10 bytes, and sizeof(k.b) is 6 bytes, and somehow this code is getting the constructor for "k" and not for just "k.b". This is, of course, fine for memcpy as it can move the just the pieces it wants. I completely agree that the safe fix is to check that the sizes match; I don't think I ever considered that they might not be identical when I wrote this code, or assumed that partial would be non-zero for this case].
[Bug target/114284] [14 Regression] arm: Load of volatile short gets miscompiled (loaded twice) since r14-8319
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114284 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #10 from Roger Sayle --- Thanks Jakub. My apologies for the unintentional breakage.
[Bug target/114187] [14 regression] bizarre register dance on x86_64 for pass-by-value struct since r14-2526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114187 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #4 from Roger Sayle --- Created attachment 57587 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57587=edit proposed patch Proposed fix attached. Currently bootstrapping and regression testing. The problematic code (from March 2023) has an interesting history.
[Bug target/114187] [14 regression] bizarre register dance on x86_64 for pass-by-value struct since r14-2526
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114187 Roger Sayle changed: What|Removed |Added Last reconfirmed||2024-03-01 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #3 from Roger Sayle --- There's a missing simplification in combine: Trying 6 -> 11: 6: r102:TI=zero_extend(r109:DF#0)<<0x40|zero_extend(r108:DF#0) REG_DEAD r108:DF REG_DEAD r109:DF 11: r105:DF=r102:TI#0+r102:TI#8 REG_DEAD r102:TI Failed to match this instruction: (set (reg:DF 105 [ _4 ]) (plus:DF (subreg:DF (ior:TI (ashift:TI (zero_extend:TI (subreg:DI (reg:DF 109) 0)) (const_int 64 [0x40])) (zero_extend:TI (subreg:DI (reg:DF 108) 0))) 8) (reg:DF 108))) where the lowpart is getting simplified to reg:DF 108, but the highpart isn't getting simplified to reg:DF 109. i.e. (subreg:DF (ior:TI (ashift:TI (zero_extend:TI (subreg:DI (reg:DF 109) 0)) (const_int 64 [0x40])) (zero_extend:TI (subreg:DI (reg:DF 108) 0))) 8) can be simplified to just (reg:DF 109). I'm looking into why this isn't happening.
[Bug other/113336] [14 Regression] libatomic (testsuite) regressions on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from Roger Sayle --- This should now be fixed on mainline.
[Bug target/106060] Inefficient constant broadcast on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106060 Roger Sayle changed: What|Removed |Added Target Milestone|--- |15.0 --- Comment #5 from Roger Sayle --- For the record (so it doesn't get lost) the final patch was posted at https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643973.html and approved (for stage 1) at https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643996.html
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #16 from Roger Sayle --- This should now be fixed on mainline. The testsuite regressions (on non-x86 targets) are cosmetic, i.e. neither wrong code nor worse performance/size, just differences in expected code generation.
[Bug target/113690] [13 Regression] ICE: in as_a, at machmode.h:381 with -O2 -fno-dce -fno-forward-propagate -fno-split-wide-types -funroll-loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113690 Roger Sayle changed: What|Removed |Added Summary|[13/14 Regression] ICE: in |[13 Regression] ICE: in |as_a, at machmode.h:381 |as_a, at machmode.h:381 |with -O2 -fno-dce |with -O2 -fno-dce |-fno-forward-propagate |-fno-forward-propagate |-fno-split-wide-types |-fno-split-wide-types |-funroll-loops |-funroll-loops --- Comment #6 from Roger Sayle --- This has now been fixed on mainline. Please let me know if this is worth backporting to GCC 13.
[Bug tree-optimization/112508] [14 Regression] Size regression when using -Os starting with r14-4089-gd45ddc2c04e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112508 Roger Sayle changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 CC||roger at nextmovesoftware dot com Last reconfirmed||2024-02-15 --- Comment #2 from Roger Sayle --- The issue appears to be with (poor costing in) loop invariant store motion. Adding the command line option "-fno-move-loop-stores" reduces the .s file from 149 lines to 54 lines, and the size of main (as reported by objdump -d) from 317 bytes to 73 bytes. To confirm that this isn't specific to this (possibly pathological/obscure) test case, I ran the CSiBE benchmark on x86_64, comparing "-Os" to "-Os -fno-move-loop-stores", which shows a net saving of 1606 bytes with -fno-move-loop-stores. There are cases where -fno-move-loop-stores reduces code size (on x86_64, and I've not investigated other targets), so I guess it would be preferrable to use more accurate size costs instead of just disabling this sub-pass. Note that the bigger hammer, -fno-tree-loop-im, also avoids the code growth, but the more precise/specific -fno-move-loop-stores is sufficient.
[Bug target/113764] [X86] __builtin_clz generates lzcnt when bsr is sufficient
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764 Roger Sayle changed: What|Removed |Added Summary|[X86] Generates lzcnt when |[X86] __builtin_clz |bsr is sufficient |generates lzcnt when bsr is ||sufficient --- Comment #4 from Roger Sayle --- Yep, CLZ_DEFINED_VALUE_AT_ZERO really complicates things. With a single "global" macro it's currently impossible for a backend to support two different CLZ instructions; one with defined behavior at zero, and the other with undefined behavior at zero. It might just be possible to do something encoding LZCNT patterns in RTL using: (if_then_else:SI (ne:SI (reg:SI x) (const_int 0)) (clz:SI (reg:SI x)) (const_int VALUE)) Additionally on x86_64, the BSR instruction sets the zero flag if it's input is zero, when the destination register becomes undefined, which can be useful with CMOV, i.e. it's possible to get defined behavior without an additional test and branch. But for Pawel's original tescase, __builtin_clz is undefined at zero, so this really is a missed optimization, with either -Os or a modern -march such as cascadelake or znver4. I agree with Jakub, this is a can of worms; potentially a lot of effort for a marginal improvement.
[Bug target/113764] [X86] Generates lzcnt when bsr is sufficient
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764 --- Comment #2 from Roger Sayle --- Investigating further, the thinking behind GCC's current behaviour can be found in Agner Fog's instruction tables; on many architectures BSR is much slower than LZCNT. Legacy AMD: BSR=4 cycles, LZCNT=2 cycles AMD BOBCAT: BSR=6 cycles, LZCNT=5 cycles AMD JAGUAR: BSR=4 cycles, LZCNT=1 cycle AMD ZEN[1-3]:BSR=4 cycles, LZCNT=1 cycle AMD ZEN4:BSR=1 cycle, LZCNT=1 cycle INTEL: BSR=3 cycles, LZCNT=3 cycles KNIGHTS LANDING: BSR=11 cycles, LZCNT=3 cycles Hence using bsr is only "better" in some (but not all) contexts, and a reasonable default (for generic tuning) is to ignore BSR when LZCNT is available, as it's only one extra cycle of latency to perform the XOR. The correct solution is to add a tuning parameter to the x86 backend, to control whether it's beneficial to use BSR when LZCNT is available, for example when optimizing for size with -Os or -Oz. This is more reasonable now that current Intel and AMD architectures have the same latency for BSR and LZCNT, than when LZCNT first appeared (explaining !TARGET_LZCNT in i386.md).
[Bug tree-optimization/113673] [12/13/14 Regression] ICE: verify_flow_info failed: BB 5 cannot throw but has an EH edge with -Os -finstrument-functions -fnon-call-exceptions -ftrapv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113673 --- Comment #4 from Roger Sayle --- The identified patch implements += the same way as |=. Presumably a version of the test case replacing "m += *data++;" with "m |= *data++;" would be more useful at identifying a patch that actually changed EH edges.
[Bug target/113832] [14 Regression] 6% exec time regression of 464.h264ref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113832 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #2 from Roger Sayle --- Adding myself to Cc list (in case this is confirmed to be a widening multiply issue).
[Bug target/113764] [X86] Generates lzcnt when bsr is sufficient
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113764 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2024-02-08 --- Comment #1 from Roger Sayle --- Confirmed. This issue has two parts. The first is that the bsr_1 pattern (and variants) is (are) conditional on !TARGET_LZCNT, so the bsrl instruction isn't currently available with -mlzcnt. The second is that the middle-end doesn't have a preferred canonical RTL representation for this idiom, but all three of the following equivalent functions should generate identical code: unsigned bsr1(unsigned x) { return __builtin_clz(x) ^ 31; } unsigned bsr2(unsigned x) { return 31 - __builtin_clz(x); } unsigned bsr3(unsigned x) { return ~__builtin_clz(x) & 31; } [Note that the tree-ssa optimizers do transform bsr3 into bsr1]. A suitable fix would be to add the equivalent clz(x)^31 variant pattern to i386.md as a "synonymous" define_insn pattern.
[Bug tree-optimization/113759] [14 regression] ICE when building fdk-aac-2.0.3 since r14-8680
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113759 --- Comment #9 from Roger Sayle --- Many thanks Jakub. Sorry again for the inconvenience.
[Bug target/113720] [14 Regression] internal compiler error: in extract_insn, at recog.cc:2812 targeting alpha-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113720 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #3 from Roger Sayle --- Sorry for the inconvenience. alpha.md's define_expand that creates RTL that contains a MULT with operands of different modes looks highly suspicious. Uros' patch to use the (relatively recently added) UMUL_HIGHPART rtx_code is certainly a step in the right direction.
[Bug target/113690] [13/14 Regression] ICE: in as_a, at machmode.h:381 with -O2 -fno-dce -fno-forward-propagate -fno-split-wide-types -funroll-loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113690 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #4 from Roger Sayle --- I'm bootstrapping and regression testing a fix.
[Bug target/113701] Issues with __int128 argument passing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113701 Roger Sayle changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=106518 --- Comment #5 from Roger Sayle --- I like Uros' patch in comment #2. There have been so many incremental changes and improvements to x86 TImode and register allocation, that this legacy heuristic (workaround?) is not only no longer useful, but it actually hurts register allocation. *cmp_doubleword appears to be the only (remaining?) place this idiom is used. Additionally, I think I've mentioned in the past that it might also be useful to have a xchg/swap sinking pass, perhaps as part of cprop_hardreg, so that for example swap followed by swap is eliminated, that swap with one destination REG_DEAD is transformed into mov, etc. Swap/xchg is almost always just hard register renaming, so these should often be eliminatable, but the abstraction is useful to allow this to happen.
[Bug other/113336] [14 Regression] libatomic (testsuite) regressions on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com Target Milestone|--- |14.0 --- Comment #7 from Roger Sayle --- A revised patch has been posted for review/approval to gcc-patches: https://gcc.gnu.org/pipermail/gcc-patches/2024-January/644147.html
[Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #7 from Roger Sayle --- I'm bootstrapping and regression testing a patch.
[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533 --- Comment #14 from Roger Sayle --- My apologies for not keeping folks updated on my thinking. Following Oleg's feedback, I've decided to slim down my proposed fix to the bare minimum, and postpone the other rtx_costs improvements until GCC 15 (or later), when I'll have more time to use to CSiBE to demonstrate the benefits/tradeoffs for -Os and -Oz. For example, with fwprop about to transition to insn_cost, it would be good for the SH backend to provide a sh_insn_cost target hook. The current minimal patch to fix this specific regression is: diff --git a/gcc/config/sh/sh.cc b/gcc/config/sh/sh.cc index 2c411c3..fba6c0fd465 100644 --- a/gcc/config/sh/sh.cc +++ b/gcc/config/sh/sh.cc @@ -3313,7 +3313,8 @@ sh_rtx_costs (rtx x, machine_mode mode ATTRIBUTE_UNUSED, i nt outer_code, { *total = sh_address_cost (XEXP (XEXP (x, 0), 0), GET_MODE (XEXP (x, 0)), - MEM_ADDR_SPACE (XEXP (x, 0)), true); + MEM_ADDR_SPACE (XEXP (x, 0)), true) + + COSTS_N_INSNS (1); return true; } return false; The minor complication is that as explained above this results in: PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times addc 6 PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times cmp/pz 25 PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times shll 3 PASS->FAIL: gcc.target/sh/pr59533-1.c scan-assembler-times subc 14 which were failures that were fixed (or silenced) by my solution to PR111267. I will note that although the scan-assembler-times complain, that this tweak to sh_rtx_costs reduces the total number of instructions in pr59533-1.c which (normally) indicates that its an improvement. *** old.s Thu Jan 25 22:54:11 2024 --- new.s Thu Jan 25 22:54:23 2024 *** *** 15,23 .global test_01 .type test_01, @function test_01: - mov.b @r4,r0 - extu.b r0,r0 mov.b @r4,r1 cmp/pz r1 mov #0,r1 rts --- 15,22 .global test_01 .type test_01, @function test_01: mov.b @r4,r1 + extu.b r1,r0 cmp/pz r1 mov #0,r1 rts ... Hence I'm looking into PR59533, which has separate tests for sh2a and !sh2a, and my latest discoveries are the -m2a isn't supported if I build gcc using --target=sh3-linux-gnu, and that --target=sh2a-linux-gnu doesn't automatically default to --target=sh2aeb-linux-gnu and instead gives a fatal error about "SH2A does not support little-endian" during the build. All part (joy?) of the learning curve.
[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533 Roger Sayle changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=59533 --- Comment #12 from Roger Sayle --- It should be mentioned that the fwprop fix for PR11267 also resolved several FAILs in gcc.target/sh/pr59533.c. I mention this as tweaking the cost of SIGN_EXTEND in sh_rtx_costs interacts with the (redundant) extensions mentioned in the initial description of PR59533.
[Bug other/113336] libatomic (testsuite) regressions on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|NEW Assignee|roger at nextmovesoftware dot com |unassigned at gcc dot gnu.org Summary|libatomic (testsuite) |libatomic (testsuite) |regressions on |regressions on arm |armv6-linux-gnueabihf | --- Comment #4 from Roger Sayle --- Hi Victor, Yes, I agree your approach is better/less invasive than mine. I simply copied the existing idiom in Makefile.am, not noticing that this adds more functionality to libatomic than is strictly required. Just adding the missing/required tas_1_2_.lo is better (and hopefully more acceptable to the maintainers/reviewers).
[Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560 --- Comment #6 from Roger Sayle --- In the .optimized dump, we have: __int128 unsigned __res; __int128 unsigned _12; ... __res_11 = in_2(D) w* 184467440738; _12 = __res_11 & 18446744073709551615; __res_7 = _12 * 100; So the first multiplication is a widening multiplication and expanded using mulx, but the second multiplication is a full width TImode multiplication, which is why it has the same RTL expansion as "x * 100". This is looking like a tree-level issue and (perhaps) not a target-specific problem. In fact, it looks like this operation is actually a highpart_multiplication as only the highpart of the result is required (which should still generate mulx, but has a different representation at the tree-level).
[Bug target/113560] Strange code generated when optimizing a multiplication on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #2 from Roger Sayle --- The costs look sane, and I'd expect the synth_mult generated sequence to be faster, though it would be good to get some microbenchmarking. A reduced test case is: __int128 foo(__int128 x) { return x*100; } The x86 backend thinks that a 128-bit (TImode) multiplication would take 14 cycles, so instead generates: x2 = x+x2 cycles x3 = x2+x 2 cycles x24 = x<<3 2 cycles x25 = x24+x 2 cycles x100 = x<<2 2 cycles which is a total of 10 cycles, and predicted to be faster than the generic implementation (requiring 2 IMULQ, 1 MULQ and 2 ADDQ) for __int128 bar(__int128 x, int y) { return x*y; }
[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533 --- Comment #10 from Roger Sayle --- Hi Oleg. Great question. The "speed" parameter passed to rtx_costs, and address_cost indicates whether the middle-end is optimizing for peformance, and interested in the nummber of cycles taken by each instruction, or optimizing for size, and interested in the number of bytes used to encode the instruction. Previously, this speed parameter was ignored by the SH backend, so the costs were the same independent of the objective function. In my proposed patch, the address cost (1) when optimizing for size attempts to return the additional size of an instruction based on the addressing mode. For register, and reg+reg addressing modes there is no size increase (overhead), and for adressing modes with displacements, and displacements to address pointers, there is a cost. (2) when optimizing for speed, address cost remains between 0 and 3, and is used to prioritize between (equivalent numbers of) instructions. Normally, rtx_costs are defined in terms of COST_N_INSNS, which multiplies by 4. Hence on many platforms a single instruction that references memory may be encoded as COSTS_N_INSNS(1)+1 (or a more complex addressing mode as COSTS_N_INSNS(1)+2) to show that this is disfavored to a single instruction that doesn't reference memory, COSTS_N_INSNS(1)+0. This is the fix for this particular regression; SIGN_EXTEND of a register now costs COSTS_N_INSNS(1), and SIGN_EXTEND of a MEM now costs COSTS_N_INSNS(1)+1. A useful way to debug rtx_costs is to use the -dP command line option, and then look at the [c=X, l=Y] annotations in the assembly language file. One way to check/confirm that these are sensible is that ideally they should be correlated when optimizing for size (with -Os or -Oz). I've found an interesting table of SH cycle counts (for different CPUs) at http://www.shared-ptr.com/sh_insns.html and these could be used to improve sh_rtx_costs further. For example, SH currently reports multiplications as a single cycle operation, which doesn't match the hardware specs, and prevents GCC from using synth_mult to produce faster (or shorter) sequences using shifts and additions. Likewise, sh_rtx_costs doesn't distinguish the machine mode, so the costs of SImode multiplications are the same as DImode multiplications. In comment #5 you mention GCC's defaults; it turns out that for rtx_costs the default values that would be provided by the middle-end, may be more accurate than the values (currently) specified by the backend. I hope this answers your question.
[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533 --- Comment #8 from Roger Sayle --- Created attachment 57190 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57190=edit proposed patch Proposed patch to provide a sane/saner set of rtx_costs for SH. There's plenty more that could be done, but these changes are (more than) sufficient to resolve the code quality regression caused by improved fwprop. If someone could try this out on SH, and report back the results, that would be great.
[Bug rtl-optimization/113542] New: gcc.target/arm/bics_3.c regression after change for pr111267
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113542 Bug ID: 113542 Summary: gcc.target/arm/bics_3.c regression after change for pr111267 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: roger at nextmovesoftware dot com Target Milestone: --- This patch is a placeholder for tracking the reported failures of FAIL: gcc.target/arm/bics_3.c scan-assembler-times bics\tr[0-9]+, r[0-9]+, r[0-9]+ 2 FAIL: gcc.target/arm/bics_3.c scan-assembler-times bics\tr[0-9]+, r[0-9]+, r[0-9]+, .sl #2 1 See https://linaro.atlassian.net/browse/GNU-1117 Alas, I've been unable to reproduce the failure on cross compilers to either arm-linux-gnueabihf nor armv8l-unknown-linux-gnueabihf, so I suspect that there's some configuration option or compile-time flag I'm missing that's required to trigger these failures (which I'm hoping is "missed optimization" rather than "wrong code"). Hopefully, filing this PR provides a mechanism to allow folks to help me investigate this issue. My apologies for the temporary inconvenience. Setting the component to "rtl-optimization" until this is confirmed to be a target (ARM backend) issue.
[Bug rtl-optimization/113533] [14 Regression] Code generation regression after change for pr111267
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533 Roger Sayle changed: What|Removed |Added Last reconfirmed||2024-01-22 Status|UNCONFIRMED |NEW CC||roger at nextmovesoftware dot com Ever confirmed|0 |1 --- Comment #6 from Roger Sayle --- To help diagnose the problem, I came up with this simple patch: diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index 7872609b336..dc563ac2ca1 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -492,6 +492,9 @@ try_fwprop_subst_pattern (obstack_watermark , insn_change _change, " (cost %d -> cost %d)\n", old_cost, new_cost); ok = false; } + else if (dump_file) + fprintf (dump_file, "change is profitable" + " (cost %d -> cost %d)\n", old_cost, new_cost); } if (!ok) which then helps reveal that on sh3-linux-gnu with -O1 we see: propagating insn 6 into insn 12, replacing: (set (reg:SI 174 [ _1 ]) (sign_extend:SI (reg:QI 169 [ *a_7(D) ]))) successfully matched this instruction to *extendqisi2_compact_snd: (set (reg:SI 174 [ _1 ]) (sign_extend:SI (mem:QI (reg/v/f:SI 168 [ aD.1817 ]) [0 *a_7(D)+0 S1 A8]))) change is profitable (cost 4 -> cost 1) which confirms Andrew's and Oleg's analyses above; the sh_rtx_costs function is a little odd... Reading from memory is four times faster than using a pseudo!? I'm investigating a "costs" patch for the SH backend. My apologies for the temporary inconvenience, and thanks to Jeff for catching/spotting this.
[Bug target/91681] Missed optimization for 128 bit arithmetic operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91681 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED Target Milestone|--- |14.0 --- Comment #7 from Roger Sayle --- This is now fixed on mainline. GCC is now optimal, and generates one less instruction than clang.
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 --- Comment #10 from Roger Sayle --- A revised and improved patch has been posted for review at https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643062.html
[Bug other/113336] libatomic (testsuite) regressions on armv6-linux-gnueabihf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336 Roger Sayle changed: What|Removed |Added Last reconfirmed||2024-01-14 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #1 from Roger Sayle --- As there's a patch for this regression (awaiting review), I should upgrade the PR status to ASSIGNED.
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #8 from Roger Sayle --- Now we're in stage4, I'll take this. I'm bootstrapping and regression testing a variant of my tweak to try_fwprop_subst_pattern. The change in comment #7 can hang loop indefinitely if the transformation results in the same cost as the original, so the logic on when to forward-propagate needed to be tweaked a little.
[Bug target/106060] Inefficient constant broadcast on x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106060 Roger Sayle changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com Status|NEW |ASSIGNED --- Comment #4 from Roger Sayle --- I have a patch for better materialization of vector constants (including cmpeq+abs and cmpeq+abs), but now that we've transitioned for stage 3 (bug fixing) to stage 4 (regression fixing), this will have to wait for GCC 15's stage 1. I'm happy to post the patch here or to gcc-patches, if anyone would like to pre-review it and/or benchmark the proposed changes.
[Bug target/112992] Inefficient vector initialization using vec_duplicate/broadcast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112992 Roger Sayle changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED Target Milestone|--- |14.0 --- Comment #10 from Roger Sayle --- This has now been fixed on mainline (we generate identical code for all four functions in comment #1).
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 --- Comment #7 from Roger Sayle --- Very many thanks to Jeff Law for pointing me to fwprop. The following simple patch also fixes this regression. diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc index 0c588f8..cbba44e 100644 --- a/gcc/fwprop.cc +++ b/gcc/fwprop.cc @@ -449,15 +449,6 @@ try_fwprop_subst_pattern (obstack_watermark , insn_ change _change, if (prop.num_replacements == 0) return false; - if (!prop.profitable_p ()) -{ - if (dump_file && (dump_flags & TDF_DETAILS)) - fprintf (dump_file, "cannot propagate from insn %d into" -" insn %d: %s\n", def_insn->uid (), use_insn->uid (), -"would increase complexity of pattern"); - return false; -} - if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "\npropagating insn %d into insn %d, replacing:\n",
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 --- Comment #6 from Roger Sayle --- Sorry for the delay in replying/answering Jakub's questions/comments. Yes, using a define_insn_and_split in the backend fixes/works around the issue (and I agree your implementation/refinement in comment #5 is better than mine in comment #2), but I've a feeling that this approach isn't the ideal solution. Nothing about this split, is specific to these x86 instructions or even to the i386 backend. A more generic fix might be teach combine.cc that it can split parallels of two independent sets, with no inter dependencies, into two insns if the total cost of the two instructions is less than the original two, i.e. a 2 insn -> 2 insn combination. But then even this doesn't feel like the perfect approach... the reason combine doesn't already support 2->2 combinations is that they're not normally required, these types of problems are usually handled by GCSE or CSE or PRE (or ?). The pattern is insn1 defines REG1 to a complicated expression, that is live in several locations, so this instruction can't be eliminated. However, if the definition of REG1 is provided to insn2 that sets REG2, this second instruction can be significantly simplified. This feels like a classic (non-)constant propagation problem. I'm thinking perhaps want_to_gcse_p (or somewhere similar) could be tweaked. For people just joining the discussion (hopefully Jeff or a Richard): (set (REG:DI 1) (concat:DI (REG:SI 2) (REG:SI 3)) ... (set (REG:SI 4) (low_part (REG:DI 1)) can be simplified so that the second assignment becomes just: (set (REG:SI 4) (REG:SI 2)) and similarly for high_part vs. low_part. These don't even need to be in the same basic block. In actuality, "concat" is a large ugly expression, and high_part/low_part are actually SUBREGs (or could be TRUNCATE or SHIFT+TRUNCATE), but the theory should remain the same. I'm trying to figure out which pass (or cselib?) is normally responsible for handling this type of pseudo-reg propagation. But the define_insn_and_split certainly papers over the deficiency in the middle-end's RTL optimizers and fixes this (very) specific case/regression.
[Bug other/113336] New: libatomic (testsuite) regressions on armv6-linux-gnueabihf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113336 Bug ID: 113336 Summary: libatomic (testsuite) regressions on armv6-linux-gnueabihf Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: roger at nextmovesoftware dot com Target Milestone: --- As suggested by Richard Earnshaw, this opens a bugzilla PR for tracking this issue. All the tests in libatomic currently fail on a raspberry pi running raspbian, but passed back in December 2020. https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642168.html The regression (which isn't really a regression) was caused by: 2023-09-26 Hans-Peter Nilsson PR target/107567 PR target/109166 * builtins.cc (expand_builtin) : Handle failure from expand_builtin_atomic_test_and_set. * optabs.cc (expand_atomic_test_and_set): When all attempts fail to generate atomic code through target support, return NULL instead of emitting non-atomic code. Also, for code handling targetm.atomic_test_and_set_trueval != 1, gcc_assert result from calling emit_store_flag_force instead of returning NULL. Prior to this, when -fno-sync-libcalls was specified on the command line, the __atomic_test_and_set built-in simply expanded to a non-atomic code sequence, which then passed libatomic's configure tests for HAVE_ATOMIC_TAS. Now that this hole/bug/correctness issue has been fixed, and HAVE_ATOMIC_TAS is now detected as false, the libatomics's tas_n.c can no longer implement tas_8_2_.o without (a missing helper function) tas_1_2_.o. Hence libatomic has (always?) been broken on armv6, but synchronization primitives can now be supported with the above change. We've just not noticed that necessary pieces of the runtime were missing, until the above correctness fix resulted in a link error.
[Bug target/113231] x86_64 uses SSE instructions for `*mem <<= const` at -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113231 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED Target Milestone|--- |14.0 Status|ASSIGNED|RESOLVED --- Comment #6 from Roger Sayle --- This should now be fixed on mainline.
[Bug target/113231] x86_64 uses SSE instructions for `*mem <<= const` at -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113231 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #4 from Roger Sayle --- I'm testing a patch, for more accurate conversion gains/costs in the scalar-to-vector pass. Adding -mno-stv will work around the problem.
[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914 --- Comment #19 from Roger Sayle --- Created attachment 56930 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56930=edit proposed patch And now for a patch that does (or should) work. This even contains an optimization, we middle-end knows we don't need to sign or zero extend if a insv doesn't modify the sign-bit. Testing on MIPS would be much appreciated. TIA.
[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914 --- Comment #18 from Roger Sayle --- Please ignore comment #17, the above patch is completely bogus/broken.
[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914 --- Comment #17 from Roger Sayle --- I think this patch might resolve the problem (or move it somewhere else): diff --git a/gcc/expr.cc b/gcc/expr.cc index 9fef2bf6585..218bca905f5 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -6274,10 +6274,7 @@ expand_assignment (tree to, tree from, bool nontemporal) result = store_expr (from, to_rtx, 0, nontemporal, false); else { - rtx to_rtx1 - = lowpart_subreg (subreg_unpromoted_mode (to_rtx), - SUBREG_REG (to_rtx), - subreg_promoted_mode (to_rtx)); + rtx to_rtx1 = gen_reg_rtx (subreg_unpromoted_mode (to_rtx)); result = store_field (to_rtx1, bitsize, bitpos, bitregion_start, bitregion_end, mode1, from, get_alias_set (to), The motivation/solution comes from a comment in expmed.cc: /* If the destination is a paradoxical subreg such that we need a truncate to the inner mode, perform the insertion on a temporary and truncate the result to the original destination. Note that we can't just truncate the paradoxical subreg as (truncate:N (subreg:W (reg:N X) 0)) is (reg:N X). */ The same caveat applies to extensions on MIPS, so we should use a new pseudo temporary register rather than update the SUBREG in place. If someone could confirm this fixes the issue on MIPS, I'll try to come up with a milder form of this fix that checks TARGET_MODE_REP_EXTENDED that'll limit the churn/impact on other targets.
[Bug rtl-optimization/112380] [14 regression] ICE when building Mesa (in combine, internal compiler error: in simplify_subreg) since r14-2526-g8911879415d6c2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112380 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #14 from Roger Sayle --- This should now be fixed on mainline, but similar issues may still be latent in combine (see the longer alternative in comment #12).
[Bug target/112992] New: Inefficient vector initialization using vec_duplicate/broadcast
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112992 Bug ID: 112992 Summary: Inefficient vector initialization using vec_duplicate/broadcast Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: roger at nextmovesoftware dot com Target Milestone: --- The following four functions should in theory all produce the same code: typedef unsigned long long v4di __attribute((vector_size(32))); typedef unsigned int v8si __attribute((vector_size(32))); typedef unsigned short v16hi __attribute((vector_size(32))); typedef unsigned char v32qi __attribute((vector_size(32))); #define MASK 0x01010101 #define MASKL 0x0101010101010101ULL #define MASKS 0x0101 v4di fooq() { return (v4di){MASKL,MASKL,MASKL,MASKL}; } v8si food() { return (v8si){MASK,MASK,MASK,MASK,MASK,MASK,MASK,MASK}; } v16hi foow() { return (v16hi){MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS, MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS,MASKS}; } v32qi foob() { return (v32qi){1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}; } On x86_64 with -mavx, we currently produce very different implementations: fooq: movabs rax, 72340172838076673 pushrbp mov rbp, rsp and rsp, -32 mov QWORD PTR [rsp-8], rax vbroadcastsdymm0, QWORD PTR [rsp-8] leave ret food: vbroadcastssymm0, DWORD PTR .LC2[rip] ret foow: vmovdqa ymm0, YMMWORD PTR .LC3[rip] ret foob: vmovdqa ymm0, YMMWORD PTR .LC4[rip] ret clang currently produces the vbroadcastss for all four. I discovered that some of my "day job" code used the "fooq" idiom, requiring a stack frame, and both reads and writes to memory [of a compile-time constant]. I suspect the fix is to add a define_insn_and_split or two to i386/sse.md, and perhaps something can be done in expand, but I'm confused why LRA/reload spills the DImode component of V4DI to the stack frame, but places the SImode component of V8SI in the constant pool. This is related (distantly) to PRs 100865 and 106060, but is potentially target independent, and seems to be going wrong in LRA/reload's REG_EQUIV elimination. Thoughts? Apologies if this is a dup. I'm happy to work up a patch if someone could advise on where best this should be fixed. Perhaps RTL's vec_duplicate could be canonicalized to the most appropriate vector mode?
[Bug rtl-optimization/112380] [14 regression] ICE when building Mesa (in combine, internal compiler error: in simplify_subreg) since r14-2526-g8911879415d6c2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112380 --- Comment #12 from Roger Sayle --- Patch proposed (actually two alternatives proposed) at https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636203.html
[Bug target/110551] [11/12/13 Regression] an extra mov when doing 128bit multiply
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551 Roger Sayle changed: What|Removed |Added Target Milestone|11.5|14.0 Resolution|--- |FIXED Summary|[11/12/13/14 Regression] an |[11/12/13 Regression] an |extra mov when doing 128bit |extra mov when doing 128bit |multiply|multiply Status|NEW |RESOLVED --- Comment #10 from Roger Sayle --- This should now be fixed on mainline.
[Bug rtl-optimization/91865] Combine misses opportunity to remove (sign_extend (zero_extend)) before searching for insn patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91865 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Roger Sayle --- This should now be fixed on mainline.
[Bug rtl-optimization/112380] [14 regression] ICE when building Mesa (in combine, internal compiler error: in simplify_subreg) since r14-2526-g8911879415d6c2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112380 Roger Sayle changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com Status|NEW |ASSIGNED --- Comment #10 from Roger Sayle --- combine.cc's expand_field_assignment needs to defend against gen_lowpart (which is gen_lowpart_for_combine) returning a CLOBBER. Otherwise, we end up calling simplify_set on: (set (reg:DI 134) (and:DI (subreg:DI (ior:SI (ior:SI (and:SI (subreg:SI (reg/v:TI 114 [ sampler ]) 0) (const_int -129280 [0xfffe0700])) (and:SI (clobber:TI (const_int 0 [0])) (const_int -129025 [0xfffe07ff]))) (and:SI (reg:SI 130) (const_int 129024 [0x1f800]))) 0) (const_int 4294967295 [0x]))) where if you look closely the "(clobber:TI (const_int 0))" causes no end of fun in simplify_rtx; it's not surprising that an assert is eventually triggered in simplify_subreg. I'm testing a patch.
[Bug c++/50755] [avr] ICE: tree check: expected class 'constant', have 'unary' (convert_expr)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50755 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #4 from Roger Sayle --- This appears to be fixed on mainline. G-J can you confirm this is resolved?
[Bug target/112298] Poor code for DImode operations on H8 port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112298 Roger Sayle changed: What|Removed |Added Status|UNCONFIRMED |NEW CC||roger at nextmovesoftware dot com Ever confirmed|0 |1 Last reconfirmed||2023-10-30
[Bug target/112103] [14 regression] gcc.target/powerpc/rlwinm-0.c fails after r14-4941-gd1bb9569d70304
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112103 Roger Sayle changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW CC||roger at nextmovesoftware dot com Last reconfirmed||2023-10-26
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 --- Comment #3 from Roger Sayle --- This patch addresses the regression, but probably isn't the correct fix. The issue is that the backend now has a way of representing the concatenation of two registers (for example, TI is constructed for two DI mode registers): (set (reg:TI 111 [ bD.2764 ]) (ior:TI (ashift:TI (zero_extend:TI (reg:DI 142)) (const_int 64 [0x40])) (zero_extend:TI (reg:DI 141 But combine is unable to cleanly extract the (original) DI mode components back out of this using SUBREGs. Currently combine gets confused and attempts to match things like: Trying 10 -> 74: 10: r111:TI=zero_extend(r142:DI)<<0x40|zero_extend(r141:DI) REG_DEAD r141:DI REG_DEAD r142:DI 74: r137:DI=r111:TI#0 Failed to match this instruction: (parallel [ (set (reg:DI 137 [ bD.2764 ]) (reg:DI 141)) (set (reg:TI 111 [ bD.2764 ]) (ior:TI (ashift:TI (zero_extend:TI (reg:DI 142)) (const_int 64 [0x40])) (zero_extend:TI (reg:DI 141 ]) which contains the simplification we want, "reg:DI 137 := reg:DI 141", but along with stuff that combine should really take care off (strip/duplicate). I'll work on a more acceptable middle-end fix, but this patch demonstrates progress, and can be used if a more general solution can't be found.
[Bug rtl-optimization/111267] [14 Regression] Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 --- Comment #2 from Roger Sayle --- Created attachment 56162 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56162=edit proof-of-concept patch
[Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #4 from Roger Sayle --- Patch proposed: https://gcc.gnu.org/pipermail/gcc-patches/2023-October/67.html
[Bug bootstrap/111812] [14 regression] Can't build with gcc 4.8.5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111812 Roger Sayle changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2023-10-16 Known to work||13.0 Known to fail||14.0 Host|powerpc64-linux-gnu |*-linux-gnu Status|UNCONFIRMED |NEW Target|powerpc64-linux-gnu | CC||roger at nextmovesoftware dot com Build|powerpc64-linux-gnu |
[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from Roger Sayle --- This should now be fixed on mainline.
[Bug middle-end/17886] variable rotate and unsigned long long rotate should be better optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=17886 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #27 from Roger Sayle --- I believe that this issue has been fixed (for a long time). For Andi's testcases in comment #3, -fdump-tree-optimized reveals all these cases are perceived as rotations by the early middle-end. long long unsigned int f (long long unsigned int x, int y) { return x_1(D) r<< y_2(D); } unsigned int f2 (unsigned int x, int y) { return x_1(D) r<< y_2(D); } long long unsigned int f3 (long long unsigned int x) { return x_1(D) r>> 55; } long unsigned int f4 (unsigned int x) { return x_1(D) r>> 22; }
[Bug tree-optimization/111519] [13/14 Regression] Wrong code at -O3 on x86_64-linux-gnu since r13-455-g1fe04c497d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111519 --- Comment #2 from Roger Sayle --- Complicated. Things have gone wrong before the strlen pass which is given: _73 = e; _72 = *_73; ... *_73 = prephitmp_23; d = _72; Here the assignment to *_73 overwrites the value of f (at *e) which then invalidates the use of _72 resulting in the wrong value for d. But figuring out which pass is at fault (perhaps complete loop unrolling?) is tricky.
[Bug target/71749] Define _REENTRANT on ARC when -pthread is passed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71749 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #6 from Roger Sayle --- This has been fixed on mainline since March 2017 thanks to: 2017-03-28 Claudiu Zissulescu Thomas Petazzoni
[Bug target/91251] Revision 272645 on top of 9.1.0 caused ICE: in extract_insn, at recog.c:2310
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91251 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #1 from Roger Sayle --- I'm unable to reproduce this with mainline gcc, the reduced testcase appears to compile fine (without an ICE). Can you confirm that this issue is also fixed for you?
[Bug target/91591] Arc: ICE in trunc_int_for_mode, at explow.c:60
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91591 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED Target Milestone|--- |8.4 CC||roger at nextmovesoftware dot com Status|UNCONFIRMED |RESOLVED --- Comment #6 from Roger Sayle --- As reported by Giulio, this bug has now been fixed.
[Bug target/83409] arc: "internal compiler error: in extract_constrain_insn" with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83409 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #3 from Roger Sayle --- I'm unable to reproduce this with mainline gcc configured with --target=arc-elf. scatterlist.i (and the reduced test case in comment #1), compile fine with -O2, -O3 and -O3 -fno-strict-aliasing. I've also tried -mcpu=em. Can you confirm that this has been fixed for you (so we can close the PR)?
[Bug target/43892] PowerPC suboptimal "add with carry" optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43892 --- Comment #39 from Roger Sayle --- My apologies for dropping the ball on this patch (series)... My only access to PowerPC hardware is/was via the GCC compile farm, which complicates things. Shortly after David's approval, Segher enquired whether the patch could be modified to also handle -mcpu=power10 (which represents carry differently): https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586868.html Trying to (also) address this then openned up a rabbit hole/can of worms related to how middle-end (and rs6000.md) represents overflow, which included a combine patch: https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586572.html Soon after GCC entered stage 4 (or stage 3), and the above patches (and an unsubmitted one for power10) simply got lost in the backlog. I believe this patch is sound, but unfortunately I don't have the bandwidth/patience to (re)check it against mainline on (multiple variants of) rs6000. If one of the IBM folks could take it from here, that'd be much appreciated.
[Bug rtl-optimization/104914] [MIPS] wrong comparison with scrabbled int value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104914 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #15 from Roger Sayle --- Is MIPS64 actually a TRULY_NOOP_TRUNCATION_TARGET? If SImode is implicitly assumed to be (sign?) extended, then an arbitrary DImode value/register can't be used as an SImode value without appropriately setting/clearing the upper bits. i.e. thus this integer truncation isn't a no-op. I suspect that the underlying problem is that the backend is relying on implicit invariants, not explicitly represented in the RTL, and then surprised when valid RTL transformations don't preserve those invariants/assumptions. I wonder why the zero_extract followed by sign_extend example mentioned in https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626137.html isn't already being considered as a try_combine candidate, allowing the backend to simply recognize or split it. I'll investigate.
[Bug target/106222] x86 Better code squence for __builtin_shuffle
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106222 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED CC||roger at nextmovesoftware dot com Target Milestone|--- |13.0 --- Comment #3 from Roger Sayle --- This was fixed by Hongtao's patch for GCC 13.
[Bug target/110843] ICE in convert_insn, at config/i386/i386-features.cc:1438 since r14-2405-g4814b63c3c2326
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110843 Roger Sayle changed: What|Removed |Added Ever confirmed|0 |1 CC||roger at nextmovesoftware dot com Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-07-28 --- Comment #1 from Roger Sayle --- My STV patch should check TARGET_AVX512VL (which is required for V2DI in VI48_AVX512VL) and not TARGET_AVX512F which is the condition in the define_insn for vp. Bootstrapping and regression testing a fix.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Roger Sayle changed: What|Removed |Added Assignee|roger at nextmovesoftware dot com |unassigned at gcc dot gnu.org --- Comment #16 from Roger Sayle --- My patch (in comment #15) is obsoleted by Richard Biener's much better solution(s): https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625416.html https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625417.html
[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701 --- Comment #7 from Roger Sayle --- Patch proposed here: https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625532.html
[Bug target/110792] [13/14 Regression] GCC 13 x86_32 miscompilation of Whirlpool hash function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110792 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #11 from Roger Sayle --- Mine. Alas the obvious fix of adding an early clobber to the rotate doubleword from memory alternative generates some truly terrible code (spills via memory to SSE registers!?), but I've come up with a better solution, forcing reload to read the operand from memory before the rotate, which appears to fix the problem without the adverse performance impact.
[Bug target/110790] [14 Regression] gcc -m32 generates invalid bit test code on gmp-6.2.1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110790 --- Comment #5 from Roger Sayle --- I'll add this testcase to the testsuite, when I apply a corrected version of my QImode offset patch to mainline. On the bright side, we'll be generating more efficient code for gmp's refmpn_tstbit by using the x86's bt instruction (it just needs to use setc not setnc in this case).
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 --- Comment #15 from Roger Sayle --- Hi Richard, There's another patch awaiting review at https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625282.html and I've another follow-up after that currently regression testing...
[Bug target/110787] [14 regression] ICE building SYSTEM.def
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110787 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #2 from Roger Sayle --- I'm bootstrapping with --enable-languages=all to investigate what's going on. I'll revert the patch once I (or anyone) can confirm that this restores bootstrap, but I'd be happier understanding the actual mechanism (cause and effect) of this ICE. Sorry for the temporary inconvenience.
[Bug c++/72825] ICE on invalid C++ code on x86_64-linux-gnu (internal compiler error: tree check: expected array_type, have error_mark in array_ref_low_bound, at tree.c:13013)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72825 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED CC||roger at nextmovesoftware dot com --- Comment #4 from Roger Sayle --- This issue has been fixed on mainline (for GCC 14), by the patch for PR 110699.
[Bug c/109598] [12/13/14 Regression] ICE: tree check: expected array_type, have error_mark in array_ref_low_bound, at tree.cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109598 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED CC||roger at nextmovesoftware dot com --- Comment #4 from Roger Sayle --- This issue has been fixed on mainline (for GCC 14), by the patch for PR 110699.
[Bug c/110699] [12/13/14 Regression] internal compiler error: tree check: expected array_type, have error_mark in array_ref_low_bound, at tree.cc:12754
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110699 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #3 from Roger Sayle --- This issue has been fixed on mainline for GCC 14.
[Bug target/110588] btl (on x86_64) not always generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110588 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Roger Sayle --- This is now fixed on mainline for GCC 14.
[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701 Roger Sayle changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com Status|NEW |ASSIGNED --- Comment #6 from Roger Sayle --- I have a fix (to combine.cc's record_dead_and_set_regs_1). Bootstrapping and regression testing.
[Bug rtl-optimization/110701] [14 Regression] Wrong code at -O1/2/3/s on x86_64-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110701 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #5 from Roger Sayle --- nonzero_bits ((reg:DI 92),SImode) is returning 340, so combine (or more specifically simplify_and_const_int_1) believes that the AND (ZERO_EXTEND) isn't unnecessary. So it's the same nonzero_bits information that allows us to turn the XOR into IOR (in insn 16) that's incorrectly telling us the AND 340 (or AND 343, or ZERO_EXTEND) is unnecessary (in insn 17).
[Bug c/89180] [meta-bug] bogus/missing -Wunused warnings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89180 Bug 89180 depends on bug 101090, which changed state. Bug 101090 Summary: incorrect -Wunused-value warning on remquo with constant values https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101090 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE
[Bug c/106264] [10/11/12/13 Regression] spurious -Wunused-value on a folded frexp, modf, and remquo calls with unused result since r9-1295-g781ff3d80e88d7d0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106264 --- Comment #9 from Roger Sayle --- *** Bug 101090 has been marked as a duplicate of this bug. ***
[Bug c/101090] incorrect -Wunused-value warning on remquo with constant values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101090 Roger Sayle changed: What|Removed |Added Resolution|--- |DUPLICATE CC||roger at nextmovesoftware dot com Status|NEW |RESOLVED --- Comment #4 from Roger Sayle --- Many thanks to Vincent for spotting/confirming that his bug report is a duplicate of PR 106264, which was fixed in GCC 13. *** This bug has been marked as a duplicate of bug 106264 ***
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Roger Sayle changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #11 from Roger Sayle --- My (upcoming) patch for PR88873 dramatically reduces the compile-time (with -O0) for this test case (by reducing the number of pseudos and reducing the number of reloads). But don't let that stop anyone from speeding up lra_final_code_change.
[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=88873 --- Comment #9 from Roger Sayle --- I'll check whether turning off the insvti_{low,high}part transformations during lra_in_progress helps compile-time. I believe everytime reload encounters a TI<->SSE SUBREG, the spill/reload generates two or three additional instructions. I'm thinking that perhaps this should ideally be an UNSPEC, that we can split after reload. As shown in PR 88873, we'd like SSE->TI->SSE to avoid going via memory [where currently this happens twice]. It looks like "interval" in pr28071.c suffers from the same x86 ABI issues [i.e. is placed in scalar TImode, where ideally we'd like V2DI].
[Bug target/110649] [14 Regression] 25% sphinx3 spec2006 regression on Ice Lake and zen between g:acaa441a98bebc52 (2023-07-06 11:36) and g:55900189ab517906 (2023-07-07 00:23)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110649 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #12 from Roger Sayle --- Hi Jan, I believe you also need to remove the profile_count entry_count = profile_count::zero (); from tree-ssa-loop-ivcanon.cc's try_peel_loop to avoid a bootstrap issue with -Werror "variable entry_count set but unused".
[Bug target/110598] [14 Regression] wrong code on llvm-14.0.6 due to memcmp being miscompiled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110598 Roger Sayle changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED Known to work||14.0 --- Comment #7 from Roger Sayle --- Many thanks to Sergei for confirming this issue is now resolved. Sorry again for the inconvenience.