[Bug sanitizer/115172] Invalid -fsanitize=bool sanitization of variable from named address space

2024-05-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115172

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
   Target Milestone|--- |11.5

--- Comment #14 from Uroš Bizjak  ---
Also fixed for 11.5+ and 12.4+.

[Bug sanitizer/115172] Invalid -fsanitize=bool sanitization of variable from named address space

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115172

--- Comment #5 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #4)
> Created attachment 58261 [details]
> gcc15-pr115172.patch
> 
> Full untested patch.

I can confirm that this patch fixes boot for the kernel config from
PR115172#43.

[Bug sanitizer/115172] Invalid -fsanitize=bool sanitization of variable from named address space

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115172

--- Comment #3 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #1)

> What the kernel does is terrible, why they just don't declare the extern
> with __seg_gs attribute?

This is how kernel currently handles percpu variables. They are redefined to a
named address space just before use, because the percpu infrastructure can't
handle AS qualified variables, mainly due to the extensive use of __typeof__,
please see [1].

The solution is to use __typeof_unqual__ with gcc-14. The prototype patch at
[2] declares percpu variables as __seg_gs during decl time, as you proposed
above.

[1] https://lore.kernel.org/lkml/87bk7ux4e9.ffs@tglx/

[2]
https://lore.kernel.org/lkml/cafuld4z-sthtu2uwv02s+nbx51qqytguo8zew50fc_pbsff...@mail.gmail.com/

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #50 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #49)
> Will do in a moment.

PR 115172

[Bug sanitizer/115172] New: Invalid -fsanitize=bool sanitization of variable from named address space

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115172

Bug ID: 115172
   Summary: Invalid -fsanitize=bool sanitization of variable from
named address space
   Product: gcc
   Version: 14.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org
  Target Milestone: ---

Created attachment 58260
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58260=edit
Preprocessed file

Originally reported in PR 111736, comment 42.

Compiling the attached preprocessed file with:

gcc -O2 -fsanitize=kernel-address -fasan-shadow-offset=0xdc00
--param asan-instrumentation-with-call-threshold=1 -fsanitize=bool -S
alternative.i

results in:

movabsq $-2305847407260205056, %rdx
movl$cpu_tlbstate_shared, %eax
shrq$3, %rax
movzbl  (%rax,%rdx), %eax
testb   %al, %al
je  .L399
jle .L473
.L399:
movzbl  %gs:cpu_tlbstate_shared(%rip), %r14d
cmpb$1, %r14b

which is wrong. %gs: prefixed addresses should not be sanitized.

Omitting -fsanitize=bool from the above compiles the preprocessed file to:

movzbl  %gs:cpu_tlbstate_shared(%rip), %eax
testb   %al, %al

where no sanitization is present with the above variable.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Uroš Bizjak  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |FIXED

--- Comment #49 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #48)
> Would have been nice to have a new bugreport for -fsanitize=bool.

Will do in a moment.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #47 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #46)
> Created attachment 58259 [details]
> Preprocessed file
> 
> gcc -O2 -fsanitize=kernel-address -fasan-shadow-offset=0xdc00
> --param asan-instrumentation-with-call-threshold=1 -fsanitize=bool -S
> alternative.i

Omitting -fsanitize=bool compiles to:

movzbl  %gs:cpu_tlbstate_shared(%rip), %eax
testb   %al, %al

where no sanitization is present with the above variable.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #46 from Uroš Bizjak  ---
Created attachment 58259
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58259=edit
Preprocessed file

gcc -O2 -fsanitize=kernel-address -fasan-shadow-offset=0xdc00
--param asan-instrumentation-with-call-threshold=1 -fsanitize=bool -S
alternative.i

results in:

movabsq $-2305847407260205056, %rdx
movl$cpu_tlbstate_shared, %eax
shrq$3, %rax
movzbl  (%rax,%rdx), %eax
testb   %al, %al
je  .L399
jle .L473
.L399:
movzbl  %gs:cpu_tlbstate_shared(%rip), %r14d
cmpb$1, %r14b

which is wrong. %gs: prefixed addresses should not be sanitized.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-05-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Uroš Bizjak  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #45 from Uroš Bizjak  ---
Yes, I can confirm the Oops due to sanitization of %gs: prefixed variable:

  307eee:   48 c7 c0 00 00 00 00mov$0x0,%rax
307ef1: R_X86_64_32Scpu_tlbstate_shared
  307ef5:   48 ba 00 00 00 00 00movabs $0xdc00,%rdx
  307efc:   fc ff df 
  307eff:   48 c1 e8 03 shr$0x3,%rax
  307f03:   0f b6 04 10 movzbl (%rax,%rdx,1),%eax
  307f07:   84 c0   test   %al,%al
  307f09:   74 06   je 307f11 <__text_poke+0x4a1>
  307f0b:   0f 8e f0 07 00 00   jle308701 <__text_poke+0xc91>
  307f11:   65 44 0f b6 35 00 00movzbl %gs:0x0(%rip),%r14d#
307f1a <__text_poke+0x4aa>
  307f18:   00 00 
307f16: R_X86_64_PC32   cpu_tlbstate_shared-0x4

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-20 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #17 from Uroš Bizjak  ---
(In reply to Haochen Jiang from comment #15)
> I am doing like this way. Suppose should be same as Comment 8.

Yes, but IMO the patch in Comment #8 better describes where the problem is.

Please note that without VPMOVWB we fall-back to the original
ix86_expand_vecop_qihi, where the expansion is implemented in a different way.

(BTW: If there is a better way to emulate VPMOVWB, it should be implemented in
vec-perm routines, it will universally benefit this permutation. In this case,
early exit, as introduced in the mentioned patch, could be removed.)

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-20 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #13 from Uroš Bizjak  ---
(In reply to Haochen Jiang from comment #12)
> (In reply to Hongtao Liu from comment #11)
> > (In reply to Haochen Jiang from comment #10)
> > > A patch like Comment 8 could definitely solve the problem. But I need to
> > > test more benchmarks to see if there is surprise.
> > > 
> > > But, yes, as Uros said in Comment 9, maybe there is a chance we could do 
> > > it
> > > better.
> > 
> > Could you add "arch=skylake-avx512" to target_clones and try disable whole
> > ix86_expand_vecop_qihi2 to see if there's any performance improvement?
> > For x86, cross-lane permutation(truncation) is not very efficient(3-4 cycles
> > for both vpermq and vpmovwb).
> 
> When I disable/enable ix86_expand_vecop_qihi2 with arch=skylake-avx512 on
> trunk, there is no performance regression comparing to GCC13 + avx2.
> 
> It seems that the regression only happens when GCC14 + avx2.

This is what the patch in Comment #8 prevents. skylake-avx512 enables
TARGET_AVX512BW, so VPMOVB is emitted instead of problematic VPERMQ.

[Bug target/115146] [15 Regression] Incorrect 8-byte vectorization: psrlw/psraw confusion

2024-05-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115146

--- Comment #8 from Uroš Bizjak  ---
(In reply to Levy Hsu from comment #5)
> case E_V16QImode:
>   mode = V8HImode;
>   gen_shr = gen_vlshrv8hi3;
>   gen_shl = gen_vashlv8hi3;

Hm, why vector-by-vector shift here? Should there be a call to gen_lshrv8hi3
and gen_ashlv8hi3 instead?

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-17 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #9 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #8)
> A better patch:

The real issue is that the following permutation (truncation):

+  for (i = 0; i < d.nelt; ++i)
+   d.perm[i] = i * 2;
+
+  ok = ix86_expand_vec_perm_const_1 ();

results in a slow code involving VPERMQ. Ideally, ix86_expand_vec_perm_const_1
should emit faster code for truncation, because this will benefit other code as
well.

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-17 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2024-05-17

--- Comment #8 from Uroš Bizjak  ---
A better patch:

--cut here--
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 4e16aedc5c1..88bfc43201b 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -24493,6 +24493,10 @@ ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest,
rtx op1, rtx op2)
   bool op2vec = GET_MODE_CLASS (GET_MODE (op2)) == MODE_VECTOR_INT;
   bool uns_p = code != ASHIFTRT;

+  /* ??? VPERMQ is slow and VPMOWVB is only available under AVX512BW.  */
+  if (!TARGET_AVX512BW)
+return false;
+
   if ((qimode == V16QImode && !TARGET_AVX2)
   || (qimode == V32QImode && (!TARGET_AVX512BW || !TARGET_EVEX512))
   /* There are no V64HImode instructions.  */
--cut here--

[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-17 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069

--- Comment #7 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #5)
> (In reply to Krzysztof Kanas from comment #4)
> > I bisected the issue and it seems that commit
> > 0368fc54bc11f15bfa0ed9913fd0017815dfaa5d introduces regression.
> 
> I guess the real guilty commit is 
> 
> commit 52ff3f7b863da1011b73c0ab3b11f6c78b6451c7
> Author: Uros Bizjak 
> Date:   Thu May 25 19:40:26 2023 +0200
>  
> i386: Use 2x-wider modes when emulating QImode vector instructions
>  
> Rewrite ix86_expand_vecop_qihi2 to expand fo 2x-wider (e.g. V16QI ->
> V16HImode)
> instructions when available.  Currently, the compiler generates following
> assembly for V16QImode multiplication (-mavx2):

The patch is at:

https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619715.html

As mentioned in Comment #3, it looks that VPERMQ is a problematic insn. This
should be reflected in some cost function. Alternatively, we can simply change
the first line in:

+  if ((qimode == V16QImode && !TARGET_AVX2)
+  || (qimode == V32QImode && !TARGET_AVX512BW)
+  /* There are no V64HImode instructions.  */
+  || qimode == V64QImode)
+ return false;

to check "qimode == V16QImode && !TARGET_AVX512VL" to avoid VPERMQ:

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 4e16aedc5c1..450035ea9e6 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -24493,7 +24493,7 @@ ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest,
rtx op1, rtx op2)
   bool op2vec = GET_MODE_CLASS (GET_MODE (op2)) == MODE_VECTOR_INT;
   bool uns_p = code != ASHIFTRT;

-  if ((qimode == V16QImode && !TARGET_AVX2)
+  if ((qimode == V16QImode && !TARGET_AVX512VL)
   || (qimode == V32QImode && (!TARGET_AVX512BW || !TARGET_EVEX512))
   /* There are no V64HImode instructions.  */
   || qimode == V64QImode)

[Bug target/114942] [14/15 Regression] ICE on valid code at -O1 with "-fno-tree-sra -fno-guess-branch-probability": in extract_constrain_insn, at recog.cc:2713

2024-05-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114942

Uroš Bizjak  changed:

   What|Removed |Added

   Keywords||ra

--- Comment #3 from Uroš Bizjak  ---
Reload starts with:

(insn 39 38 48 6 (parallel [
(set (strict_low_part (subreg:QI (reg/v:HI 108 [ f ]) 0))
(ior:QI (subreg:QI (zero_extract:HI (reg/v:HI 108 [ f ])
(const_int 8 [0x8])
(const_int 8 [0x8])) 0)
(reg:QI 121 [ _7 ])))
(clobber (reg:CC 17 flags))
]) "pr114942.c":19:7 626 {*iorqi_exthi_1_slp}
 (expr_list:REG_DEAD (reg:QI 121 [ _7 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil

  Choosing alt 1 in insn 39:  (0)   (1) !qm  (2) Q {*iorqi_exthi_1_slp}

and then allocates:

(insn 39 56 57 6 (parallel [
(set (strict_low_part (reg:QI 2 cx [orig:108 f ] [108]))
(ior:QI (subreg:QI (zero_extract:HI (reg/v:HI 2 cx [orig:108 f
] [108])
(const_int 8 [0x8])
(const_int 8 [0x8])) 0)
(reg:QI 0 ax [orig:121 _7 ] [121])))
(clobber (reg:CC 17 flags))
]) "pr114942.c":19:7 626 {*iorqi_exthi_1_slp}

not taking into account the earlyclobber of operand 0.

[Bug target/114942] [14/15 Regression] ICE on valid code at -O1 with "-fno-tree-sra -fno-guess-branch-probability": in extract_constrain_insn, at recog.cc:2713

2024-05-04 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114942

--- Comment #2 from Uroš Bizjak  ---
This is the insn in question:

;; Alternative 1 is needed to work around LRA limitation, see PR82524.
 (define_insn_and_split "*qi_ext_1_slp"
   [(set (strict_low_part (match_operand:QI 0 "register_operand" "+Q,"))
 (any_logic:QI
   (subreg:QI
 (match_operator:SWI248 3 "extract_operator"
   [(match_operand 2 "int248_register_operand" "Q,Q")
(const_int 8)
(const_int 8)]) 0)
   (match_operand:QI 1 "nonimmediate_operand" "0,!qm")))
(clobber (reg:CC FLAGS_REG))]

When targeting alternative 1, reload should use some other register for operand
2.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-04-24 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|13.3|11.5
 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #41 from Uroš Bizjak  ---
Fixed everywhere.

[Bug rtl-optimization/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -msta

2024-04-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

--- Comment #14 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #13)
> Created attachment 58013 [details]
> gcc14-pr114810.patch
> 
> So like this?  Tried hard to reduce the testcase, but it didn't progress at
> all, so at least tried manually using cvise's clex to rename tokens a little
> bit.

LGTM.

[Bug rtl-optimization/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -msta

2024-04-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

--- Comment #12 from Uroš Bizjak  ---
(In reply to Vladimir Makarov from comment #9)
> (In reply to Uroš Bizjak from comment #7)
> >
> > 
> > Please note that the insn is defined as:
> > 
> > (define_insn_and_split "*andn3_doubleword_bmi"
> >   [(set (match_operand: 0 "register_operand" "=,r,r")
> > (and:
> >   (not: (match_operand: 1 "register_operand" "r,0,r"))
> >   (match_operand: 2 "nonimmediate_operand" "ro,ro,0")))
> >(clobber (reg:CC FLAGS_REG))]
> > 
> > where the problematic alternative (=,r,ro) allows a memory input in its
> > operand 2 constraint. The allocator could spill a DImode value to a stack in
> > advance and reload the value from the memory in this particular alternative.
> 
> That is not how LRA (and the old reload) works.  If an operand matches the
> constraint (r in ro), it does not change its location (do reloads).
> 
> In general, it is possible to implement reloads for operands already matched
> to a constraint but this would significantly complicate already too
> complicated code.  And probably heuristics based on reload costs would
> reject such reloads anyway.
> 
> I probably could implement reg starvation recognition in process_alt_operand
> and penalize the alternative and most probably it will not affect other
> targets.  Still it is not easy because of different possible class subsets
> or intersections.
> 
> Still I think Jakub's solution is reasonable at this stage.  If I implement
> my proposed solution we could commit it after the release.

Yes, I agree. Not knowing the internals of the RA, it looked "obvious" that RA
could use memory operand here, and using this solution in the
target-independent code could solve the issue also for other register starved
targets.

Another long-term improvement in the RA could be allocating multi-regs in a
random order. As far as x86 is concerned, apart from passing function
parameters in registers and perhaps some asm constraint ("A"), there is no need
for double-word registers to be allocated in any specific order. The
double-word value could be allocated in {hi,lo} register tuplet, where hi and
lo can be any register. Using this approach, the RA could allocate a
double-word value also in [01]2[34]5 situation.

[Bug rtl-optimization/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -msta

2024-04-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

--- Comment #11 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #8)
> (In reply to Uroš Bizjak from comment #7)
> > (define_insn_and_split "*andn3_doubleword_bmi"
> >   [(set (match_operand: 0 "register_operand" "=,r,r")
> > (and:
> >   (not: (match_operand: 1 "register_operand" "r,0,r"))
> >   (match_operand: 2 "nonimmediate_operand" "ro,ro,0")))
> >(clobber (reg:CC FLAGS_REG))]
> > 
> > where the problematic alternative (=,r,ro) allows a memory input in its
> > operand 2 constraint. The allocator could spill a DImode value to a stack in
> > advance and reload the value from the memory in this particular alternative.
> 
> So, given the known ia32 register starvation, can't we split that first
> alternative to
> =,r,o with "nox64" isa and =,r,ro with "x64" isa?

Yes, IMO this is an acceptable workaround, but please split the constraint to
(=,r,r) and (=,r,o), with the former limited to "x64" isa. This is what the
other patterns do, new mode attribute just hides the obvious fact.

[Bug rtl-optimization/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -msta

2024-04-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

--- Comment #7 from Uroš Bizjak  ---
(In reply to Vladimir Makarov from comment #6)
> The problem is that the alternative assumes 3 DI values live simultaneously.
> This means 6 regs and we have only 6 available ones. One input reg is
> assigned to 0 another one is to 3.  So we have [01]2[34]5, where regs in
> brackets are taken by the operands.  Although there are still 2 regs but
> they can not be used as they are not adjacent.
> 
> The one solution is to somehow penalize the chosen alternative by changing
> alternative heuristics in lra-constraints.cc.  But it definitely can affect
> other targets in some unpredicted way.  So the solution is too risky
> especially at this stage.  Also it might be possible that there is no
> alternative with less 3 living pseudos for some different insn case.
> 
> I don't see non-risky solution right now.  I'll be thinking how to better
> fix this.

Please note that the insn is defined as:

(define_insn_and_split "*andn3_doubleword_bmi"
  [(set (match_operand: 0 "register_operand" "=,r,r")
(and:
  (not: (match_operand: 1 "register_operand" "r,0,r"))
  (match_operand: 2 "nonimmediate_operand" "ro,ro,0")))
   (clobber (reg:CC FLAGS_REG))]

where the problematic alternative (=,r,ro) allows a memory input in its
operand 2 constraint. The allocator could spill a DImode value to a stack in
advance and reload the value from the memory in this particular alternative.

[Bug rtl-optimization/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -msta

2024-04-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

Uroš Bizjak  changed:

   What|Removed |Added

   Keywords||ra
  Component|target  |rtl-optimization

--- Comment #5 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #4)

> the compilation succeeds, and a spill to memory is emitted:

I think RA should emit a similar spill with the original instruction when
alternative (=,r,ro) is used. There is nothing wrong with the original insn
definition.

Re-confirmed as RA problem.

[Bug target/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -mstackrealign

2024-04-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

--- Comment #4 from Uroš Bizjak  ---
An interesting observation, when the insn is defined only with problematic
alternative:

(define_insn_and_split "*andn3_doubleword_bmi"
  [(set (match_operand: 0 "register_operand" "=")
(and:
  (not: (match_operand: 1 "register_operand" "r"))
  (match_operand: 2 "nonimmediate_operand" "ro")))
   (clobber (reg:CC FLAGS_REG))]

the compilation succeeds, and a spill to memory is emitted:


(insn 1170 65 1177 7 (set (mem/c:DI (plus:SI (reg/f:SI 6 bp)
(const_int -168 [0xff58])) [71 %sfp+-144 S8 A64])
(reg:DI 0 ax [orig:217 _13 ] [217])) "pr114810.C":296:36 84
{*movdi_internal}
 (nil))

...

(insn 987 1154  7 (parallel [
(set (reg:DI 3 bx [453])
(and:DI (not:DI (reg:DI 0 ax [452]))
(mem/c:DI (plus:SI (reg/f:SI 6 bp)
(const_int -168 [0xff58])) [71
%sfp+-144 S8 A64])))
(clobber (reg:CC 17 flags))
]) "pr114810.C":296:6 703 {*andndi3_doubleword_bmi}
 (nil))

[Bug target/114810] [14 Regression] internal compiler error: in lra_split_hard_reg_for, at lra-assigns.cc:1868 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -m32 -mstackrealign

2024-04-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114810

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-04-22
 Ever confirmed|0   |1

--- Comment #3 from Uroš Bizjak  ---
It is alternative 0 (=,r,ro) that causes the spill failure. Following
definition works OK:

(define_insn_and_split "*andn3_doubleword_bmi"
  [(set (match_operand: 0 "register_operand" "=r,r")
(and:
  (not: (match_operand: 1 "register_operand" "0,r"))
  (match_operand: 2 "nonimmediate_operand" "ro,0")))
   (clobber (reg:CC FLAGS_REG))]

and compiles to:

(insn 1150 1108 987 7 (set (reg:DI 2 cx [453])
(reg:DI 3 bx [452])) "pr114810.C":296:6 84 {*movdi_internal}
 (nil))
(insn 987 1150 1151 7 (parallel [
(set (reg:DI 2 cx [453])
(and:DI (not:DI (reg:DI 2 cx [453]))
(reg:DI 0 ax [orig:217 _13 ] [217])))
(clobber (reg:CC 17 flags))
]) "pr114810.C":296:6 703 {*andndi3_doubleword_bmi}
 (nil))

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #13 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #12)
> short a;
> short c;
> short d;
> void
> foo (short b, short f)
> {
>   c = b + a;
>   d = f + a;
> }
> 
> foo(short, short):
> addwa(%rip), %di
> addwa(%rip), %si
> movw%di, c(%rip)
> movw%si, d(%rip)
> ret
> 
> this one is bad since gcc10.1 and there's no subreg, The problem is if the
> operand is used by more than 1 insn, and they all support separate m
> constraint, mem_cost is quite small(just 1, reg move cost is 2), and this
> makes RA more inclined to propagate memory across insns. I guess RA assumes
> the separate m means the insn only support memory_operand?

I don't see this as problematic. IIRC, there was a discussion in the past that
a couple (two?) memory accesses from the same location close to each other can
be faster (so, -O2, not -Os) than preloading the value to the register first.

In contrast, the example from the Comment #11 already has the correct value in
%eax, so there is no need to reload it again from memory, even in a narrower
mode.

[Bug rtl-optimization/112560] [14 Regression] ICE in try_combine on pr112494.c

2024-04-11 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560

--- Comment #13 from Uroš Bizjak  ---
(In reply to Segher Boessenkool from comment #12)
> You cannot use a :CC value as argument of an unspec, as explained before.
> 
> The result of a comparison is expressed as a comparison, in RTL.  This patch
> allows malformed RTL in more places than before, not progress at all.

During our discussion we determined that this form with UNSPEC is actually a
copy operation, so it is not an use [1] of CC register, because "use" is in the
form of cc-compared-to-0.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647426.html

Now, let's see the part my patch fixes. The original change that introduced the
functionality (See Comment #1) updates the "use" of the CC register. It assumes
exclusively cc-compared-to-0 use, but there are several patterns in various
target .md files that use naked CC register. The "???" comment suggests that
the transformation tripped on this and thus the "verify the zero_rtx" was
bolted on. The zero_rtx is inherent part of the regular CC reg "use", so this
addition _mostly_ weeded out invalid access with naked CC reg.

If the CC reg is used as the source of copy operation ("move"), with or without
UNSPEC, then the unpatched compiler will blindly use:

SUBST (XEXP (*cc_use_loc, 0), newpat_dest);

which *assumes* the certain form of the changed expression. Failed assumption
will lead to memory corruption, and this is what my patch prevents.

Patched compiler will find the sole use of the naked CC reg (due to
find_single_use) in the RTX, and update its mode at the right place. If the new
mode is not recognized by the insn pattern, then the combination is rejected.

In any case, we are trading silent memory corruption with failed combine
attempt. In my rule book, this is "progress" with bold, capital letters.

[Bug rtl-optimization/112560] [14 Regression] ICE in try_combine on pr112494.c

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560

--- Comment #11 from Uroš Bizjak  ---
(In reply to Segher Boessenkool from comment #10)
> It is still wrong.  You're trying to sweep tour wrong assumptions under the
> rug,
> but they will only rear up elsewhere.  Just fix the actual *target* problem!

I can't see what could be wrong with:

(define_insn "@pushfl2"
  [(set (match_operand:W 0 "push_operand" "=<")
(unspec:W [(match_operand 1 "flags_reg_operand")]
  UNSPEC_PUSHFL))]
  "GET_MODE_CLASS (GET_MODE (operands[1])) == MODE_CC"
  "pushf{}"
  [(set_attr "type" "push")
   (set_attr "mode" "")])

it is just a push of the flags reg to the stack. If the push can't be
described in this way, then it is the middle end at fault, we can't
just change modes at will.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #10 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #5)
> > My experience is memory cost for the operand with rm or separate r, m is
> > different which impacts RA decision.
> > 
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html
> 
> Change operands[1] alternative 2 from m -> rm, then RA makes perfect
> decision.

Yes, I can confirm this oddity:

movlv1(%rip), %edx  # 5 [c=6 l=6]  *zero_extendsidi2/3
movq%rdx, v2(%rip)  # 16[c=4 l=7]  *movdi_internal/5
movq%rdx, %rax  # 18[c=4 l=3]  *movdi_internal/3
ret # 21[c=0 l=1]  simple_return_internal

But even there is room for improvement. The last move can be eliminated by
allocating %eax in the first instruction.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #8 from Uroš Bizjak  ---
BTW: The reason for the original change:

 (define_insn "*movhi_internal"
-  [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r ,r ,m ,*k,*k
,*r,*m,*k,?r,?v,*v,*v,*m")
-   (match_operand:HI 1 "general_operand"  "r
,rn,rm,rn,*r,*km,*k,*k,CBC,v, r, v, m, v"))]
+  [(set (match_operand:HI 0 "nonimmediate_operand"
+"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*v,*v,*v,m")
+   (match_operand:HI 1 "general_operand"
+"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r  ,C ,*v,m ,*v"))]

was that (r,r) overrides (r,rn) and (r,rm), so the later two can be changed
(without introducing any side effect) to (r,n) and (r,m), since (reg,reg) is
always matched by the (r,r) constraint. The different treatment of the changed
later two patterns is confusing at least.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #7 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #5)
> > My experience is memory cost for the operand with rm or separate r, m is
> > different which impacts RA decision.
> > 
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html
> 
> Change operands[1] alternative 2 from m -> rm, then RA makes perfect
> decision.

Oh, you are also the author of the above patch ;)

Can you please take the issue from here and perhaps review other x86 patterns
for unoptimal constraints? I was always under impression that rm and separate
"r,m" are treated in the same way...

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #6 from Uroš Bizjak  ---
LRA starts with this:

5: r98:SI=[`v1']
  REG_EQUIV [`v1']
6: [`v2']=zero_extend(r98:SI)
7: r101:HI=r98:SI#0
  REG_DEAD r98:SI
   12: ax:HI=r101:HI
  REG_DEAD r101:HI
   13: use ax:HI

then decides that:

  Removing equiv init insn 5 (freq=1000)
5: r98:SI=[`v1']
  REG_EQUIV [`v1']

and substitutes all follow-up usages of r98 with a memory access. In insn 6, we
have:

(mem/c:SI (symbol_ref:DI ("v1")))

while in insn 7 we have:

(mem/c:HI (symbol_ref:DI ("v1")))

It looks that different modes of memory read confuse LRA to not CSE the read.

IMO, if the preloaded value is later accessed in different modes, LRA should
leave it. Alternatively, LRA should CSE memory accesses in different modes.

Cc LRA expert ... oh, he already is in the loop.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #3 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #2)
> This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6

Strange, if I revert the constraints to the previous setting with: 

--cut here--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 10ae3113ae8..262dd25a8e0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -2870,9 +2870,9 @@ (define_peephole2

 (define_insn "*movhi_internal"
   [(set (match_operand:HI 0 "nonimmediate_operand"
-"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*Yv,*v,*v,jm,m")
+"=r,r,r,m ,*k,*k ,*r ,*m ,*k ,?r,?v,*Yv,*v,*v,*jm,*m")
(match_operand:HI 1 "general_operand"
-"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r  ,C  ,*v,m ,*x,*v"))]
+"r ,n,m,rn,*r ,*km,*k,*k,CBC,v,r  ,C  ,v,m ,x,v"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))
&& ix86_hardreg_mov_ok (operands[0], operands[1])"
 {
--cut here--

I still get:

movlv1(%rip), %eax  # 6 [c=6 l=6]  *zero_extendsidi2/3
movq%rax, v2(%rip)  # 16[c=4 l=7]  *movdi_internal/5
movzwl  v1(%rip), %eax  # 7 [c=5 l=7]  *movhi_internal/2

[Bug rtl-optimization/112560] [14 Regression] ICE in try_combine on pr112494.c

2024-04-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560

--- Comment #9 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #8)
> Fixed I suppose.

Yes - I plan backport the patch to at least gcc-13.

[Bug target/114639] [riscv] ICE in create_pre_exit, at mode-switching.cc:451

2024-04-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114639

--- Comment #9 from Uroš Bizjak  ---
(In reply to Andrew Pinski from comment #2)
> /* If we didn't see a full return value copy, verify that there
>is a plausible reason for this.  If some, but not all of the
>return register is likely spilled, we can expect that there
>is a copy for the likely spilled part.  */

This part of the mode-switching pass is a real PITA. The trick here is with the
calculation of forced_late_switch (but please see N.b. comment at the beginning
of the function where some failed assumptions are described).

[Bug middle-end/114547] comparison than less than 0 (or greater or equal to than 0) after a subtraction does not use the flags regster

2024-04-02 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114547

Uroš Bizjak  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||ubizjak at gmail dot com
   Last reconfirmed||2024-04-02
 Status|UNCONFIRMED |NEW

--- Comment #3 from Uroš Bizjak  ---
The similar testcase with comparison to zero:

int z(int *v, int n) {
*v -= n;
return *v == 0;
}

int nz(int *v, int n) {
*v -= n;
return *v != 0;
}

generates expected code:

z:
xorl%eax, %eax
subl%esi, (%rdi)
sete%al
ret

nz:
xorl%eax, %eax
subl%esi, (%rdi)
setne   %al
ret

The middle end expands via standard sequence:

(insn 10 9 11 (set (reg:CCZ 17 flags)
(compare:CCZ (reg:SI 83 [ _2 ])
(const_int 0 [0]))) "s.c":3:12 -1
 (nil))

(insn 11 10 12 (set (reg:QI 91)
(eq:QI (reg:CCZ 17 flags)
(const_int 0 [0]))) "s.c":3:12 -1
 (nil))

(insn 12 11 13 (set (reg:SI 90)
(zero_extend:SI (reg:QI 91))) "s.c":3:12 -1
 (nil))

(insn 13 12 14 (set (reg:SI 85 [  ])
(reg:SI 90)) "s.c":3:12 -1
 (nil))

OTOH, the sign compare is expanded via:

(insn 12 9 13 (parallel [
(set (reg:SI 90)
(lshiftrt:SI (reg:SI 83 [ _2 ])
(const_int 31 [0x1f])))
(clobber (reg:CC 17 flags))
]) "pr114547.c":4:12 -1
 (nil))

(insn 13 12 14 (set (reg:SI 85 [  ])
(reg:SI 90)) "pr114547.c":4:12 -1
 (nil))

(The above shift also interferes with RMW creation, resulting in unoptimal asm
sequence with two extra moves, and additional NEG insn in "ns" case.)

Middle-end expansion should avoid premature optimization in this case, at least
for targets that can merge sign comparison with the arith instruction.

[Bug target/114487] ICE when building libsdl2 on -mfpmath=sse x86 with LTO

2024-03-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114487

--- Comment #4 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #3)
> (In reply to Uroš Bizjak from comment #2)
> > Adding -msse to the second compilation works OK, removing -mfpmath=sse from
> > the first compilation also works OK.
> 
> Which makes this PR a LTO reincarnation of PR66047.

Please see the FIXME in ix86_function_sseregparm:

  /* Refuse to produce wrong code when local function with SSE enabled
 is called from SSE disabled function.
 FIXME: We need a way to detect these cases cross-ltrans partition
 and avoid using SSE calling conventions on local functions called
 from function with SSE disabled.  For now at least delay the
 warning until we know we are going to produce wrong code.
 See PR66047  */

[Bug target/114487] ICE when building libsdl2 on -mfpmath=sse x86 with LTO

2024-03-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114487

--- Comment #3 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #2)
> Adding -msse to the second compilation works OK, removing -mfpmath=sse from
> the first compilation also works OK.

Which makes this PR a LTO reincarnation of PR66047.

[Bug target/114487] ICE when building libsdl2 on -mfpmath=sse x86 with LTO

2024-03-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114487

--- Comment #2 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #1)

> (insn 6 5 0 (set (reg/v:SF 99 [ gamma ])
> (reg:SF 20 xmm0)) "testautomation-testautomation_pixels.i":15:17 -1
>  (nil))
> 
> I'm not sure what's wrong - looks like a target issue to me.

We are working with SFmode, so -msse is enough to trigger the bug.

This is known issue. GCC assumes that at least moves of all hard registers are
working, which is not the case when LTO-compiling
testautomation-testautomation_pixels.i with SDL_test_fuzzer.o (that enables and
uses XMM registers via -msse -mfpmath=sse).

It looks to me that the compiler hits this part in function_value_32 when
LTO-compiling:

--cut here--
  /* Override FP return register with %xmm0 for local functions when
 SSE math is enabled or for functions with sseregparm attribute.  */
  if ((fn || fntype) && (mode == SFmode || mode == DFmode))
{
  int sse_level = ix86_function_sseregparm (fntype, fn, false);
  if (sse_level == -1)
{
  error ("calling %qD with SSE calling convention without "
 "SSE/SSE2 enabled", fn);
  sorry ("this is a GCC bug that can be worked around by adding "
 "attribute used to function called");
}
  else if ((sse_level >= 1 && mode == SFmode)
   || (sse_level == 2 && mode == DFmode))
regno = FIRST_SSE_REG;
}
--cut here--

Adding -msse to the second compilation works OK, removing -mfpmath=sse from the
first compilation also works OK.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-25 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #29 from Uroš Bizjak  ---
Do we also need to adjust TSAN? There is a bugreport that KCSAN does work
correctly with the named address spaces.(In reply to Jakub Jelinek from comment
#28)
> Created attachment 57807 [details]
> gcc14-pr111736-tsan.patch
> 
> Untested patch for tsan.

Yes, this patch fixes the failure for linux kernel.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-25 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #26 from Uroš Bizjak  ---
Do we also need to adjust TSAN? There is a bugreport that KCSAN does work
correctly with the named address spaces.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #19 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #16)
> (In reply to Richard Biener from comment #13)
> > The original testcase is fixed, appearantly slapping 'extern' on the int
> > makes it not effective.
> > 
> > Possibly better amend the
> > 
> >   if (VAR_P (inner) && DECL_HARD_REGISTER (inner))
> > return;
> > 
> > check though.  As indicated my fix fixed only VAR_DECL cases, there's
> > still pointer-based accesses (MEM_REF) to consider.  So possibly even
> > the following is necessary
> 
> I must admit that to create the patch from Comment #11 I just mindlessly
> searched for DECL_THREAD_LOCAL_P in asan.cc and amended the location with
> ADDR_SPACE_GENERIC_P check.
> 
> However, ASAN should back off from annotating *any* gs: prefixed address. 
> 
> I'll test your patch from Comment #13 ASAP.

Weee, it works!

Decompressing Linux... Parsing ELF... Performing relocations... done.
Booting the kernel (entry_offset: 0x).
[0.00] Linux version 6.8.0-11485-ge1826833c3a9 (uros@localhost) (xgcc
(GCC) 14.0.1 20240321 (experimental) [master r14-9588-g415091f0909], GNU ld
version 2.40-14.fc39) #1 SMP PREEMPT_DYNAMIC Thu Mar 21 09:44:30 CET 2024
...

I have used slightly different patch:

--cut here--
diff --git a/gcc/asan.cc b/gcc/asan.cc
index cfe83106460..026d079a4a1 100644
--- a/gcc/asan.cc
+++ b/gcc/asan.cc
@@ -2755,6 +2755,9 @@ instrument_derefs (gimple_stmt_iterator *iter, tree t,
   if (VAR_P (inner) && DECL_HARD_REGISTER (inner))
 return;

+  if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (TREE_TYPE (inner
+return;
+
   poly_int64 decl_size;
   if ((VAR_P (inner)
|| (TREE_CODE (inner) == RESULT_DECL
--cut here--

Hard registers and named address spaces really have nothing in common.

IMO, the fixes here should be applied to all release branches. Running KASAN
sanitized kernel with the named AS is the ultimate test for this PR.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #16 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #13)
> The original testcase is fixed, appearantly slapping 'extern' on the int
> makes it not effective.
> 
> Possibly better amend the
> 
>   if (VAR_P (inner) && DECL_HARD_REGISTER (inner))
> return;
> 
> check though.  As indicated my fix fixed only VAR_DECL cases, there's
> still pointer-based accesses (MEM_REF) to consider.  So possibly even
> the following is necessary

I must admit that to create the patch from Comment #11 I just mindlessly
searched for DECL_THREAD_LOCAL_P in asan.cc and amended the location with
ADDR_SPACE_GENERIC_P check.

However, ASAN should back off from annotating *any* gs: prefixed address. 

I'll test your patch from Comment #13 ASAP.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Uroš Bizjak  changed:

   What|Removed |Added

   Priority|P3  |P1

--- Comment #12 from Uroš Bizjak  ---
P1, I would *really* like this PR to be fixed for gcc-14, the new linux kernel
support for named address spaces depend on this fix.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-20 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #11 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #10)
> Huh, is this really fixed?

IMO, this patch is also needed:

--cut here--
diff --git a/gcc/asan.cc b/gcc/asan.cc
index cfe83106460..54dcc3a38db 100644
--- a/gcc/asan.cc
+++ b/gcc/asan.cc
@@ -2764,7 +2764,9 @@ instrument_derefs (gimple_stmt_iterator *iter, tree t,
   && poly_int_tree_p (DECL_SIZE (inner), _size)
   && known_subrange_p (bitpos, bitsize, 0, decl_size))
 {
-  if (VAR_P (inner) && DECL_THREAD_LOCAL_P (inner))
+  if (VAR_P (inner)
+ && (DECL_THREAD_LOCAL_P (inner)
+ || !ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (TREE_TYPE (inner)
return;
   /* If we're not sanitizing globals and we can tell statically that this
 access is inside a global variable, then there's no point adding
--cut here--

But unfortunately, it doesn't result in bootable kernel.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-20 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #10 from Uroš Bizjak  ---
Huh, is this really fixed?

--cut here--
extern int __seg_gs m;

int foo (void)
{
  return m;
}

extern __thread int n;

int bar (void)
{
  return n;
}

extern int o;

int baz (void)
{
  return o;
}
--cut here--

gcc -O2 -fsanitize=address:


foo:
.LASANPC0:
.LFB0:
.cfi_startproc
movl$m, %eax
movq%rax, %rdx
andl$7, %eax
shrq$3, %rdx
addl$3, %eax
movzbl  2147450880(%rdx), %edx
cmpb%dl, %al
jl  .L2
testb   %dl, %dl
jne .L13
.L2:
movl%gs:m(%rip), %eax
ret
.L13:
pushq   %rax
.cfi_def_cfa_offset 16
movl$m, %edi
call__asan_report_load4
.cfi_endproc
.LFE0:
.size   foo, .-foo
.p2align 4
.globl  bar
.type   bar, @function

The memory access is still annotated with asan code.

I did test patched gcc by building a kernel with named address spaces, but I'm
not sure I did it correctly anymore - I was not able to boot recent -tip with
KASAN and enabled named address spaces.

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-19 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #22 from Uroš Bizjak  ---
Fixed also for TImode STV.

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #18 from Uroš Bizjak  ---
When we split
(insn 37 36 38 10 (set (reg:DI 104 [ _18 ])
(mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6 MEM[(struct
SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])) "test.C":22:42 84
{*movdi_internal}
 (expr_list:REG_EH_REGION (const_int -11 [0xfff5])

into

(insn 104 36 37 10 (set (subreg:V2DI (reg:DI 124) 0)
(vec_concat:V2DI (mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6
MEM[(struct SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])
(const_int 0 [0]))) "test.C":22:42 -1
(nil)))
(insn 37 104 105 10 (set (subreg:V2DI (reg:DI 104 [ _18 ]) 0)
(subreg:V2DI (reg:DI 124) 0)) "test.C":22:42 2024 {movv2di_internal}
 (expr_list:REG_EH_REGION (const_int -11 [0xfff5])
(nil)))

we must copy the REG_EH_REGION note to the first insn and split the block
after the newly added insn.  The REG_EH_REGION on the second insn will be
removed later since it no longer traps.

gcc/ChangeLog:

* config/i386/i386-features.cc
(general_scalar_chain::convert_op): Handle REG_EH_REGION note.
(convert_scalars_to_vector): Ditto.
* config/i386/i386-features.h (class scalar_chain): New
memeber control_flow_insns.

gcc/testsuite/ChangeLog:

* g++.target/i386/pr111822.C: New test.

[Bug testsuite/40130] using RUNTESTFLAGS="--target_board 'foo{-mxyz,-mjkl,}'" screws up ieee.exp (and possibly others?)

2024-03-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40130

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
   Target Milestone|--- |4.5.0

--- Comment #11 from Uroš Bizjak  ---
Fixed long time ago.

[Bug rtl-optimization/112560] [14 Regression] ICE in try_combine on pr112494.c

2024-03-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560

--- Comment #6 from Uroš Bizjak  ---
v3 patch at [1]

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647634.html

[Bug rtl-optimization/112560] [14 Regression] ICE in try_combine on pr112494.c

2024-03-11 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560

Uroš Bizjak  changed:

   What|Removed |Added

  Attachment #56705|0   |1
is obsolete||

--- Comment #5 from Uroš Bizjak  ---
Created attachment 57666
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57666=edit
Proposed v2 patch

New version of patch in testing.

This version handles change of the mode of CC reg outside comparison. Now we
scan the RTX and change the mode of the CC reg at the proper place. We are
guaranteed by find_single_use that cc_use_loc can be non-NULL only when exactly
one user follows the combined comparison.

In case of unsupported cc_use_insn combine will be undone. To avoid combine
failure, pushfl2 in i386.md is changed to accept all MODE_CC modes.

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #13 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #12)

> > But I think, we could do better. Adding CC.
> 
> We sure could, but I doubt it's too important?  Maybe for Go/Ada.

Preloading stuff is simply loading from the same DImode address, so I'd think
that EH_NOTE should be moved from the original insn to the new insn without
much problems.

Please note that on x86_32 split pass is later splitting DImode memory access
to two SImode loads, this looks somehow harder problem as far as EH notes are
concerned, as the one above.

I'm not versed in this area, so I'll leave the fix to someone else.

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

Uroš Bizjak  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #11 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #10)
> The easiest fix would be to refuse applying STV to a insn that
> can_throw_internal () (that's an insn that has associated EH info).  Updating
> in this case would require splitting the BB or at least moving the now
> no longer throwing insn to the next block (along the fallthru edge).

This would be simply:

--cut here--
diff --git a/gcc/config/i386/i386-features.cc
b/gcc/config/i386/i386-features.cc
index 1de2a07ed75..90acb33db49 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -437,6 +437,10 @@ scalar_chain::add_insn (bitmap candidates, unsigned int
insn_uid,
   && !HARD_REGISTER_P (SET_DEST (def_set)))
 bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));

+  if (cfun->can_throw_non_call_exceptions
+  && can_throw_internal (insn))
+return false;
+
   /* ???  The following is quadratic since analyze_register_chain
  iterates over all refs to look for dual-mode regs.  Instead this
  should be done separately for all regs mentioned in the chain once.  */
--cut here--

But I think, we could do better. Adding CC.

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #9 from Uroš Bizjak  ---
The offending insn is emitted in general_scalar_chain::convert_op due to
preloading, but I have no idea how EH should be adjusted.  There is another
instance in timode_scalar_chain::convert_op.

emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (vmode, tmp, 0),
   gen_gpr_to_xmm_move_src (vmode, *op)),
  insn);

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-06 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
 Target||i686

--- Comment #29 from Uroš Bizjak  ---
Fixed.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-06 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

Uroš Bizjak  changed:

   What|Removed |Added

  Attachment #57614|0   |1
is obsolete||

--- Comment #27 from Uroš Bizjak  ---
Created attachment 57626
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57626=edit
Proposed v2 patch

New version in testing:
  - uses optimize_size to stabilize optab discovery
  - explicitly enables insn pattern for TARGET_SSE2

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #25 from Uroš Bizjak  ---
Created attachment 57614
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57614=edit
Proposed patch

Proposed patch that changes optimize_function_for_size_p (cfun) to
optimize_size.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #23 from Uroš Bizjak  ---
(In reply to Jan Hubicka from comment #21)
> Looking at the prototype patch, why need to change also the splitters?

Purely for implementation reasons, we check for general resp. SSE register in
the operand predicates, so there is no need to use additional insn constraints.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #19 from Uroš Bizjak  ---
(In reply to Jan Hubicka from comment #18)
> But the problem here is more that optab initializations happens only at
> the optimization_node changes and not if we switch from hot function to
> cold?

I think solving optab init problem is a better solution than the target patch
from comment #10. Using optimize_function_for_size_p in named pattern predicate
would avoid using the non-optimal pattern also in cold functions, and would be
preferrable to using optimize_size.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #16 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #15)

> Seems various backends use e.g. optimize_size or !optimize_size or optimize
> > 0 etc. in
> insn-flags.h, so perhaps change optimize_function_for_size_p (cfun) to
> optimize_size?

Won't this cause a mismatch with insn patterns we split to? As said in Comment
#13, these have insn predicates that include optimize_function_for_size_p.

[Bug rtl-optimization/114211] [13 Regression] wrong code with -O -fno-tree-coalesce-vars since r13-1907

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114211

--- Comment #9 from Uroš Bizjak  ---
Noticed this in passing:

--> movq%rcx, %rdx
addqv(%rip), %rax
adcqv+8(%rip), %rdx
vmovq   %rax, %xmm1
vpinsrq $1, %rdx, %xmm1, %xmm0

We could use %rcx instead of %rdx to eliminate the marked move. This is an
artefact of post-reload split of *add3_doubleword. The _doubleword
patterns use only general regs, so they could be split before reload as well.
IIRC, there were some minor RA problems with the later approach, but as years
passed and LRA improved, perhaps the split point can be moved before reload.

Something to try (again) for gcc-15. It is just the case of using
ix86_pre_reload_split instead of reload_completed.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #13 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #12)
> Still, it would be nice to understand what changed
> optimize_function_for_size_p (cfun)
> after IPA.  Is something adjusting node->count or node->frequency?
> Otherwise it should just depend on the optimize_size flag which should not
> change...

The target-dependent issue is with insn patterns we split to. These are enabled
with "!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun)", as
they are intended to be combined from movstrict insn pattern (which can FAIL).

So, the condition for V2QI insn is now either !TARGET_PARTIAL_REG_STALL or
TARGET_SSE2 (where we disable alternative 0 for TARGET_PARTIAL_REG_STALL
targets, but we still emit SSE instruction).

Please note that while -march=i686 enables TARGET_PARTIAL_REG_STALL, it enables
SSE2, so the proposed solution won't have much impact. Also, V2QImode is not
much used, so I guess the proposed solution is the right compromise.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2024-03-05
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

--- Comment #10 from Uroš Bizjak  ---
Created attachment 57612
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57612=edit
Prototype patch

Let's try this approach.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #9 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #8)
> > grep optimize_ insn-flags.h | wc -l
> 14
> 
> so it's not very many standard patterns that would be affected.  I'd say
> using these kind of flags on standard patterns is at least fragile?

You are right, only recently added neg, add, sub, ashl, lshr, ashr V2QImode
standard patterns have optimize_function_for_size_p in their condition.

Any suggestion on how to fix them?

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #5 from Uroš Bizjak  ---
Huh, it looks that optimize_function_for_size_p (cfun) is not stable during
LTO?!

Using:

--cut here--
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 2856ae6ffef..80114494b0b 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -2975,7 +2975,7 @@ (define_insn "v2qi3"
  (match_operand:V2QI 1 "register_operand" "0,0,Yw")
  (match_operand:V2QI 2 "register_operand" "Q,x,Yw")))
(clobber (reg:CC FLAGS_REG))]
-  "!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun)"
+  "!TARGET_PARTIAL_REG_STALL"
   "#"
   [(set_attr "isa" "*,sse2_noavx,avx")
(set_attr "type" "multi,sseadd,sseadd")
@@ -2987,7 +2987,7 @@ (define_split
  (match_operand:V2QI 1 "general_reg_operand")
  (match_operand:V2QI 2 "general_reg_operand")))
(clobber (reg:CC FLAGS_REG))]
-  "(!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun))
+  "!TARGET_PARTIAL_REG_STALL
&& reload_completed"
   [(parallel
  [(set (strict_low_part (match_dup 0))
@@ -3021,7 +3021,7 @@ (define_split
  (match_operand:V2QI 1 "sse_reg_operand")
  (match_operand:V2QI 2 "sse_reg_operand")))
(clobber (reg:CC FLAGS_REG))]
-  "(!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun))
+  "!TARGET_PARTIAL_REG_STALL
&& TARGET_SSE2 && reload_completed"
   [(set (match_dup 0)
 (plusminus:V16QI (match_dup 1) (match_dup 2)))]
--cut here--

So, removing optimize-size bypass successfully compiles the testcase.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #4 from Uroš Bizjak  ---
(In reply to Sam James from comment #0)

> (insn 160 159 161 26 (parallel [
> (set (reg:V2QI 250 [ vect_patt_207.470_183 ])
> (minus:V2QI (reg:V2QI 251)
> (reg:V2QI 249 [ vect__4.468_451 ])))
> (clobber (reg:CC 17 flags))
> ])

This is the definition of the offending pattern in mmx.md:

(define_insn "v2qi3"
  [(set (match_operand:V2QI 0 "register_operand" "=?Q,x,Yw")
(plusminus:V2QI
  (match_operand:V2QI 1 "register_operand" "0,0,Yw")
  (match_operand:V2QI 2 "register_operand" "Q,x,Yw")))
   (clobber (reg:CC FLAGS_REG))]
  "!TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun)"
  "#"
  [(set_attr "isa" "*,sse2_noavx,avx")
   (set_attr "type" "multi,sseadd,sseadd")
   (set_attr "mode" "QI,TI,TI")])

where -march=i686 (aka pentiumpro) implies TARGET_PARTIAL_REG_STALL, so it
should be disabled unless optimizing for size. I wonder if
optimize_function_for_size_p is stable during the LTO compilation, but we have
plenty of usages like the above throughout x86  .md files and there were no
problems reported.

Another possibility is that the instruction RTX is emitted without checking of
the pattern condition, the testcase is compiled with -O3, so
optimize_function_for_size_p should also be false.

I don't see anything wrong with the above pattern. The failure also happens
very early into the RTL part of the compilation (vregs pass is the first pass
that tries to recognize the pattern), so my bet is on middle-end emitting insn
pattern without checking for pattern availability.

[Bug rtl-optimization/114211] [13/14 Regression] wrong code with -O -fno-tree-coalesce-vars

2024-03-04 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114211

Uroš Bizjak  changed:

   What|Removed |Added

  Component|target  |rtl-optimization
   Keywords|needs-bisection |

--- Comment #3 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #2)
> Possibly target independent rtl-optimization issue.

It is _subreg1 pass that converts:

(insn 10 7 11 2 (set (reg/v:TI 106 [ h ])
(rotate:TI (reg/v:TI 106 [ h ])
(const_int 64 [0x40]))) "pr114211.c":9:5 1042
{rotl64ti2_doubleword}
 (nil))

to:

(insn 39 7 40 2 (set (reg:DI 128 [ h+8 ])
(reg:DI 127 [ h ])) "pr114211.c":9:5 84 {*movdi_internal}
 (nil))
(insn 40 39 11 2 (set (reg:DI 127 [ h ])
(reg:DI 128 [ h+8 ])) "pr114211.c":9:5 84 {*movdi_internal}
 (nil))

Well... this won't swap. Either parallel should be emitted, or a temporary
should be used.

Adding -fno-split-wide-types fixes the testcase.

Re-confirmed as rtl-optimization problem.

[Bug target/113720] [14 Regression] internal compiler error: in extract_insn, at recog.cc:2812 targeting alpha-linux-gnu

2024-03-03 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113720

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |RESOLVED

--- Comment #8 from Uroš Bizjak  ---
Assuming fixed, please reopen if not.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #56 from Uroš Bizjak  ---
The testcase is fixed with g:430c772be3382134886db33133ed466c02efc71c

[Bug target/113871] psrlq is not used for PERM

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113871

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |14.0
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Uroš Bizjak  ---
Implemented for gcc-14.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #55 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #53)
> Comment on attachment 57424 [details]
> Proposed testsuite patch
> 
> As skylake-avx512 is -mavx512{f,cd,bw,dq,vl}, requiring just avx512f
> effective target and testing it at runtime IMHO isn't enough.
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...

I think that addition of

+# if defined(__AVX512VL__)
+want_level = 7, want_b = bit_AVX512VL;
+# elif defined(__AVX512F__)
+want_level = 7, want_b = bit_AVX512F;
+# elif defined(__AVX2__)

to check_vect solves all current uses in gcc.dg/vect

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #52 from Uroš Bizjak  ---
Created attachment 57424
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57424=edit
Proposed testsuite patch

This patch fixes the failure for me (+ some other dg.exp/vect inconsistencies).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #48 from Uroš Bizjak  ---
The runtime testcase fails on non-AVX512F x86 targets due to:

/* { dg-do run } */
/* { dg-options "-O3" } */
/* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-*
i?86-*-* } } } */

but check_vect() only checks runtime support up to AVX2.

[Bug target/113871] psrlq is not used for PERM

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113871

Uroš Bizjak  changed:

   What|Removed |Added

  Attachment #57417|0   |1
is obsolete||

--- Comment #5 from Uroš Bizjak  ---
Created attachment 57419
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57419=edit
Proposed v2 patch

New version in testing, also handles 32-bit vectors.

[Bug target/113871] psrlq is not used for PERM

2024-02-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113871

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #4 from Uroš Bizjak  ---
Created attachment 57417
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57417=edit
Proposed patch

Patch in testing.

[Bug target/113720] [14 Regression] internal compiler error: in extract_insn, at recog.cc:2812 targeting alpha-linux-gnu

2024-02-03 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113720

--- Comment #5 from Uroš Bizjak  ---
(In reply to Matthias Klose from comment #4)
> Uros proposed patch lets the build succeed.

FTR, the problem was in umuldi3_highpart expander, which did:

   if (REG_P (operands[2]))
 operands[2] = gen_rtx_ZERO_EXTEND (TImode, operands[2]);

on register_operand predicate, which also allows SUBREG RTX. So, subregs were
emitted without ZERO_EXTEND RTX.

But nowadays we have UMUL_HIGHPART that allows us to fix this issue while also
simplifying the instruction RTX.

Matthias, can you please run the regression check - The fix is kind of obvious,
but just to be sure.

[Bug target/113720] [14 Regression] internal compiler error: in extract_insn, at recog.cc:2812 targeting alpha-linux-gnu

2024-02-02 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113720

--- Comment #1 from Uroš Bizjak  ---
Created attachment 57292
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57292=edit
Patch that introduces umul_highpart RTX

Please try the attached (untested) patch.

[Bug target/82580] Optimize comparisons for __int128 on x86-64

2024-02-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82580

Uroš Bizjak  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |FIXED

--- Comment #18 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #17)

> A different regression happens with pr82580.c, f0 function. Without the
> patch, the compiler generates:

Fixed by r14-8713-g44764984cf24e27cf7756cffd197283b9c62db8b

[Bug target/113701] Issues with __int128 argument passing

2024-02-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113701

--- Comment #4 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #2)

> > The most problematic function is f3, which regressed noticeably from 
> > gcc-12.3:
> 
> This patch solves the regression:
> 
> --cut here--
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index bac0a6ade67..02fed16db72 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -1632,11 +1632,6 @@ (define_insn_and_split "*cmp_doubleword"
>   (set (match_dup 4) (ior:DWIH (match_dup 4) (match_dup 5)))])]
>  {
>split_double_mode (mode, [0], 2, [0],
> [2]);
> -  /* Placing the SUBREG pieces in pseudos helps reload.  */
> -  for (int i = 0; i < 4; i++)
> -if (SUBREG_P (operands[i]))
> -  operands[i] = force_reg (mode, operands[i]);
> -
>operands[4] = gen_reg_rtx (mode);
>  
>/* Special case comparisons against -1.  */
> --cut here--

Roger, this part was added in [1]. Does this code improve some real reload
issue, not covered in the testsuite, because testsuite results do not show any
regression if the above code is removed.

BTW: The issue was noticed in gcc.target/i386/pr82580.c test case.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595860.html

[Bug target/113701] Issues with __int128 argument passing

2024-02-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113701

--- Comment #2 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #0)
> Following testcase:
> 
> --cut here--
> typedef unsigned __int128 U;
> 
> U f0 (U x, U y) { return x + y; }
> U f1 (U x, U y) { return x - y; }
> 
> U f2 (U x, U y) { return x | y; }
> 
> int f3 (U x, U y) { return x == y; }
> int f4 (U x, U y) { return x < y; }
> --cut here--
> 
> shows some issues with __int128 parameter passing.
> 
> gcc -O2:
> 
> f0:
> movq%rdx, %rax
> movq%rcx, %rdx
> addq%rdi, %rax
> adcq%rsi, %rdx
> ret
> 
> f1:
> xchgq   %rdi, %rsi
> movq%rdx, %r8
> movq%rsi, %rax
> movq%rdi, %rdx
> subq%r8, %rax
> sbbq%rcx, %rdx
> ret
> 
> f2:
> xchgq   %rdi, %rsi
> movq%rdx, %rax
> movq%rcx, %rdx
> orq %rsi, %rax
> orq %rdi, %rdx
> ret
> 
> f3:
> xchgq   %rdi, %rsi
> movq%rdx, %r8
> movq%rcx, %rax
> movq%rsi, %rdx
> movq%rdi, %rcx
> xorq%rax, %rcx
> xorq%r8, %rdx
> xorl%eax, %eax
> orq %rcx, %rdx
> sete%al
> ret
> 
> f4:
> xorl%eax, %eax
> cmpq%rdx, %rdi
> sbbq%rcx, %rsi
> setc%al
> ret
> 
> Functions f0 and f4 are now optimal.
> 
> Functions f1, f2 and f3 emit extra XCHG, but the swap should be propagated
> to MOV instructions instead.
> 
> The most problematic function is f3, which regressed noticeably from gcc-12.3:

This patch solves the regression:

--cut here--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index bac0a6ade67..02fed16db72 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1632,11 +1632,6 @@ (define_insn_and_split "*cmp_doubleword"
  (set (match_dup 4) (ior:DWIH (match_dup 4) (match_dup 5)))])]
 {
   split_double_mode (mode, [0], 2, [0], [2]);
-  /* Placing the SUBREG pieces in pseudos helps reload.  */
-  for (int i = 0; i < 4; i++)
-if (SUBREG_P (operands[i]))
-  operands[i] = force_reg (mode, operands[i]);
-
   operands[4] = gen_reg_rtx (mode);

   /* Special case comparisons against -1.  */
--cut here--

gcc -O2:

f3:
xchgq   %rdi, %rsi
xorl%eax, %eax
xorq%rsi, %rdx
xorq%rdi, %rcx
orq %rcx, %rdx
sete%al
ret

[Bug target/113701] Issues with __int128 argument passing

2024-02-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113701

Uroš Bizjak  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #1 from Uroš Bizjak  ---
CC added.

[Bug target/113701] New: Issues with __int128 argument passing

2024-02-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113701

Bug ID: 113701
   Summary: Issues with __int128 argument passing
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

Following testcase:

--cut here--
typedef unsigned __int128 U;

U f0 (U x, U y) { return x + y; }
U f1 (U x, U y) { return x - y; }

U f2 (U x, U y) { return x | y; }

int f3 (U x, U y) { return x == y; }
int f4 (U x, U y) { return x < y; }
--cut here--

shows some issues with __int128 parameter passing.

gcc -O2:

f0:
movq%rdx, %rax
movq%rcx, %rdx
addq%rdi, %rax
adcq%rsi, %rdx
ret

f1:
xchgq   %rdi, %rsi
movq%rdx, %r8
movq%rsi, %rax
movq%rdi, %rdx
subq%r8, %rax
sbbq%rcx, %rdx
ret

f2:
xchgq   %rdi, %rsi
movq%rdx, %rax
movq%rcx, %rdx
orq %rsi, %rax
orq %rdi, %rdx
ret

f3:
xchgq   %rdi, %rsi
movq%rdx, %r8
movq%rcx, %rax
movq%rsi, %rdx
movq%rdi, %rcx
xorq%rax, %rcx
xorq%r8, %rdx
xorl%eax, %eax
orq %rcx, %rdx
sete%al
ret

f4:
xorl%eax, %eax
cmpq%rdx, %rdi
sbbq%rcx, %rsi
setc%al
ret

Functions f0 and f4 are now optimal.

Functions f1, f2 and f3 emit extra XCHG, but the swap should be propagated to
MOV instructions instead.

The most problematic function is f3, which regressed noticeably from gcc-12.3:

f3:
xorq%rdx, %rdi
xorq%rcx, %rsi
xorl%eax, %eax
orq %rsi, %rdi
sete%al
ret

[Bug target/113609] EQ/NE comparison between avx512 kmask and -1 can be optimized with kxortest with checking CF.

2024-01-25 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113609

--- Comment #2 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #1)
> Since they're different modes, CCZ for cmp, but CCS for kortest, it could be
> diffcult to optimize it in RA stage by adding alternatives(like we did for
> compared to 0). So the easy way could be adding peephole to hanlde that.

You can use pre-reload split for this. Please see for example how *jcc_bt
and *bt_setcqi provide compound operation (constructed from CCZmode
compare) that is later split to CCCmode operation. You will have to provide jcc
and setcc patterns to fully handle mode change.

[Bug target/82580] Optimize comparisons for __int128 on x86-64

2024-01-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82580

Uroš Bizjak  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #17 from Uroš Bizjak  ---
(In reply to Roger Sayle from comment #16)
> Advance warning that the testcase pr82580.c will start FAILing due to
> differences in register allocation following improvements to __int128
> parameter passing as explained in
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623756.html.
> We might need additional reload alternatives/preferences to ensure that we
> don't generate a movzbl.  Hopefully, Jakub and/or Uros have some suggestions
> for how best this can be fixed.
> 
> Previously, the SUBREGs and CLOBBERs generated by middle-end RTL expansion
> (unintentionally) ensured that rdx and rax would never be used for __int128
> arguments, which conveniently allowed the use of xor eax,eax; setc al in
> peephole2 as AX_REG wasn't live.  Now reload has more freedom, it elects to
> use rax as at this point the backend hasn't expressed any preference that it
> would like eax reserved for producing the result.

A different regression happens with pr82580.c, f0 function. Without the patch,
the compiler generates:

f0:
xorq%rdi, %rdx
xorq%rcx, %rsi
xorl%eax, %eax
orq %rsi, %rdx
sete%al
ret

But with the patch:

f0:
xchgq   %rdi, %rsi
movq%rdx, %r8
movq%rcx, %rax
movq%rsi, %rdx
movq%rdi, %rcx
xorq%rax, %rcx
xorq%r8, %rdx
xorl%eax, %eax
orq %rcx, %rdx
sete%al
ret

It looks to me that *concatditi3_3 ties two registers together so RA now tries
to satisfy *concatditi3_3 constraints *and* *cmpti_doubleword constraints.

The gcc.target/i386/pr43644-2.c mitigates this issue with
*addti3_doubleword_concat pattern that combines *addti3_doubleword with concat
insn, but doubleword compares (and other doubleword insn besides addti3) do not
provide these compound instructions.

So, without a common strategy to use doubleword_concat patterns for all double
word instructions, it is questionable if the complications with concat insn are
worth the pain of providing (many?) doubleword_concat patterns.

The real issue is with x86_64 doubleword arguments. Unfortunately, the ABI
specifies RDI/RSI to pass the double word argument, while the compiler regalloc
order sequence is RSI/RDI. IMO, we can try to swap RDI and RSI in the order and
RA would be able to allocate registers in the same optimal way as for x86_32
with -mregparm=3, even without synthetic concat patterns.

[Bug target/45434] x86 missed optimization: use high register (ah, bh, ch, dh) when available to make comparisons

2024-01-17 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45434

--- Comment #9 from Uroš Bizjak  ---
The current mainline compiles:

--cut here--
_Bool foo(int i)
{
  return (i & 0xFF) == ((i & 0xFF00) >> 8);
}

_Bool bar(int i)
{
  return (i & 0xFF) <= ((i & 0xFF00) >> 8);
}
--cut here--

with -O2 to:

foo:
movl%edi, %eax
sarl$8, %eax
xorl%edi, %eax
testb   %al, %al
sete%al
ret

bar:
movl%edi, %eax
cmpb%al, %ah
setnb   %al
ret

While ._original dump reads:

;; Function foo (null)

{
  return ((i >> 8 ^ i) & 255) == 0;
}


;; Function bar (null)

{
  return (i & 255) <= (i >> 8 & 255);
}

The test for equivalence goes through XOR which interferes with the ability of
the combine pass to form the compare with zero-extracted argument RTX.

[Bug rtl-optimization/113048] [13/14 Regression] ICE: in lra_split_hard_reg_for, at lra-assigns.cc:1862 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -march=cascadelake since r13

2024-01-15 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113048

--- Comment #8 from Uroš Bizjak  ---
(In reply to Vladimir Makarov from comment #7)
> I believe this PR was recently fixed by
> https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;
> h=a729b6e002fe76208f33fdcdee49d6a310a1940e

Yes, I can confirm that the test compiles OK with recent:

xgcc (GCC) 14.0.1 20240115 (experimental) [master r14-7251-g04f22670d32]

[Bug target/113255] [11/12/13/14 Regression] wrong code with -O2 -mtune=k8

2024-01-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113255

Uroš Bizjak  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #6 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #4)

> I can't decipher this from what expand generates but the problem lies
> there (the rep_8bytes expansion).

Let's ask Honza.

[Bug tree-optimization/108477] fwprop over-optimizes conversion from + to |

2024-01-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108477

--- Comment #2 from Uroš Bizjak  ---
If we consider the following testcase:

--cut here--
unsigned int foo (unsigned int a, unsigned int b)
{
  unsigned int r = a & 0x1;
  unsigned int p = b & ~0x3;

  return r + p + 2;
}

unsigned int bar (unsigned int a, unsigned int b)
{
  unsigned int r = a & 0x1;
  unsigned int p = b & ~0x3;

  return r | p | 2;
}
--cut here--

the above testcase compiles (x86_64 -O2) to:

foo:
andl$1, %edi
andl$-4, %esi
orl %esi, %edi
leal2(%rdi), %eax
ret

bar:
andl$1, %edi
andl$-4, %esi
orl %esi, %edi
movl%edi, %eax
orl $2, %eax
ret

So, there is no further simplification in any case, we can't combine OR with a
PLUS in the first case, and we don't have OR instruction with multiple inputs
in the second case.

If we switch around the logic in the conversion and convert from IOR/XOR to
PLUS, as is the case in the following patch:

--cut here--
diff --git a/gcc/match.pd b/gcc/match.pd
index 7b4b15acc41..deac18a7635 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1830,18 +1830,18 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
&& element_precision (type) <= element_precision (TREE_TYPE (@1)))
(bit_not (rop (convert @0) (convert @1))

-/* If we are XORing or adding two BIT_AND_EXPR's, both of which are and'ing
+/* If we are ORing or XORing two BIT_AND_EXPR's, both of which are and'ing
with a constant, and the two constants have no bits in common,
-   we should treat this as a BIT_IOR_EXPR since this may produce more
+   we should treat this as a PLUS_EXPR since this may produce more
simplifications.  */
-(for op (bit_xor plus)
+(for op (bit_ior bit_xor)
  (simplify
   (op (convert1? (bit_and@4 @0 INTEGER_CST@1))
   (convert2? (bit_and@5 @2 INTEGER_CST@3)))
   (if (tree_nop_conversion_p (type, TREE_TYPE (@0))
&& tree_nop_conversion_p (type, TREE_TYPE (@2))
&& (wi::to_wide (@1) & wi::to_wide (@3)) == 0)
-   (bit_ior (convert @4) (convert @5)
+   (plus (convert @4) (convert @5)

 /* (X | Y) ^ X -> Y & ~ X*/
 (simplify
--cut here--

then the resulting assembly reads:

foo:
andl$-4, %esi
andl$1, %edi
leal2(%rsi,%rdi), %eax
ret

bar:
andl$1, %edi
andl$-4, %esi
leal(%rdi,%rsi), %eax
orl $2, %eax
ret

On x86, the conversion can now use LEA instruction, which is much more usable
than OR instruction. In the first case, LEA implements three input PLUS
instruction, while in the second case, even though the instruction can't be
combined with a follow-up OR, the non-destructive LEA avoids a move.

[Bug tree-optimization/108477] fwprop over-optimizes conversion from + to |

2024-01-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108477

--- Comment #1 from Uroš Bizjak  ---
This conversion happens due to th following code in match.pd:

/* If we are XORing or adding two BIT_AND_EXPR's, both of which are and'ing
   with a constant, and the two constants have no bits in common,
   we should treat this as a BIT_IOR_EXPR since this may produce more
   simplifications.  */
(for op (bit_xor plus)
 (simplify
  (op (convert1? (bit_and@4 @0 INTEGER_CST@1))
  (convert2? (bit_and@5 @2 INTEGER_CST@3)))
  (if (tree_nop_conversion_p (type, TREE_TYPE (@0))
   && tree_nop_conversion_p (type, TREE_TYPE (@2))
   && (wi::to_wide (@1) & wi::to_wide (@3)) == 0)
   (bit_ior (convert @4) (convert @5)

[Bug rtl-optimization/109052] Unnecessary reload with -mfpmath=both

2024-01-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109052

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |14.0

--- Comment #9 from Uroš Bizjak  ---
Implemented for gcc-14.

[Bug target/113255] [11/12/13/14 Regression] wrong code with -O2 -mtune=k8

2024-01-07 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113255

--- Comment #3 from Uroš Bizjak  ---
_.dse1 pass is removing the store for some reason, -fno-dse "fixes" the
testcase.

Before _.dse1 pass, we have:

(insn 41 40 46 4 (set (mem/c:SI (plus:DI (reg/f:DI 19 frame)
(const_int -36 [0xffdc])) [2 e[1].y+0 S4 A32])
(reg:SI 98 [ e$1$y ])) "pr113255.c":21:9 85 {*movsi_internal}
 (expr_list:REG_DEAD (reg:SI 98 [ e$1$y ])
(nil)))

But _.dse1 pass decides that:

**scanning insn=41
  mem: (plus:DI (reg/f:DI 19 frame)
(const_int -36 [0xffdc]))

   after canon_rtx address: (plus:DI (reg/f:DI 19 frame)
(const_int -36 [0xffdc]))
  gid=1 offset=-36
 processing const base store gid=1[-36..-32)
mems_found = 1, cannot_delete = false

...

Locally deleting insn 41
deferring deletion of insn with uid = 41.

[Bug rtl-optimization/113048] [13/14 Regression] ICE: in lra_split_hard_reg_for, at lra-assigns.cc:1862 (unable to find a register to spill) {*andndi3_doubleword_bmi} with -march=cascadelake since r13

2024-01-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113048

Uroš Bizjak  changed:

   What|Removed |Added

  Component|target  |rtl-optimization

--- Comment #5 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #3)
> Started with r13-1716-gfd3d25d6df1cbd385d2834ff3059dfb6905dd75c

There is nothing wrong with the constrints in *andndi3_doubleword_bmi:

(define_insn_and_split "*andn3_doubleword_bmi"
  [(set (match_operand: 0 "register_operand" "=,r,r")
(and:
  (not: (match_operand: 1 "register_operand" "r,0,r"))
  (match_operand: 2 "nonimmediate_operand" "ro,ro,0")))
   (clobber (reg:CC FLAGS_REG))]

Reconfirmed as RA problem.

[Bug target/113231] x86_64 uses SSE instructions for `*mem <<= const` at -Os

2024-01-04 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113231

Uroš Bizjak  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com

--- Comment #3 from Uroš Bizjak  ---
CC Roger.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2023-12-29 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug target/113133] [14 Regression] ICE: SIGSEGV in mark_label_nuses(rtx_def*) (emit-rtl.cc:3896) with -O -fno-tree-ter -mavx512f -march=barcelona

2023-12-29 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113133

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #10 from Uroš Bizjak  ---
Fixed by a partial revert of
r14-4499-gc1eef66baa8dde706d7ea6921648e6016dc7c93d.

[Bug target/113133] [14 Regression] ICE: SIGSEGV in mark_label_nuses(rtx_def*) (emit-rtl.cc:3896) with -O -fno-tree-ter -mavx512f -march=barcelona

2023-12-29 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113133

--- Comment #8 from Uroš Bizjak  ---
(In reply to Haochen Jiang from comment #6)
> Aha, I see what happened. x/ymm16+ are usable for AVX512F w/o AVX512VL and
> that is why I added that to allow them.
> 
> Let me find a way to see if we can fix this.

It looks to me that ix86_hard_regno_mode_ok should be fixed to allow x/ymm16+
also with EVEX512. Currently we have:

  /* TODO check for QI/HI scalars.  */
  /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
  if (TARGET_AVX512VL
  && (VALID_AVX256_REG_OR_OI_MODE (mode)
  || VALID_AVX512VL_128_REG_MODE (mode)))
return true;

so the compiler is unable to change some of the modes of the xmm16 to 128-bit
mode using lowpart_subreg, e.g. DFmode to V4SFmode.

Please also note that your original patch missed to add TARGET_EVEX512 to the
splitter that handles float_truncate with TARGET_USE_VECTOR_FP_CONVERTS.

I propose to proceed with the minimal fix from Comment #3 as a hotfix to
unbreak the testcase in this PR. The real, but more involved fix is to fix
ix86_hard_regno_mode_ok, which I'll leave to you.

[Bug target/113133] [14 Regression] ICE: SIGSEGV in mark_label_nuses(rtx_def*) (emit-rtl.cc:3896) with -O -fno-tree-ter -mavx512f -march=barcelona

2023-12-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113133

Uroš Bizjak  changed:

   What|Removed |Added

 CC||haochen.jiang at intel dot com

--- Comment #4 from Uroš Bizjak  ---
Caused by r14-4499-gc1eef66baa8dde706d7ea6921648e6016dc7c93d

[Bug target/113133] [14 Regression] ICE: SIGSEGV in mark_label_nuses(rtx_def*) (emit-rtl.cc:3896) with -O -fno-tree-ter -mavx512f -march=barcelona

2023-12-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113133

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

--- Comment #3 from Uroš Bizjak  ---
This patch also fixes the failure:

--cut here--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index ca6dbf42a6d..cdb9ddc4eb3 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -5210,7 +5210,7 @@ (define_split
&& optimize_insn_for_speed_p ()
&& reload_completed
&& (!EXT_REX_SSE_REG_P (operands[0])
-   || TARGET_AVX512VL || TARGET_EVEX512)"
+   || TARGET_AVX512VL)"
[(set (match_dup 2)
 (float_extend:V2DF
   (vec_select:V2SF
--cut here--

[Bug target/113133] [14 Regression] ICE: SIGSEGV in mark_label_nuses(rtx_def*) (emit-rtl.cc:3896) with -O -fno-tree-ter -mavx512f -march=barcelona

2023-12-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113133

--- Comment #2 from Uroš Bizjak  ---
Another testcase:

--cut here--
void
foo1 (double *d, float f)
{
  register float x __asm ("xmm16") = f;
  asm volatile ("" : "+v" (x));

  *d = x;
}

void
foo2 (float *f, double d)
{
  register double x __asm ("xmm16") = d;
  asm volatile ("" : "+v" (x));

  *f = x;
}
--cut here--

[Bug target/113133] [14 Regression] ICE: SIGSEGV in mark_label_nuses(rtx_def*) (emit-rtl.cc:3896) with -O -fno-tree-ter -mavx512f -march=barcelona

2023-12-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113133

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2023-12-28
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #1 from Uroš Bizjak  ---
Created attachment 56962
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56962=edit
Proposed patch

Patch in testing.

lowpart_subreg can't handle:

lowpart_subreg (V4SFmode, operands[0], DFmode);

and

lowpart_subreg (V2DFmode, operands[0], SFmode);

subreg conversions and will return NULL_RTX for these cases.

  1   2   3   4   5   6   7   8   9   >