[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa

2024-05-22 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #8 from prathamesh3492 at gcc dot gnu.org ---
Hi Tamar,
Using -falign-loops=5 indeed brings back the performance.
The adrp instruction has same address (0x4ae784) by setting -falign-loops=5
(which reduces misalignment to 4) with/without a2f4be3dae0. So I guess this is
really code-alignment issue ?

(Also in our latest builds the regression has seemingly gone away without any
adjustments to code alignment)

Thanks,
Prathamesh

[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa

2024-05-03 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
Hi Tamar,
Sorry for late response.

perf profile for povray with LTO:

Compiled with 82d6d385f97 (commit before a2f4be3dae0): 
  
20.03%  pov::All_CSG_Intersect_Intersections   
  16.42%  pov::All_Plane_Intersections 
 10.29% 
pov::All_Sphere_Intersections  
10.10%  pov::Intersect_BBox_Tree

Compiled with a2f4be3dae0: 
   19.51% 
pov::All_CSG_Intersect_Intersections   
   16.91%  pov::All_Plane_Intersections
  
12.53%  pov::All_Sphere_Intersections  
  9.81%   pov::Intersect_BBox_Tree  

I verified there are no code-gen differences for any of the above hot
functions.
Running size on povray_r_exe.out shows a slight code-size decrease of 344 bytes
for text section:
Compiled with 82d6d385f97: 1101505
Compiled with a2f4be3dae0: 1101161

Curiously, there’s a meaningful difference for pov::All_Sphere_Intersections,
which seems to be caused due to following adrp instruction (with no code-gen
changes in All_Sphere_Intersections):

Compiled with 82d6d385f97:
 18.07 │4aec44:   adrp  x0, 4e 
  1.77 │4aec48:   ldr   d28, [x0, #2784]

Compiled with a2f4be3dae0:
 28.93 │4aeae4:   adrp  x0, 4e 
  1.27  │4aeae8:   ldr   d28, [x0, #2432]

This seems to come from following condition in Intersect_Sphere (which gets
inlined into All_Sphere Intersections):

if ((OCSquared >= Radius2) && (t_Closest_Approach < EPSILON))

As far as I see, there’s no difference between both adrp instructions except
the address (4aec44 vs 4aeae4). And as far as I know, adrp will only calculate
pc-relative page address (and not load any data). To check for any possible
icache misses I used L1I_CACHE_REFILL counter, and turns out that there are 64%
more L1 icache misses for above adrp instruction with a2f4be3dae0 compared to
82d6d385f97, which may (partially) explain the performance difference ?
Although perf stat shows there are around 7% more L1 icache misses for whole
program run with 82d6d385f97 compared to a2f4be3dae0.

I could (repeatedly) reproduce the issue on two neoverse-v2 machines.
The full command line passed to the compiler was:
"-O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arrays
-flto -march=native -mcpu=neoverse-v2"

Thanks,
Prathamesh

[Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2

2024-04-26 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

Bug ID: 114860
   Summary: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto
-march=native -mcpu=neoverse-v2
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
It seems performance of povray bmk is regressing ~5.5% with -O3 -flto
-march=native -mcpu=neoverse-v2, and ~1.6% without LTO.

This seems to have happened after following commit:
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=a2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6

Reverting it brings back performance. I am investigating further.

Thanks,
Prathamesh

[Bug tree-optimization/114736] [13 Regression] ICE during SLP pass with gfortran-13 -O3 -mcpu=neoverse-v2

2024-04-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114736

--- Comment #11 from prathamesh3492 at gcc dot gnu.org ---
Hi Richard,
Thanks for the quick fix! I verified it now compiles the test-case with -O3
-mcpu=neoverse-v2. I suppose this will need backporting to gcc-13 branch. The
test compiles OK with gfortran-12.

Thanks,
Prathamesh

[Bug tree-optimization/114736] ICE during SLP pass with gfortran-13 -O3 -mcpu=neoverse-v2

2024-04-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114736

--- Comment #6 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 57957
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57957=edit
SLP dump

[Bug tree-optimization/114736] ICE during SLP pass with gfortran-13 -O3 -mcpu=neoverse-v2

2024-04-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114736

--- Comment #5 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #3)
> Does -fno-cost-model affect the behavior here?

With 43da77a4, it doesn't result in ICE with -fno-vect-cost-model or
-fvect-cost-model=unlimited.

Prior to 43da77a4, it still results in ICE with -fno-vect-cost-model. It only
seems to pass with -fvect-cost-model=very-cheap.

Thanks,
Prathamesh

[Bug tree-optimization/114736] ICE during SLP pass with gfortran-13 -O3 -mcpu=neoverse-v2

2024-04-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114736

--- Comment #2 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 57956
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57956=edit
Input to SLP pass (dse4 dump)

[Bug tree-optimization/114736] ICE during SLP pass with gfortran-13 -O3 -mcpu=neoverse-v2

2024-04-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114736

--- Comment #1 from prathamesh3492 at gcc dot gnu.org ---
Investigating this a bit further, the ICE appears with gfortran-13 because for
the testcase, because length of postorder traversal over SLP graph (27) doesn't
match number of nodes (28), and thus we hit the following assert in
create_partitions:

   /* Calculate a postorder of the graph, ignoring edges that correspond
 to natural latch edges in the cfg.  Reading the vector from the end
 to the beginning gives the reverse postorder.  */
  auto_vec initial_rpo;
  graphds_dfs (m_slpg, _leafs[0], m_leafs.length (), _rpo,
   false, NULL, skip_cfg_latch_edges);
  gcc_assert (initial_rpo.length () == m_vertices.length ());

Postorder traversal of graph (initial_rpo) shows:
vertices: [ 0 1 2 3 4 5 6 7 8 9 22 23 24 10 11 12 13 14 15 16 17
18 19 20 21 25 27 ]
Vertex 26 seems to be missing, which corresponds to bb15, and thus
initial_rpo.length() is one less than m_vertices.length().

(If we don't ignore cfg latch edges during dfs walk, then it seems to "work",
but that's not right approach I guess...)

The issue doesn't reproduce with master, running git bisect showed it went away
after:
http://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=43da77a4f1636280c4259402c9c2c543e6ec6c0b

With 43da77a4, vect_slp_function splits the region at offending bb15, because
it is loop header, and it's containing loop gets marked as dont_vectorize by
ifcvt.
slp dump shows:
t5.f90:1:21: missed: splitting region at dont-vectorize loop 3 entry at bb15

So, bb15 doesn't get passed to vect_slp_bbs and eventually to
create_partitions,
avoiding the assert. So I am wondering if the issue has gone latent on trunk
rather than fixed since presence or absence of loop->dont_vectorize shouldn't
affect correctness of BB vectorizer ?

Perhaps not relevant, but this issue seems to surface only with -O3 -
mcpu=neoverse-v2. It doesn't surface with -O3, or -O3 -mcpu=generic+sve2 or
even trying out equivalent -march options corresponding to neoverse-v2:
-march=armv9-a+rng+crc+i8mm+bf16+sve2-bitperm+memtag+profile.

Thanks,
Prathamesh

[Bug tree-optimization/114736] New: ICE during SLP pass with gfortran-13 -O3 -mcpu=neoverse-v2

2024-04-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114736

Bug ID: 114736
   Summary: ICE during SLP pass with gfortran-13 -O3
-mcpu=neoverse-v2
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
For the following test-case:

SUBROUTINE MY_ROUTINE (N, A, B )
IMPLICIT NONE
INTEGER,   INTENT(IN):: N
COMPLEX,   INTENT(IN):: A(N)
COMPLEX,   INTENT(OUT)   :: B(N)
INTEGER  :: II
B(:) = (1.,0.)
DO II = 1, N-1
B(II) = A(N-II+1) / A(N-II)
ENDDO
END SUBROUTINE MY_ROUTINE

Compiling with gfortran-13 -O3 -mcpu=neoverse-v2 results in following ICE:

during GIMPLE pass: slp
dump file: t5.f90.180t.slp1
t5.f90:1:21:

1 | SUBROUTINE MY_ROUTINE (N, A, B )
  | ^
internal compiler error: in create_partitions, at tree-vect-slp.cc:4226
0x12a4aef vect_optimize_slp_pass::create_partitions()
../../gcc/gcc/tree-vect-slp.cc:4226
0x12a617b vect_optimize_slp_pass::run()
../../gcc/gcc/tree-vect-slp.cc:5642
0x12a626b vect_optimize_slp(vec_info*)
../../gcc/gcc/tree-vect-slp.cc:5666
0x12abdef vect_optimize_slp(vec_info*)
../../gcc/gcc/tree-vect-slp.cc:7486
0x12abdef vect_slp_analyze_bb_1
../../gcc/gcc/tree-vect-slp.cc:7450
0x12abdef vect_slp_region
../../gcc/gcc/tree-vect-slp.cc:7538
0x12adc0b vect_slp_bbs
../../gcc/gcc/tree-vect-slp.cc:7746
0x12adf57 vect_slp_function(function*)
../../gcc/gcc/tree-vect-slp.cc:7847
0x12b949f execute
../../gcc/gcc/tree-vectorizer.cc:1529

Thanks,
Prathamesh

[Bug target/114323] [14 Regression] MVE vector load intrinsic miscompiled since r14-5622-g4d7647edfd7d98

2024-03-14 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114323

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||prathamesh3492 at gcc dot 
gnu.org

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
Just to expand on previous comments:
Before patch, input to dse is:

  uint32x4_t D.13560;
  const uint32_t D.13545[4];
  uint32x4_t V0;
  __simd128_uint32_t _7;

   :
  # .MEM_2 = VDEF <.MEM_1(D)>
  D.13545 = *.LC0;
  # .MEM_8 = VDEF <.MEM_2>
  _7 = __builtin_mve_vld1q_uv4si ();
  # .MEM_6 = VDEF <.MEM_8>
  D.13545 ={v} {CLOBBER(eos)};
  # VUSE <.MEM_6>
  return _7;

In this case, we have following virtual def-use chain:
.MEM_1(D) -> .MEM_2 -> .MEM_8 -> .MEM_6


However after patch, input to dse is:
  const uint32_t D.13539[4];
  uint32x4_t V0;

   :
  # .MEM_2 = VDEF <.MEM_1(D)>
  D.13539 = *.LC0;
  V0_3 = vld1q_u32 ();
  # .MEM_5 = VDEF <.MEM_2>
  D.13539 ={v} {CLOBBER(eos)};
  # VUSE <.MEM_5>
  return V0_3;

There's a missing use of MEM_2 in call to vld1q_u32, and
since the only use of MEM_2 now is in clobber statement,
dse considers it as a dead store, and simplifies it to:

   :
  V0_3 = vld1q_u32 ();
  # .MEM_5 = VDEF <.MEM_1(D)>
  D.13539 ={v} {CLOBBER(eos)};
  # VUSE <.MEM_5>
  return V0_3;

thus passing uninitialized pointer to vld1q_u32.

Thanks,
Prathamesh

[Bug target/112950] gcc.target/aarch64/sve/acle/general/dupq_5.c fails on aarch64_be-linux-gnu

2024-01-29 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112950

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
Fixed.

[Bug target/112950] gcc.target/aarch64/sve/acle/general/dupq_5.c fails on aarch64_be-linux-gnu

2023-12-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112950

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |prathamesh3492 at gcc 
dot gnu.org

--- Comment #1 from prathamesh3492 at gcc dot gnu.org ---
Sorry for the breakage, will take a look.

Thanks,
Prathamesh

[Bug middle-end/111754] [14 Regression] ICE: in decompose, at rtl.h:2313 at -O

2023-11-28 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111754

--- Comment #15 from prathamesh3492 at gcc dot gnu.org ---
Sorry for the regression, and thanks for the prompt fix!

[Bug middle-end/111754] [14 Regression] ICE: in decompose, at rtl.h:2313 at -O

2023-11-27 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111754

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from prathamesh3492 at gcc dot gnu.org ---
Fixed.

[Bug rtl-optimization/111702] [14 Regression] ICE: in insert_regs, at cse.cc:1114 with -O2 -fstack-protector-all -frounding-math

2023-11-10 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111702

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
HI, sorry for the breakage, will take a look.

Thanks,
Prathamesh

[Bug tree-optimization/111648] [14 Regression] Wrong code at -O2/3 on x86_64-linux-gnu since r14-3243-ga7dba4a1c05

2023-10-18 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111648

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from prathamesh3492 at gcc dot gnu.org ---
Fixed.

[Bug middle-end/111754] [14 Regression] ICE: in decompose, at rtl.h:2313 at -O

2023-10-10 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111754

--- Comment #7 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Richard Biener from comment #5)
> It seems we have VECTOR_CST_NELTS_PER_PATTERN ({ 9.0e+0, 0.0, 0.0, 0.0 })
> 2 and VECTOR_CST_NPATTERNS == 1.  And the selector { 1, 0, 1, 2 } has
> npatterns == 1 and nelts-per-pattern == 3.
> 
>   /* (1) If SEL is a suitable mask as determined by
>  valid_mask_for_fold_vec_perm_cst_p, then:
>  res_npatterns = max of npatterns between ARG0, ARG1, and SEL
>  res_nelts_per_pattern = max of nelts_per_pattern between
>  ARG0, ARG1 and SEL.
>  (2) If SEL is not a suitable mask, and TYPE is VLS then:
>  res_npatterns = nelts in result vector.
>  res_nelts_per_pattern = 1.
>  This exception is made so that VLS ARG0, ARG1 and SEL work as before. 
> */
>   if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> {
>   res_npatterns
> = std::max (VECTOR_CST_NPATTERNS (arg0),
> std::max (VECTOR_CST_NPATTERNS (arg1),
>   sel.encoding ().npatterns ()));
> 
>   res_nelts_per_pattern
> = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
>   sel.encoding ().nelts_per_pattern ()));
> 
>   res_nelts = res_npatterns * res_nelts_per_pattern;
> 
> this seems to be a case that doesn't fit, so the fix needs to be to
> valid_mask_for_fold_vec_perm_cst_p which really looks a bit
> unwieldly.
valid_mask_for_fold_vec_perm_cst_p returns incorrectly true here,
which is being addressed in PR111648 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631926.html

Even if the vectors had integral element type:
arg0 = arg1 = (v4si){ 9, 0, 0, 0 }  // encoded as {9, 0, ...}
and sel = { 1, 0, 1, 2 }  // encoded as {1, 0, 1, ...}

The pattern in sel {1, 0, 1, ...}
would choose elements from arg0, and
res would have incorrect encoding with step = -9:
res = { arg0[1], arg0[0], arg0[1], ... } 
= { 0, 9, 0, ... }
And res[3] will be incorrectly computed as -9 instead of arg0[2].

However, for floating element types, even if encoding is correct,
I assume it will still ICE when trying to derive elements not present in
encoding since poly_int_cst can only deal with integral elements ?
> 
> An assert that res_nelts is power-of-two would be nice to add.
Sorry, I don't understand. res_nelts for VLA need not be power of 2,
since res_nelts_per_pattern can be 3. The encoding for res is chosen
to be max of npatterns and max of nelts_per_pattern between arg0, arg1, and
sel.

Thanks,
Prathamesh

[Bug middle-end/111754] [14 Regression] ICE: in decompose, at rtl.h:2313 at -O

2023-10-10 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111754

--- Comment #6 from prathamesh3492 at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #4)
> On Tue, 10 Oct 2023, prathamesh3492 at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111754
> > 
> > --- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
> > The issue is that we only support integral vector types in 
> > fold_vec_perm_cst,
> > but fail to check for the same before calling it from fold_vec_perm.
> > The following tweak fixes the ICE:
> > 
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index 4f8561509ff..a29a8af6d2f 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -10801,7 +10801,8 @@ fold_vec_perm (tree type, tree arg0, tree arg1, 
> > const
> > vec_perm_indices )
> >  return NULL_TREE;
> > 
> >if (TREE_CODE (arg0) == VECTOR_CST
> > -  && TREE_CODE (arg1) == VECTOR_CST)
> > +  && TREE_CODE (arg1) == VECTOR_CST
> > +  && INTEGRAL_TYPE_P (TREE_TYPE (type)))
> >  return fold_vec_perm_cst (type, arg0, arg1, sel);
> 
> Huh, that looks wrong.  I fail to see how the element type matters
> at all.

IIUC, the element type matters for VLA folding when sel has a stepped sequence
because in that case we need to derive elements from the encoding using
vector_cst_elt / vector_cst_int_elt, and it gets enforced for VLS vectors too
because they are handled in unified manner in fold_vec_perm_cst.

Another possible approach is to use "VLS exception" in fold_vec_perm_cst to
encode all the elements:
res_npatterns = res_nelts;
res_nelts_per_patterns = 1
just like we do if valid_mask_for_fold_vec_perm_cst_p returns false.

Does the following fix look OK instead ?

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 4f8561509ff..356eb052fbc 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -10642,6 +10642,11 @@ valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree
arg1,
   if (sel_nelts_per_pattern < 3)
 return true;

+  /* If SEL contains stepped sequence, ensure that we are dealing with
+ integral vector_cst.  */
+  if (!INTEGRAL_TYPE_P (TREE_TYPE (arg0)))
+return false;
+
   for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
 {
   poly_uint64 a1 = sel[pattern + sel_npatterns];

[Bug middle-end/111754] [14 Regression] ICE: in decompose, at rtl.h:2313 at -O

2023-10-10 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111754

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
The issue is that we only support integral vector types in fold_vec_perm_cst,
but fail to check for the same before calling it from fold_vec_perm.
The following tweak fixes the ICE:

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 4f8561509ff..a29a8af6d2f 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -10801,7 +10801,8 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const
vec_perm_indices )
 return NULL_TREE;

   if (TREE_CODE (arg0) == VECTOR_CST
-  && TREE_CODE (arg1) == VECTOR_CST)
+  && TREE_CODE (arg1) == VECTOR_CST
+  && INTEGRAL_TYPE_P (TREE_TYPE (type)))
 return fold_vec_perm_cst (type, arg0, arg1, sel);

   /* For fall back case, we want to ensure we have VLS vectors

and results in the following .optimized dump:
F bar (F a, F b)
{
  F c;

   [local count: 1073741824]:
  c_2 = VEC_PERM_EXPR ;
  __builtin_logbl (0.0);
  return c_2;

}

F foo ()
{
   [local count: 1073741824]:
  __builtin_logbl (0.0);
  return { 0.0, 9.0e+0, 0.0, 0.0 };

}

[Bug middle-end/111754] [14 Regression] ICE: in decompose, at rtl.h:2313 at -O

2023-10-10 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111754

--- Comment #2 from prathamesh3492 at gcc dot gnu.org ---
Hi,
Sorry for the breakage, will take a look.

Thanks,
Prathamesh

[Bug tree-optimization/111697] New: Sub optimal code gen for initialising vector using loop

2023-10-04 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111697

Bug ID: 111697
   Summary: Sub optimal code gen for initialising vector using
loop
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
For the following test-case:

typedef int v4si __attribute__((vector_size (sizeof (int) * 4)));
v4si f(int x)
{
  v4si v;
  for (int i = 0; i < 4; i++)
v[i] = x;
  return v;
}

Compiling with -O2 results in following .optimized dump:

v4si f (int x)
{
  v4si v;

   [local count: 214748368]:
  v_16 = BIT_INSERT_EXPR ;
  v_20 = BIT_INSERT_EXPR ;
  v_24 = BIT_INSERT_EXPR ;
  v_2 = BIT_INSERT_EXPR ;
  return v_2;

}

and following code-gen on aarch64:
f:
moviv0.4s, 0
fmovs31, w0
ins v0.s[0], v31.s[0]
ins v0.s[1], v31.s[0]
ins v0.s[2], v31.s[0]
ins v0.s[3], v31.s[0]
ret

which could instead be a single dup instruction:
f:
dup v0.4s, w0
ret

Similarly, code-gen on x86_64:
f:
movd%edi, %xmm0
movd%edi, %xmm1
pshufd  $225, %xmm0, %xmm0
movss   %xmm1, %xmm0
pshufd  $225, %xmm0, %xmm0
pshufd  $198, %xmm0, %xmm0
movss   %xmm1, %xmm0
pshufd  $198, %xmm0, %xmm0
pshufd  $39, %xmm0, %xmm0
movss   %xmm1, %xmm0
pshufd  $39, %xmm0, %xmm0
ret

[Bug tree-optimization/111648] [14 Regression] Wrong code at -O2/3 on x86_64-linux-gnu since r14-3243-ga7dba4a1c05

2023-10-03 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111648

--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
(In reply to prathamesh3492 from comment #3)
> Created attachment 56037 [details]
> Untested fix
> 
> The issue is that when a1 is a multiple of vector length, we end up creating
> following encoding in result: { base_elem, arg[0], arg[1], ... } where arg
> is chosen input vector, which is incorrect.
> 
> For above case, vectorizer pass creates VEC_PERM_EXPR where:
> arg0: { -16, -9, -10, -11 } 
> arg1: { -12, -5, -6, -7 } 
> sel = { 3, 4, 5, 6 }
> 
> arg0, arg1 and sel are encoded with npatterns = 1 and nelts_per_pattern = 3.
> Since a1 = 4 and arg_len = 4, it ended up creating the result with
> following encoding:
> res = { arg0[3], arg1[0], arg1[1] } // npatterns = 1, nelts_per_pattern = 3
> = { -11, -12, -5 }
> 
> So for res[4], it used S = (-5) - (-12) = 7
Typo: I meant res[3], not res[4]. Sorry.
> And hence computed it as -5 + 7 = 2.
> instead of arg1[2], ie, -6.
> which is the difference we see in output at -O0 vs -O2.
> 
> The patch tweaks the constratints in valid_mask_for_fold_vec_perm_cst_p to
> punt if a1 is a multiple of vector length, so a1 ... ae only selects from
> stepped part of the input vector, which seems to fix this issue.
> I will run a proper bootstrap+test and post it upstream.

[Bug tree-optimization/111648] [14 Regression] Wrong code at -O2/3 on x86_64-linux-gnu since r14-3243-ga7dba4a1c05

2023-10-03 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111648

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 56037
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56037=edit
Untested fix

The issue is that when a1 is a multiple of vector length, we end up creating
following encoding in result: { base_elem, arg[0], arg[1], ... } where arg is
chosen input vector, which is incorrect.

For above case, vectorizer pass creates VEC_PERM_EXPR where:
arg0: { -16, -9, -10, -11 } 
arg1: { -12, -5, -6, -7 } 
sel = { 3, 4, 5, 6 }

arg0, arg1 and sel are encoded with npatterns = 1 and nelts_per_pattern = 3.
Since a1 = 4 and arg_len = 4, it ended up creating the result with
following encoding:
res = { arg0[3], arg1[0], arg1[1] } // npatterns = 1, nelts_per_pattern = 3
= { -11, -12, -5 }

So for res[4], it used S = (-5) - (-12) = 7
And hence computed it as -5 + 7 = 2.
instead of arg1[2], ie, -6.
which is the difference we see in output at -O0 vs -O2.

The patch tweaks the constratints in valid_mask_for_fold_vec_perm_cst_p to punt
if a1 is a multiple of vector length, so a1 ... ae only selects from stepped
part of the input vector, which seems to fix this issue.
I will run a proper bootstrap+test and post it upstream.

[Bug tree-optimization/111648] Wrong code at -O2/3 on x86_64-linux-gnu since r14-3243-ga7dba4a1c05

2023-09-30 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111648

--- Comment #1 from prathamesh3492 at gcc dot gnu.org ---
Hi,
Sorry for the breakage, will take a look.

Thanks,
Prathamesh

[Bug tree-optimization/111048] [14 Regression] Wrong AVX2 code on highway-1.0.6 on -O2 and above since r14-3243-ga7dba4a1c05a76

2023-08-21 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111048

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #10 from prathamesh3492 at gcc dot gnu.org ---
Fixed.

[Bug tree-optimization/111048] [14 Regression] Wrong AVX2 code on highway-1.0.6 on -O2 and above since r14-3243-ga7dba4a1c05a76

2023-08-18 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111048

--- Comment #8 from prathamesh3492 at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #7)
> = ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
>   : VECTOR_CST_NPATTERNS (arg1);
> 
> should be q1 & 1 :)

Oops, sorry for the typo :/
And yes, that fixes the issue.

For more context we have following inputs to VEC_PERM_EXPR:
arg0 (1, 1): { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
arg1: (4, 1): { 255, 63, 15, 3, 255, 63, 15, 3, 255, 63, 15, 3, 255, 63, 15, 3
}
sel (2, 3):  { 0, 16, 1, 17, 2, 18, ... }
arg0 len: 16
sel nelts: 16

In valid_mask_for_fold_vec_perm_cst_p for the pattern: {16, 17, 18, ...}
arg_npatterns is erroneously set to VECTOR_CST_NPATTERNS (arg0) and we have:
step = 1, arg_npatterns = 1
Thus, step becomes a "multiple" of arg_npatterns and we (wrongly) return true
for this case.

So in the loop below in fold_vec_perm_cst, we have res with following encoding:
res (4, 3): { 0, 255, 0, 63, 0, 15, 0, 3, 0, 255, 0, 63, ... }

Since len = 16, it has to compute the remaining elements.
For index 13, it comes as "a3" in pattern: { 255, 15, 255, ... }
So the step gets computed as: 255 - 15 = 240
And IIUC the next element thus becomes: (255 + 240)%256 = 239.

By correctly setting arg_npatterns = VECTOR_CST_NPATTERNS (arg1) for this
case, arg_npatterns becomes 4.
Since step == 1 is not a multiple of arg_npatterns we return false,
and use the fallback:
res_npatterns = 16, res_nelts_per_pattern = 1.
and the loop below correctly encodes the elements.

I will shortly send a patch after validating it.

Thanks,
Prathamesh

[Bug tree-optimization/111048] [14 Regression] Wrong AVX2 code on highway-1.0.6 on -O2 and above

2023-08-18 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111048

--- Comment #6 from prathamesh3492 at gcc dot gnu.org ---
Sorry for the breakage, I will take a look.

Thanks,
Prathamesh

[Bug rtl-optimization/110867] [14 Regression] ICE in combine after 7cdd0860949c6c3232e6cff1d7ca37bb5234074c

2023-08-13 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110867

--- Comment #10 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Stefan Schulze Frielinghaus from comment #9)
> It looks like as if the first fix didn't entirely solve the problem.  It
> turns out that the normal form of const_int is not always met.  Before
> releasing a new patch, could you test it first in order to make sure that I
> do not break bootstrapping again.  I already gave it a try against the
> reproducer but would like to make sure that the whole bootstrap is
> successful.

Hi Stefan,
I bootstrapped+tested your patch from Comment 8 on arm, and it seems OK.

Thanks,
Prathamesh

[Bug middle-end/110857] aarch64-linux-gnu profiledbootstrap broken

2023-08-04 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110857

--- Comment #6 from prathamesh3492 at gcc dot gnu.org ---
profiledbootstrap now works on aarch64-linux-gnu, thanks!

[Bug middle-end/110857] aarch64-linux-gnu profiledbootstrap broken

2023-08-04 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110857

--- Comment #5 from prathamesh3492 at gcc dot gnu.org ---
Hi Honza,
Sorry for late response, and thanks for the fix! I am currently running
profiledbootstrap on aarch64 with your fix, and will let you know the results
after it completes.

Thanks,
Prathamesh

[Bug rtl-optimization/110867] [14 Regression] ICE in combine after 7cdd0860949c6c3232e6cff1d7ca37bb5234074c

2023-08-01 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110867

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
(In reply to prathamesh3492 from comment #2)
> (In reply to Stefan Schulze Frielinghaus from comment #1)
> > The optimization introduced by r14-2879-g7cdd0860949c6c hits during
> > combination of insn
> > 
> > (insn 31 3 32 2 (set (reg:SI 118 [ _1 ])
> > (mem:SI (reg/v/f:SI 115 [ a ]) [1 *a_4(D)+0 S4 A64])) "t.c":15:7 758
> > {*arm_movsi_vfp}
> >  (nil))
> > 
> > and
> > 
> > (insn 9 32 10 2 (set (reg:CC 100 cc)
> > (compare:CC (reg:SI 118 [ _1 ])
> > (const_int -2147483648 [0x8000]))) "t.c":15:6 272
> > {*arm_cmpsi_insn}
> >  (nil))
> > 
> > The idea of r14-2879-g7cdd0860949c6c is to get rid of large constants while
> > performing an unsigned comparison.  In this case it looks like a 32-bit
> > constant is sign-extended into a 64-bit constant and then a 32-bit
> > comparison is done.  While writing the optimization I always assumed that
> > the constant does fit into int_mode which is apparently not the case here. 
> > Thus one possible solution would be to simply bail out in those cases:
> > 
> > diff --git a/gcc/combine.cc b/gcc/combine.cc
> > index 0d99fa541c5..e46d202d0a7 100644
> > --- a/gcc/combine.cc
> > +++ b/gcc/combine.cc
> > @@ -11998,11 +11998,15 @@ simplify_compare_const (enum rtx_code code,
> > machine_mode mode,
> >   x0 >= 0x40.  */
> >if ((code == LEU || code == LTU || code == GEU || code == GTU)
> >&& is_a  (GET_MODE (op0), _mode)
> > +  && HWI_COMPUTABLE_MODE_P (int_mode)
> >&& MEM_P (op0)
> >&& !MEM_VOLATILE_P (op0)
> >/* The optimization makes only sense for constants which are big
> > enough
> >  so that we have a chance to chop off something at all.  */
> >&& (unsigned HOST_WIDE_INT) const_op > 0xff
> > +  /* Bail out, if the constant does not fit into INT_MODE.  */
> > +  && (unsigned HOST_WIDE_INT) const_op
> > +< ((HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1) << 1) -
> > 1)
> >/* Ensure that we do not overflow during normalization.  */
> >&& (code != GTU || (unsigned HOST_WIDE_INT) const_op <
> > HOST_WIDE_INT_M1U))
> >  {
> > 
> > Does this resolve the problem for you?
> 
> Yes, it worked thanks! I will do a full bootstrap+test with your fix and let
> you know the results.
Bootstrap+testing works fine with your fix. Thanks!
> 
> Thanks,
> Prathamesh

[Bug rtl-optimization/110867] ICE in combine after 7cdd0860949c6c3232e6cff1d7ca37bb5234074c

2023-08-01 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110867

--- Comment #2 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Stefan Schulze Frielinghaus from comment #1)
> The optimization introduced by r14-2879-g7cdd0860949c6c hits during
> combination of insn
> 
> (insn 31 3 32 2 (set (reg:SI 118 [ _1 ])
> (mem:SI (reg/v/f:SI 115 [ a ]) [1 *a_4(D)+0 S4 A64])) "t.c":15:7 758
> {*arm_movsi_vfp}
>  (nil))
> 
> and
> 
> (insn 9 32 10 2 (set (reg:CC 100 cc)
> (compare:CC (reg:SI 118 [ _1 ])
> (const_int -2147483648 [0x8000]))) "t.c":15:6 272
> {*arm_cmpsi_insn}
>  (nil))
> 
> The idea of r14-2879-g7cdd0860949c6c is to get rid of large constants while
> performing an unsigned comparison.  In this case it looks like a 32-bit
> constant is sign-extended into a 64-bit constant and then a 32-bit
> comparison is done.  While writing the optimization I always assumed that
> the constant does fit into int_mode which is apparently not the case here. 
> Thus one possible solution would be to simply bail out in those cases:
> 
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index 0d99fa541c5..e46d202d0a7 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -11998,11 +11998,15 @@ simplify_compare_const (enum rtx_code code,
> machine_mode mode,
>   x0 >= 0x40.  */
>if ((code == LEU || code == LTU || code == GEU || code == GTU)
>&& is_a  (GET_MODE (op0), _mode)
> +  && HWI_COMPUTABLE_MODE_P (int_mode)
>&& MEM_P (op0)
>&& !MEM_VOLATILE_P (op0)
>/* The optimization makes only sense for constants which are big
> enough
>  so that we have a chance to chop off something at all.  */
>&& (unsigned HOST_WIDE_INT) const_op > 0xff
> +  /* Bail out, if the constant does not fit into INT_MODE.  */
> +  && (unsigned HOST_WIDE_INT) const_op
> +< ((HOST_WIDE_INT_1U << (GET_MODE_PRECISION (int_mode) - 1) << 1) -
> 1)
>/* Ensure that we do not overflow during normalization.  */
>&& (code != GTU || (unsigned HOST_WIDE_INT) const_op <
> HOST_WIDE_INT_M1U))
>  {
> 
> Does this resolve the problem for you?

Yes, it worked thanks! I will do a full bootstrap+test with your fix and let
you know the results.

Thanks,
Prathamesh

[Bug rtl-optimization/110867] New: ICE in combine after 7cdd0860949c6c3232e6cff1d7ca37bb5234074c

2023-08-01 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110867

Bug ID: 110867
   Summary: ICE in combine after
7cdd0860949c6c3232e6cff1d7ca37bb5234074c
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

For the following test-case adapted from libgcc/fixed-bit.c:

typedef int DItype __attribute__ ((mode (DI)));

void
__gnu_saturate1sq (DItype *a)
{
  DItype max, min;
  max = (DItype)1 << (31 + 0);
  max = max - 1;

  min = (DItype)1 << (2 * (4 * 8) - 1);
  min = min >> (2 * (4 * 8) - 1 - (31 + 0));



  if (*a > max)
*a = max;
  else if (*a < min)
*a = min;
}

Compiling with -O2 on armv8l-unknown-linux-gnueabihf results in following ICE:
typedef int DItype __attribute__ ((mode (DI)));

during RTL pass: combine
foo.c: In function '__gnu_saturate1sq':
foo.c:19:1: internal compiler error: in decompose, at rtl.h:2297
   19 | }
  | ^
0xaa23e3 wi::int_traits >::decompose(long
long*, unsigned int, std::pair const&)
../../gcc/gcc/rtl.h:2297
0xaf5ab3 wide_int_ref_storage::wide_int_ref_storage
>(std::pair const&)
../../gcc/gcc/wide-int.h:1030
0xaf5023 generic_wide_int
>::generic_wide_int >(std::pair const&)
../../gcc/gcc/wide-int.h:788
0xf916f9 simplify_const_unary_operation(rtx_code, machine_mode, rtx_def*,
machine_mode)
../../gcc/gcc/simplify-rtx.cc:2131
0xf8bad5 simplify_context::simplify_unary_operation(rtx_code, machine_mode,
rtx_def*, machine_mode)
../../gcc/gcc/simplify-rtx.cc:889
0xf8a591 simplify_context::simplify_gen_unary(rtx_code, machine_mode, rtx_def*,
machine_mode)
../../gcc/gcc/simplify-rtx.cc:360
0x9bd1b7 simplify_gen_unary(rtx_code, machine_mode, rtx_def*, machine_mode)
../../gcc/gcc/rtl.h:3520
0x1bd5677 simplify_comparison
../../gcc/gcc/combine.cc:13125
0x1bc2b2b simplify_set
../../gcc/gcc/combine.cc:6848
0x1bc1647 combine_simplify_rtx
../../gcc/gcc/combine.cc:6353
0x1bbf97f subst
../../gcc/gcc/combine.cc:5609
0x1bb864b try_combine
../../gcc/gcc/combine.cc:3302
0x1bb30fb combine_instructions
../../gcc/gcc/combine.cc:1264
0x1bd8d25 rest_of_handle_combine
../../gcc/gcc/combine.cc:15059
0x1bd8dd5 execute
../../gcc/gcc/combine.cc:15103
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.


Thanks,
Prathamesh

[Bug middle-end/110857] New: aarch64-linux-gnu profiledbootstrap broken

2023-07-31 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110857

Bug ID: 110857
   Summary: aarch64-linux-gnu profiledbootstrap broken
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Bootstrapping gcc with profiledboostrap results in following failure:

during GIMPLE pass: ivcanon
../../gcc/gcc/cfgrtl.cc: In function ‘bool could_fall_through(basic_block,
basic_block)’:
../../gcc/gcc/cfgrtl.cc:670:1: internal compiler error: in operator>, at
profile-count.h:995
  670 | could_fall_through (basic_block src, basic_block target)
  | ^~
0xc6c89f profile_count::operator>(profile_count const&) const
../../gcc/gcc/profile-count.h:995
0xc6c89f profile_count::operator>(profile_count const&) const
../../gcc/gcc/profile-count.h:987
0xc6c89f update_loop_exit_probability_scale_dom_bbs(loop*, edge_def*,
profile_count)
../../gcc/gcc/cfgloopmanip.cc:641
0xc6cb2b scale_loop_profile(loop*, profile_probability, long)
../../gcc/gcc/cfgloopmanip.cc:776
0x1338a5f try_unroll_loop_completely
../../gcc/gcc/tree-ssa-loop-ivcanon.cc:927
0x1338a5f canonicalize_loop_induction_variables
../../gcc/gcc/tree-ssa-loop-ivcanon.cc:1274
0x13396cf canonicalize_induction_variables()
../../gcc/gcc/tree-ssa-loop-ivcanon.cc:1317
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.

This seems most likely caused due to:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=88618fa0211d77d91b70f7af9b02e08a34b57912

Thanks,
Prathamesh

[Bug tree-optimization/110280] [13/14 Regression] internal compiler error: in const_unop, at fold-const.cc:1884

2023-06-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110280

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||prathamesh3492 at gcc dot 
gnu.org

--- Comment #11 from prathamesh3492 at gcc dot gnu.org ---
Hi, sorry for the breakage, I will take a look.

Thanks,
Prathamesh

[Bug target/107920] [13 Regression] ICE in execute_todo, at passes.cc:2140

2022-12-02 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107920

--- Comment #14 from prathamesh3492 at gcc dot gnu.org ---
Posted patch:
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607714.html

Thanks,
Prathamesh

[Bug target/107920] [13 Regression] ICE in execute_todo, at passes.cc:2140

2022-12-01 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107920

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

  Attachment #53992|0   |1
is obsolete||

--- Comment #13 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 54001
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54001=edit
untested fix 2

Hi Jakub, thanks for the suggestions. The issue with previous patch was it used
gimple_seq_add_stmt, and passed resulting seq to gsi_replace_with_seq_vops. The
attached patch uses gimple_seq_add_stmt_without_update instead, which works to
resolve the issue without calling update_ssa(). Does it look OK ?

Thanks,
Prathamesh

[Bug target/107920] [13 Regression] ICE in execute_todo, at passes.cc:2140

2022-11-30 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107920

--- Comment #11 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 53992
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53992=edit
untested fix

Thanks for the suggestions. The attached patch uses gsi_replace_with_seq_vops
for preserving VUSE, which prevents the issue and results in following dump for
fre:

   :
  # VUSE <.MEM_2(D)>
  _5 = MEM  [(signed char * {ref-all})x_3(D)];
  _4 = VEC_PERM_EXPR <_5, _5, { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, ... }>;
  # VUSE <.MEM_2(D)>
  return _4;

and following code-gen:
test_s8:
ldr q0, [x0]
dup z0.q, z0.q[0]
ret

I am not sure tho if using update_ssa in the patch is ideal. If not, could you
please suggest a better alternative ?

Thanks,
Prathamesh

[Bug tree-optimization/106360] [13 regression] ICE in many test cases after r13-1745-g4c323130257744

2022-07-20 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106360

--- Comment #1 from prathamesh3492 at gcc dot gnu.org ---
Hi,
Sorry for the breakage. I will take a look.

[Bug target/96339] [SVE] Optimise svlast[ab]

2021-10-07 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96339

--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Tejas Belagod from comment #3)
> > Are you still working on this PR ? If not, can I assign it to myself ?
> 
> Yes I am - its almost done - just been busy with a few higher priority
> things. I'll find some time to clean it up, test it and post it soon. Just
> curious - is there some urgency for this fix?
> 
> Thanks,
> Tejas.

Hi Tejas,
Thanks for the heads up. No urgency, I was just looking around for missed
optimizations related to SVE in the bugzilla ;-)

Thanks,
Prathamesh

[Bug target/96339] [SVE] Optimise svlast[ab]

2021-10-06 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96339

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||prathamesh3492 at gcc dot 
gnu.org

--- Comment #2 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Tejas Belagod from comment #1)
> Small correction - the sequence translates to
>   umovw0, v0.b[1]

Hi Tejas,
Are you still working on this PR ? If not, can I assign it to myself ?

Thanks,
Prathamesh

[Bug target/93183] SVE does not use neg as conditional

2021-10-04 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93183

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > We get:
> > .L3:
> > ld1bz0.b, p0/z, [x1, x3]
> > movprfx z2, z0
> > and z2.b, z2.b, #0xc0
> > neg z1.b, p1/m, z0.b  ;;;  THIS
> > cmpeq   p2.b, p1/z, z2.b, #0
> > sel z0.b, p2, z0.b, z1.b   AND THIS
> > st1bz0.b, p0, [x0, x3]
> > incbx3
> > whilelo p0.b, w3, w2
> > b.any   .L3
> > 
> > The two instructions marked should be combined.
> 
> The problem is that it isn't a straight combination of the
> NEG and SEL, because the condition is the inverse of the one
> that we want for predication.
IIUC, sel is redundant and we could generate following code
instead for the inner loop ?

ld1bz0.b, p0/z, [x1, x2]
movprfx z2, z0
and z2.b, z2.b, #0xc0
cmpne   p2.b, p1/z, z2.b, #0
neg z0.b, p2/m, z0.b
st1bz0.b, p0, [x0, x3]
incbx3
whilelo p0.b, w3, w2
b.any   .L3
> 
> This is one of the things that the IFN_COND_* functions were
> designed to fix.  We should probably add unary versions of those.

The input to isel pass is:
vect__3.11_39 = .MASK_LOAD (_22, 8B, loop_mask_38);
vect_t1_15.12_41 = vect__3.11_39 & { 192, ... };
vect_t_12.13_42 = -vect__3.11_39;
_44 = vect_t1_15.12_41 == { 0, ... };
vect_iftmp.14_45 = VEC_COND_EXPR <_44, vect__3.11_39, vect_t_12.13_42>;

where vect__3.11_39 and vect_t_12.13_42 are negatives of each other.

I suppose in isel pass if we come across vec_cond_expr of the form:
op2 = vec_cond_expr
then we could lower it to a new internal function say IFN_COND_NEG.

IFN_COND_NEG could use a new optab say cond_neg_optab to expand it to:
movprfx op2, op1
set predicate according to inverted cond
op2 = predicate/m neg op2

Does this look reasonable ?

Thanks,
Prathamesh

[Bug target/66791] [ARM] Replace builtins with gcc vector extensions code

2021-06-21 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66791

--- Comment #8 from prathamesh3492 at gcc dot gnu.org ---
Patch committed for vceq:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=316dd79876873222552bdf6aa31338012bc9b955

[Bug target/97903] [ARM NEON] Missed optimization in lowering test operation

2021-05-05 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97903

--- Comment #2 from prathamesh3492 at gcc dot gnu.org ---
Fixed in
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d9937da063e5847f45f7f1f7a02bed7dbc8fb2f6

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #17 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Martin Liška from comment #15)
> I see, so it's a real issue and I support the workaround mentioned in
> Comment 10.
> Please send it to the mailing list.

Patch posted:
https://gcc.gnu.org/pipermail/gcc-patches/2021-January/563848.html

Thanks,
Prathamesh

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #16 from prathamesh3492 at gcc dot gnu.org ---
(In reply to Tamar Christina from comment #14)
> I just ran into the same problem, with a slightly different testcase:
This is a better one to reproduce the issue, thanks! I verified the patch in
comment 10 resolves the ICE for this test-case.

Thanks,
Prathamesh
> 
> > cat crash.c
> 
> #pragma GCC push_options
> #pragma GCC target ("arch=armv8.2-a+fp16")
> #pragma GCC pop_options
> 
> results in the same crash:
> 
> crash.c:3:9: internal compiler error: 'global_options' are modified in local
> context
> 3 | #pragma GCC pop_options
>   | ^~~
> 0x1199c6d cl_optimization_compare(gcc_options*, gcc_options*)
> build-arm-none-eabi/obj/gcc2/gcc/options-save.c:14897
> 0xb38463 handle_pragma_pop_options
> src/gcc/gcc/c-family/c-pragma.c:1092
> 0xb38eef c_invoke_pragma_handler(unsigned int)
> src/gcc/gcc/c-family/c-pragma.c:1515
> 0xa80622 c_parser_pragma
> src/gcc/gcc/c/c-parser.c:12525
> 0xa63dc6 c_parser_external_declaration
> src/gcc/gcc/c/c-parser.c:1758
> 0xa63938 c_parser_translation_unit
> src/gcc/gcc/c/c-parser.c:1650
> 0xaa6139 c_parse_file()
> src/gcc/gcc/c/c-parser.c:21990
> 0xb322f2 c_common_parse_file()
> src/gcc/gcc/c-family/c-opts.c:1211

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #13 from prathamesh3492 at gcc dot gnu.org ---
IIUC, the issue comes from the following mismatch in cl_optimization_compare:

  if (ptr1->x_arm_fp16_format != ptr2->x_arm_fp16_format)
internal_error ("% are modified in local context");

x_arm_fp16_format is of following type defined in arm-opts.h:

/* Which __fp16 format to use.
   The enumeration values correspond to the numbering for the
   Tag_ABI_FP_16bit_format attribute.
 */
enum arm_fp16_format_type
{
  ARM_FP16_FORMAT_NONE = 0,
  ARM_FP16_FORMAT_IEEE = 1,
  ARM_FP16_FORMAT_ALTERNATIVE = 2
};

For the test-case passing -mfp16-format=alternative results in:
ptr1->x_arm_fp16_format == ARM_FP16_FORMAT_ALTERNATIVE and
ptr2->x_arm_fp16_format == ARM_FP16_FORMAT_IEEE,
and the mismatch results in ICE.

Thanks,
Prathamesh

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #12 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 50003
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50003=edit
options-save.c

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-18 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #10 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 49997
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49997=edit
untested fix

Hi,
Sorry for late response. The option that seemed to be causing the issue was
arm_fp16_format. The attached patch fixes the issue by excluding it from checks
in cl_optimization_compare, which prevents the ICE for me.
Does this look OK ?

Thanks,
Prathamesh

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-12 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #7 from prathamesh3492 at gcc dot gnu.org ---
I think the error is correct.
CCing Kyrill -- could you please confirm if the error is valid for
above case ?
Thanks!

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-12 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #5 from prathamesh3492 at gcc dot gnu.org ---
Hi,
Unfortunately I am still getting the same ICE with
g:e91910d3576eeac714c93ec25ea3b15012007903.

Thanks,
Prathamesh

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-12 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 49954
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49954=edit
Output of passing --verbose

Command line option used to compile:
../arm-stage1-build/gcc/xgcc -B ../arm-stage1-build/gcc 
-mfp16-format=alternative test.c -S -save-temps  --verbose >verbose_output.txt
2>&1

[Bug target/98636] [ARM] ICE on passing incompatible options for fp16 - global_options’ are modified in local context

2021-01-12 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

--- Comment #2 from prathamesh3492 at gcc dot gnu.org ---
Created attachment 49953
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49953=edit
Preprocessed test-case

[Bug target/98636] New: [ARM] ICE on passing incompatible options for fp16

2021-01-12 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98636

Bug ID: 98636
   Summary: [ARM] ICE on passing incompatible options for fp16
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

For any test-case, that includes arm_neon.h, for instance:
#include 
void f() {}

Passing incompatible fp16 format seems to result in ICE.
For example, passing -mfp16-format=alternative resulted in:
In file included from test.c:1:
../arm-stage1-build/gcc/include/arm_neon.h:18122:9: error: selected fp16
options are incompatible
18122 | #pragma GCC target ("arch=armv8.2-a+fp16fml")
  | ^~~
../arm-stage1-build/gcc/include/arm_neon.h:18324:9: internal compiler error:
‘global_options’ are modified in local context
18324 | #pragma GCC pop_options
  | ^~~
0xdcb103 cl_optimization_compare(gcc_options*, gcc_options*)
   
/home/bilbo/gnu-toolchain/gcc/vfma/arm-stage1-build/gcc/options-save.c:12555
0x97d54d handle_pragma_pop_options
../../gcc/gcc/c-family/c-pragma.c:1092
0x8f3cbb c_parser_pragma
../../gcc/gcc/c/c-parser.c:12525
0x91aab5 c_parser_external_declaration
../../gcc/gcc/c/c-parser.c:1758
0x91b269 c_parser_translation_unit
../../gcc/gcc/c/c-parser.c:1650
0x91b269 c_parse_file()
../../gcc/gcc/c/c-parser.c:21935
0x97b045 c_common_parse_file()
../../gcc/gcc/c-family/c-opts.c:1211


My built version is configured as:
Using built-in specs.
COLLECT_GCC=../arm-stage1-build/gcc/xgcc
Target: arm-linux-gnueabihf
Configured with: ../gcc/configure --enable-languages=c,c++ --disable-bootstrap
--target=arm-linux-gnueabihf --with-arch=armv7-a --with-fpu=neon
--with-float=hard --with-mode=thumb
--with-sysroot=/home/bilbo/gnu-toolchain/sysroots/arm-linux-gnueabihf
--disable-werror
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.0 20210111 (experimental) (GCC)

Thanks,
Prathamesh

[Bug target/98537] [11 Regression] ICE in emit_move_insn since r11-5839

2021-01-08 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98537

--- Comment #6 from prathamesh3492 at gcc dot gnu.org ---
Thanks for the suggestions, I could reproduce it now.

Input to isel is:
  _1 = a_2(D) == b_3(D);
  c_4 = VEC_COND_EXPR <_1, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }>;
  return c_4;

For the following check added in r11-5839:
  if (integer_minus_onep (op1)
  && integer_zerop (op2)
  && TYPE_MODE (TREE_TYPE (lhs)) == TYPE_MODE (TREE_TYPE (op0))
  && expand_vec_cmp_expr_p (op0a_type, op0_type, tcode))

With -march=skylake-avx512, it seems the TYPE_MODE (TREE_TYPE (lhs))
and TYPE_MODE (TREE_TYPE (op0)) do not agree, and we bail out.

That happens because, lhs (c_4) mode is V4SI while op0 (_1) mode is QI.
Without -march=skylake-avx512, the type mode is V4SI for both lhs and op0.

With -march=skylake-avx512, c_4's type is vector(4) 
and without it, the type is vector(4) .

Thanks,
Prathamesh

[Bug target/98537] [11 Regression] ICE in emit_move_insn since r11-5839

2021-01-07 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98537

--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
Hi,
It seems to work on my machine for x86_64.
Compiling with -O3 (or -O2),
.optimized dump shows:

v4si foo (v4si b, v4si a)
{
  v4si c;
  vector(4)  _1;

   [local count: 1073741824]:
  _1 = a_2(D) == b_3(D);
  c_4 = VIEW_CONVERT_EXPR(_1);
  return c_4;

}

I tried on top of af362af18f405c34840d820143aa3a94f72fce4d.

Btw, on ARM it seems to "scalarize" the code,
.optimized dump shows:

  _6 = BIT_FIELD_REF ;
  _7 = BIT_FIELD_REF ;
  _8 = _6 == _7 ? -1 : 0;
  _9 = BIT_FIELD_REF ;
  _10 = BIT_FIELD_REF ;
  _11 = _9 == _10 ? -1 : 0;
  _12 = BIT_FIELD_REF ;
  _13 = BIT_FIELD_REF ;
  _14 = _12 == _13 ? -1 : 0;
  _15 = BIT_FIELD_REF ;
  _16 = BIT_FIELD_REF ;
  _17 = _15 == _16 ? -1 : 0;
  c_4 = {_8, _11, _14, _17};
  return c_4;

Thanks,
Prathamesh

[Bug target/98435] [ARM NEON] Missed optimization in expanding vector constructor

2020-12-23 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98435

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

   Severity|normal  |enhancement
  Build||x86_64-unknown-linux-gnu
   Keywords||missed-optimization
 CC||prathamesh3492 at gcc dot 
gnu.org
 Target||arm-linux-gnueabi
   Host||x86_64-unknown-linux-gnu

[Bug target/98435] New: [ARM NEON] Missed optimization in expanding vector constructor

2020-12-23 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98435

Bug ID: 98435
   Summary: [ARM NEON] Missed optimization in expanding vector
constructor
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

For the following test-case:

#include 

bfloat16x4_t f1 (bfloat16_t a)
{
  return vdup_n_bf16 (a);
}

bfloat16x4_t f2 (bfloat16_t a)
{
  return (bfloat16x4_t) {a, a, a, a};
}

Compiling with arm-linux-gnueabi -O3 -mfpu=neon -mfloat-abi=softfp 
-march=armv8.2-a+bf16+fp16 results in f2 not being vectorized:

f1:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
vdup.16 d16, r0
vmovr0, r1, d16  @ v4bf
bx  lr


f2:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
mov r3, r0  @ __bf16
adr r1, .L4
ldrdr0, [r1]
mov r2, r3  @ __bf16
mov ip, r3  @ __bf16
bfi r1, r2, #0, #16
bfi r0, ip, #0, #16
bfi r1, r3, #16, #16
bfi r0, r2, #16, #16
bx  lr


.optimized dump shows:
bfloat16x4_t f1 (bfloat16_t a)
{
  __simd64_bfloat16_t _3;

   [local count: 1073741824]:
  _3 = __builtin_neon_vdup_nv4bf (a_2(D)); [tail call]
  return _3;

}

bfloat16x4_t f2 (bfloat16_t a)
{
  bfloat16x4_t _2;

   [local count: 1073741824]:
  _2 = {a_1(D), a_1(D), a_1(D), a_1(D)};
  return _2;
}

[Bug c/98200] New: [GIMPLE FE] ICE with parsing ternary expr with -fgimple

2020-12-08 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98200

Bug ID: 98200
   Summary: [GIMPLE FE] ICE with parsing ternary expr with
-fgimple
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Following test-case ICE's with -fgimple:

int __GIMPLE() f(int x, int y)
{
  int a;
  a = (x < y) ? 1 : 2;
  return a;
}


foo.c: In function ‘f’:
foo.c:4:7: error: expected expression before ‘(’ token
4 |   a = (x < y) ? 1 : 2;
  |   ^
foo.c:4:7: internal compiler error: in extract_ops_from_tree, at
gimple-expr.c:556
0x6a1b64 extract_ops_from_tree(tree_node*, tree_code*, tree_node**,
tree_node**, tree_node**)
../../gcc/gcc/gimple-expr.c:556
0xb9ab20 gimple_build_assign(tree_node*, tree_node*)
../../gcc/gcc/gimple.c:436
0x91788a c_parser_gimple_statement
../../gcc/gcc/c/gimple-parser.c:879
0x91788a c_parser_gimple_compound_statement
../../gcc/gcc/c/gimple-parser.c:649
0x91788a c_parser_gimple_compound_statement
../../gcc/gcc/c/gimple-parser.c:381
0x919b77 c_parser_parse_gimple_body(c_parser*, char*, c_declspec_il,
profile_count)
../../gcc/gcc/c/gimple-parser.c:253
0x908e77 c_parser_declaration_or_fndef
../../gcc/gcc/c/c-parser.c:2533
0x9106b3 c_parser_external_declaration
../../gcc/gcc/c/c-parser.c:1777
0x99 c_parser_translation_unit
../../gcc/gcc/c/c-parser.c:1650
0x99 c_parse_file()
../../gcc/gcc/c/c-parser.c:21877
0x970de5 c_common_parse_file()
../../gcc/gcc/c-family/c-opts.c:1211

The ICE does not happen with parens around x < y are removed.
I guess even if this isn't syntactically valid input to gimple-fe,
it shouldn't ICE during parsing ?

Thanks,
Prathamesh

[Bug tree-optimization/97849] [10/11 Regression] aarch64: ICE (segfault) during GIMPLE pass: ifcvt since r10-3543-gf30b3d28

2020-11-23 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97849

--- Comment #3 from prathamesh3492 at gcc dot gnu.org ---
Fixed on trunk.

[Bug target/97906] New: [ARM NEON] Missed optimization in lowering to vcage

2020-11-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97906

Bug ID: 97906
   Summary: [ARM NEON] Missed optimization in lowering to vcage
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
Similar to PR97872 and PR97903, for following test-case:

#include 

uint32x2_t f1(float32x2_t a, float32x2_t b)
{
  return vabs_f32 (a) >= vabs_f32 (b);
}

uint32x2_t f2(float32x2_t a, float32x2_t b)
{
  return (uint32x2_t) __builtin_neon_vcagev2sf (a, b);
}

Code-gen:

f2:
vacge.f32  d0, d0, d1
bx lr

f1:
vabs.f32d0, d0
vabs.f32d1, d1
sub sp, sp, #8
vmov.32 r3, d0[0]
vmovs13, r3
vmov.32 r3, d1[0]
vmovs12, r3
vmov.32 r3, d1[1]
vcmpe.f32   s12, s13
vmovs14, r3
vmov.32 r3, d0[1]
vmrsAPSR_nzcv, FPSCR
vmovs15, r3
ite ls
movls   r3, #-1
movhi   r3, #0
vcmpe.f32   s14, s15
str r3, [sp]
vmrsAPSR_nzcv, FPSCR
ite ls
movls   r3, #-1
movhi   r3, #0
str r3, [sp, #4]
vldrd0, [sp]
add sp, sp, #8
@ sp needed
bx  lr

For f1, it is initially lowered to:

f1 (float32x2_t a, float32x2_t b)
{
  vector(2)  _1;
  vector(2) int _2;
  uint32x2_t _6;
  __simd64_float32_t _7;
  __simd64_float32_t _8;

   [local count: 1073741824]:
  _8 = __builtin_neon_vabsv2sf (a_4(D));
  _7 = __builtin_neon_vabsv2sf (b_5(D));
  _1 = _7 <= _8;
  _2 = VEC_COND_EXPR <_1, { -1, -1 }, { 0, 0 }>;
  _6 = VIEW_CONVERT_EXPR(_2);
  return _6;
}

and veclower seems to "scalarize" the cond_expr op:

f1 (float32x2_t a, float32x2_t b)
{
  vector(2) int _2;
  uint32x2_t _6;
  __simd64_float32_t _7;
  __simd64_float32_t _8;
  float _11;
  float _12;
  int _13;
  float _14;
  float _15;
  int _16;

   [local count: 1073741824]:
  _8 = __builtin_neon_vabsv2sf (a_4(D));
  _7 = __builtin_neon_vabsv2sf (b_5(D));
  _11 = BIT_FIELD_REF <_7, 32, 0>;
  _12 = BIT_FIELD_REF <_8, 32, 0>;
  _13 = _11 <= _12 ? -1 : 0;
  _14 = BIT_FIELD_REF <_7, 32, 32>;
  _15 = BIT_FIELD_REF <_8, 32, 32>;
  _16 = _14 <= _15 ? -1 : 0;
  _2 = {_13, _16};
  _6 = VIEW_CONVERT_EXPR(_2);
  return _6;

}

Thanks,
Prathamesh

[Bug target/97903] [ARM NEON] Missed optimization in lowering test operation

2020-11-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97903

prathamesh3492 at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||missed-optimization
   Host||x86_64-unknown-linux-gnu
  Build||x86_64-unknown-linux-gnu
 CC||clyon at gcc dot gnu.org,
   ||prathamesh3492 at gcc dot 
gnu.org
 Target||arm-linux-gnueabihf
   Assignee|unassigned at gcc dot gnu.org  |prathamesh3492 at gcc 
dot gnu.org
   Severity|normal  |enhancement

[Bug target/97903] New: [ARM NEON] Missed optimization in lowering test operation

2020-11-19 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97903

Bug ID: 97903
   Summary: [ARM NEON] Missed optimization in lowering test
operation
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
For the following test-case:

#include 

uint8x8_t f1(int8x8_t a, int8x8_t b) {
  return (uint8x8_t) ((a & b) != 0);
}

uint8x8_t f2(int8x8_t a, int8x8_t b) {
  return vtst_s8 (a, b);
}

Code-gen:

f2:
vtst.8  d0, d0, d1
bx  lr


f1:
vmov.i32d16, #0  @ v8qi
vandd1, d0, d1
vmov.i32d17, #0x  @ v8qi
vceq.i8 d1, d1, d16
vbsld1, d16, d17
vmovd0, d1  @ v8qi
bx  lr

The optimized dump for f1 shows:
  _1 = a_4(D) & b_5(D);
  _3 = .VCOND (_1, { 0, 0, 0, 0, 0, 0, 0, 0 }, { -1, -1, -1, -1, -1, -1, -1, -1
}, { 0, 0, 0, 0, 0, 0, 0, 0 }, 113);
  _6 = VIEW_CONVERT_EXPR(_3);

I think we miss opportunity to combine AND followed by VCOND into a vector test
instruction. Should we add a .VTEST internal function that expands to vtst ? Or
alternatively, add a peephole pattern in backend ?

Thanks,
Prathamesh

[Bug target/97872] New: Missed optimization for less-than comparison on vectors

2020-11-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97872

Bug ID: 97872
   Summary: Missed optimization for less-than comparison on
vectors
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
For the following test-case:

#include 

uint8x8_t f1(int8x8_t a, int8x8_t b) {
  return a < b;
}

uint8x8_t f2(int8x8_t a, int8x8_t b) {
  return vclt_s8 (a, b);
}

Code-gen for f2 uses vcgt insn
f2:
vcgt.s8 d0, d1, d0
bx  lr

However code-gen for f1 results in:
f1:
vmov.i32 d16, #0x  @ v8qi
vmov.i32 d17, #0  @ v8qi
vcgt.s8 d0, d1, d0
vbsld0, d16, d17
bx  lr

which IIUC is redundant, since vcgt will set all-ones, or all-zeros in d0
depending on the comparison.

The reason this happens is because vclt_s8 uses __builtin_neon_vcgtv8qi that
emits vcgt.s8, while f1 is lowered to using VCOND in optimized dump:

f1 (int8x8_t a, int8x8_t b)
{
  vector(8) signed char _2;
  uint8x8_t _5;

   [local count: 1073741824]:
  _2 = .VCOND (a_3(D), b_4(D), { -1, -1, -1, -1, -1, -1, -1, -1 }, { 0, 0, 0,
0, 0, 0, 0, 0 }, 107);
  _5 = VIEW_CONVERT_EXPR(_2);
  return _5;

}

and correspondingly expanded to:
;; _2 = .VCOND (a_3(D), b_4(D), { -1, -1, -1, -1, -1, -1, -1, -1 }, { 0, 0, 0,
0, 0, 0, 0, 0 }, 107);

(insn 7 6 8 (set (reg:V8QI 117)
(const_vector:V8QI [
(const_int -1 [0x]) repeated x8
])) "foo.c":4:12 -1
 (nil))

(insn 8 7 9 (set (reg:V8QI 118)
(const_vector:V8QI [
(const_int 0 [0]) repeated x8
])) "foo.c":4:12 -1
 (nil))

(insn 9 8 10 (set (reg:V8QI 119)
(neg:V8QI (gt:V8QI (reg/v:V8QI 116 [ b ])
(reg/v:V8QI 115 [ a ] "foo.c":4:12 -1
 (nil))

(insn 10 9 0 (set (reg:V8QI 113 [ _2 ])
(unspec:V8QI [
(reg:V8QI 119)
(reg:V8QI 117)
(reg:V8QI 118)
] UNSPEC_VBSL)) "foo.c":4:12 -1
 (nil))

Thanks,
Prathamesh

[Bug tree-optimization/97849] [10/11 Regression] aarch64: ICE (segfault) during GIMPLE pass: ifcvt since r10-3543-gf30b3d28

2020-11-16 Thread prathamesh3492 at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97849

--- Comment #1 from prathamesh3492 at gcc dot gnu.org ---
Hi,
Sorry for the breakage, will take a look.

Regards,
Prathamesh