[Bug target/96893] aarch64:Segmentation fault signal terminated program cc1

2020-09-02 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96893

--- Comment #2 from z.zhanghaijian at huawei dot com  ---
(In reply to Richard Biener from comment #1)
> Please provide a script to generate the testcase.  A backtrace would be nice
> as well, likely the stack blows up.

cat gen.sh
#!/bin/bash
echo 'int f301() { }' > test.c
for i in {30..1}
do
j=$(($i+1))
echo "void f$i() {printf(\"$i\\n\");  f$j(); }" >> test.c
done
echo "int main(int argc, char** argv){   f1(); }" >> test.c

The script can generate the testcase.

[Bug c/96834] [9/10/11 Regression] Segmentation fault signal terminated program cc1

2020-09-02 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96834

--- Comment #7 from z.zhanghaijian at huawei dot com  ---
(In reply to Richard Biener from comment #5)
> (In reply to z.zhanghaij...@huawei.com from comment #4)
> > The case like:
> > test.c:
> > int f31() { }
> >  void f30() {   printf("30\n"); f31(); }
> >  void f29() {   printf("29\n"); f30(); }
> >  void f28() {   printf("28\n"); f29(); }
> >  void f27() {   printf("27\n"); f28(); }
> > ...
> >  void f10() {   printf("10\n"); f11(); }
> >  void f9() {printf("9\n");  f10(); }
> >  void f8() {printf("8\n");  f9(); }
> >  void f7() {printf("7\n");  f8(); }
> >  void f6() {printf("6\n");  f7(); }
> >  void f5() {printf("5\n");  f6(); }
> >  void f4() {printf("4\n");  f5(); }
> >  void f3() {printf("3\n");  f4(); }
> >  void f2() {printf("2\n");  f3(); }
> >  void f1() {printf("1\n");  f2(); }
> >  int main(int argc, char** argv){   f1(); }
> > 
> > This can also produces the error on aarch64:
> > 
> > gcc test.c -S
> > gcc: internal compiler error: Segmentation fault signal terminated program
> > cc1
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > See <https://gcc.gnu.org/bugs/> for instructions.
> 
> But that's sth entirely different and not vectorization triggered.

Yes, I'll re-submit a new PR.

[Bug target/96893] New: aarch64:Segmentation fault signal terminated program cc1

2020-09-02 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96893

Bug ID: 96893
   Summary: aarch64:Segmentation fault signal terminated program
cc1
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

The case like:
test.c:
int f31() { }
 void f30() {   printf("30\n"); f31(); }
 void f29() {   printf("29\n"); f30(); }
 void f28() {   printf("28\n"); f29(); }
 void f27() {   printf("27\n"); f28(); }
...
 void f10() {   printf("10\n"); f11(); }
 void f9() {printf("9\n");  f10(); }
 void f8() {printf("8\n");  f9(); }
 void f7() {printf("7\n");  f8(); }
 void f6() {printf("6\n");  f7(); }
 void f5() {printf("5\n");  f6(); }
 void f4() {printf("4\n");  f5(); }
 void f3() {printf("3\n");  f4(); }
 void f2() {printf("2\n");  f3(); }
 void f1() {printf("1\n");  f2(); }
 int main(int argc, char** argv){   f1(); }

This produces the error on aarch64:

gcc test.c -S -w
gcc: internal compiler error: Segmentation fault signal terminated program cc1
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.

gcc version 11.0.0 20200902 (experimental) (GCC)

[Bug c/96834] [9/10/11 Regression] Segmentation fault signal terminated program cc1

2020-08-29 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96834

z.zhanghaijian at huawei dot com  changed:

   What|Removed |Added

 CC||z.zhanghaijian at huawei dot 
com

--- Comment #4 from z.zhanghaijian at huawei dot com  ---
The case like:
test.c:
int f31() { }
 void f30() {   printf("30\n"); f31(); }
 void f29() {   printf("29\n"); f30(); }
 void f28() {   printf("28\n"); f29(); }
 void f27() {   printf("27\n"); f28(); }
...
 void f10() {   printf("10\n"); f11(); }
 void f9() {printf("9\n");  f10(); }
 void f8() {printf("8\n");  f9(); }
 void f7() {printf("7\n");  f8(); }
 void f6() {printf("6\n");  f7(); }
 void f5() {printf("5\n");  f6(); }
 void f4() {printf("4\n");  f5(); }
 void f3() {printf("3\n");  f4(); }
 void f2() {printf("2\n");  f3(); }
 void f1() {printf("1\n");  f2(); }
 int main(int argc, char** argv){   f1(); }

This can also produces the error on aarch64:

gcc test.c -S
gcc: internal compiler error: Segmentation fault signal terminated program cc1
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.

[Bug target/96582] New: aarch64:ICE during GIMPLE pass: veclower

2020-08-12 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96582

Bug ID: 96582
   Summary: aarch64:ICE during GIMPLE pass: veclower
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

For aarch64 SVE,

The case:
typedef unsigned char v32u8 __attribute__ ((vector_size (32)));

unsigned __attribute__((noinline, noclone))
foo(unsigned u)
{
  v32u8 v32u8_0 = (v32u8){} > (v32u8){-u};
  return v32u8_0[31] + v32u8_0[0];
}

This will cause an ICE when compiled with -S -march=armv8.5-a+sve
-msve-vector-bits=512.

By tracing the debug infomation, It is found that the error is caused by the
failure to find the pattern corresponding to CODE_FOR_vcond_mask_vnx8qivnx8bi.

I tried to extend the mode of this pattern from SVE_FULL to SVE_ALL to fix it.

Proposed patch:
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -6722,11 +6722,11 @@ (define_insn "@aarch64_sve_"
 ;; UNSPEC_SEL operand order: mask, true, false (as for VEC_COND_EXPR)
 ;; SEL operand order:mask, true, false
 (define_expand "@vcond_mask_"
-  [(set (match_operand:SVE_FULL 0 "register_operand")
-   (unspec:SVE_FULL
+  [(set (match_operand:SVE_ALL 0 "register_operand")
+   (unspec:SVE_ALL
  [(match_operand: 3 "register_operand")
-  (match_operand:SVE_FULL 1 "aarch64_sve_reg_or_dup_imm")
-  (match_operand:SVE_FULL 2 "aarch64_simd_reg_or_zero")]
+  (match_operand:SVE_ALL 1 "aarch64_sve_reg_or_dup_imm")
+  (match_operand:SVE_ALL 2 "aarch64_simd_reg_or_zero")]
  UNSPEC_SEL))]
   "TARGET_SVE"
   {
@@ -6740,11 +6740,11 @@ (define_expand "@vcond_mask_"
 ;; - a duplicated immediate and a register
 ;; - a duplicated immediate and zero
 (define_insn "*vcond_mask_"
-  [(set (match_operand:SVE_FULL 0 "register_operand" "=w, w, w, w, ?w, ?,
?")
-   (unspec:SVE_FULL
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=w, w, w, w, ?w, ?,
?")
+   (unspec:SVE_ALL
  [(match_operand: 3 "register_operand" "Upa, Upa, Upa, Upa,
Upl, Upl, Upl")
-  (match_operand:SVE_FULL 1 "aarch64_sve_reg_or_dup_imm" "w, vss, vss,
Ufc, Ufc, vss, Ufc")
-  (match_operand:SVE_FULL 2 "aarch64_simd_reg_or_zero" "w, 0, Dz, 0,
Dz, w, w")]
+  (match_operand:SVE_ALL 1 "aarch64_sve_reg_or_dup_imm" "w, vss, vss,
Ufc, Ufc, vss, Ufc")
+  (match_operand:SVE_ALL 2 "aarch64_simd_reg_or_zero" "w, 0, Dz, 0,
Dz, w, w")]
  UNSPEC_SEL))]
   "TARGET_SVE
&& (!register_operand (operands[1], mode)

Any suggestions?

[Bug target/96581] New: aarch64:ICE during GIMPLE pass: veclower

2020-08-12 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96581

Bug ID: 96581
   Summary: aarch64:ICE during GIMPLE pass: veclower
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

For aarch64 SVE,

The case:
typedef unsigned char v32u8 __attribute__ ((vector_size (32)));

unsigned __attribute__((noinline, noclone))
foo(unsigned u)
{
  v32u8 v32u8_0 = (v32u8){} > (v32u8){-u};
  return v32u8_0[31] + v32u8_0[0];
}

This will cause an ICE when compiled with -S -march=armv8.5-a+sve
-msve-vector-bits=512.

By tracing the debug infomation, It is found that the error is caused by the
failure to find the pattern corresponding to CODE_FOR_vcond_mask_vnx8qivnx8bi.

I tried to extend the mode of this pattern from SVE_FULL to SVE_ALL to fix it.

Proposed patch:
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -6722,11 +6722,11 @@ (define_insn "@aarch64_sve_"
 ;; UNSPEC_SEL operand order: mask, true, false (as for VEC_COND_EXPR)
 ;; SEL operand order:mask, true, false
 (define_expand "@vcond_mask_"
-  [(set (match_operand:SVE_FULL 0 "register_operand")
-   (unspec:SVE_FULL
+  [(set (match_operand:SVE_ALL 0 "register_operand")
+   (unspec:SVE_ALL
  [(match_operand: 3 "register_operand")
-  (match_operand:SVE_FULL 1 "aarch64_sve_reg_or_dup_imm")
-  (match_operand:SVE_FULL 2 "aarch64_simd_reg_or_zero")]
+  (match_operand:SVE_ALL 1 "aarch64_sve_reg_or_dup_imm")
+  (match_operand:SVE_ALL 2 "aarch64_simd_reg_or_zero")]
  UNSPEC_SEL))]
   "TARGET_SVE"
   {
@@ -6740,11 +6740,11 @@ (define_expand "@vcond_mask_"
 ;; - a duplicated immediate and a register
 ;; - a duplicated immediate and zero
 (define_insn "*vcond_mask_"
-  [(set (match_operand:SVE_FULL 0 "register_operand" "=w, w, w, w, ?w, ?,
?")
-   (unspec:SVE_FULL
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=w, w, w, w, ?w, ?,
?")
+   (unspec:SVE_ALL
  [(match_operand: 3 "register_operand" "Upa, Upa, Upa, Upa,
Upl, Upl, Upl")
-  (match_operand:SVE_FULL 1 "aarch64_sve_reg_or_dup_imm" "w, vss, vss,
Ufc, Ufc, vss, Ufc")
-  (match_operand:SVE_FULL 2 "aarch64_simd_reg_or_zero" "w, 0, Dz, 0,
Dz, w, w")]
+  (match_operand:SVE_ALL 1 "aarch64_sve_reg_or_dup_imm" "w, vss, vss,
Ufc, Ufc, vss, Ufc")
+  (match_operand:SVE_ALL 2 "aarch64_simd_reg_or_zero" "w, 0, Dz, 0,
Dz, w, w")]
  UNSPEC_SEL))]
   "TARGET_SVE
&& (!register_operand (operands[1], mode)

Any suggestions?

[Bug driver/96230] driver: ICE in process_command, at gcc.c:5095

2020-07-17 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96230

--- Comment #2 from z.zhanghaijian at huawei dot com  ---
(In reply to Richard Biener from comment #1)
> But then an empty dumpbase should be OK?

The case ic-misattribution-1.c in the gcc testsuite contains an empty string
-dumpbase "". 

However, this case directly generates an executable without -S/-c. The have_c
is false when without -S/-c. So it goes to the branch "else if (!have_c &&
(!explicit_dumpdir || (dumpbase && !*dumpbase)))" to process the empty string.

[Bug driver/96230] New: driver: ICE in process_command, at gcc.c:5095

2020-07-16 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96230

Bug ID: 96230
   Summary: driver: ICE in process_command, at gcc.c:5095
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: driver
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

For the case:

cat foo.c

void foo (void)
{
  return;
}

$gcc foo.c -S -dumpbase "" -dumpbase-ext .c -o foo.o

gcc: internal compiler error: in process_command, at gcc.c:5095
0x40ca3b process_command
../.././gcc/gcc.c:5095
0x41335b driver::set_up_specs() const
../.././gcc/gcc.c:8077
0x403b03 driver::main(int, char**)
../.././gcc/gcc.c:7885
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

I think we shoud consider the dumpbase is an empty string when with -S/-c.

Proposed patch:
diff --git a/gcc/gcc.c b/gcc/gcc.c
index c0eb3c10cfd..b8a9a8eada9 100644
--- a/gcc/gcc.c
+++ b/gcc/gcc.c
@@ -5086,7 +5086,7 @@ process_command (unsigned int decoded_options_count,
  extension from output_name before combining it with dumpdir.  */
   if (dumpbase_ext)
 {
-  if (!dumpbase)
+  if (!dumpbase || !*dumpbase)
{
  free (dumpbase_ext);
  dumpbase_ext = NULL;

Any suggestions?

[Bug target/95523] aarch64:ICE in register_tuple_type,at config/aarch64/aarch64-sve-builtins.cc:3434

2020-06-06 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95523

--- Comment #4 from z.zhanghaijian at huawei dot com  ---
> Could you try setting DECL_USER_ALIGN on the FIELD_DECL?
> that should (hopefully) force the field to keep its
> natural alignment.

Do you mean to change the alignment to natural alignment when processing
arm_sve.h, and then change it back after handle_arm_sve_h?

I tracked the calculation process of TYPE_ALIGN (tuple_type), which is
determined by maximum_field_alignment in layout_decl, not from DECL_USER_ALIGN.

We can set maximum_field_alignment in sve_switcher to achieve the purpose of
natural alignment.

like:
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index bdb04e8170d..c49fcebcd43 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -878,6 +878,9 @@ sve_switcher::sve_switcher ()
   aarch64_isa_flags = (AARCH64_FL_FP | AARCH64_FL_SIMD | AARCH64_FL_F16
   | AARCH64_FL_SVE);

+  m_old_maximum_field_alignment = maximum_field_alignment;
+  maximum_field_alignment = 0;
+
   m_old_general_regs_only = TARGET_GENERAL_REGS_ONLY;
   global_options.x_target_flags &= ~MASK_GENERAL_REGS_ONLY;

@@ -895,6 +898,7 @@ sve_switcher::~sve_switcher ()
   if (m_old_general_regs_only)
 global_options.x_target_flags |= MASK_GENERAL_REGS_ONLY;
   aarch64_isa_flags = m_old_isa_flags;
+  maximum_field_alignment = m_old_maximum_field_alignment;
 }

 function_builder::function_builder ()
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h
b/gcc/config/aarch64/aarch64-sve-builtins.h
index 526d9f55e7b..3ffe2516df9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -658,6 +658,7 @@ public:

 private:
   unsigned long m_old_isa_flags;
+  unsigned int m_old_maximum_field_alignment;
   bool m_old_general_regs_only;
   bool m_old_have_regs_of_mode[MAX_MACHINE_MODE];
 };

[Bug target/95523] aarch64:ICE in register_tuple_type,at config/aarch64/aarch64-sve-builtins.cc:3434

2020-06-04 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95523

--- Comment #2 from z.zhanghaijian at huawei dot com  ---
(In reply to rsand...@gcc.gnu.org from comment #1)
> The reason for the assert is that the alignment is part of the
> ABI of the types and is relied on when using LDR and STR for
> some moves.  Even though -fpack-struct=N changes the ABI in general,
> I don't think it should change it in this particular case.
> 
> I have to wonder why GCC even has -fpack-struct= though.  Do you have
> a specific need for it, or was this caught by option coverage testing?
> 
> If there is no specific need, I'd be tempted to make -fpack-struct
> unsupported for AArch64.  I think there are so many other things
> that could go wrong if it is used.

We use the option in some projects. If this option is not supported, the impact
is relatively large. I think it can make the error reporting more friendly, so
that this option can still be used normally without sve.

like:
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index bdb04e8170d..e93e766aba6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -3504,6 +3504,12 @@ handle_arm_sve_h ()
   return;
 }

+  if (maximum_field_alignment)
+{
+  error ("SVE is incompatible with the use of %qs or %qs",
"-fpack-struct", "#pragma pack");
+  return;
+}
+
   sve_switcher sve;

   /* Define the vector and tuple types.  */

Any suggestions?

[Bug target/95523] New: aarch64:ICE in register_tuple_type,at config/aarch64/aarch64-sve-builtins.cc:3434

2020-06-03 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95523

Bug ID: 95523
   Summary: aarch64:ICE in register_tuple_type,at
config/aarch64/aarch64-sve-builtins.cc:3434
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

ICE issue triggered under option -fpack-struct=n:

Example:
test.c:
#include "arm_sve.h"

gcc test.c -S -fpack-struct=2

In file included from test.c:1:
lib/gcc/aarch64-linux-gnu/11.0.0/include/arm_sve.h:40:9: internal compiler
error: in register_tuple_type, at config/aarch64/aarch64-sve-builtins.cc:3434
   40 | #pragma GCC aarch64 "arm_sve.h"
  | ^~~
0x17ef8b3 register_tuple_type
../.././gcc/config/aarch64/aarch64-sve-builtins.cc:3434
0x17f00ff aarch64_sve::handle_arm_sve_h()
../.././gcc/config/aarch64/aarch64-sve-builtins.cc:3516
0xae927f aarch64_pragma_aarch64
../.././gcc/config/aarch64/aarch64-c.c:281
0xaafe53 c_invoke_pragma_handler(unsigned int)
../.././gcc/c-family/c-pragma.c:1501
0x9f6133 c_parser_pragma
../.././gcc/c/c-parser.c:12509
0x9dc03f c_parser_external_declaration
../.././gcc/c/c-parser.c:1726
0x9dbb43 c_parser_translation_unit
../.././gcc/c/c-parser.c:1618
0xa14a23 c_parse_file()
../.././gcc/c/c-parser.c:21746
0xaa7ed3 c_common_parse_file()
../.././gcc/c-family/c-opts.c:1190

The #pragma GCC aarch64 "arm_sve.h" will tell GCC to insert the necessary type
and function definitions.
When register the tuple type, the function register_tuple_type will check
TYPE_ALIGN (tuple_type) == 128. The option -fpack-struct=n will change the
alignment of the tuple. When -fpack-struct=2, the value of TYPE_ALIGN
(tuple_type) is 16, which will cause ICE.

I think there is no need to judge TYPE_ALIGN, even if type_mode is SVE vector.
Any questions?

Proposed patch:
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index bdb04e8170d..5bc5af91016 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -3432,8 +3432,7 @@ register_tuple_type (unsigned int num_vectors,
vector_type_index type)
   make_type_sizeless (tuple_type);
   layout_type (tuple_type);
   gcc_assert (VECTOR_MODE_P (TYPE_MODE (tuple_type))
- && TYPE_MODE_RAW (tuple_type) == TYPE_MODE (tuple_type)
- && TYPE_ALIGN (tuple_type) == 128);
+ && TYPE_MODE_RAW (tuple_type) == TYPE_MODE (tuple_type));

   /* Work out the structure name.  */
   char buffer[sizeof ("svbfloat16x4_t")];

[Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations

2020-06-02 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

--- Comment #5 from z.zhanghaijian at huawei dot com  ---
Created attachment 48659
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48659=edit
Fold phi whose incoming args are defined from binary operations

I tried to make a patch to do this optimization(in attachment):

replaces

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
  x1 = a + b;
  goto 
bb2:
  x2 = a + c;
bb3:
  x = PHI ;

with

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
bb2:
bb3:
  x3 = PHI ;
  x = a + x3;

This patch will check all the phi nodes in bb3. We do the optimization only if
these phis can all be converted, This can avoid most scenes that the blocks is
not empty after the optimization, but there are still some scenes that the
block is not empty.

For example 1:

int f1(int cond, int a, int b, int c, int d, int e, int f, int x, int y, int z,
int w, int m, int n)
{

  if (cond) {
x = e + f;
b = x >> w;
c = m + 12;
a = b + z;
  }
  else {
d = y >> w;
c = n + 12;
a = d + z;
  }
  a = a + 18;
  return c + a;

}

Tree dump before optimization:

   [local count: 536870913]:
  x_13 = e_11(D) + f_12(D);
  b_14 = x_13 >> w_5(D);
  c_16 = m_15(D) + 12;
  a_17 = z_9(D) + b_14;
  goto ; [100.00%]

   [local count: 536870913]:
  d_6 = y_4(D) >> w_5(D);
  c_8 = n_7(D) + 12;
  a_10 = d_6 + z_9(D);

   [local count: 1073741824]:
  # a_1 = PHI 
  # c_2 = PHI 
  a_18 = a_1 + 18;
  _19 = c_2 + a_18;
  return _19;

Tree dump after optimization:

   [local count: 536870913]:
  x_13 = e_11(D) + f_12(D);
  goto ; [100.00%]

   [local count: 536870913]:

   [local count: 1073741824]:
  # _21 = PHI 
  # _23 = PHI 
  c_2 = _23 + 12;
  _22 = _21 >> w_5(D);
  a_1 = z_9(D) + _22;
  a_18 = a_1 + 18;
  _19 = c_2 + a_18;
  return _19;

Assemble before optimization:

.LFB0:
.cfi_startproc
ldr w1, [sp, 8]
ldr w2, [sp, 16]
cbz w0, .L2
add w5, w5, w6
ldr w0, [sp, 24]
asr w5, w5, w2
add w1, w5, w1
add w1, w1, 18
add w0, w0, 12
add w0, w1, w0
ret
.p2align 2,,3
.L2:
ldr w0, [sp]
asr w2, w0, w2
ldr w0, [sp, 32]
add w1, w2, w1
add w1, w1, 18
add w0, w0, 12
add w0, w1, w0
ret
.cfi_endproc

Assemble after optimization:

.LFB0:
.cfi_startproc
ldr w1, [sp]
ldr w3, [sp, 8]
ldr w4, [sp, 16]
ldr w2, [sp, 32]
cbz w0, .L2
ldr w2, [sp, 24]
add w1, w5, w6
.L2:
asr w0, w1, w4
add w0, w0, w3
add w0, w0, w2
add w0, w0, 30
ret

Because the statement x_13 = e_11 (D) + f_12 (D) in bb3 does not have a phi
node in bb5, so bb3 cannot be emptied. I have not found a good way to solve it.
Any suggestions?

Without considering the register pressure, the performance of example 1 is
profitable. And this patch is effective for 500.perlbench_r in comment 3.

[Bug target/94820] [8/9/10 Regression] pr94780.c fails with ICE on aarch64

2020-04-28 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94820

--- Comment #3 from z.zhanghaijian at huawei dot com  ---
I have an initial fix for aarch64 that is under testing.
Will post to gcc-patches when finished.

[Bug target/94821] New: aarch64: ICE in walk_body at gcc/tree-nested.c:713

2020-04-28 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94821

Bug ID: 94821
   Summary: aarch64: ICE in walk_body at gcc/tree-nested.c:713
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

The case gcc.dg/pr94780.c on aarch64:

_Atomic double x;

double
foo (void)
{
  double bar () { return x; }
  x /= 3;
  return bar ();
}

---
gcc pr94780.c -S

pr94780.c: In function ‘foo’:
pr94780.c:8:1: internal compiler error: Segmentation fault
8 | foo (void)
  | ^~~
0x125cdc3 crash_signal
../.././gcc/toplev.c:328
0x94a4dc tree_check(tree_node*, char const*, int, char const*, tree_code)
../.././gcc/tree.h:3286
0x1356f83 convert_nonlocal_reference_op
../.././gcc/tree-nested.c:1064
0x1677c5f walk_tree_1(tree_node**, tree_node* (*)(tree_node**, int*, void*),
void*, hash_set >*,
tree_node* (*)(tree_node**, int*, tree_node* (*)(tree_node**, int*, void*),
void*, hash_set >*))
../.././gcc/tree.c:12000
0xe0a7b7 walk_gimple_op(gimple*, tree_node* (*)(tree_node**, int*, void*),
walk_stmt_info*)
../.././gcc/gimple-walk.c:268
0xe0b33b walk_gimple_stmt(gimple_stmt_iterator*, tree_node*
(*)(gimple_stmt_iterator*, bool*, walk_stmt_info*), tree_node* (*)(tree_node**,
int*, void*), walk_stmt_info*)
../.././gcc/gimple-walk.c:596
0xe09d83 walk_gimple_seq_mod(gimple**, tree_node* (*)(gimple_stmt_iterator*,
bool*, walk_stmt_info*), tree_node* (*)(tree_node**, int*, void*),
walk_stmt_info*)
../.././gcc/gimple-walk.c:51
0xe0b437 walk_gimple_stmt(gimple_stmt_iterator*, tree_node*
(*)(gimple_stmt_iterator*, bool*, walk_stmt_info*), tree_node* (*)(tree_node**,
int*, void*), walk_stmt_info*)
../.././gcc/gimple-walk.c:605
0xe09d83 walk_gimple_seq_mod(gimple**, tree_node* (*)(gimple_stmt_iterator*,
bool*, walk_stmt_info*), tree_node* (*)(tree_node**, int*, void*),
walk_stmt_info*)
../.././gcc/gimple-walk.c:51
0x1355da7 walk_body
../.././gcc/tree-nested.c:713
0x1355def walk_function
../.././gcc/tree-nested.c:724
0x135613b walk_all_functions
../.././gcc/tree-nested.c:789
0x13607b3 lower_nested_functions(tree_node*)
../.././gcc/tree-nested.c:3551
0xbc6187 cgraph_node::analyze()
../.././gcc/cgraphunit.c:676
0xbc842f analyze_functions
../.././gcc/cgraphunit.c:1227
0xbcd97b symbol_table::finalize_compilation_unit()
../.././gcc/cgraphunit.c:2974
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
---

PR94780 was only fixed on i386, the same error was also reported on aarch64.

I have an initial fix patch that is under testing.

Any suggestions?

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-22 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

z.zhanghaijian at huawei dot com  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|REOPENED|RESOLVED

--- Comment #19 from z.zhanghaijian at huawei dot com  ---
Resolved.

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-22 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #18 from z.zhanghaijian at huawei dot com  ---
(In reply to Segher Boessenkool from comment #17)
> [ Please don't add other email addresses for me; I get enough mail already,
>   I don't need all bugzilla mail in duplicate :-) ]


OK

> (In reply to z.zhanghaij...@huawei.com from comment #16)
> > Ok, I will create a new PR to track this bug, and I will submit a bugfix
> > patch whit that PR.
> 
> You can make this PR RESOLVED again, after you made a new PR.


OK, the new PR is PR94708, I will make this PR RESOLVED.

> 
> > In addition, I tracked the process of generating fmaxnm/fminnm and found
> > that it was generated in phiopt (minmax_replacement) and if-conversion
> > (noce_try_minmax). In the rtl combine, only fminnm can be generated. Is it
> > necessary for us to improve this optimization in the rtl combine using the
> > above patch in stage1?
> 
> Yeah, ifcvt will often do it.
> 
> combine can handle max just fine as well; you'll need to track down why
> it doesn't here (I noticed it doesn't as well, it wasn't immediately
> obvious to me what the difference with the min case is).


I will continue to track why fmaxnm is not generated.

[Bug rtl-optimization/94708] New: rtl combine should consider NaNs when generate fp min/max

2020-04-22 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94708

Bug ID: 94708
   Summary: rtl combine should consider NaNs when generate fp
min/max
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

Rtl combine should consider NaNs when generate fp min/max.

There is detailed discussion information here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

Proposed patch:

diff --git a/gcc/combine.c b/gcc/combine.c
index cff76cd3303..eaf93a05235 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -6643,7 +6643,8 @@ simplify_if_then_else (rtx x)

   /* Look for MIN or MAX.  */

-  if ((! FLOAT_MODE_P (mode) || flag_unsafe_math_optimizations)
+  if ((! FLOAT_MODE_P (mode)
+   || (flag_unsafe_math_optimizations && flag_finite_math_only))
   && comparison_p
   && rtx_equal_p (XEXP (cond, 0), true_rtx)
   && rtx_equal_p (XEXP (cond, 1), false_rtx)

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-22 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #16 from z.zhanghaijian at huawei dot com  ---
(In reply to Segher Boessenkool from comment #15)
> replacing flag_unsafe_math_operations by flag_finite_math_only isn't correct,
> but you can add it instead, i.e.
> 
> -  if ((! FLOAT_MODE_P (mode) || flag_unsafe_math_optimizations)
> +  if (!FLOAT_MODE_P (mode)
> +  || (flag_unsafe_math_optimizations && flag_finite_math_only))
> 
> or such?
> 
> Thanks for working on a patch!


Ok, I will create a new PR to track this bug, and I will submit a bugfix patch
whit that PR.

In addition, I tracked the process of generating fmaxnm/fminnm and found that
it was generated in phiopt (minmax_replacement) and if-conversion
(noce_try_minmax). In the rtl combine, only fminnm can be generated. Is it
necessary for us to improve this optimization in the rtl combine using the
above patch in stage1?

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-21 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #14 from z.zhanghaijian at huawei dot com  ---
(In reply to Segher Boessenkool from comment #11)
> Confirmed the comment 4 problem, on all archs.  This is a very old bug.

There are two ways to fix this bug:
1. Change flag_unsafe_math_optimizations to flag_finite_math_only, so that
fmaxnm/fminnm can be generated under -ffinite-math-only;
2. Delete this optimization.
Which one do you prefer?

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-21 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #13 from z.zhanghaijian at huawei dot com  ---
When change to flag_finite_math_only, this fmaxnm can also be generated with
the patch above(swap the true_rtx/false_rtx).

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-21 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #12 from z.zhanghaijian at huawei dot com  ---
(In reply to Segher Boessenkool from comment #11)
> Confirmed the comment 4 problem, on all archs.  This is a very old bug.

Ok to me, can this optimization change flag_unsafe_math_optimizations to
flag_finite_math_only?

Like the patch:

diff --git a/gcc/combine.c b/gcc/combine.c
index cff76cd3303..f394d8dfd03 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -6643,7 +6643,7 @@ simplify_if_then_else (rtx x)

   /* Look for MIN or MAX.  */

-  if ((! FLOAT_MODE_P (mode) || flag_unsafe_math_optimizations)
+  if ((! FLOAT_MODE_P (mode) || flag_finite_math_only)
   && comparison_p
   && rtx_equal_p (XEXP (cond, 0), true_rtx)
   && rtx_equal_p (XEXP (cond, 1), false_rtx)

Can this fix the bug?

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-20 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #8 from z.zhanghaijian at huawei dot com  ---
(In reply to Segher Boessenkool from comment #7)
> Can r94 or r93 be NaN there?
> 
> (I should build an aarch64 compiler...  takes almost a day though :-) )

Yes, r94 and r93 are function arguments, there is no limit in the example, it
may be NaN. 

-funsafe-math-optimizations allow optimization assume that arguments are not
NaNs like -ffinite-math-only?

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-20 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #6 from z.zhanghaijian at huawei dot com  ---
(In reply to Segher Boessenkool from comment #5)
> Can you show the  -fdump-rtl-combine-all  dump where that insn is
> created?
> 
> It is fine to generate min or max insns here; but you need to handle the 
> case where vara is NaN: you should return that NaN then.  Other than that
> your function is just the max of vara, varb, varc.

The dump info:

Trying 39 -> 40:
   39: cc:CCFPE=cmp(r94:SF,r93:SF)
   40: r94:SF={(cc:CCFPE<0)?r94:SF:r93:SF}
  REG_DEAD r93:SF
  REG_DEAD cc:CCFPE
Successfully matched this instruction:
(set (reg:SF 94 [ _4 ])
(smin:SF (reg:SF 94 [ _4 ])
(reg:SF 93 [ _2 ])))
allowing combination of insns 39 and 40
original costs 4 + 4 = 8
replacement cost 8
deferring deletion of insn with uid = 39.
modifying insn i340: r94:SF=smin(r94:SF,r93:SF)
  REG_DEAD r93:SF
deferring rescan insn with uid = 40.

[Bug rtl-optimization/94665] missed minmax optimization opportunity for if/else structure.

2020-04-20 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

--- Comment #4 from z.zhanghaijian at huawei dot com  ---
(In reply to Segher Boessenkool from comment #2)
> If vara is a NaN, this is not the same; it needs -ffinite-math-only.
> And in fact adding that option does the trick (on powerpc that is, I
> don't have an aarch64 Fortran handy).
> 
> Could you check this please?

Yes, on aarch64, fmaxnm can be generated with -ffinite-math-only and
-funsafe-math-optimizations.

One question: why is it OK for rtl combine to generate the fminnm here?
Anything I missed?

[Bug rtl-optimization/94665] New: missed minmax optimization opportunity for if/else structure.

2020-04-19 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94665

Bug ID: 94665
   Summary: missed minmax optimization opportunity for if/else
structure.
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

Minmax optimization for fortran,
for example:

SUBROUTINE mydepart(vara,varb,varc,res)
  REAL, INTENT(IN) :: vara,varb,varc
  REAL, INTENT(out) :: res

  res = vara
  if (res .lt. varb)  res = varb
  if (res .gt. varc)  res = varc
end SUBROUTINE

on aarch64, compile with -O2 -S -funsafe-math-optimizations, the asm:

ldr s2, [x0]
ldr s0, [x1]
ldr s1, [x2]
fcmpe   s2, s0
fcsel   s0, s0, s2, mi
fminnm  s1, s1, s0
str s1, [x3]
ret

The second if statement is optimized to fminnm, but the first can not.

In fact, it can be optimized to:

ldr s2, [x0]
ldr s1, [x1]
ldr s0, [x2]
fmaxnm  s1, s2, s1
fminnm  s0, s0, s1
str s0, [x3]

My proposal: I tracked the generation of fminnm is done in
simplify_if_then_else. The reason why the first statement optimization is not
done is that the conditions are not met:
rtx_equal_p (XEXP (cond, 0), true_rtx) && rtx_equal_p (XEXP (cond, 1),
false_rtx).

The RTX:

(if_then_else:SF (lt (reg:SF 92 [ _1 ])
(reg:SF 93 [ _2 ]))
(reg:SF 93 [ _2 ])
(reg:SF 92 [ _1 ]))

We can swap the true_rtx/false_rtx, and take the maximum.

the patch:

diff --git a/gcc/combine.c b/gcc/combine.c
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -6641,25 +6641,43 @@ simplify_if_then_else (rtx x)

   if ((! FLOAT_MODE_P (mode) || flag_unsafe_math_optimizations)
   && comparison_p
-  && rtx_equal_p (XEXP (cond, 0), true_rtx)
-  && rtx_equal_p (XEXP (cond, 1), false_rtx)
   && ! side_effects_p (cond))
-switch (true_code)
-  {
-  case GE:
-  case GT:
-   return simplify_gen_binary (SMAX, mode, true_rtx, false_rtx);
-  case LE:
-  case LT:
-   return simplify_gen_binary (SMIN, mode, true_rtx, false_rtx);
-  case GEU:
-  case GTU:
-   return simplify_gen_binary (UMAX, mode, true_rtx, false_rtx);
-  case LEU:
-  case LTU:
-   return simplify_gen_binary (UMIN, mode, true_rtx, false_rtx);
-  default:
-   break;
+{
+  int swapped = 0;
+  if (rtx_equal_p (XEXP (cond, 0), false_rtx)
+ && rtx_equal_p (XEXP (cond, 1), true_rtx))
+   {
+ std::swap (true_rtx, false_rtx);
+ swapped = 1;
+   }
+
+  if (rtx_equal_p (XEXP (cond, 0), true_rtx)
+ && rtx_equal_p (XEXP (cond, 1), false_rtx))
+   switch (true_code)
+ {
+ case GE:
+ case GT:
+   return simplify_gen_binary (swapped ? SMIN : SMAX,
+   mode, true_rtx, false_rtx);
+ case LE:
+ case LT:
+   return simplify_gen_binary (swapped ? SMAX : SMIN,
+   mode, true_rtx, false_rtx);
+ case GEU:
+ case GTU:
+   return simplify_gen_binary (swapped ? UMIN : UMAX,
+   mode, true_rtx, false_rtx);
+ case LEU:
+ case LTU:
+   return simplify_gen_binary (swapped ? UMAX : UMIN,
+   mode, true_rtx, false_rtx);
+ default:
+   break;
+ }
+
+  /* Restore if not MIN or MAX.  */
+  if (swapped)
+   std::swap (true_rtx, false_rtx);
   }

   /* If we have (if_then_else COND (OP Z C1) Z) and OP is an identity when its

Any suggestions?

[Bug tree-optimization/94398] ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-30 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398

--- Comment #1 from z.zhanghaijian at huawei dot com  ---
(gdb) bt
#0  aarch64_builtin_support_vector_misalignment (mode=E_VNx4SFmode,
type=0xb79ec2a0, misalignment=-1, is_packed=false)
at ../../gcc-git/gcc/config/aarch64/aarch64.c:17510
#1  0x0220631c in vect_supportable_dr_alignment (dr_info=0x2ef3798,
check_aligned_accesses=false) at ../../gcc-git/gcc/tree-vect-data-refs.c:6618
#2  0x0162fc9c in vectorizable_load (stmt_info=0x2ef3770,
gsi=0xe0b0, vec_stmt=0xdde0, slp_node=0x0,
slp_node_instance=0x0,
cost_vec=0x0) at ../../gcc-git/gcc/tree-vect-stmts.c:9172
#3  0x01635174 in vect_transform_stmt (stmt_info=0x2ef3770,
gsi=0xe0b0, slp_node=0x0, slp_node_instance=0x0)
at ../../gcc-git/gcc/tree-vect-stmts.c:11034
#4  0x0165a340 in vect_transform_loop_stmt (loop_vinfo=0x2ed0ad0,
stmt_info=0x2ef3770, gsi=0xe0b0, seen_store=0xe0a8)
at ../../gcc-git/gcc/tree-vect-loop.c:8307
#5  0x0165b5c4 in vect_transform_loop (loop_vinfo=0x2ed0ad0,
loop_vectorized_call=0x0) at ../../gcc-git/gcc/tree-vect-loop.c:8708
#6  0x01689f08 in try_vectorize_loop_1
(simduid_to_vf_htab=@0xed68: 0x0, num_vectorized_loops=0xed7c,
loop=0xb782,
loop_vectorized_call=0x0, loop_dist_alias_call=0x0) at
../../gcc-git/gcc/tree-vectorizer.c:990
#7  0x0168a184 in try_vectorize_loop
(simduid_to_vf_htab=@0xed68: 0x0, num_vectorized_loops=0xed7c,
loop=0xb782)
at ../../gcc-git/gcc/tree-vectorizer.c:1047
#8  0x0168a330 in vectorize_loops () at
../../gcc-git/gcc/tree-vectorizer.c:1127
#9  0x014e55e4 in (anonymous namespace)::pass_vectorize::execute
(this=0x2d6f860, fun=0xb7817000) at ../../gcc-git/gcc/tree-ssa-loop.c:414
#10 0x0113dec0 in execute_one_pass (pass=0x2d6f860) at
../../gcc-git/gcc/passes.c:2502
#11 0x0113e284 in execute_pass_list_1 (pass=0x2d6f860) at
../../gcc-git/gcc/passes.c:2590
#12 0x0113e2c0 in execute_pass_list_1 (pass=0x2d6f070) at
../../gcc-git/gcc/passes.c:2591
#13 0x0113e2c0 in execute_pass_list_1 (pass=0x2d6dd00) at
../../gcc-git/gcc/passes.c:2591
#14 0x0113e32c in execute_pass_list (fn=0xb7817000, pass=0x2d6db20)
at ../../gcc-git/gcc/passes.c:2601
#15 0x00be2f50 in cgraph_node::expand (this=0xb79dc870) at
../../gcc-git/gcc/cgraphunit.c:2299
#16 0x00be3814 in expand_all_functions () at
../../gcc-git/gcc/cgraphunit.c:2470
#17 0x00be45c4 in symbol_table::compile (this=0xb79ce000) at
../../gcc-git/gcc/cgraphunit.c:2820
#18 0x00be4b14 in symbol_table::finalize_compilation_unit
(this=0xb79ce000) at ../../gcc-git/gcc/cgraphunit.c:3000
#19 0x0129f7dc in compile_file () at ../../gcc-git/gcc/toplev.c:483
#20 0x012a3a14 in do_compile () at ../../gcc-git/gcc/toplev.c:2273
#21 0x012a3de0 in toplev::main (this=0xf148, argc=21,
argv=0xf298) at ../../gcc-git/gcc/toplev.c:2412
#22 0x0224a038 in main (argc=21, argv=0xf298) at
../../gcc-git/gcc/main.c:39
(gdb) p misalignment
$3 = -1
(gdb) p mode
$4 = E_VNx4SFmode

vect_supportable_dr_alignment is expected to return either dr_aligned or
dr_unaligned_supported for masked operations. But it seems that this function
only catches internal_fn IFN_MASK_LOAD & IFN_MASK_STORE.
We are emitting a mask gather load here for this test case.  As backends have
their own vector misalignment support policy, I am supposing this should be
better handled in the auto-vect shared code.

Proposed fix by felix.y...@huawei.com:
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 0192aa6..67d3345 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -6509,11 +6509,26 @@ vect_supportable_dr_alignment (dr_vec_info *dr_info,

   /* For now assume all conditional loads/stores support unaligned
  access without any special code.  */
-  if (gcall *stmt = dyn_cast  (stmt_info->stmt))
-if (gimple_call_internal_p (stmt)
-   && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
-   || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
-  return dr_unaligned_supported;
+  gcall *call = dyn_cast  (stmt_info->stmt);
+  if (call && gimple_call_internal_p (call))
+{
+  internal_fn ifn = gimple_call_internal_fn (call);
+  switch (ifn)
+   {
+ case IFN_MASK_LOAD:
+ case IFN_MASK_LOAD_LANES:
+ case IFN_MASK_GATHER_LOAD:
+ case IFN_MASK_STORE:
+ case IFN_MASK_STORE_LANES:
+ case IFN_MASK_SCATTER_STORE:
+   return dr_unaligned_supported;
+ default:
+   break;
+   }
+}
+
+  if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+return dr_unaligned_supported;

   if (loop_vinfo)
 {

[Bug tree-optimization/94398] New: ICE: in vectorizable_load, at tree-vect-stmts.c:9173

2020-03-30 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94398

Bug ID: 94398
   Summary: ICE: in vectorizable_load, at tree-vect-stmts.c:9173
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
CC: rguenther at suse dot de
  Target Milestone: ---

test case: gcc/testsuite/gcc.dg/pr94269.c

Command line: aarch64-linux-gnu-gcc -S -O2 -fopt-info -march=armv8.2-a+sve
-msve-vector-bits=256 -ftree-loop-vectorize -funsafe-math-optimizations
-mstrict-align pr94269.c

pr94269.c:16:9: optimized: loop vectorized using 32 byte vectors
during GIMPLE pass: vect
pr94269.c: In function 'foo':
pr94269.c:5:1: internal compiler error: in vectorizable_load, at
tree-vect-stmts.c:9173
5 | foo(long n, float *x, int inc_x,
  | ^~~
0x162fcc7 vectorizable_load
../../gcc-git/gcc/tree-vect-stmts.c:9173
0x1635173 vect_transform_stmt(_stmt_vec_info*, gimple_stmt_iterator*,
_slp_tree*, _slp_instance*)
../../gcc-git/gcc/tree-vect-stmts.c:11034
0x165a33f vect_transform_loop_stmt
../../gcc-git/gcc/tree-vect-loop.c:8307
0x165b5c3 vect_transform_loop(_loop_vec_info*, gimple*)
../../gcc-git/gcc/tree-vect-loop.c:8708
0x1689f07 try_vectorize_loop_1
../../gcc-git/gcc/tree-vectorizer.c:990
0x168a183 try_vectorize_loop
../../gcc-git/gcc/tree-vectorizer.c:1047
0x168a32f vectorize_loops()
../../gcc-git/gcc/tree-vectorizer.c:1127
0x14e55e3 execute
../../gcc-git/gcc/tree-ssa-loop.c:414
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

With -mstrict-align, aarch64_builtin_support_vector_misalignment will returns
false when misalignment factor is unknown at compile time.
Then vect_supportable_dr_alignment returns dr_unaligned_unsupported, which
triggers the ICE.

[Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations

2020-03-25 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

--- Comment #4 from z.zhanghaijian at huawei dot com  ---
(In reply to Richard Biener from comment #2)
> Note that with binary operations you are eventually increasing register
> pressure up to a point where we need to spill so IMHO this should be only
> done if both
> blocks become empty after the transform.

I can try to add some constraints to only do the fold when both
blocks become empty after the transform. Even with this constraints, the
improvements of spec2017 mentioned above can also be done, but do you think
need to add a option to control this constraint to let users choose? Because
the register pressure problem we worry about does not exist in most cases.

[Bug tree-optimization/94274] fold phi whose incoming args are defined from binary operations

2020-03-24 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

--- Comment #3 from z.zhanghaijian at huawei dot com  ---
(In reply to Marc Glisse from comment #1)
> Detecting common beginnings / endings in branches is something gcc does very
> seldom. Even at -Os, for if(cond)f(b);else f(c); we need to wait until
> rtl-optimizations to get a single call to f. (of course the reverse
> transformation of duplicating a statement that was after the branches into
> them, if it simplifies, is nice as well, and they can conflict)
> I don't know if handling one such very specific case (binary operations with
> a common argument) separately is a good idea when we don't even handle unary
> operations.

I tried to test this fold on specint2017 and found some performance gains on
500.perlbench_r. Then compared the assemble and found some improvements.

For example:

S_invlist_max, which is inlined by many functions, such as
S__append_range_to_invlist, S_ssc_anything, Perl__invlist_invert ...

invlist_inline.h:
#define FROM_INTERNAL_SIZE(x) ((x)/ sizeof(UV))

S_invlist_max(inlined by S__append_range_to_invlist, S_ssc_anything,
Perl__invlist_invert, ):
return SvLEN(invlist) == 0  /* This happens under _new_invlist_C_array */
   ? FROM_INTERNAL_SIZE(SvCUR(invlist)) - 1
   : FROM_INTERNAL_SIZE(SvLEN(invlist)) - 1;

Dump tree phiopt:

 [local count: 536870911]:
  _46 = pretmp_112 >> 3;
  iftmp.1123_47 = _46 + 18446744073709551615;
  goto ; [100.00%]

   [local count: 536870911]:
  _48 = _44 >> 3;
  iftmp.1123_49 = _48 + 18446744073709551615;

   [local count: 1073741823]:
  # iftmp.1123_50 = PHI 

Which can replaces with:

   [local count: 536870912]:

   [local count: 1073741823]:
  # _48 = PHI <_44(2), pretmp_112(3)>
  _49 = _48 >> 3;
  iftmp.1123_50 = _49 + 18446744073709551615;

Assemble:

lsr x5, x6, #3
lsr x3, x3, #3
sub x20, x5, #0x1
sub x3, x3, #0x1
cselx20, x3, x20, ne

Replaces with:

cselx3, x3, x4, ne
lsr x3, x3, #3
sub x20, x3, #0x1

This can eliminate two instruction.

[Bug tree-optimization/94274] New: fold phi whose incoming args are defined from binary operations

2020-03-23 Thread z.zhanghaijian at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94274

Bug ID: 94274
   Summary: fold phi whose incoming args are defined from binary
operations
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: z.zhanghaijian at huawei dot com
  Target Milestone: ---

For if/else structure, 
Example 1:

int test(int cond, int a, int b, int c)
{
  int result = 0;

  if (cond)
result = a + b;
  else
result = a + c;
  return result;
}

The expressions is binary operation and have a common subexpression "a", and
the opcode is the same.

E.g. on aarch64, gcc will do the binary operation first, and then do csel:

cmp w0, 0
add w0, w1, w2
add w1, w1, w3
cselw0, w1, w0, eq

In fact, it can be optimized to do csel first and then do binary operations:

cmp w0, 0
cselw2, w2, w3, ne
add w0, w2, w1

This can eliminate one instruction. This scenario is very common, and the
switch/case structure is the same.

Example 2:

int test(int cond, int a, int b, int c, int d)
{
  int result = 0;

  switch (cond) {
case 1:
  result = a + b;
  break;
case 8:
  result = a + c;
  break;
default:
  result = a + d;
  break;
  }
  return result;
}

gcc will do the binary operation first, and then do csel :

mov w5, w0
add w0, w1, w2
cmp w5, 1
beq .L1
add w4, w1, w4
cmp w5, 8
add w1, w1, w3
cselw0, w1, w4, eq
.L1:
ret

Which can further optimized into :

cmp w0, 1
beq .L3
cmp w0, 8
cselw4, w4, w3, ne
add w0, w1, w4
ret
.L3:
mov w4, w2
add w0, w1, w4
ret

My proposal: fold the merging phi node in tree_ssa_phiopt_worker (ssa-phiopt) :

For example 1:

replaces

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
  x1 = a + b;
  goto 
bb2:
  x2 = a + c;
bb3:
  x = PHI ;

with

bb0:
  if (cond) goto bb1; else goto bb2;
bb1:
bb2:
bb3:
  x3 = PHI ;
  x = a + x3;


For example 2:

replaces

bb0:
  if (cond == 1) goto bb2; else goto bb1;
bb1:
  if (cond == 8) goto bb3; else goto bb4;
bb2:
  x2 = a + b;
  goto 
bb3:
  x3 = a + c;
  goto 
bb4:
  x4 = a + d;
bb5:
  x5 = PHI ;

with

bb0:
  if (cond == 1) goto bb2; else goto bb1;
bb1:
  if (cond == 8) goto bb3; else goto bb4;
bb2:
bb3:
bb4:
bb5:
  x5 = PHI ;
  x = a + x5;

I have an initial implementation that is under testing. In part, it based on
the LLVM InstCombinePass(InstCombinePHI.cpp).

Any suggestions?