[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-03-02 Thread bina2374 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #11 from Mel Chen  ---
(In reply to Mel Chen from comment #10)
> (In reply to Richard Biener from comment #9)
> > (In reply to Mel Chen from comment #8)
> > > Sorry for using the bad example to describe the problem I am facing. Let 
> > > me
> > > clarify my question with a more precise example.
> > > 
> > > void array_mul(int N, int *C, short *A, short *B) {
> > >   int i, j;
> > >   for (i = 0; i < N; i++) {
> > > C[i] = 0; // Will be transformed to __builtin_memset
> > > for (j = 0; j < N; j++) {
> > >   C[i] += (int)A[i * N + j] * (int)B[j];
> > > }
> > >   }
> > > }
> > > 
> > > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the 
> > > store
> > > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). 
> > > But
> > > without -fno-tree-loop-distribute-patterns, it will be transformed to 
> > > memset
> > > by loop distribution (ldist) because ldist executes before dse3. Finally 
> > > the
> > > memset will not be eliminated.
> > > 
> > > Another point is if there are other operations in the same level loop as 
> > > the
> > > store operation, is it really beneficial to do loop distribution and then
> > > convert to builtin function?
> > 
> > Sure, it shows a cost modeling issue given that usually loop distribution
> > merges partitions which touch the same memory stream (but IIRC maybe only
> > for loads).  But more to the point we're missing to eliminate the dead store
> > which should be appearant at least after PRE - LIM2 applied store motion
> > but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
> > pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> > next DSE only happening after loop distribution.
> > 
> > Which means we should eventually do
> > 
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index e9ed3c7bc57..be3a9becde0 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
> >NEXT_PASS (pass_sancov);
> >NEXT_PASS (pass_asan);
> >NEXT_PASS (pass_tsan);
> > +  NEXT_PASS (pass_dse);
> >NEXT_PASS (pass_dce);
> >/* Pass group that runs when 1) enabled, 2) there are loops
> >  in the function.  Make sure to run pass_fix_loops before
> 
> Yes, doing DSE before ldist is a simple and effective way.
> This patch has been verified to be work on coremark. Not only improved
> performance, but also code size.

The test case gcc.dg/tree-ssa/ldist-33.c is failure after I added DSE.

/* { dg-do compile { target size32plus } } */
/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns
-fdump-tree-ldist-details" } */

#define N (1024)
double a[N][N], b[N][N], c[N][N];

void
foo (void)
{
  unsigned i, j, k;

  for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
  {
c[i][j] = 0.0;
for (k = 0; k < N; ++k)
  c[i][j] += a[i][k] * b[k][j];
  }
}

/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 1 loops and 1
library" "ldist" } } */

It is similar to the example I showed earlier. DSE eliminated 'c[i][j] = 0.0'
so no loop will be split. My question is how to handle this test case? Add
-fno-tree-dse into dg-options, modify this test case, delete this test case, or
others.

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-25 Thread bina2374 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #10 from Mel Chen  ---
(In reply to Richard Biener from comment #9)
> (In reply to Mel Chen from comment #8)
> > Sorry for using the bad example to describe the problem I am facing. Let me
> > clarify my question with a more precise example.
> > 
> > void array_mul(int N, int *C, short *A, short *B) {
> >   int i, j;
> >   for (i = 0; i < N; i++) {
> > C[i] = 0; // Will be transformed to __builtin_memset
> > for (j = 0; j < N; j++) {
> >   C[i] += (int)A[i * N + j] * (int)B[j];
> > }
> >   }
> > }
> > 
> > If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
> > operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
> > without -fno-tree-loop-distribute-patterns, it will be transformed to memset
> > by loop distribution (ldist) because ldist executes before dse3. Finally the
> > memset will not be eliminated.
> > 
> > Another point is if there are other operations in the same level loop as the
> > store operation, is it really beneficial to do loop distribution and then
> > convert to builtin function?
> 
> Sure, it shows a cost modeling issue given that usually loop distribution
> merges partitions which touch the same memory stream (but IIRC maybe only
> for loads).  But more to the point we're missing to eliminate the dead store
> which should be appearant at least after PRE - LIM2 applied store motion
> but only PRE elides the resulting load of C[i].  Usually DCE and DSE come in
> pairs but after PRE we have DCE, CDDCE w/o accompaning DSE only with the
> next DSE only happening after loop distribution.
> 
> Which means we should eventually do
> 
> diff --git a/gcc/passes.def b/gcc/passes.def
> index e9ed3c7bc57..be3a9becde0 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -254,6 +254,7 @@ along with GCC; see the file COPYING3.  If not see
>NEXT_PASS (pass_sancov);
>NEXT_PASS (pass_asan);
>NEXT_PASS (pass_tsan);
> +  NEXT_PASS (pass_dse);
>NEXT_PASS (pass_dce);
>/* Pass group that runs when 1) enabled, 2) there are loops
>  in the function.  Make sure to run pass_fix_loops before

Yes, doing DSE before ldist is a simple and effective way.
This patch has been verified to be work on coremark. Not only improved
performance, but also code size.

[Bug tree-optimization/94092] Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2021-02-24 Thread bina2374 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

--- Comment #8 from Mel Chen  ---
Sorry for using the bad example to describe the problem I am facing. Let me
clarify my question with a more precise example.

void array_mul(int N, int *C, short *A, short *B) {
  int i, j;
  for (i = 0; i < N; i++) {
C[i] = 0; // Will be transformed to __builtin_memset
for (j = 0; j < N; j++) {
  C[i] += (int)A[i * N + j] * (int)B[j];
}
  }
}

If I compile the case with -O2 -fno-tree-loop-distribute-patterns, the store
operation 'C[i] = 0' can be eliminated by dead store elimination (dse3). But
without -fno-tree-loop-distribute-patterns, it will be transformed to memset by
loop distribution (ldist) because ldist executes before dse3. Finally the
memset will not be eliminated.

Another point is if there are other operations in the same level loop as the
store operation, is it really beneficial to do loop distribution and then
convert to builtin function?

[Bug target/96297] New: Redundant zero extension after inlining the function

2020-07-23 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96297

Bug ID: 96297
   Summary: Redundant zero extension after inlining the function
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bina2374 at gmail dot com
CC: kito at gcc dot gnu.org, wilson at gcc dot gnu.org
  Target Milestone: ---
Target: riscv32-unknown-elf

Command line: bin/riscv64-unknown-elf-gcc -march=rv32imafc -mabi=ilp32f -O3
call_is_digit.c -S

==
 C Source
==
unsigned char is_digit(unsigned char c) {
  return ((c >= '0') & (c <= '9')) ? 1 : 0;
}

int call_is_digit(unsigned char c) {
  if (is_digit(c))
return 0xa;
  else
return 0;
}

=
 GCC asm
=
is_digit:
addia0,a0,-48
sltiu   a0,a0,10
ret

call_is_digit:
addia0,a0,-48
andia0,a0,0xff  # redundant zero extension
li  a5,9
bleua0,a5,.L5
li  a0,0
ret
.L5:
li  a0,10
ret

The zero extension instruction in the function is_digit is eliminated in
combine pass, but it in the function call_is_digit can not be eliminated.

[Bug tree-optimization/95685] New: Loop invariants can't be moved out of the loop

2020-06-15 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95685

Bug ID: 95685
   Summary: Loop invariants can't be moved out of the loop
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bina2374 at gmail dot com
CC: kito at gcc dot gnu.org, wilson at gcc dot gnu.org
  Target Milestone: ---
Target: riscv32-unknown-elf

Command line: bin/riscv64-unknown-elf-gcc -march=rv32imafc -mabi=ilp32f -O2
-funroll-loops bar.c -S

==
 C Source
==
unsigned short bar(unsigned char data, unsigned short crc) {
  unsigned char i = 0;

  for (i = 0; i < 3; i++) {
data >>= 1;
if ((data & 1) == 1)
  crc ^= 0x2001;
  }
  return crc;
}

=
 GCC asm
=
bar:
srlia4,a0,1
andit0,a4,1
mv  a5,a0
mv  a0,a1
beq t0,zero,.L2
li  t1,8192 #
addit2,t1,1 # 0x2001
xor a0,a1,t2
.L2:
srlia1,a5,2
andia2,a1,1
beq a2,zero,.L3
li  a3,8192 #
addia6,a3,1 # 0x2001
xor a0,a0,a6
.L3:
srlia7,a5,3
andit3,a7,1
beq t3,zero,.L4
li  t4,8192 #
addit5,t4,1 # 0x2001
xor a0,a0,t5
.L4:
ret

If I compile it without -funroll-loops, loop invariant code motion works:
bar:
li  a2,8192 #
mv  a4,a0
li  a5,3
mv  a0,a1
addia2,a2,1 # 0x2001
.L3:
srlia4,a4,1
addia5,a5,-1
andia3,a4,1
andia5,a5,0xff
beq a3,zero,.L2
xor a0,a0,a2
.L2:
bne a5,zero,.L3
ret

I guess the problem is the order of optimization passes.
Because cunroll pass makes the loop no longer a loop, loop2_invariant can't
work on it.

===
 Ideal Code Generation 
===
bar:
srlia4,a0,1
andit0,a4,1
mv  a5,a0
mv  a0,a1
li  t1,8192 #
addit2,t1,1 # t2 = 0x2001
beq t0,zero,.L2
xor a0,a1,t2
.L2:
srlia1,a5,2
andia2,a1,1
beq a2,zero,.L3
xor a0,a0,t2# Use t2
.L3:
srlia7,a5,3
andit3,a7,1
beq t3,zero,.L4
xor a0,a0,t2# Use t2
.L4:
ret

[Bug target/95632] Redundant zero extension

2020-06-15 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95632

--- Comment #2 from Mel Chen  ---
(In reply to Jim Wilson from comment #1)
> We sign extend HImode constants as that is the natural thing to do to make
> arithmetic work.  This does mean that unsigned short logical operations need
> a zero extend after the operation which might otherwise be unnecessary. 
> This can't be handled at rtl generation time as we don't know if the
> constant will be used for arithmetic or logicals or signed or unsigned.  But
> maybe an optimization pass could go over the code and convert HImode
> constants to signed or unsigned as appropriate to reduce the number of
> sign/zero extend operations.  We have the ree pass that we might be able to
> extend to handle this.

Extend ree pass is a good way, but now it seems only scanning XXX_extend.
Because the zero_extend has been split to 2 shift instructions before ree pass,
do we need to keep zero_extend until ree pass? Or is there any other way to
know that the shift pair was a zero_extend?
> 
> Handling this in combine requires a 4->3 splitter which is something combine
> doesn't do.  We could work around that by not splitting constants before
> combine, but that would be a major change and probably not beneficial, as we
> wouldn't be able to easily optimize the high part of the constants anymore.

I agree. This way is a bit risky.
> 
> Another approach here might be to split the xor along with the constant.  If
> we generated something like
>   srlia0,a0,1
> xoria0,a0,1
>   li  a5,-24576
>   xor a0,a0,a5
> then we can optimize away the following zero extend with a 3->2 splitter
> which combine already supports via find_split_point.  We can still optimize
> the high part of the constant. Since the immediates are sign extended, if
> the low part of the immediate has the sign bit set, we would have to invert
> the high part of the immediate to get the right result.  At least I think
> that works, I haven't double checked it yet.  This only works for or if the
> low part doesn't have the sign bit set.  And this only works for and if the
> low part does have the sign bit set.

I'm not sure how difficult it is to split 1 xor to 2 xor before combine pass,
but I have another proposal:

The following dump is combine dump:
Trying 8, 9, 10 -> 11:
8: r79:SI=0xa000
9: r78:SI=r79:SI+0x1
  REG_DEAD r79:SI
  REG_EQUAL 0xa001
   10: r77:SI=r72:SI^r78:SI
  REG_DEAD r78:SI
  REG_DEAD r72:SI
   11: r80:SI=zero_extend(r77:SI#0)
  REG_DEAD r77:SI
Failed to match this instruction:
(set (reg:SI 80)
(xor:SI (reg:SI 72 [ _4 ])
(const_int 40961 [0xa001])))

Is it possible to pretend that we have a pattern that can match xor (reg:SI
80), (reg: SI 72), 0xa001 in combine pass?
And then, if the constant part is too large to put in to the immediate part, it
can be split to 2 xor in split pass.

[Bug target/95632] New: Redundant zero extension

2020-06-10 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95632

Bug ID: 95632
   Summary: Redundant zero extension
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bina2374 at gmail dot com
CC: kito at gcc dot gnu.org, wilson at gcc dot gnu.org
  Target Milestone: ---
Target: riscv32-unknown-elf

Command line: bin/riscv64-unknown-elf-gcc -march=rv32imafc -mabi=ilp32f -O2
foo.c -S

==
 C Source
==
unsigned short foo(unsigned short crc) {
  crc ^= 0x4002;
  crc >>= 1;
  crc |= 0x8000;

  return crc;
}

=
 GCC asm
=
foo:
li  a5,-24576  #
addia5,a5,1# a5 = 0xa001
srlia0,a0,1
xor a0,a0,a5
sllia0,a0,16   # 
srlia0,a0,16   # redundant zero-extension
ret

===
 Ideal Code Generation 
===
foo:
li  a5,40960   #
addia5,a5,1# a5 = 0xa001
srlia0,a0,1
xor a0,a0,a5
ret

[Bug target/95252] New: testcase gcc.dg/torture/pr67916.c failure when testing with -msave-restore

2020-05-21 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95252

Bug ID: 95252
   Summary: testcase gcc.dg/torture/pr67916.c failure when testing
with -msave-restore
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bina2374 at gmail dot com
CC: kito at gcc dot gnu.org, wilson at gcc dot gnu.org
  Target Milestone: ---
Target: riscv64-unknown-elf

FAIL: gcc.dg/torture/pr67916.c   -O3 -fomit-frame-pointer -funroll-loops
-fpeel-loops -ftracer -finline-functions  execution test

The ABI is lp64d, arch is rv64imafdc, and mcmodel is medany.

The full execution command line is:
/bin/riscv64-unknown-elf-gcc
/gcc/testsuite/gcc.dg/torture/pr67916.c
-fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers
-fdiagnostics-color=never -fdiagnostics-urls=never -msave-restore -O3
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions
-lm -o ./pr67916.exe

Execute with qemu:
$ /bin/qemu-riscv64./pr67916.exe
Segmentation fault (core dumped)

By the way, I try to reduce the number of compilation options, and found the
smallest combination of options that will cause an execution error:
/bin/riscv64-unknown-elf-gcc
/gcc/testsuite/gcc.dg/torture/pr67916.c -msave-restore -O3
-funroll-loops -lm  -o ./pr67916.exe

[Bug tree-optimization/88440] size optimization of memcpy-like code

2020-03-09 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

Mel Chen  changed:

   What|Removed |Added

 CC||bina2374 at gmail dot com

--- Comment #27 from Mel Chen  ---
Related new bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

[Bug tree-optimization/94092] New: Code size and performance degradations after -ftree-loop-distribute-patterns was enabled at -O[2s]+

2020-03-09 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Bug ID: 94092
   Summary: Code size and performance degradations after
-ftree-loop-distribute-patterns was enabled at -O[2s]+
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bina2374 at gmail dot com
  Target Milestone: ---
Target: riscv64-unknown-elf

Since https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440,
loop-distribute-patterns was enabled at -O[2s]+. We understand that this patch
is helpful in some cases, but not for all.
In the original bugzilla, the loop count is known, and the performance and code
size are all improved finally. However, if loop count is unknown, the results
may become worse.

void init_array(int N, int *array) {
  int i;
  for (i = 0; i < N; i++)
array[i] = 0;
}

init_array:
mv  a2,a0
mv  a0,a1
ble a2,zero,.L1
sllia2,a2,2
li  a1,0
tailmemset
.L1:
ret

For coremark, this is not only harmful to performance, but also code size.
Therefore, I suggest that we have to discuss how to resolve the case of unknown
loop count.

[Bug rtl-optimization/92656] New: The zero_extend insn can't be eliminated in the combine pass

2019-11-25 Thread bina2374 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92656

Bug ID: 92656
   Summary: The zero_extend insn can't be eliminated in the
combine pass
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bina2374 at gmail dot com
  Target Milestone: ---
Target: RISC-V

I compiled the C function with "-march=rv32imafc -mabi=ilp32f
-mtune=sifive-7-series -O2 -funroll-loops", and there are more slli/ srli
instructions than GCC 8.3.

==
 C Source
==
unsigned short foo(unsigned short val, unsigned short result) {
  unsigned char i = 0;
  unsigned char data = (unsigned char)(val >> 8);

  for (i = 0; i < 3; i++) {
data >>= 1;
if (data & 1)
  result ^= 0x4002;
result >>= 1;
  }

  return result;
}

==
 Assembly
 GCC 9.2
=
foo:
li  a5,16384
srlia4,a0,9
addit0,a5,2
andit1,a4,1
xor a3,a1,t0
srlia2,a0,10
bne t1,zero,1f; mv a3,a1; 1: # movcc
srlit2,a3,1
andia6,a2,1
xor a1,t2,t0
srlia0,a0,11
sllia7,a1,16  ###
andit5,a0,1
srlit3,a7,16  ###
bne a6,zero,1f; mv t3,t2; 1: # movcc
srlit4,t3,1
sllit6,t4,16  ###
srlia5,t6,16  ###
xor t0,a5,t03
sllia4,t0,16  ###
srlit1,a4,16  ###
bne t5,zero,1f; mv t1,a5; 1: # movcc
srlia0,t1,1
ret

==
 Assembly
 GCC 8.3
==
foo:
srlia0,a0,8
li  a5,16384
addit0,a5,2
srlia2,a0,1
xor a3,a1,t0
andit1,a2,1
bne t1,zero,1f; mv a3,a1; 1: # movcc
srlia4,a0,2
srlit2,a3,1
andia1,a4,1
xor a6,t2,t0
bne a1,zero,1f; mv a6,t2; 1: # movcc
srlia7,a0,3
srlit3,a6,1
andit4,a7,1
xor t5,t3,t0
bne t4,zero,1f; mv t5,t3; 1: # movcc
srlia0,t5,1
ret

When combiner try to combine zero_extend insn and another insn, the subst
pattern can not simplify according rule below because the last condition
(nonzero_bits) can not be met. 
In simplify-rtx.c:
/* (zero_extend:M (subreg:N )) is  (for M == O) or
   (zero_extend:M ), if X doesn't have any non-zero bits outside
   of mode N.  E.g.
   (zero_extend:SI (subreg:QI (and:SI (reg:SI) (const_int 63)) 0)) is
   (and:SI (reg:SI) (const_int 63)).  */
if (partial_subreg_p (op)
&& is_a  (mode, _mode)
&& is_a  (GET_MODE (SUBREG_REG (op)), _mode)
&& GET_MODE_PRECISION (op0_mode) <= HOST_BITS_PER_WIDE_INT
&& GET_MODE_PRECISION (int_mode) >= GET_MODE_PRECISION (op0_mode)
&& subreg_lowpart_p (op)
&& (nonzero_bits (SUBREG_REG (op), op0_mode)
& ~GET_MODE_MASK (GET_MODE (op))) == 0)
{
if (GET_MODE_PRECISION (int_mode) == GET_MODE_PRECISION (op0_mode))
  return SUBREG_REG (op);
return simplify_gen_unary (ZERO_EXTEND, int_mode, SUBREG_REG (op),
   op0_mode);
}

By the way, I also noticed this issue could be caused by 2-to-2 combination
(https://gcc.gnu.org/viewcvs/gcc?view=revision=263067).