[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-09-25 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #7 from cuilili  ---
(In reply to Martin Jambor from comment #6)
> I believe this has been fixed?

Yes.

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-06-24 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #3 from cuilili  ---
I reproduced S1244 regression on znver3.

Src code:

for (int i = 0; i < LEN_1D-1; i++)
  {
a[i] = b[i] + c[i] * c[i] + b[i] * b[i] + c[i];
d[i] = a[i] + a[i+1];
  }

Base version: Base + commit version:

Assembler Assembler 
Loop1:Loop1:
vmovsd 0x60c400(%rax),%xmm2   vmovsd 0x60ba00(%rax),%xmm2   
vmovsd 0x60ba00(%rax),%xmm1   vmovsd 0x60c400(%rax),%xmm1   
add$0x8,%rax  add$0x8,%rax  

vaddsd %xmm1,%xmm2,%xmm0  vmovsd %xmm2,%xmm2,%xmm0  
vmulsd %xmm2,%xmm2,%xmm2  vfmadd132sd %xmm2,%xmm1,%xmm0 
vfmadd132sd %xmm1,%xmm2,%xmm1 vfmadd132sd %xmm1,%xmm2,%xmm1 

vaddsd %xmm1,%xmm0,%xmm0  vaddsd %xmm1,%xmm0,%xmm0  
vmovsd %xmm0,0x60cdf8(%rax)   vmovsd %xmm0,0x60cdf8(%rax)   
vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0 
vmovsd %xmm0,0x60aff8(%rax)   vmovsd %xmm0,0x60aff8(%rax)   
cmp$0x9f8,%raxcmp$0x9f8,%rax
jneLoop1: jneLoop1


For the Base version, mult and FMA have dependencies, which increases the
latency of the critical dependency chain. I didn't find out why znver3 has
regression. Same binary running on ICX has 11% gain (with #define iterations
1).

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-06-09 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

cuilili  changed:

   What|Removed |Added

 CC||lili.cui at intel dot com

--- Comment #2 from cuilili  ---

The commit changed the break dependency chain function, in order to generate
more FMA. S242 has a chain that needs to be broken. The chain is in a small
loop and related with the loop reduction variable a[i-1].


Src code:

for (int i = 1; i < LEN_1D; ++i) 
   {
 a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];
   }

--
Base version:

SSA tree
ssa1 = (s1+s2) + b[i];
ssa2 = c[i] + d[i];
ssa3 = ssa1+ssa2;
ssa4 = ssa3 + a[i-1]

a[i-1] uses xmm1, there are 2 instructions using xmm0 have dependencies across
iterations

Assembler
Loop1:
vmovsd 0x60c400(%rax),%xmm0  
vaddsd 0x60b000(%rax),%xmm3,%xmm2
add$0x8,%rax 
vaddsd 0x60b9f8(%rax),%xmm0,%xmm0
vaddsd %xmm2,%xmm0,%xmm0 
vaddsd %xmm0,%xmm1,%xmm1 ---> 1   
vmovsd %xmm1,0x60cdf8(%rax)  ---> 2
cmp$0xa00,%rdx
jneLoop1

--
Base + commit g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409 version:

a[i-1] uses xmm0, there are 4 instructions using xmm0 have dependencies across
iterations

SSA tree
ssa1 = (s1+s2) + b[i];
ssa2 = c[i] + d[i];
ssa3 = ssa1 + a[i-1]
ssa3 = ssa2 + ssa3;

Assembler
Loop1:
vaddsdq  0x60b000(%rax), %xmm0, %xmm0  ---> 1
vmovsdq  0x60c400(%rax), %xmm1
add $0x8, %rax   
vaddsdq  0x60b9f8(%rax), %xmm1, %xmm1
vaddsd %xmm2, %xmm0, %xmm0 ---> 2
vaddsd %xmm1, %xmm0, %xmm0 ---> 3
vmovsdq  %xmm0, 0x60cdf8(%rax) ---> 4
cmp$0xa00,%rdx
jneLoop1

[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU

2023-06-06 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271

--- Comment #14 from cuilili  ---
This regression has been fixed with the commit below and we can close this
ticket.

https://gcc.gnu.org/g:1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a

[Bug tree-optimization/110038] [14 Regression] ICE: in rewrite_expr_tree_parallel, at tree-ssa-reassoc.cc:5522 with --param=tree-reassoc-width=2147483647

2023-06-06 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110038

--- Comment #5 from cuilili  ---
(In reply to Martin Jambor from comment #4)
> So is this now fixed?

Yes, the attachment case has been fixed.

[Bug tree-optimization/110038] [14 Regression] ICE: in rewrite_expr_tree_parallel, at tree-ssa-reassoc.cc:5522 with --param=tree-reassoc-width=2147483647

2023-05-30 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110038

--- Comment #2 from cuilili  ---
(In reply to Richard Biener from comment #1)
> Probably best to limit the values to reassoc-width by adding the
> appropriate IntegerRange attribute in params.opt
> 
> IntegerRange(0, 256)
> 
> maybe?

"rewrite_expr_tree_parallel" got a wrong width from "get_reassociation_width" 

The number of ops is 4, width is 2147483647.

get_reassociation_width:
...
  width_min = 1;
  while (width > width_min)
{
  int width_mid = (width + width_min) / 2;   --> (width + 1) out of bounds
...

So Richard suggested that limiting tree-reassoc-width to IntegerRange(0, 256)
would solve the ICE, I also added a width constraint in
rewrite_expr_tree_parallel, here is the patch.


https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620154.html

1. Limit the value of tree-reassoc-width to IntegerRange(0, 256).
2. Add width limit in rewrite_expr_tree_parallel.

[Bug target/104271] [12/13 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU

2022-11-27 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271

--- Comment #12 from cuilili  ---
This regression caused by the store forwarding issue, we eliminate the
redundant two pairs of loads and stores which have store forwarding issue by
inlining. 

This regression has been fixed by 

https://gcc.gnu.org/g:1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2022-07-26 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 105493, which changed state.

Bug 105493 Summary: [12/13 Regression] x86_64 538.imagick_r 6% regressions and 
2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/105493] [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718

2022-07-26 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493

cuilili  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #9 from cuilili  ---
This regression was fixed by commit
r13-1021-g269edf4e5e6ab489730038f7e3495550623179fe, now close this ticket.

[Bug target/105493] [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718

2022-05-05 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493

--- Comment #2 from cuilili  ---
(In reply to Richard Biener from comment #1)
> Martin is currently re-benchmarking GCC 12 on AMD, so let's see if there's
> anything left on those.

AMD may not have this issue, Richard fixed AMD regression with this commit
r12-7612-g69619acd8d9b5856f5af6e5323d9c7c4ec9ad08f, but intel wasn't fixed
because they use different costs.

[Bug target/105493] New: [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718

2022-05-05 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493

Bug ID: 105493
   Summary: [12/13 Regression] x86_64 538.imagick_r 6% regressions
and 2% 525.x264_r regressions on Alder Lake after
r12-7319-g90d693bdc9d718
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lili.cui at intel dot com
  Target Milestone: ---

Similar issue with https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104762
they are all caused by the same commit 90d693bdc9d71841f51d68826ffa5bd685d7f0bc

options: -march=native -Ofast -lto

Alder Lake single copy:

after Vs. before this commit
525.x264_r   -9.09%
538.imagick_r-25.00%

Alder Lake multicopy:

after Vs. before this commit
525.x264_r   -2.00%
538.imagick_r-6.7%

[Bug target/104723] [12 regression] Redundant usage of stack

2022-04-24 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723

--- Comment #11 from cuilili  ---
(In reply to Jakub Jelinek from comment #10)

> And for the backend, the question is how big the penalty for the overlapping
> store is compared to doing multiple non-overlapping stores.  Say for those
> 49 bytes one could do one OI, one TI/V1TI and one QI load/store as opposed to
> one aligned and one misaligned OI load/store.
> 
> For say:
> void
> foo (void *p, void *q)
> {
>   __builtin_memcpy (p, q, 49);
> }
> we emit the 2 overlapping loads/stores for -mavx512f and 4 non-overlapping
> loads/stores with say -mavx2.

I execute both code sequence 10 times on ICX and znver3 machines.

For ICX: 2 overlapping loads/stores are 3.5x faster than 4 non-overlapping
loads/stores.
For Znver3: 2 overlapping loads/stores are 1.39x faster than 4 non-overlapping
loads/stores.


vmovdqu ymm0, YMMWORD PTR [rsi]
vmovdqu YMMWORD PTR [rdi], ymm0
vmovdqu ymm1, YMMWORD PTR [rsi+17]
vmovdqu YMMWORD PTR [rdi+17], ymm1


vmovdqu xmm0, XMMWORD PTR [rsi]
vmovdqu XMMWORD PTR [rdi], xmm0
vmovdqu xmm1, XMMWORD PTR [rsi+16]
vmovdqu XMMWORD PTR [rdi+16], xmm1
vmovdqu xmm2, XMMWORD PTR [rsi+32]
vmovdqu XMMWORD PTR [rdi+32], xmm2
movzx   eax, BYTE PTR [rsi+48]
mov BYTE PTR [rdi+48], al
---

[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU

2022-04-15 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271

--- Comment #9 from cuilili  ---
Really appreciate for your reply, I debugged SRA pass with the small testcase
and found that SRA dose not handle this situation.

SRA cannot split callee's first parameter for "Do not decompose non-BLKmode
parameters in a way that would create a BLKmode parameter. Especially for
pass-by-reference (hence, pointer type parameters), it's not worth it."

Before inline:
For caller 
store-1 :   128 bits store of struct "a" (it is an implicit store during IPA
pass, the store can only be found after a certain pass.)
For callee
load-1 :128 bits load of struct "a" for operation "c->a=(*a)"
store-2:128 bits store of struct "c->a" for operation "c->a=(*a)" 
load-2 :4 * 32 bits load for c->a.f1, c->a.f2, c->a.f3 and c->a.f4.
(because the store-2 using vector register to store, we cannot use the register
directly here.) 

After inline:
For caller
None.
For callee
store-2 :  128 bits store of struct c->a for operation "c->a=(*a)"


int callee (struct A *a, struct C *c)
{
  c->a=(*a);   
  if ((c->b + 7) & 17)
{
  c->a.f1 = c->a.f2 + c->a.f3;
  c->a.f2 = c->a.f2 - c->a.f3;
  c->a.f3 = c->a.f2 + c->a.f3;
  c->a.f4 = c->a.f2 - c->a.f3;
  c->b = c->a.f2 + c->a.f4;
  return 0;
}
  return 1;
}

int caller (int d, struct C *c)
{
  struct A a;
  a.f1 = 1 + d;
  a.f2 = 2;
  a.f3 = 12 + d;
  a.f4 = 68 + d;
  if (d > 0)
return callee (, c);
  else
return 1;
}
-
In 538.imagic_r(c_ray also has the similar code), if we inline the hot
function, the redundant store and load structure's size is 256 bits (4 elements
of size 64 bits), which can eliminates one 256-bit store, one 256-bit load, and
four 64-bit loads.
can I do it like this? Computes the total size of all callee arguments that can
eliminate redundant loads and stores. Thanks!

[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU

2022-03-29 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271

--- Comment #7 from cuilili  ---
Created attachment 52706
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52706=edit
Add a heuristic for eliminate redundant load and store in inline pass.

Hi Richard,

Could you help take a look? This is my first time adding code in mid-end, hope
you can give me some advice, thank you!

I add a INLINE_HINT_eliminate_load_and_store hint in to inline pass. when
callee's memory access is caller's local memory parameter and access size is
greater than the target threshold, we will enable the hint. with the hint,
inlining_insns_auto will enlarge the bound. The target hook is only enabled for
x86 now.

With the patch applied
Icelake server: 538.imagic_r get 15.18% improvement for multicopy and 40.78%
improvement for single copy with no measurable changes for other benchmarks.

Casecadelake: 538.imagic_r get 12.4% improvement for multicopy with and code
size increased by 0.4%. With no measurable changes for other benchmarks.

Znver3 server: 538.imagic_r get 9.6% improvement for multicopy with and code
size increased by 0.5%. With no measurable changes for other benchmarks.

[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU

2022-03-24 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271

--- Comment #6 from cuilili  ---
I created a patch to fix this regression. The patch is under performance
testing. Will sent it out later.

[Bug target/104723] [12 regression] Redundant usage of stack

2022-03-02 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723

--- Comment #9 from cuilili  ---
(In reply to cuilili from comment #3)
> (In reply to Hongtao.liu from comment #1)
> > STF issue here?
> 
correct comment #3

I used perf to collect the "ld_blocks.store_forward" event for those two test
cases, stlf_64_55_64.S has STLF issue due to the two stores overlapping, not
related to crossing cache line.

In this case it has STLF issue.

$cat stlf_64_55_64.S
...
.LFB0:
.cfi_startproc
vmovdqu   %ymm0, -64(%rsp)
vmovdqu   %ymm1, -55(%rsp)
vmovdqu   -64(%rsp), %ymm0
ret
.cfi_endproc
...

$ perf stat -e ld_blocks.store_forward ./stlf_64_55_64.out
runtime= : 128883744

 Performance counter stats for './stlf_64_55_64.out':

10,000,507  ld_blocks.store_forward:u


In this case it can do STLF.

$ cat stlf_64_128_64.S
...
.LFB0:
.cfi_startproc
vmovdqu   %ymm0, -64(%rsp)
vmovdqu   %ymm1, -128(%rsp)
vmovdqu   -64(%rsp), %ymm0
ret
.cfi_endproc
...

$ perf stat -e ld_blocks.store_forward ./stlf_64_128_64.out
runtime= : 56477424

 Performance counter stats for './stlf_64_128_64.out':

 2  ld_blocks.store_forward:u

   0.022103902 seconds time elapsed
-

[Bug target/104723] [12 regression] Redundant usage of stack

2022-03-01 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723

--- Comment #3 from cuilili  ---
(In reply to Hongtao.liu from comment #1)
> STF issue here?

Yes, Since "YMMWORD PTR [rsp-72]" across the cache line, it has STLF issue
here.

vmovdqu64   YMMWORD PTR [rsp-72], ymm31 --> store 32 bytes from [rsp-72],
across cache line
vmovdqu64   YMMWORD PTR [rsp-55], ymm31 --> over write part of YMMWORD PTR
[rsp-72]
vmovdqu64   ymm31, YMMWORD PTR [rsp-72] --> STLF with first instruction and has
penalty.

[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

2022-02-27 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #28 from cuilili  ---
(In reply to H.J. Lu from comment #25)
> Can this be mitigated by removing redundant load and store?
Yes, inlining say_sphere can remove redundant loads and stores, O3 does
inlining, but O2 is more sensitive to code size and cannot be inlined.

[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

2022-02-25 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #24 from cuilili  ---
(In reply to cuilili from comment #23)
> (In reply to Richard Biener from comment #17)
> > I do wonder though how CLX is fine with such access pattern ;)  (did you 
> > test
> > with just -O2?)
> 
Sorry, correct w/ and w/t order.

 Actually CLX also has STLF issues, there is 13.7% regression when comparing
 "gcc trunk + -O2" w/t and w/ "-fno-tree-vectorize"

[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

2022-02-25 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

cuilili  changed:

   What|Removed |Added

 CC||lili.cui at intel dot com

--- Comment #23 from cuilili  ---
(In reply to Richard Biener from comment #17)
> I do wonder though how CLX is fine with such access pattern ;)  (did you test
> with just -O2?)

Actually CLX also has STLF issues, there is 13.7% regression when comparing
"gcc trunk + -O2" w/ and w/t "-fno-tree-vectorize"

[Bug target/95621] New: Add CET(PTA_SHSTK) to march=tigerlake

2020-06-10 Thread lili.cui at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95621

Bug ID: 95621
   Summary: Add CET(PTA_SHSTK) to march=tigerlake
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lili.cui at intel dot com
  Target Milestone: ---

For intel TigerLake need support CET, add PTA_SHSTK to march=tigerlake.

[Bug target/95525] Bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG

2020-06-04 Thread lili.cui at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95525

cuilili  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #4 from cuilili  ---
Fixed for GCC 11, GCC 10.

[Bug target/95525] New: Bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG

2020-06-04 Thread lili.cui at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95525

Bug ID: 95525
   Summary: Bitmask conflict between PTA_AVX512VP2INTERSECT  and
PTA_WAITPKG
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lili.cui at intel dot com
  Target Milestone: ---

In gcc trunk, bitmask conflict between PTA_AVX512VP2INTERSECT  and PTA_WAITPKG
in gcc/config/i386/i386.h

const wide_int_bitmask PTA_AVX512VP2INTERSECT (0, HOST_WIDE_INT_1U << 9);
const wide_int_bitmask PTA_WAITPKG (0, HOST_WIDE_INT_1U << 9);