https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117717

--- Comment #6 from Chengyi Zhang <iansseijelly at berkeley dot edu> ---
More bug report information:
GCC version: gcc (conda-forge gcc 14.2.0-1) 14.2.0
system: x86_64 Intel(R) Xeon(R) w5-3425
GCC was pre-built. 
Issue: GCC performance is poor with autovectorization under -O2 and -O3. 
```
gcc sort.c -o sort.bin -g
gcc -O0 sort.c -o sort_0.bin -g
gcc -O1 sort.c -o sort_1.bin -g
gcc -O2 sort.c -o sort_2.bin -g -fopt-info
gcc -O3 sort.c -o sort_3.bin -g -fopt-info
gcc -O3 sort.c -o sort_3_no_vectorize.bin -g -fno-tree-vectorize

---running O0---
3733 ms

---running O1---
1748 ms

---running O2---
6830 ms

---running O3---
6843 ms

---running O3 no vectorize---
1320 ms
```

-fopt-info message:
```
sort.c:56:5: optimized:  Inlining generate_array/15 into sort_array/17.
sort.c:63:5: optimized:  Inlining start/12 into main/18.
sort.c:58:5: optimized:  Inlined stop/13 into sort_array/17 which now has time
753.000147 and size 30, net change of -4.
optimized:  Inlined check_sorted.part.0/23 into check_sorted/16 which now has
time 1147.395180 and size 19, net change of -3.
sort.c:57:5: optimized:  Inlined bubble_sort/25 into sort_array/17 which now
has time 1626.380314 and size 43, net change of +13.
sort.c:59:5: optimized:  Inlined check_sorted/26 into sort_array/17 which now
has time 2760.851677 and size 57, net change of +14.
sort.c:31:26: optimized: basic block part vectorized using 8 byte vectors
sort.c:39:24: optimized: loop vectorized using 16 byte vectors
sort.c:39:24: optimized: loop vectorized using 16 byte vectors
sort.c:31:26: optimized: basic block part vectorized using 8 byte vectors
```

Reply via email to