https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117717
--- Comment #6 from Chengyi Zhang <iansseijelly at berkeley dot edu> --- More bug report information: GCC version: gcc (conda-forge gcc 14.2.0-1) 14.2.0 system: x86_64 Intel(R) Xeon(R) w5-3425 GCC was pre-built. Issue: GCC performance is poor with autovectorization under -O2 and -O3. ``` gcc sort.c -o sort.bin -g gcc -O0 sort.c -o sort_0.bin -g gcc -O1 sort.c -o sort_1.bin -g gcc -O2 sort.c -o sort_2.bin -g -fopt-info gcc -O3 sort.c -o sort_3.bin -g -fopt-info gcc -O3 sort.c -o sort_3_no_vectorize.bin -g -fno-tree-vectorize ---running O0--- 3733 ms ---running O1--- 1748 ms ---running O2--- 6830 ms ---running O3--- 6843 ms ---running O3 no vectorize--- 1320 ms ``` -fopt-info message: ``` sort.c:56:5: optimized: Inlining generate_array/15 into sort_array/17. sort.c:63:5: optimized: Inlining start/12 into main/18. sort.c:58:5: optimized: Inlined stop/13 into sort_array/17 which now has time 753.000147 and size 30, net change of -4. optimized: Inlined check_sorted.part.0/23 into check_sorted/16 which now has time 1147.395180 and size 19, net change of -3. sort.c:57:5: optimized: Inlined bubble_sort/25 into sort_array/17 which now has time 1626.380314 and size 43, net change of +13. sort.c:59:5: optimized: Inlined check_sorted/26 into sort_array/17 which now has time 2760.851677 and size 57, net change of +14. sort.c:31:26: optimized: basic block part vectorized using 8 byte vectors sort.c:39:24: optimized: loop vectorized using 16 byte vectors sort.c:39:24: optimized: loop vectorized using 16 byte vectors sort.c:31:26: optimized: basic block part vectorized using 8 byte vectors ```