https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
Bug ID: 119298
Summary: 538.imagick_r is faster when compiled with GCC 14.2
and -Ofast -flto -march=native than with master on
Zen5
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: jamborm at gcc dot gnu.org
CC: hubicka at gcc dot gnu.org, rguenth at gcc dot gnu.org
Blocks: 26163
Target Milestone: ---
Host: x86_64-linux-gnu
Target: x86_64-linux-gnu
SPEC INTrate 2017 538.imagick_r benchmark is faster when compiled with
GCC 14.2 and -Ofast -flto -march=native than with trunk/master on Zen
5 CPUs.
The regression has been introduced in r15-3441-g4292297a0f938f (Jan
Hubicka: Zen5 tuning part 5: update instruction latencies in
x86-tune-costs)
It is the modification of "cost of ADDSS/SD SUBSS/SD insns" that is
the culprit, bumping it back to COSTS_N_INSNS(3) (instead of
COSTS_N_INSNS(2)) makes the regression go away. Nevertheless, Honza
claims the cost should be correct.
Perf stat of the slow run:
116866.57 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8347 page-faults:u # 71.423 /sec
484499860679 cycles:u # 4.146 GHz
21879349058 stalled-cycles-frontend:u # 4.52% frontend
cycles idle
2030074730877 instructions:u # 4.19 insn per
cycle
# 0.01 stalled cycles per
insn
224436542157 branches:u # 1.920 G/sec
1716173329 branch-misses:u # 0.76% of all
branches
116.881252465 seconds time elapsed
116.808499000 seconds user
0.057350000 seconds sys
Perf report of the slow run (annotated assmebly attached):
# Samples: 470K of event 'cycles:Pu'
# Event count (approx.): 484158470552
#
# Overhead Samples Command Shared Object
Symbol
# ........ ............ ............... ...............................
.............................................
#
44.71% 210348 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64
[.] MeanShiftImage
28.76% 135308 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64
[.] GetVirtualPixelsFromNexus
25.50% 120106 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64
[.] MorphologyApply
Perf stat of the fast run (with just the one cost reverted):
Performance counter stats for 'taskset -c 0 specinvoke':
108805.48 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8312 page-faults:u # 76.393 /sec
450981792793 cycles:u # 4.145 GHz
22610930072 stalled-cycles-frontend:u # 5.01% frontend
cycles idle
1933965750890 instructions:u # 4.29 insn per
cycle
# 0.01 stalled cycles per
insn
224433996552 branches:u # 2.063 G/sec
1721069495 branch-misses:u # 0.77% of all
branches
108.819368844 seconds time elapsed
108.763582000 seconds user
0.041314000 seconds sys
Perf report of the fast run (annotated assmebly attached):
# Samples: 427K of event 'cycles:Pu'
# Event count (approx.): 439380128661
#
# Overhead Samples Command Shared Object
Symbol
# ........ ............ ............... ...............................
..................................................
#
44.53% 190164 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.]
MeanShiftImage
28.13% 120243 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.]
MorphologyApply
26.20% 111906 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.]
GetVirtualPixelsFromNexus
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)