https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82604
Bug ID: 82604
Summary: [8 Regression] SPEC CPU2006 410.bwaves ~50%
performance regression with trunk@253679 when
ftree-parallelize-loops is used
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alexander.nesterovskiy at intel dot com
Target Milestone: ---
Minimal options to reproduce regression (4 threads is just for example, there
can be more):
-Ofast -funroll-loops -flto -ftree-parallelize-loops=4
Auto-parallelization became mostly useless for 410.bwaves after r253679.
CPU time distributes like this:
Thread0 Thread1 Thread2 Thread3
r253679: ~91% ~3% ~3% ~3%
r253678: ~34% ~22% ~22% ~22%
Linking with "-fopt-info-loop-optimized" shows that twice less loops have
parallelized:
---
gfortran -Ofast -funroll-loops -flto -ftree-parallelize-loops=4 -g
-fopt-info-loop-optimized=loop.optimized *.o
grep parallelizing loop.optimized -c
---
r253679: 19
r253678: 38
Most valuable missed parallelization is
"block_solver.f:170:0: note: parallelizing outer loop 2"
in the hottest function "mat_times_vec".