https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #16 from Thorsten Kurth <thorstenkurth at me dot com> --- FYI, the code is: https://github.com/zronaghi/BoxLib.git in branch cpp_kernels_openmp4dot5 and then in Src/LinearSolvers/C_CellMG the file ABecLaplacian.cpp. For example, lines 542 and 543 can be commented out and commented in and when the test case in run you get significant slowdown when the code is compiled with that stuff commented in. I did not map all the scalar stuff so it might be that this is a problem. But in any case, it should not create copies of that stuff at all in my opinion. Please don't look at that code right now because it is a bit convoluted I just wanted to show that this issue appears. So when I have the target section I mentioned above commented in I get by running: #!/bin/bash export OMP_NESTED=false export OMP_NUM_THREADS=64 export OMP_PLACES=threads export OMP_PROC_BIND=spread export OMP_MAX_ACTIVE_LEVELS=1 execpath="/project/projectdirs/mpccc/tkurth/Portability/BoxLib/Tutorials/MultiGrid_C" exec=`ls -latr ${execpath}/main3d.*.MPI.OMP.ex | awk '{print $9}'` #execute ${exec} inputs the following: tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh MPI initialized with 1 MPI processes OMP initialized with 64 OMP threads Using Dirichlet or Neumann boundary conditions. Grid resolution : 128 (cells) Domain size : 1 (length unit) Max_grid_size : 32 (cells) Number of grids : 64 Sum of RHS : -2.68882138776405e-17 ---------------------------------------- Solving with BoxLib C++ solver WARNING: using C++ kernels in LinOp WARNING: using C++ MG solver with C kernels MultiGrid: Initial rhs = 135.516568492921 MultiGrid: Initial residual = 135.516568492921 MultiGrid: Iteration 1 resid/bnorm = 0.379119045820053 MultiGrid: Iteration 2 resid/bnorm = 0.0107971623268356 MultiGrid: Iteration 3 resid/bnorm = 0.000551321916982188 MultiGrid: Iteration 4 resid/bnorm = 3.55014555643671e-05 MultiGrid: Iteration 5 resid/bnorm = 2.57082340920002e-06 MultiGrid: Iteration 6 resid/bnorm = 1.90970439886018e-07 MultiGrid: Iteration 7 resid/bnorm = 1.44525222814178e-08 MultiGrid: Iteration 8 resid/bnorm = 1.10675190626368e-09 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 , Solve time: 5.84898591041565, CG time: 0.162226438522339 Converged res < eps_rel*max(bnorm,res_norm) Run time : 5.98936820030212 ---------------------------------------- Unused ParmParse Variables: [TOP]::hypre.solver_flag(nvals = 1) :: [1] [TOP]::hypre.pfmg_rap_type(nvals = 1) :: [1] [TOP]::hypre.pfmg_relax_type(nvals = 1) :: [2] [TOP]::hypre.num_pre_relax(nvals = 1) :: [2] [TOP]::hypre.num_post_relax(nvals = 1) :: [2] [TOP]::hypre.skip_relax(nvals = 1) :: [1] [TOP]::hypre.print_level(nvals = 1) :: [1] done. When I comment it out, recompile, I get: tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh MPI initialized with 1 MPI processes OMP initialized with 64 OMP threads Using Dirichlet or Neumann boundary conditions. Grid resolution : 128 (cells) Domain size : 1 (length unit) Max_grid_size : 32 (cells) Number of grids : 64 Sum of RHS : -2.68882138776405e-17 ---------------------------------------- Solving with BoxLib C++ solver WARNING: using C++ kernels in LinOp WARNING: using C++ MG solver with C kernels MultiGrid: Initial rhs = 135.516568492921 MultiGrid: Initial residual = 135.516568492921 MultiGrid: Iteration 1 resid/bnorm = 0.379119045820053 MultiGrid: Iteration 2 resid/bnorm = 0.0107971623268356 MultiGrid: Iteration 3 resid/bnorm = 0.000551321916981978 MultiGrid: Iteration 4 resid/bnorm = 3.5501455563633e-05 MultiGrid: Iteration 5 resid/bnorm = 2.5708234090034e-06 MultiGrid: Iteration 6 resid/bnorm = 1.90970439781153e-07 MultiGrid: Iteration 7 resid/bnorm = 1.44525225042545e-08 MultiGrid: Iteration 8 resid/bnorm = 1.10675108045705e-09 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 , Solve time: 0.759385108947754, CG time: 0.14183521270752 Converged res < eps_rel*max(bnorm,res_norm) Run time : 0.879786014556885 ---------------------------------------- Unused ParmParse Variables: [TOP]::hypre.solver_flag(nvals = 1) :: [1] [TOP]::hypre.pfmg_rap_type(nvals = 1) :: [1] [TOP]::hypre.pfmg_relax_type(nvals = 1) :: [2] [TOP]::hypre.num_pre_relax(nvals = 1) :: [2] [TOP]::hypre.num_post_relax(nvals = 1) :: [2] [TOP]::hypre.skip_relax(nvals = 1) :: [1] [TOP]::hypre.print_level(nvals = 1) :: [1] done. it is like 7.3x slowdown. The smoothing kernel (gauss-seidel red-black) is the most expensive kernel in the Multi-Grid code, so I see the biggest effect here. But the other kernels (prolongation, restriction, dot products etc) have slowdowns as well amounting to a total of more than 10x for the whole app.