Hei, the compilation line is (as shown below)
//usr/lib64/mpi/gcc/openmpi3/bin/mpicxx -DBOOST_ALL_NO_LIB -DBOOST_FILESYSTEM_DYN_LINK -DBOOST_MPI_DYN_LINK -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_SERIALIZATION_DYN_LINK -DUSE_CUDA -I/home/roland/Dokumente/C++-Projekte/armadillo_with_PETSc/include -I/opt/intel/compilers_and_libraries_2020.2.254/linux/mkl/include -I/opt/armadillo/include -isystem /opt/petsc_release/include -isystem /opt/fftw3/include -isystem /opt/boost/include -march=native -fopenmp-simd -DMKL_LP64 -m64 -Wall -Wextra -pedantic -fPIC -flto -O2 -funroll-loops -funroll-all-loops -fstrict-aliasing -mavx -march=native -fopenmp -std=gnu++17 -c <source_files> -o <target_files>/ Regards, Roland // Am 17.02.2021 um 18:56 schrieb Jed Brown: > It's entirely possible, especially if libgomp is being mixed with libiomp. > > Roland hasn't show us the compilation line (just linker), because `omp > parallel` shouldn't do anything with just -fopenmp-simd and no -fopenmp. > > Matthew Knepley <[email protected]> writes: > >> Jed, is it possible that this is an oversubscription penalty from bad >> OpenMP settings? <said by a person who knows less about OpenMP than >> cuneiform> >> >> Thanks, >> >> Matt >> >> On Wed, Feb 17, 2021 at 12:11 PM Roland Richter <[email protected]> >> wrote: >> >>> My PetscScalar is complex double (i.e. even higher penalty), but my matrix >>> has a size of 8kk elements, so that should not an issue. >>> Regards, >>> Roland >>> ------------------------------ >>> *Von:* Jed Brown <[email protected]> >>> *Gesendet:* Mittwoch, 17. Februar 2021 17:49:49 >>> *An:* Roland Richter; PETSc >>> *Betreff:* Re: [petsc-users] Explicit linking to OpenMP results in >>> performance drop and wrong results >>> >>> Roland Richter <[email protected]> writes: >>> >>>> Hei, >>>> >>>> I replaced the linking line with >>>> >>>> //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx -march=native -fopenmp-simd >>>> -DMKL_LP64 -m64 >>>> CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o >>>> bin/armadillo_with_PETSc >>>> -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib >>>> /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran >>>> -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 >>>> -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl >>>> /opt/boost/lib/libboost_filesystem.so.1.72.0 >>>> /opt/boost/lib/libboost_mpi.so.1.72.0 >>>> /opt/boost/lib/libboost_program_options.so.1.72.0 >>>> /opt/boost/lib/libboost_serialization.so.1.72.0 >>>> /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so >>>> /opt/petsc_release/lib/libpetsc.so >>>> /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so >>>> / >>>> >>>> and now the results are correct. Nevertheless, when comparing the loop >>>> in line 26-28 in file test_scaling.cpp >>>> >>>> /#pragma omp parallel for// >>>> // for(int i = 0; i < r_0 * r_1; ++i)// >>>> // *(out_mat_ptr + i) = (*(in_mat_ptr + i) * scaling_factor);/ >>>> >>>> the version without /#pragma omp parallel/ for is significantly faster >>>> (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there >>>> still such a big difference? >>> Sounds like you're using a profile to attribute time? Each `omp parallel` >>> region incurs a cost ranging from about a microsecond to 10 or more >>> microseconds depending on architecture, number of threads, and OpenMP >>> implementation. Your loop (for double precision) operates at around 8 >>> entries per clock cycle (depending on architecture) if the operands are in >>> cache so the loop size r_0 * r_1 should be at least 10000 just to pay off >>> the cost of `omp parallel`. >>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
