Hei, that was the reason for increased run times. When removing #pragma omp parallel for, my loop took ~18 seconds. When changing it to #pragma omp parallel for num_threads(2) or #pragma omp parallel for num_threads(4) (on a i7-6700), the loop took ~16 s, but when increasing it to #pragma omp parallel for num_threads(8), the loop took 28 s.
Regards, Roland Am 17.02.21 um 18:51 schrieb Matthew Knepley: > Jed, is it possible that this is an oversubscription penalty from bad > OpenMP settings? <said by a person who knows less about OpenMP than > cuneiform> > > Thanks, > > Matt > > On Wed, Feb 17, 2021 at 12:11 PM Roland Richter > <[email protected] <mailto:[email protected]>> wrote: > > My PetscScalar is complex double (i.e. even higher penalty), but > my matrix has a size of 8kk elements, so that should not an issue. > Regards, > Roland > ------------------------------------------------------------------------ > *Von:* Jed Brown <[email protected] <mailto:[email protected]>> > *Gesendet:* Mittwoch, 17. Februar 2021 17:49:49 > *An:* Roland Richter; PETSc > *Betreff:* Re: [petsc-users] Explicit linking to OpenMP results in > performance drop and wrong results > > Roland Richter <[email protected] > <mailto:[email protected]>> writes: > > > Hei, > > > > I replaced the linking line with > > > > //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx -march=native -fopenmp-simd > > -DMKL_LP64 -m64 > > CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o > > bin/armadillo_with_PETSc > > -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib > > /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran > > -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 > > -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl > > /opt/boost/lib/libboost_filesystem.so.1.72.0 > > /opt/boost/lib/libboost_mpi.so.1.72.0 > > /opt/boost/lib/libboost_program_options.so.1.72.0 > > /opt/boost/lib/libboost_serialization.so.1.72.0 > > /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so > > /opt/petsc_release/lib/libpetsc.so > > /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so > > / > > > > and now the results are correct. Nevertheless, when comparing > the loop > > in line 26-28 in file test_scaling.cpp > > > > /#pragma omp parallel for// > > // for(int i = 0; i < r_0 * r_1; ++i)// > > // *(out_mat_ptr + i) = (*(in_mat_ptr + i) * > scaling_factor);/ > > > > the version without /#pragma omp parallel/ for is significantly > faster > > (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there > > still such a big difference? > > Sounds like you're using a profile to attribute time? Each `omp > parallel` region incurs a cost ranging from about a microsecond to > 10 or more microseconds depending on architecture, number of > threads, and OpenMP implementation. Your loop (for double > precision) operates at around 8 entries per clock cycle (depending > on architecture) if the operands are in cache so the loop size r_0 > * r_1 should be at least 10000 just to pay off the cost of `omp > parallel`. > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which > their experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/>
