Re: [petsc-users] Explicit linking to OpenMP results in performance drop and wrong results

Roland Richter Wed, 17 Feb 2021 11:29:08 -0800

Hei,

the compilation line is (as shown below)


//usr/lib64/mpi/gcc/openmpi3/bin/mpicxx -DBOOST_ALL_NO_LIB
-DBOOST_FILESYSTEM_DYN_LINK -DBOOST_MPI_DYN_LINK
-DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_SERIALIZATION_DYN_LINK
-DUSE_CUDA
-I/home/roland/Dokumente/C++-Projekte/armadillo_with_PETSc/include
-I/opt/intel/compilers_and_libraries_2020.2.254/linux/mkl/include
-I/opt/armadillo/include -isystem /opt/petsc_release/include -isystem
/opt/fftw3/include -isystem /opt/boost/include -march=native
-fopenmp-simd -DMKL_LP64 -m64 -Wall -Wextra -pedantic -fPIC -flto -O2
-funroll-loops -funroll-all-loops -fstrict-aliasing -mavx -march=native
-fopenmp -std=gnu++17 -c <source_files> -o <target_files>/

Regards,

Roland
//

Am 17.02.2021 um 18:56 schrieb Jed Brown:
> It's entirely possible, especially if libgomp is being mixed with libiomp.
>
> Roland hasn't show us the compilation line (just linker), because `omp 
> parallel` shouldn't do anything with just -fopenmp-simd and no -fopenmp. 
>
> Matthew Knepley <[email protected]> writes:
>
>> Jed, is it possible that this is an oversubscription penalty from bad
>> OpenMP settings? <said by a person who knows less about OpenMP than
>> cuneiform>
>>
>>   Thanks,
>>
>>      Matt
>>
>> On Wed, Feb 17, 2021 at 12:11 PM Roland Richter <[email protected]>
>> wrote:
>>
>>> My PetscScalar is complex double (i.e. even higher penalty), but my matrix
>>> has a size of 8kk elements, so that should not an issue.
>>> Regards,
>>> Roland
>>> ------------------------------
>>> *Von:* Jed Brown <[email protected]>
>>> *Gesendet:* Mittwoch, 17. Februar 2021 17:49:49
>>> *An:* Roland Richter; PETSc
>>> *Betreff:* Re: [petsc-users] Explicit linking to OpenMP results in
>>> performance drop and wrong results
>>>
>>> Roland Richter <[email protected]> writes:
>>>
>>>> Hei,
>>>>
>>>> I replaced the linking line with
>>>>
>>>> //usr/lib64/mpi/gcc/openmpi3/bin/mpicxx  -march=native -fopenmp-simd
>>>> -DMKL_LP64 -m64
>>>> CMakeFiles/armadillo_with_PETSc.dir/Unity/unity_0_cxx.cxx.o -o
>>>> bin/armadillo_with_PETSc
>>>> -Wl,-rpath,/opt/boost/lib:/opt/fftw3/lib64:/opt/petsc_release/lib
>>>> /usr/lib64/libgsl.so /usr/lib64/libgslcblas.so -lgfortran
>>>> -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64
>>>> -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
>>>> /opt/boost/lib/libboost_filesystem.so.1.72.0
>>>> /opt/boost/lib/libboost_mpi.so.1.72.0
>>>> /opt/boost/lib/libboost_program_options.so.1.72.0
>>>> /opt/boost/lib/libboost_serialization.so.1.72.0
>>>> /opt/fftw3/lib64/libfftw3.so /opt/fftw3/lib64/libfftw3_mpi.so
>>>> /opt/petsc_release/lib/libpetsc.so
>>>> /usr/lib64/gcc/x86_64-suse-linux/9/libgomp.so
>>>> /
>>>>
>>>> and now the results are correct. Nevertheless, when comparing the loop
>>>> in line 26-28 in file test_scaling.cpp
>>>>
>>>> /#pragma omp parallel for//
>>>> //    for(int i = 0; i < r_0 * r_1; ++i)//
>>>> //        *(out_mat_ptr + i) = (*(in_mat_ptr + i) * scaling_factor);/
>>>>
>>>> the version without /#pragma omp parallel/ for is significantly faster
>>>> (i.e. 18 s vs 28 s) compared to the version with /omp./ Why is there
>>>> still such a big difference?
>>> Sounds like you're using a profile to attribute time? Each `omp parallel`
>>> region incurs a cost ranging from about a microsecond to 10 or more
>>> microseconds depending on architecture, number of threads, and OpenMP
>>> implementation. Your loop (for double precision) operates at around 8
>>> entries per clock cycle (depending on architecture) if the operands are in
>>> cache so the loop size r_0 * r_1 should be at least 10000 just to pay off
>>> the cost of `omp parallel`.
>>>
>>
>> -- 
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-users] Explicit linking to OpenMP results in performance drop and wrong results

Reply via email to