Re: [petsc-users] Strange efficiency in PETSc-dev using OpenMP

Danyang Su Mon, 23 Sep 2013 12:14:54 -0700

Hi Barry,

Another strange problem:

Currently I have PETSc-3.4.2 MPI version and PETSc-dev OpenMP version onmy computer, with different environment variable of PETSC_ARCH andPETSC_DIR. Before installation of PETSc-dev OpenMP version, thePETSc-3.4.2 MPI version works fine. But after installation of PETSc-devOpenMP version, the same problem exist in PETSc-3.4.2 MPI version if runwith 1 processor, but no problem with 2 or more processors.


Thanks,

Danyang

On 23/09/2013 12:01 PM, Danyang Su wrote:

Hi Barry,
Sorry I forgot the message in the previous email. It is still slowwhen run without the "-threadcomm_type openmp -threadcomm_nthreads 1"
Thanks,

Danyang

On 23/09/2013 11:43 AM, Barry Smith wrote:
    You did not answer my question from yesterday:

  If you run the Openmp compiled version WITHOUT the

-threadcomm_nthreads 1
-threadcomm_type openmp

  command line options is it still slow?


On Sep 23, 2013, at 1:33 PM, Danyang Su <[email protected]> wrote:
Hi Shri,
It seems that the problem does not result from the affinitiessetting for threads. I have tried several settings, the threads areset to different cores, but there is no improvement.
Here is the information of package, core and thread maps

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuidleaf 11 infoOMP: Info #154: KMP_AFFINITY: Initial OS proc set respected:{0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #156: KMP_AFFINITY: 12 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 6 cores/pkg x 2threads/core (6 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0thread 0OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0thread 1OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1thread 0OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1thread 1OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2thread 0OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2thread 1OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3thread 0OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3thread 1OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4thread 0OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 4thread 1OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 5thread 0OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 5thread 1OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermostlevels of machine
And here is the internal thread bounding with different kmp_affinitysettings:
1. KMP_AFFINITY=verbose,granularity=thread,compact
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set{0}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set{1}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set{2}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set{3}
2. KMP_AFFINITY=verbose,granularity=fine,compact
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set{0}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set{1}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set{2}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set{3}
3. KMP_AFFINITY=verbose,granularity=fine,compact,1,0
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set{0}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set{2}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set{4}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set{6}
4. KMP_AFFINITY=verbose,scatter
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set{0,1}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set{2,3}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set{4,5}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set{6,7}
5. KMP_AFFINITY=verbose,compact (For this setting, two threads areassigned to the same core)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set{0,1}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set{0,1}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set{2,3}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set{2,3}
6. KMP_AFFINITY=verbose,granularity=core,compact (For this setting,two threads are assigned to the same core)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set{0,1}OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set{0,1}OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set{2,3}OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set{2,3}
The first 4 settings can assign threads to a distinct core, but theproblem is not solved.
Thanks,

Danyang



On 22/09/2013 8:00 PM, Shri wrote:
I think this is definitely an issue with setting the affinities forthreads, i.e., the assignment of threads to cores. Ideally eachthread should be assigned to a distinct core but in your case allthe 4 threads are getting pinned to the same core resulting in sucha massive slowdown. Unfortunately, the thread affinities for OpenMPare set through environment variables. For Intel's OpenMP one needsto define the thread affinities through the environment variableKMP_AFFINITY. See this document herehttp://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm.Try setting the affinities via KMP_AFFINITY and let us know if itworks.
Shri
On Sep 21, 2013, at 11:06 PM, Danyang Su wrote:
Hi Shri,
Thanks for your info. It can work with the option -threadcomm_typeopenmp. But another problem arises, as described as follows.
The sparse matrix is 53760*53760 with 1067392 non-zero entries.If the codes is compiled using PETSc-3.4.2, it works fine, theequations can be solved quickly and I can see the speedup. But ifthe code is compiled using PETSc-dev with OpenMP option, it takesa long time in solving the equations and I cannot see any speedupwhen more processors are used.
For PETSc-3.4.2, run by "mpiexec -n 4 ksp_inhm_d -log_summarylog_mpi4_petsc3.4.2.log", the iteration and runtime are:
Iterations     6 time_assembly  0.4137E-01 time_ksp 0.9296E-01
For PETSc-dev, run by "mpiexec -n 1 ksp_inhm_d -threadcomm_typeopenmp -threadcomm_nthreads 4 -log_summarylog_openmp_petsc_dev.log", the iteration and runtime are:
Iterations     6 time_assembly  0.3595E+03 time_ksp 0.2907E+00
Most of the time 'time_assembly 0.3595E+03' is spent on thefollowing codes
                 do i = istart, iend - 1
                    ii = ia_in(i+1)
                    jj = ia_in(i+2)
call MatSetValues(a, ione, i, jj-ii,ja_in(ii:jj-1)-1, a_in(ii:jj-1), Insert_Values, ierr)
                 end do

The log files for both PETSc-3.4.2 and PETSc-dev are attached.
Is there anything wrong with my codes or with running option? Theabove codes works fine when using MPICH.
Thanks and regards,

Danyang

On 21/09/2013 2:09 PM, Shri wrote:
There are three thread communicator types in PETSc. The defaultis "no thread" which is basically a non-threaded version. Theother two types are "openmp" and "pthread". If you want to useOpenMP then use the option -threadcomm_type openmp.
Shri
On Sep 21, 2013, at 3:46 PM, Danyang Su <[email protected]>wrote:
Hi Barry,

Thanks for the quick reply.

After changing
#if defined(PETSC_HAVE_PTHREADCLASSES) || defined(PETSC_HAVE_OPENMP)
to
#if defined(PETSC_HAVE_PTHREADCLASSES)
and comment out
#elif defined(PETSC_HAVE_OPENMP)
PETSC_EXTERN PetscStack *petscstack;

It can be compiled and validated with "make test".
But I still have questions on running the examples. Afterrebuild the codes (e.g., ksp_ex2f.f), I can run it with "mpiexec-n 1 ksp_ex2f", or "mpiexec -n 4 ksp_ex2f", or "mpiexec -n 1ksp_ex2f -threadcomm_nthreads 1", but if I run it with "mpiexec-n 1 ksp_ex2f -threadcomm_nthreads 4", there will be a lot oferror information (attached).
The codes is not modified and there is no OpenMP routines in it.For the current development in my project, I want to keep theOpenMP codes in calculating matrix values, but want to solve itwith PETSc (OpenMP). Is it possible?
Thanks and regards,

Danyang



On 21/09/2013 7:26 AM, Barry Smith wrote:
   Danyang,
I don't think the || defined (PETSC_HAVE_OPENMP)belongs in the code below.
/* Linux functions CPU_SET and others don't work if sched.h isnot included beforeincluding pthread.h. Also, these functions are active onlyif either _GNU_SOURCEor __USE_GNU is not set (see /usr/include/sched.h and/usr/include/features.h), hence
     set these first.
*/
#if defined(PETSC_HAVE_PTHREADCLASSES) || defined(PETSC_HAVE_OPENMP)
Edit include/petscerror.h and locate these lines and removethat part and then rerun make all. Let us know if it works ornot.
    Barry

i.e. replace
#if defined(PETSC_HAVE_PTHREADCLASSES) || defined(PETSC_HAVE_OPENMP)
with

#if defined(PETSC_HAVE_PTHREADCLASSES)

On Sep 21, 2013, at 6:53 AM, Matthew Knepley
<[email protected]>
  wrote:
On Sat, Sep 21, 2013 at 12:18 AM, Danyang Su<[email protected]>
  wrote:
Hi All,
I got error information in compiling petsc-dev with openmp incygwin. Before, I have successfully compiled petsc-3.4.2 andit works fine.
The log files have been attached.
The OpenMP configure test is wrong. It clearly fails to findpthread.h, but the test passes. Then in petscerror.hwe guard pthread.h using PETSC_HAVE_OPENMP. Can someone whoknows OpenMP fix this?
     Matt
  Thanks,

Danyang



--
What most experimenters take for granted before they begintheir experiments is infinitely more interesting than anyresults to which their experiments lead.
-- Norbert Wiener
<error.txt>
<log_mpi4_petsc3.4.2.log><log_openmp_petsc_dev.log>

Re: [petsc-users] Strange efficiency in PETSc-dev using OpenMP

Reply via email to