Re: [petsc-users] petsc4py help with parallel execution

Balay, Satish via petsc-users Fri, 16 Nov 2018 09:12:02 -0800

Yes PETSc prefers sequential MKL - as MPI handles parallelism.

One way to trick petsc configure to use threaded MKL is to enable pardiso. i.e:


--with-blaslapack-dir=/opt/intel/mkl --with-mkl_pardiso-dir=/opt/intel/mkl

http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2018/11/15/configure_master_arch-pardiso_grind.log

BLAS/LAPACK: -Wl,-rpath,/soft/com/packages/intel/16/u3/mkl/lib/intel64 
-L/soft/com/packages/intel/16/u3/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_core 
-lmkl_intel_thread -liomp5 -ldl -lpthread

Or you can manually specify the correct MKL library list [with
threading] via --with-blaslapack-lib option.

Satish

On Fri, 16 Nov 2018, Ivan Voznyuk via petsc-users wrote:

> Hi,
> You were totally right: no miracle, parallelization does come from
> multithreading. We checked Option 1/: played with OMP_NUM_THREADS=1 it
> changed computational time.
> 
> So, I reinstalled everything (starting with Ubuntu ending with petsc) and
> configured the following things:
> 
> - installed system's ompenmpi
> - installed Intel MKL Blas / Lapack
> - configured PETSC as ./configure --with-cc=mpicc --with-fc=mpif90
> --with-cxx=mpicxx --with-blas-lapack-dir=/opt/intel/mkl/lib/intel64
> --download-scalapack --download-mumps --with-hwloc --with-shared
> --with-openmp=1 --with-pthread=1 --with-scalar-type=complex
> hoping that it would take into account blas multithreading
> - installed petsc4py
> 
> However, I do not get any parallelization...
> What I tried to do so far unsuccessfully :
> - play with OMP_NUM_THREADS
> - reinstall the system
> - ldd PETSc.cpython-35m-x86_64-linux-gnu.so yields lld_result.txt (here
> attached)
> I noted that libmkl_sequential.so library there. Do you think this is
> normal?
> - I found a similar problem reported here:
> https://lists.mcs.anl.gov/pipermail/petsc-users/2016-March/028803.html To
> solve this problem, developers recommended to replace -lmkl_sequential to
> -lmkl_intel_thread options in PETSC_ARCH/lib/conf/petscvariables. However,
> I did not find something that would be named like this (it might be a
> change of version)
> - Anyway, I replaced lmkl_sequential to lmkl_intel_thread in every file of
> PETSC, but it changed nothing.
> 
> As a result, in the new make.log (here attached ) I have a parameter
> #define PETSC_HAVE_LIBMKL_SEQUENTIAL 1 and option -lmkl_sequential
> 
> Do you have any idea of what I should change in the initial options in
> order to obtain the blas multithreding parallelization?
> 
> Thanks a lot for your help!
> 
> Ivan
> 
> 
> 
> 
> 
> 
> On Fri, Nov 16, 2018 at 1:25 AM Dave May <[email protected]> wrote:
> 
> >
> >
> > On Thu, 15 Nov 2018 at 17:44, Ivan via petsc-users <
> > [email protected]> wrote:
> >
> >> Hi Stefano,
> >>
> >> In fact, yes, we look at the htop output (and the resulting computational
> >> time ofc).
> >>
> >> In our code we use MUMPS, which indeed depends on blas / lapack. So I
> >> think this might be it!
> >>
> >> I will definetely check it (I mean the difference between our MUMPS,
> >> blas, lapack).
> >>
> >> If you have an idea of how we can verify on his PC that the source of his
> >> parallelization does come from BLAS, please do not hesitate to tell me!
> >>
> >
> > Option 1/
> > * Set this environment variable
> >   export OMP_NUM_THREADS=1
> > * Re-run your "parallel" test.
> > * If the performance differs (job runs slower) compared with your previous
> > run where you inferred parallelism was being employed, you can safely
> > assume that the parallelism observed comes from threads
> >
> > Option 2/
> > * Re-configure PETSc to use a known BLAS implementation which does not
> > support threads
> > * Re-compile PETSc
> > * Re-run your parallel test
> > * If the performance differs (job runs slower) compared with your previous
> > run where you inferred parallelism was being employed, you can safely
> > assume that the parallelism observed comes from threads
> >
> > Option 3/
> > * Use a PC which does not depend on BLAS at all,
> > e.g. -pc_type jacobi -pc_type bjacobi
> > * If the performance differs (job runs slower) compared with your previous
> > run where you inferred parallelism was being employed, you can safely
> > assume that the parallelism observed comes from BLAS + threads
> >
> >
> >
> >> Thanks!
> >>
> >> Ivan
> >> On 15/11/2018 18:24, Stefano Zampini wrote:
> >>
> >> If you say your program is parallel by just looking at the output from
> >> the top command, you are probably linking against a multithreaded blas
> >> library
> >>
> >> Il giorno Gio 15 Nov 2018, 20:09 Matthew Knepley via petsc-users <
> >> [email protected]> ha scritto:
> >>
> >>> On Thu, Nov 15, 2018 at 11:59 AM Ivan Voznyuk <
> >>> [email protected]> wrote:
> >>>
> >>>> Hi Matthew,
> >>>>
> >>>> Does it mean that by using just command python3 simple_code.py (without
> >>>> mpiexec) you *cannot* obtain a parallel execution?
> >>>>
> >>>
> >>> As I wrote before, its not impossible. You could be directly calling
> >>> PMI, but I do not think you are doing that.
> >>>
> >>>
> >>>> It s been 5 days we are trying to understand with my colleague how he
> >>>> managed to do so.
> >>>> It means that by using simply python3 simple_code.py he gets 8
> >>>> processors workiing.
> >>>> By the way, we wrote in his code few lines:
> >>>> rank = PETSc.COMM_WORLD.Get_rank()
> >>>> size = PETSc.COMM_WORLD.Get_size()
> >>>> and we got rank = 0, size = 1
> >>>>
> >>>
> >>> This is MPI telling you that you are only running on 1 processes.
> >>>
> >>>
> >>>> However, we compilator arrives to KSP.solve(), somehow it turns on 8
> >>>> processors.
> >>>>
> >>>
> >>> Why do you think its running on 8 processes?
> >>>
> >>>
> >>>> This problem is solved on his PC in 5-8 sec (in parallel, using *python3
> >>>> simple_code.py*), on mine it takes 70-90 secs (in sequantial, but with
> >>>> the same command *python3 simple_code.py*)
> >>>>
> >>>
> >>> I think its much more likely that there are differences in the solver
> >>> (use -ksp_view to see exactly what solver was used), then
> >>> to think it is parallelism. Moreover, you would never ever ever see that
> >>> much speedup on a laptop since all these computations
> >>> are bandwidth limited.
> >>>
> >>>   Thanks,
> >>>
> >>>      Matt
> >>>
> >>>
> >>>> So, conclusion is that on his computer this code works in the same way
> >>>> as scipy: all the code is executed in sequantial mode, but when it comes 
> >>>> to
> >>>> solution of system of linear equations, it runs on all available
> >>>> processors. All this with just running python3 my_code.py (without any
> >>>> mpi-smth)
> >>>>
> >>>> Is it an exception / abnormal behavior? I mean, is it something
> >>>> irregular that you, developers, have never seen?
> >>>>
> >>>> Thanks and have a good evening!
> >>>> Ivan
> >>>>
> >>>> P.S. I don't think I know the answer regarding Scipy...
> >>>>
> >>>>
> >>>> On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Hi Matthew,
> >>>>>> Thanks for your reply!
> >>>>>>
> >>>>>> Let me precise what I mean by defining few questions:
> >>>>>>
> >>>>>> 1. In order to obtain a parallel execution of simple_code.py, do I
> >>>>>> need to go with mpiexec python3 simple_code.py, or I can just launch
> >>>>>> python3 simple_code.py?
> >>>>>>
> >>>>>
> >>>>> mpiexec -n 2 python3 simple_code.py
> >>>>>
> >>>>>
> >>>>>> 2. This simple_code.py consists of 2 parts: a) preparation of matrix
> >>>>>> b) solving the system of linear equations with PETSc. If I launch 
> >>>>>> mpirun
> >>>>>> (or mpiexec) -np 8 python3 simple_code.py, I suppose that I will 
> >>>>>> basically
> >>>>>> obtain 8 matrices and 8 systems to solve. However, I need to prepare 
> >>>>>> only
> >>>>>> one matrix, but launch this code in parallel on 8 processors.
> >>>>>>
> >>>>>
> >>>>> When you create the Mat object, you give it a communicator (here
> >>>>> PETSC_COMM_WORLD). That allows us to distribute the data. This is all
> >>>>> covered extensively in the manual and the online tutorials, as well as 
> >>>>> the
> >>>>> example code.
> >>>>>
> >>>>>
> >>>>>> In fact, here attached you will find a similar code (scipy_code.py)
> >>>>>> with only one difference: the system of linear equations is solved with
> >>>>>> scipy. So when I solve it, I can clearly see that the solution is 
> >>>>>> obtained
> >>>>>> in a parallel way. However, I do not use the command mpirun (or 
> >>>>>> mpiexec). I
> >>>>>> just go with python3 scipy_code.py.
> >>>>>>
> >>>>>
> >>>>> Why do you think its running in parallel?
> >>>>>
> >>>>>   Thanks,
> >>>>>
> >>>>>      Matt
> >>>>>
> >>>>>
> >>>>>> In this case, the first part (creation of the sparse matrix) is not
> >>>>>> parallel, whereas the solution of system is found in a parallel way.
> >>>>>> So my question is, Do you think that it s possible to have the same
> >>>>>> behavior with PETSC? And what do I need for this?
> >>>>>>
> >>>>>> I am asking this because for my colleague it worked! It means that he
> >>>>>> launches the simple_code.py on his computer using the command python3
> >>>>>> simple_code.py (and not mpi-smth python3 simple_code.py) and he 
> >>>>>> obtains a
> >>>>>> parallel execution of the same code.
> >>>>>>
> >>>>>> Thanks for your help!
> >>>>>> Ivan
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Nov 15, 2018 at 11:54 AM Matthew Knepley <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> On Thu, Nov 15, 2018 at 4:53 AM Ivan Voznyuk via petsc-users <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Dear PETSC community,
> >>>>>>>>
> >>>>>>>> I have a question regarding the parallel execution of petsc4py.
> >>>>>>>>
> >>>>>>>> I have a simple code (here attached simple_code.py) which solves a
> >>>>>>>> system of linear equations Ax=b using petsc4py. To execute it, I use 
> >>>>>>>> the
> >>>>>>>> command python3 simple_code.py which yields a sequential 
> >>>>>>>> performance. With
> >>>>>>>> a colleague of my, we launched this code on his computer, and this 
> >>>>>>>> time the
> >>>>>>>> execution was in parallel. Although, he used the same command python3
> >>>>>>>> simple_code.py (without mpirun, neither mpiexec).
> >>>>>>>>
> >>>>>>> I am not sure what you mean. To run MPI programs in parallel, you
> >>>>>>> need a launcher like mpiexec or mpirun. There are Python programs 
> >>>>>>> (like
> >>>>>>> nemesis) that use the launcher API directly (called PMI), but that is 
> >>>>>>> not
> >>>>>>> part of petsc4py.
> >>>>>>>
> >>>>>>>   Thanks,
> >>>>>>>
> >>>>>>>      Matt
> >>>>>>>
> >>>>>>>> My configuration: Ubuntu x86_64 Ubuntu 16.04, Intel Core i7, PETSc
> >>>>>>>> 3.10.2, PETSC_ARCH=arch-linux2-c-debug, petsc4py 3.10.0 in virtualenv
> >>>>>>>>
> >>>>>>>> In order to parallelize it, I have already tried:
> >>>>>>>> - use 2 different PCs
> >>>>>>>> - use Ubuntu 16.04, 18.04
> >>>>>>>> - use different architectures (arch-linux2-c-debug,
> >>>>>>>> linux-gnu-c-debug, etc)
> >>>>>>>> - ofc use different configurations (my present config can be found
> >>>>>>>> in make.log that I attached here)
> >>>>>>>> - mpi from mpich, openmpi
> >>>>>>>>
> >>>>>>>> Nothing worked.
> >>>>>>>>
> >>>>>>>> Do you have any ideas?
> >>>>>>>>
> >>>>>>>> Thanks and have a good day,
> >>>>>>>> Ivan
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Ivan VOZNYUK
> >>>>>>>> PhD in Computational Electromagnetics
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> What most experimenters take for granted before they begin their
> >>>>>>> experiments is infinitely more interesting than any results to which 
> >>>>>>> their
> >>>>>>> experiments lead.
> >>>>>>> -- Norbert Wiener
> >>>>>>>
> >>>>>>> https://www.cse.buffalo.edu/~knepley/
> >>>>>>> <http://www.cse.buffalo.edu/~knepley/>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Ivan VOZNYUK
> >>>>>> PhD in Computational Electromagnetics
> >>>>>> +33 (0)6.95.87.04.55
> >>>>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage>
> >>>>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> What most experimenters take for granted before they begin their
> >>>>> experiments is infinitely more interesting than any results to which 
> >>>>> their
> >>>>> experiments lead.
> >>>>> -- Norbert Wiener
> >>>>>
> >>>>> https://www.cse.buffalo.edu/~knepley/
> >>>>> <http://www.cse.buffalo.edu/~knepley/>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ivan VOZNYUK
> >>>> PhD in Computational Electromagnetics
> >>>> +33 (0)6.95.87.04.55
> >>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage>
> >>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106>
> >>>>
> >>>
> >>>
> >>> --
> >>> What most experimenters take for granted before they begin their
> >>> experiments is infinitely more interesting than any results to which their
> >>> experiments lead.
> >>> -- Norbert Wiener
> >>>
> >>> https://www.cse.buffalo.edu/~knepley/
> >>> <http://www.cse.buffalo.edu/~knepley/>
> >>>
> >>
> 
>

Re: [petsc-users] petsc4py help with parallel execution

Reply via email to