Yes PETSc prefers sequential MKL - as MPI handles parallelism. One way to trick petsc configure to use threaded MKL is to enable pardiso. i.e:
--with-blaslapack-dir=/opt/intel/mkl --with-mkl_pardiso-dir=/opt/intel/mkl http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2018/11/15/configure_master_arch-pardiso_grind.log BLAS/LAPACK: -Wl,-rpath,/soft/com/packages/intel/16/u3/mkl/lib/intel64 -L/soft/com/packages/intel/16/u3/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -ldl -lpthread Or you can manually specify the correct MKL library list [with threading] via --with-blaslapack-lib option. Satish On Fri, 16 Nov 2018, Ivan Voznyuk via petsc-users wrote: > Hi, > You were totally right: no miracle, parallelization does come from > multithreading. We checked Option 1/: played with OMP_NUM_THREADS=1 it > changed computational time. > > So, I reinstalled everything (starting with Ubuntu ending with petsc) and > configured the following things: > > - installed system's ompenmpi > - installed Intel MKL Blas / Lapack > - configured PETSC as ./configure --with-cc=mpicc --with-fc=mpif90 > --with-cxx=mpicxx --with-blas-lapack-dir=/opt/intel/mkl/lib/intel64 > --download-scalapack --download-mumps --with-hwloc --with-shared > --with-openmp=1 --with-pthread=1 --with-scalar-type=complex > hoping that it would take into account blas multithreading > - installed petsc4py > > However, I do not get any parallelization... > What I tried to do so far unsuccessfully : > - play with OMP_NUM_THREADS > - reinstall the system > - ldd PETSc.cpython-35m-x86_64-linux-gnu.so yields lld_result.txt (here > attached) > I noted that libmkl_sequential.so library there. Do you think this is > normal? > - I found a similar problem reported here: > https://lists.mcs.anl.gov/pipermail/petsc-users/2016-March/028803.html To > solve this problem, developers recommended to replace -lmkl_sequential to > -lmkl_intel_thread options in PETSC_ARCH/lib/conf/petscvariables. However, > I did not find something that would be named like this (it might be a > change of version) > - Anyway, I replaced lmkl_sequential to lmkl_intel_thread in every file of > PETSC, but it changed nothing. > > As a result, in the new make.log (here attached ) I have a parameter > #define PETSC_HAVE_LIBMKL_SEQUENTIAL 1 and option -lmkl_sequential > > Do you have any idea of what I should change in the initial options in > order to obtain the blas multithreding parallelization? > > Thanks a lot for your help! > > Ivan > > > > > > > On Fri, Nov 16, 2018 at 1:25 AM Dave May <[email protected]> wrote: > > > > > > > On Thu, 15 Nov 2018 at 17:44, Ivan via petsc-users < > > [email protected]> wrote: > > > >> Hi Stefano, > >> > >> In fact, yes, we look at the htop output (and the resulting computational > >> time ofc). > >> > >> In our code we use MUMPS, which indeed depends on blas / lapack. So I > >> think this might be it! > >> > >> I will definetely check it (I mean the difference between our MUMPS, > >> blas, lapack). > >> > >> If you have an idea of how we can verify on his PC that the source of his > >> parallelization does come from BLAS, please do not hesitate to tell me! > >> > > > > Option 1/ > > * Set this environment variable > > export OMP_NUM_THREADS=1 > > * Re-run your "parallel" test. > > * If the performance differs (job runs slower) compared with your previous > > run where you inferred parallelism was being employed, you can safely > > assume that the parallelism observed comes from threads > > > > Option 2/ > > * Re-configure PETSc to use a known BLAS implementation which does not > > support threads > > * Re-compile PETSc > > * Re-run your parallel test > > * If the performance differs (job runs slower) compared with your previous > > run where you inferred parallelism was being employed, you can safely > > assume that the parallelism observed comes from threads > > > > Option 3/ > > * Use a PC which does not depend on BLAS at all, > > e.g. -pc_type jacobi -pc_type bjacobi > > * If the performance differs (job runs slower) compared with your previous > > run where you inferred parallelism was being employed, you can safely > > assume that the parallelism observed comes from BLAS + threads > > > > > > > >> Thanks! > >> > >> Ivan > >> On 15/11/2018 18:24, Stefano Zampini wrote: > >> > >> If you say your program is parallel by just looking at the output from > >> the top command, you are probably linking against a multithreaded blas > >> library > >> > >> Il giorno Gio 15 Nov 2018, 20:09 Matthew Knepley via petsc-users < > >> [email protected]> ha scritto: > >> > >>> On Thu, Nov 15, 2018 at 11:59 AM Ivan Voznyuk < > >>> [email protected]> wrote: > >>> > >>>> Hi Matthew, > >>>> > >>>> Does it mean that by using just command python3 simple_code.py (without > >>>> mpiexec) you *cannot* obtain a parallel execution? > >>>> > >>> > >>> As I wrote before, its not impossible. You could be directly calling > >>> PMI, but I do not think you are doing that. > >>> > >>> > >>>> It s been 5 days we are trying to understand with my colleague how he > >>>> managed to do so. > >>>> It means that by using simply python3 simple_code.py he gets 8 > >>>> processors workiing. > >>>> By the way, we wrote in his code few lines: > >>>> rank = PETSc.COMM_WORLD.Get_rank() > >>>> size = PETSc.COMM_WORLD.Get_size() > >>>> and we got rank = 0, size = 1 > >>>> > >>> > >>> This is MPI telling you that you are only running on 1 processes. > >>> > >>> > >>>> However, we compilator arrives to KSP.solve(), somehow it turns on 8 > >>>> processors. > >>>> > >>> > >>> Why do you think its running on 8 processes? > >>> > >>> > >>>> This problem is solved on his PC in 5-8 sec (in parallel, using *python3 > >>>> simple_code.py*), on mine it takes 70-90 secs (in sequantial, but with > >>>> the same command *python3 simple_code.py*) > >>>> > >>> > >>> I think its much more likely that there are differences in the solver > >>> (use -ksp_view to see exactly what solver was used), then > >>> to think it is parallelism. Moreover, you would never ever ever see that > >>> much speedup on a laptop since all these computations > >>> are bandwidth limited. > >>> > >>> Thanks, > >>> > >>> Matt > >>> > >>> > >>>> So, conclusion is that on his computer this code works in the same way > >>>> as scipy: all the code is executed in sequantial mode, but when it comes > >>>> to > >>>> solution of system of linear equations, it runs on all available > >>>> processors. All this with just running python3 my_code.py (without any > >>>> mpi-smth) > >>>> > >>>> Is it an exception / abnormal behavior? I mean, is it something > >>>> irregular that you, developers, have never seen? > >>>> > >>>> Thanks and have a good evening! > >>>> Ivan > >>>> > >>>> P.S. I don't think I know the answer regarding Scipy... > >>>> > >>>> > >>>> On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley <[email protected]> > >>>> wrote: > >>>> > >>>>> On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> Hi Matthew, > >>>>>> Thanks for your reply! > >>>>>> > >>>>>> Let me precise what I mean by defining few questions: > >>>>>> > >>>>>> 1. In order to obtain a parallel execution of simple_code.py, do I > >>>>>> need to go with mpiexec python3 simple_code.py, or I can just launch > >>>>>> python3 simple_code.py? > >>>>>> > >>>>> > >>>>> mpiexec -n 2 python3 simple_code.py > >>>>> > >>>>> > >>>>>> 2. This simple_code.py consists of 2 parts: a) preparation of matrix > >>>>>> b) solving the system of linear equations with PETSc. If I launch > >>>>>> mpirun > >>>>>> (or mpiexec) -np 8 python3 simple_code.py, I suppose that I will > >>>>>> basically > >>>>>> obtain 8 matrices and 8 systems to solve. However, I need to prepare > >>>>>> only > >>>>>> one matrix, but launch this code in parallel on 8 processors. > >>>>>> > >>>>> > >>>>> When you create the Mat object, you give it a communicator (here > >>>>> PETSC_COMM_WORLD). That allows us to distribute the data. This is all > >>>>> covered extensively in the manual and the online tutorials, as well as > >>>>> the > >>>>> example code. > >>>>> > >>>>> > >>>>>> In fact, here attached you will find a similar code (scipy_code.py) > >>>>>> with only one difference: the system of linear equations is solved with > >>>>>> scipy. So when I solve it, I can clearly see that the solution is > >>>>>> obtained > >>>>>> in a parallel way. However, I do not use the command mpirun (or > >>>>>> mpiexec). I > >>>>>> just go with python3 scipy_code.py. > >>>>>> > >>>>> > >>>>> Why do you think its running in parallel? > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Matt > >>>>> > >>>>> > >>>>>> In this case, the first part (creation of the sparse matrix) is not > >>>>>> parallel, whereas the solution of system is found in a parallel way. > >>>>>> So my question is, Do you think that it s possible to have the same > >>>>>> behavior with PETSC? And what do I need for this? > >>>>>> > >>>>>> I am asking this because for my colleague it worked! It means that he > >>>>>> launches the simple_code.py on his computer using the command python3 > >>>>>> simple_code.py (and not mpi-smth python3 simple_code.py) and he > >>>>>> obtains a > >>>>>> parallel execution of the same code. > >>>>>> > >>>>>> Thanks for your help! > >>>>>> Ivan > >>>>>> > >>>>>> > >>>>>> On Thu, Nov 15, 2018 at 11:54 AM Matthew Knepley <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> On Thu, Nov 15, 2018 at 4:53 AM Ivan Voznyuk via petsc-users < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> Dear PETSC community, > >>>>>>>> > >>>>>>>> I have a question regarding the parallel execution of petsc4py. > >>>>>>>> > >>>>>>>> I have a simple code (here attached simple_code.py) which solves a > >>>>>>>> system of linear equations Ax=b using petsc4py. To execute it, I use > >>>>>>>> the > >>>>>>>> command python3 simple_code.py which yields a sequential > >>>>>>>> performance. With > >>>>>>>> a colleague of my, we launched this code on his computer, and this > >>>>>>>> time the > >>>>>>>> execution was in parallel. Although, he used the same command python3 > >>>>>>>> simple_code.py (without mpirun, neither mpiexec). > >>>>>>>> > >>>>>>> I am not sure what you mean. To run MPI programs in parallel, you > >>>>>>> need a launcher like mpiexec or mpirun. There are Python programs > >>>>>>> (like > >>>>>>> nemesis) that use the launcher API directly (called PMI), but that is > >>>>>>> not > >>>>>>> part of petsc4py. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> Matt > >>>>>>> > >>>>>>>> My configuration: Ubuntu x86_64 Ubuntu 16.04, Intel Core i7, PETSc > >>>>>>>> 3.10.2, PETSC_ARCH=arch-linux2-c-debug, petsc4py 3.10.0 in virtualenv > >>>>>>>> > >>>>>>>> In order to parallelize it, I have already tried: > >>>>>>>> - use 2 different PCs > >>>>>>>> - use Ubuntu 16.04, 18.04 > >>>>>>>> - use different architectures (arch-linux2-c-debug, > >>>>>>>> linux-gnu-c-debug, etc) > >>>>>>>> - ofc use different configurations (my present config can be found > >>>>>>>> in make.log that I attached here) > >>>>>>>> - mpi from mpich, openmpi > >>>>>>>> > >>>>>>>> Nothing worked. > >>>>>>>> > >>>>>>>> Do you have any ideas? > >>>>>>>> > >>>>>>>> Thanks and have a good day, > >>>>>>>> Ivan > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Ivan VOZNYUK > >>>>>>>> PhD in Computational Electromagnetics > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> What most experimenters take for granted before they begin their > >>>>>>> experiments is infinitely more interesting than any results to which > >>>>>>> their > >>>>>>> experiments lead. > >>>>>>> -- Norbert Wiener > >>>>>>> > >>>>>>> https://www.cse.buffalo.edu/~knepley/ > >>>>>>> <http://www.cse.buffalo.edu/~knepley/> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Ivan VOZNYUK > >>>>>> PhD in Computational Electromagnetics > >>>>>> +33 (0)6.95.87.04.55 > >>>>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage> > >>>>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106> > >>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> What most experimenters take for granted before they begin their > >>>>> experiments is infinitely more interesting than any results to which > >>>>> their > >>>>> experiments lead. > >>>>> -- Norbert Wiener > >>>>> > >>>>> https://www.cse.buffalo.edu/~knepley/ > >>>>> <http://www.cse.buffalo.edu/~knepley/> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Ivan VOZNYUK > >>>> PhD in Computational Electromagnetics > >>>> +33 (0)6.95.87.04.55 > >>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage> > >>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106> > >>>> > >>> > >>> > >>> -- > >>> What most experimenters take for granted before they begin their > >>> experiments is infinitely more interesting than any results to which their > >>> experiments lead. > >>> -- Norbert Wiener > >>> > >>> https://www.cse.buffalo.edu/~knepley/ > >>> <http://www.cse.buffalo.edu/~knepley/> > >>> > >> > >
