Likely this build is with gnu compilers. https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/
The above suggests - the following might work: '--with-blaslapack-lib=-L/opt/intel/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl' Satish On Fri, 16 Nov 2018, Balay, Satish via petsc-users wrote: > On Fri, 16 Nov 2018, Ivan Voznyuk via petsc-users wrote: > > > Hi Satish, > > Thanks for your reply. > > > > Bad news... I tested 2 solutions that you proposed, none has worked. > > > > 1. --with-blaslapack-dir=/opt/intel/mkl > > --with-mkl_pardiso-dir=/opt/intel/mkl installed well, without any problems. > > However, the code is still turning in sequential way. > > Are you using Intel compilers? > > please send configure.log for this. > > Satish > > > 2. When I changed -lmkl_sequential to -lmkl_intel_thread -liomp, he at > > first did not find the liomp, so I had to create a symbolic link of > > libiomp5.so > > to /lib. > > At the launching of the .py code I had to go with: > > export > > LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_core.so:/opt/intel/mkl/lib/intel64/libmkl_sequential.so > > and > > export LD_LIBRARY_PATH=/opt/petsc/petsc1/arch-linux2-c-debug/lib/ > > > > But still it does not solve the given problem and code is still running > > sequentially... > > > > May be you have some other ideas? > > > > Thanks, > > Ivan > > > > > > > > > > On Fri, Nov 16, 2018 at 6:11 PM Balay, Satish <[email protected]> wrote: > > > > > Yes PETSc prefers sequential MKL - as MPI handles parallelism. > > > > > > One way to trick petsc configure to use threaded MKL is to enable pardiso. > > > i.e: > > > > > > --with-blaslapack-dir=/opt/intel/mkl --with-mkl_pardiso-dir=/opt/intel/mkl > > > > > > > > > http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2018/11/15/configure_master_arch-pardiso_grind.log > > > > > > BLAS/LAPACK: -Wl,-rpath,/soft/com/packages/intel/16/u3/mkl/lib/intel64 > > > -L/soft/com/packages/intel/16/u3/mkl/lib/intel64 -lmkl_intel_lp64 > > > -lmkl_core -lmkl_intel_thread -liomp5 -ldl -lpthread > > > > > > Or you can manually specify the correct MKL library list [with > > > threading] via --with-blaslapack-lib option. > > > > > > Satish > > > > > > On Fri, 16 Nov 2018, Ivan Voznyuk via petsc-users wrote: > > > > > > > Hi, > > > > You were totally right: no miracle, parallelization does come from > > > > multithreading. We checked Option 1/: played with OMP_NUM_THREADS=1 it > > > > changed computational time. > > > > > > > > So, I reinstalled everything (starting with Ubuntu ending with petsc) > > > > and > > > > configured the following things: > > > > > > > > - installed system's ompenmpi > > > > - installed Intel MKL Blas / Lapack > > > > - configured PETSC as ./configure --with-cc=mpicc --with-fc=mpif90 > > > > --with-cxx=mpicxx --with-blas-lapack-dir=/opt/intel/mkl/lib/intel64 > > > > --download-scalapack --download-mumps --with-hwloc --with-shared > > > > --with-openmp=1 --with-pthread=1 --with-scalar-type=complex > > > > hoping that it would take into account blas multithreading > > > > - installed petsc4py > > > > > > > > However, I do not get any parallelization... > > > > What I tried to do so far unsuccessfully : > > > > - play with OMP_NUM_THREADS > > > > - reinstall the system > > > > - ldd PETSc.cpython-35m-x86_64-linux-gnu.so yields lld_result.txt (here > > > > attached) > > > > I noted that libmkl_sequential.so library there. Do you think this is > > > > normal? > > > > - I found a similar problem reported here: > > > > https://lists.mcs.anl.gov/pipermail/petsc-users/2016-March/028803.html > > > To > > > > solve this problem, developers recommended to replace -lmkl_sequential > > > > to > > > > -lmkl_intel_thread options in PETSC_ARCH/lib/conf/petscvariables. > > > However, > > > > I did not find something that would be named like this (it might be a > > > > change of version) > > > > - Anyway, I replaced lmkl_sequential to lmkl_intel_thread in every file > > > of > > > > PETSC, but it changed nothing. > > > > > > > > As a result, in the new make.log (here attached ) I have a parameter > > > > #define PETSC_HAVE_LIBMKL_SEQUENTIAL 1 and option -lmkl_sequential > > > > > > > > Do you have any idea of what I should change in the initial options in > > > > order to obtain the blas multithreding parallelization? > > > > > > > > Thanks a lot for your help! > > > > > > > > Ivan > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Nov 16, 2018 at 1:25 AM Dave May <[email protected]> > > > wrote: > > > > > > > > > > > > > > > > > > > On Thu, 15 Nov 2018 at 17:44, Ivan via petsc-users < > > > > > [email protected]> wrote: > > > > > > > > > >> Hi Stefano, > > > > >> > > > > >> In fact, yes, we look at the htop output (and the resulting > > > computational > > > > >> time ofc). > > > > >> > > > > >> In our code we use MUMPS, which indeed depends on blas / lapack. So I > > > > >> think this might be it! > > > > >> > > > > >> I will definetely check it (I mean the difference between our MUMPS, > > > > >> blas, lapack). > > > > >> > > > > >> If you have an idea of how we can verify on his PC that the source of > > > his > > > > >> parallelization does come from BLAS, please do not hesitate to tell > > > me! > > > > >> > > > > > > > > > > Option 1/ > > > > > * Set this environment variable > > > > > export OMP_NUM_THREADS=1 > > > > > * Re-run your "parallel" test. > > > > > * If the performance differs (job runs slower) compared with your > > > previous > > > > > run where you inferred parallelism was being employed, you can safely > > > > > assume that the parallelism observed comes from threads > > > > > > > > > > Option 2/ > > > > > * Re-configure PETSc to use a known BLAS implementation which does not > > > > > support threads > > > > > * Re-compile PETSc > > > > > * Re-run your parallel test > > > > > * If the performance differs (job runs slower) compared with your > > > previous > > > > > run where you inferred parallelism was being employed, you can safely > > > > > assume that the parallelism observed comes from threads > > > > > > > > > > Option 3/ > > > > > * Use a PC which does not depend on BLAS at all, > > > > > e.g. -pc_type jacobi -pc_type bjacobi > > > > > * If the performance differs (job runs slower) compared with your > > > previous > > > > > run where you inferred parallelism was being employed, you can safely > > > > > assume that the parallelism observed comes from BLAS + threads > > > > > > > > > > > > > > > > > > > >> Thanks! > > > > >> > > > > >> Ivan > > > > >> On 15/11/2018 18:24, Stefano Zampini wrote: > > > > >> > > > > >> If you say your program is parallel by just looking at the output > > > > >> from > > > > >> the top command, you are probably linking against a multithreaded > > > > >> blas > > > > >> library > > > > >> > > > > >> Il giorno Gio 15 Nov 2018, 20:09 Matthew Knepley via petsc-users < > > > > >> [email protected]> ha scritto: > > > > >> > > > > >>> On Thu, Nov 15, 2018 at 11:59 AM Ivan Voznyuk < > > > > >>> [email protected]> wrote: > > > > >>> > > > > >>>> Hi Matthew, > > > > >>>> > > > > >>>> Does it mean that by using just command python3 simple_code.py > > > (without > > > > >>>> mpiexec) you *cannot* obtain a parallel execution? > > > > >>>> > > > > >>> > > > > >>> As I wrote before, its not impossible. You could be directly calling > > > > >>> PMI, but I do not think you are doing that. > > > > >>> > > > > >>> > > > > >>>> It s been 5 days we are trying to understand with my colleague how > > > he > > > > >>>> managed to do so. > > > > >>>> It means that by using simply python3 simple_code.py he gets 8 > > > > >>>> processors workiing. > > > > >>>> By the way, we wrote in his code few lines: > > > > >>>> rank = PETSc.COMM_WORLD.Get_rank() > > > > >>>> size = PETSc.COMM_WORLD.Get_size() > > > > >>>> and we got rank = 0, size = 1 > > > > >>>> > > > > >>> > > > > >>> This is MPI telling you that you are only running on 1 processes. > > > > >>> > > > > >>> > > > > >>>> However, we compilator arrives to KSP.solve(), somehow it turns on > > > > >>>> 8 > > > > >>>> processors. > > > > >>>> > > > > >>> > > > > >>> Why do you think its running on 8 processes? > > > > >>> > > > > >>> > > > > >>>> This problem is solved on his PC in 5-8 sec (in parallel, using > > > *python3 > > > > >>>> simple_code.py*), on mine it takes 70-90 secs (in sequantial, but > > > with > > > > >>>> the same command *python3 simple_code.py*) > > > > >>>> > > > > >>> > > > > >>> I think its much more likely that there are differences in the > > > > >>> solver > > > > >>> (use -ksp_view to see exactly what solver was used), then > > > > >>> to think it is parallelism. Moreover, you would never ever ever see > > > that > > > > >>> much speedup on a laptop since all these computations > > > > >>> are bandwidth limited. > > > > >>> > > > > >>> Thanks, > > > > >>> > > > > >>> Matt > > > > >>> > > > > >>> > > > > >>>> So, conclusion is that on his computer this code works in the same > > > way > > > > >>>> as scipy: all the code is executed in sequantial mode, but when it > > > comes to > > > > >>>> solution of system of linear equations, it runs on all available > > > > >>>> processors. All this with just running python3 my_code.py (without > > > any > > > > >>>> mpi-smth) > > > > >>>> > > > > >>>> Is it an exception / abnormal behavior? I mean, is it something > > > > >>>> irregular that you, developers, have never seen? > > > > >>>> > > > > >>>> Thanks and have a good evening! > > > > >>>> Ivan > > > > >>>> > > > > >>>> P.S. I don't think I know the answer regarding Scipy... > > > > >>>> > > > > >>>> > > > > >>>> On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley <[email protected]> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk < > > > > >>>>> [email protected]> wrote: > > > > >>>>> > > > > >>>>>> Hi Matthew, > > > > >>>>>> Thanks for your reply! > > > > >>>>>> > > > > >>>>>> Let me precise what I mean by defining few questions: > > > > >>>>>> > > > > >>>>>> 1. In order to obtain a parallel execution of simple_code.py, do > > > > >>>>>> I > > > > >>>>>> need to go with mpiexec python3 simple_code.py, or I can just > > > launch > > > > >>>>>> python3 simple_code.py? > > > > >>>>>> > > > > >>>>> > > > > >>>>> mpiexec -n 2 python3 simple_code.py > > > > >>>>> > > > > >>>>> > > > > >>>>>> 2. This simple_code.py consists of 2 parts: a) preparation of > > > matrix > > > > >>>>>> b) solving the system of linear equations with PETSc. If I launch > > > mpirun > > > > >>>>>> (or mpiexec) -np 8 python3 simple_code.py, I suppose that I will > > > basically > > > > >>>>>> obtain 8 matrices and 8 systems to solve. However, I need to > > > prepare only > > > > >>>>>> one matrix, but launch this code in parallel on 8 processors. > > > > >>>>>> > > > > >>>>> > > > > >>>>> When you create the Mat object, you give it a communicator (here > > > > >>>>> PETSC_COMM_WORLD). That allows us to distribute the data. This is > > > all > > > > >>>>> covered extensively in the manual and the online tutorials, as > > > well as the > > > > >>>>> example code. > > > > >>>>> > > > > >>>>> > > > > >>>>>> In fact, here attached you will find a similar code > > > (scipy_code.py) > > > > >>>>>> with only one difference: the system of linear equations is > > > solved with > > > > >>>>>> scipy. So when I solve it, I can clearly see that the solution is > > > obtained > > > > >>>>>> in a parallel way. However, I do not use the command mpirun (or > > > mpiexec). I > > > > >>>>>> just go with python3 scipy_code.py. > > > > >>>>>> > > > > >>>>> > > > > >>>>> Why do you think its running in parallel? > > > > >>>>> > > > > >>>>> Thanks, > > > > >>>>> > > > > >>>>> Matt > > > > >>>>> > > > > >>>>> > > > > >>>>>> In this case, the first part (creation of the sparse matrix) is > > > not > > > > >>>>>> parallel, whereas the solution of system is found in a parallel > > > way. > > > > >>>>>> So my question is, Do you think that it s possible to have the > > > same > > > > >>>>>> behavior with PETSC? And what do I need for this? > > > > >>>>>> > > > > >>>>>> I am asking this because for my colleague it worked! It means > > > that he > > > > >>>>>> launches the simple_code.py on his computer using the command > > > python3 > > > > >>>>>> simple_code.py (and not mpi-smth python3 simple_code.py) and he > > > obtains a > > > > >>>>>> parallel execution of the same code. > > > > >>>>>> > > > > >>>>>> Thanks for your help! > > > > >>>>>> Ivan > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> On Thu, Nov 15, 2018 at 11:54 AM Matthew Knepley < > > > [email protected]> > > > > >>>>>> wrote: > > > > >>>>>> > > > > >>>>>>> On Thu, Nov 15, 2018 at 4:53 AM Ivan Voznyuk via petsc-users < > > > > >>>>>>> [email protected]> wrote: > > > > >>>>>>> > > > > >>>>>>>> Dear PETSC community, > > > > >>>>>>>> > > > > >>>>>>>> I have a question regarding the parallel execution of petsc4py. > > > > >>>>>>>> > > > > >>>>>>>> I have a simple code (here attached simple_code.py) which > > > solves a > > > > >>>>>>>> system of linear equations Ax=b using petsc4py. To execute it, > > > I use the > > > > >>>>>>>> command python3 simple_code.py which yields a sequential > > > performance. With > > > > >>>>>>>> a colleague of my, we launched this code on his computer, and > > > this time the > > > > >>>>>>>> execution was in parallel. Although, he used the same command > > > python3 > > > > >>>>>>>> simple_code.py (without mpirun, neither mpiexec). > > > > >>>>>>>> > > > > >>>>>>> I am not sure what you mean. To run MPI programs in parallel, > > > > >>>>>>> you > > > > >>>>>>> need a launcher like mpiexec or mpirun. There are Python > > > programs (like > > > > >>>>>>> nemesis) that use the launcher API directly (called PMI), but > > > that is not > > > > >>>>>>> part of petsc4py. > > > > >>>>>>> > > > > >>>>>>> Thanks, > > > > >>>>>>> > > > > >>>>>>> Matt > > > > >>>>>>> > > > > >>>>>>>> My configuration: Ubuntu x86_64 Ubuntu 16.04, Intel Core i7, > > > PETSc > > > > >>>>>>>> 3.10.2, PETSC_ARCH=arch-linux2-c-debug, petsc4py 3.10.0 in > > > virtualenv > > > > >>>>>>>> > > > > >>>>>>>> In order to parallelize it, I have already tried: > > > > >>>>>>>> - use 2 different PCs > > > > >>>>>>>> - use Ubuntu 16.04, 18.04 > > > > >>>>>>>> - use different architectures (arch-linux2-c-debug, > > > > >>>>>>>> linux-gnu-c-debug, etc) > > > > >>>>>>>> - ofc use different configurations (my present config can be > > > found > > > > >>>>>>>> in make.log that I attached here) > > > > >>>>>>>> - mpi from mpich, openmpi > > > > >>>>>>>> > > > > >>>>>>>> Nothing worked. > > > > >>>>>>>> > > > > >>>>>>>> Do you have any ideas? > > > > >>>>>>>> > > > > >>>>>>>> Thanks and have a good day, > > > > >>>>>>>> Ivan > > > > >>>>>>>> > > > > >>>>>>>> -- > > > > >>>>>>>> Ivan VOZNYUK > > > > >>>>>>>> PhD in Computational Electromagnetics > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> -- > > > > >>>>>>> What most experimenters take for granted before they begin their > > > > >>>>>>> experiments is infinitely more interesting than any results to > > > which their > > > > >>>>>>> experiments lead. > > > > >>>>>>> -- Norbert Wiener > > > > >>>>>>> > > > > >>>>>>> https://www.cse.buffalo.edu/~knepley/ > > > > >>>>>>> <http://www.cse.buffalo.edu/~knepley/> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> -- > > > > >>>>>> Ivan VOZNYUK > > > > >>>>>> PhD in Computational Electromagnetics > > > > >>>>>> +33 (0)6.95.87.04.55 > > > > >>>>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage> > > > > >>>>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106> > > > > >>>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> -- > > > > >>>>> What most experimenters take for granted before they begin their > > > > >>>>> experiments is infinitely more interesting than any results to > > > which their > > > > >>>>> experiments lead. > > > > >>>>> -- Norbert Wiener > > > > >>>>> > > > > >>>>> https://www.cse.buffalo.edu/~knepley/ > > > > >>>>> <http://www.cse.buffalo.edu/~knepley/> > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>>> -- > > > > >>>> Ivan VOZNYUK > > > > >>>> PhD in Computational Electromagnetics > > > > >>>> +33 (0)6.95.87.04.55 > > > > >>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage> > > > > >>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106> > > > > >>>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> What most experimenters take for granted before they begin their > > > > >>> experiments is infinitely more interesting than any results to which > > > their > > > > >>> experiments lead. > > > > >>> -- Norbert Wiener > > > > >>> > > > > >>> https://www.cse.buffalo.edu/~knepley/ > > > > >>> <http://www.cse.buffalo.edu/~knepley/> > > > > >>> > > > > >> > > > > > > > > > > > > > > > > > > >
