Barry, I understand what you are saying but which example/options then is the best one to compute the scalability in a multi-core machine ? I chose the nonlinear diffusion problem specifically because of its inherent stiffness that could lead probably provide noticeable scalability in a multi-core system. From your experience, do you think there is another example program that will demonstrate this much more rigorously or clearly ? Btw, I dont get good speedup even for 2 processes with ex20.c and that was the original motivation for this thread.
Satish. I configured with --download-mpich now without the mpich-device. The results are given above. I will try with the options you provided although I dont entirely understand what they mean, which kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu ? Vijay On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > ? Ok, everything makes sense. Looks like you are using two level multigrid > (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant > -mg_coarse_redundant_pc_type lu ?This means it is solving the coarse grid > problem redundantly on each process (each process is solving the entire > coarse grid solve using LU factorization). The time for the factorization is > (in the two process case) > > MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307 > MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0 > > which is 74 percent of the total solve time (and 84 percent of the flops). ? > When 3/4th of the entire run is not parallel at all you cannot expect much > speedup. ?If you run with -snes_view it will display exactly the solver being > used. You cannot expect to understand the performance if you don't understand > what the solver is actually doing. Using a 20 by 20 by 20 coarse grid is > generally a bad idea since the code spends most of the time there, stick with > something like 5 by 5 by 5. > > ?Suggest running with the default grid and -dmmg_nlevels 5 now the percent in > the coarse solve will be a trivial percent of the run time. > > ?You should get pretty good speed up for 2 processes but not much better > speedup for four processes because as Matt noted the computation is memory > bandwidth limited; > http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. Note > also that this is running multigrid which is a fast solver, but doesn't > parallel scale as well many slow algorithms. For example if you run > -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 processors > but crummy speed. > > ?Barry > > > > On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: > >> Barry, >> >> Please find attached the patch for the minor change to control the >> number of elements from command line for snes/ex20.c. I know that this >> can be achieved with -grid_x etc from command_line but thought this >> just made the typing for the refinement process a little easier. I >> apologize if there was any confusion. >> >> Also, find attached the full log summaries for -np=1 and -np=2. Thanks. >> >> Vijay >> >> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>> >>> ?We need all the information from -log_summary to see what is going on. >>> >>> ?Not sure what -grid 20 means but don't expect any good parallel >>> performance with less than at least 10,000 unknowns per process. >>> >>> ? Barry >>> >>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: >>> >>>> Here's the performance statistic on 1 and 2 processor runs. >>>> >>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 >>>> -log_summary >>>> >>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00 >>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02 >>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 ?5.045e+09 >>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 ?5.969e+08 >>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000 >>>> >>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 >>>> -log_summary >>>> >>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00 >>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02 >>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 ?9.313e+09 >>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 ?1.186e+09 >>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 ?1.593e+03 >>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 ?2.824e+07 >>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000 >>>> >>>> I am not entirely sure if I can make sense out of that statistic but >>>> if there is something more you need, please feel free to let me know. >>>> >>>> Vijay >>>> >>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> >>>> wrote: >>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <vijay.m at gmail.com> >>>>> wrote: >>>>>> >>>>>> Matt, >>>>>> >>>>>> The -with-debugging=1 option is certainly not meant for performance >>>>>> studies but I didn't expect it to yield the same cpu time as a single >>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take >>>>>> approximately the same amount of time for computation of solution. But >>>>>> I am currently configuring without debugging symbols and shall let you >>>>>> know what that yields. >>>>>> >>>>>> On a similar note, is there something extra that needs to be done to >>>>>> make use of multi-core machines while using MPI ? I am not sure if >>>>>> this is even related to PETSc but could be an MPI configuration option >>>>>> that maybe either I or the configure process is missing. All ideas are >>>>>> much appreciated. >>>>> >>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On most >>>>> cheap multicore machines, there is a single memory bus, and thus using >>>>> more >>>>> cores gains you very little extra performance. I still suspect you are not >>>>> actually >>>>> running in parallel, because you usually see a small speedup. That is why >>>>> I >>>>> suggested looking at -log_summary since it tells you how many processes >>>>> were >>>>> run and breaks down the time. >>>>> ? ?Matt >>>>> >>>>>> >>>>>> Vijay >>>>>> >>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com> >>>>>> wrote: >>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <vijay.m at >>>>>>> gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am trying to configure my petsc install with an MPI installation to >>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But >>>>>>>> eventhough the configure/make process went through without problems, >>>>>>>> the scalability of the programs don't seem to reflect what I expected. >>>>>>>> My configure options are >>>>>>>> >>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ --download-mpich=1 >>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g >>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1 >>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++ >>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes >>>>>>>> --with-debugging=1 --with-errorchecking=yes >>>>>>> >>>>>>> 1) For performance studies, make a build using --with-debugging=0 >>>>>>> 2) Look at -log_summary for a breakdown of performance >>>>>>> ? ?Matt >>>>>>> >>>>>>>> >>>>>>>> Is there something else that needs to be done as part of the configure >>>>>>>> process to enable a decent scaling ? I am only comparing programs with >>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately the >>>>>>>> same time as noted from -log_summary. If it helps, I've been testing >>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom >>>>>>>> -grid parameter from command-line to control the number of unknowns. >>>>>>>> >>>>>>>> If there is something you've witnessed before in this configuration or >>>>>>>> if you need anything else to analyze the problem, do let me know. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Vijay >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments >>>>>>> is infinitely more interesting than any results to which their >>>>>>> experiments >>>>>>> lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments >>>>> is infinitely more interesting than any results to which their experiments >>>>> lead. >>>>> -- Norbert Wiener >>>>> >>> >>> >> <ex20.patch><ex20_np1.out><ex20_np2.out> > >
