On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote: > Barry, > > I understand what you are saying but which example/options then is the > best one to compute the scalability in a multi-core machine ? I chose > the nonlinear diffusion problem specifically because of its inherent > stiffness that could lead probably provide noticeable scalability in a > multi-core system. From your experience, do you think there is another > example program that will demonstrate this much more rigorously or > clearly ? Btw, I dont get good speedup even for 2 processes with > ex20.c and that was the original motivation for this thread.
Did you follow my instructions? Barry > > Satish. I configured with --download-mpich now without the > mpich-device. The results are given above. I will try with the options > you provided although I dont entirely understand what they mean, which > kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu > ? > > Vijay > > On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >> >> Ok, everything makes sense. Looks like you are using two level multigrid >> (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant >> -mg_coarse_redundant_pc_type lu This means it is solving the coarse grid >> problem redundantly on each process (each process is solving the entire >> coarse grid solve using LU factorization). The time for the factorization is >> (in the two process case) >> >> MatLUFactorNum 14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 37 41 0 0 0 74 82 0 0 0 1307 >> MatILUFactorSym 7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 7.0e+00 0 0 0 0 1 0 0 0 0 2 0 >> >> which is 74 percent of the total solve time (and 84 percent of the flops). >> When 3/4th of the entire run is not parallel at all you cannot expect much >> speedup. If you run with -snes_view it will display exactly the solver >> being used. You cannot expect to understand the performance if you don't >> understand what the solver is actually doing. Using a 20 by 20 by 20 coarse >> grid is generally a bad idea since the code spends most of the time there, >> stick with something like 5 by 5 by 5. >> >> Suggest running with the default grid and -dmmg_nlevels 5 now the percent >> in the coarse solve will be a trivial percent of the run time. >> >> You should get pretty good speed up for 2 processes but not much better >> speedup for four processes because as Matt noted the computation is memory >> bandwidth limited; >> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. Note >> also that this is running multigrid which is a fast solver, but doesn't >> parallel scale as well many slow algorithms. For example if you run >> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 >> processors but crummy speed. >> >> Barry >> >> >> >> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: >> >>> Barry, >>> >>> Please find attached the patch for the minor change to control the >>> number of elements from command line for snes/ex20.c. I know that this >>> can be achieved with -grid_x etc from command_line but thought this >>> just made the typing for the refinement process a little easier. I >>> apologize if there was any confusion. >>> >>> Also, find attached the full log summaries for -np=1 and -np=2. Thanks. >>> >>> Vijay >>> >>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>>> >>>> We need all the information from -log_summary to see what is going on. >>>> >>>> Not sure what -grid 20 means but don't expect any good parallel >>>> performance with less than at least 10,000 unknowns per process. >>>> >>>> Barry >>>> >>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: >>>> >>>>> Here's the performance statistic on 1 and 2 processor runs. >>>>> >>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 >>>>> -log_summary >>>>> >>>>> Max Max/Min Avg Total >>>>> Time (sec): 8.452e+00 1.00000 8.452e+00 >>>>> Objects: 1.470e+02 1.00000 1.470e+02 >>>>> Flops: 5.045e+09 1.00000 5.045e+09 5.045e+09 >>>>> Flops/sec: 5.969e+08 1.00000 5.969e+08 5.969e+08 >>>>> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 >>>>> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 >>>>> MPI Reductions: 4.440e+02 1.00000 >>>>> >>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 >>>>> -log_summary >>>>> >>>>> Max Max/Min Avg Total >>>>> Time (sec): 7.851e+00 1.00000 7.851e+00 >>>>> Objects: 2.000e+02 1.00000 2.000e+02 >>>>> Flops: 4.670e+09 1.00580 4.657e+09 9.313e+09 >>>>> Flops/sec: 5.948e+08 1.00580 5.931e+08 1.186e+09 >>>>> MPI Messages: 7.965e+02 1.00000 7.965e+02 1.593e+03 >>>>> MPI Message Lengths: 1.412e+07 1.00000 1.773e+04 2.824e+07 >>>>> MPI Reductions: 1.046e+03 1.00000 >>>>> >>>>> I am not entirely sure if I can make sense out of that statistic but >>>>> if there is something more you need, please feel free to let me know. >>>>> >>>>> Vijay >>>>> >>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> >>>>> wrote: >>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <vijay.m at gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Matt, >>>>>>> >>>>>>> The -with-debugging=1 option is certainly not meant for performance >>>>>>> studies but I didn't expect it to yield the same cpu time as a single >>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take >>>>>>> approximately the same amount of time for computation of solution. But >>>>>>> I am currently configuring without debugging symbols and shall let you >>>>>>> know what that yields. >>>>>>> >>>>>>> On a similar note, is there something extra that needs to be done to >>>>>>> make use of multi-core machines while using MPI ? I am not sure if >>>>>>> this is even related to PETSc but could be an MPI configuration option >>>>>>> that maybe either I or the configure process is missing. All ideas are >>>>>>> much appreciated. >>>>>> >>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On most >>>>>> cheap multicore machines, there is a single memory bus, and thus using >>>>>> more >>>>>> cores gains you very little extra performance. I still suspect you are >>>>>> not >>>>>> actually >>>>>> running in parallel, because you usually see a small speedup. That is >>>>>> why I >>>>>> suggested looking at -log_summary since it tells you how many processes >>>>>> were >>>>>> run and breaks down the time. >>>>>> Matt >>>>>> >>>>>>> >>>>>>> Vijay >>>>>>> >>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com> >>>>>>> wrote: >>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <vijay.m at >>>>>>>> gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I am trying to configure my petsc install with an MPI installation to >>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But >>>>>>>>> eventhough the configure/make process went through without problems, >>>>>>>>> the scalability of the programs don't seem to reflect what I expected. >>>>>>>>> My configure options are >>>>>>>>> >>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ --download-mpich=1 >>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g >>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1 >>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++ >>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes >>>>>>>>> --with-debugging=1 --with-errorchecking=yes >>>>>>>> >>>>>>>> 1) For performance studies, make a build using --with-debugging=0 >>>>>>>> 2) Look at -log_summary for a breakdown of performance >>>>>>>> Matt >>>>>>>> >>>>>>>>> >>>>>>>>> Is there something else that needs to be done as part of the configure >>>>>>>>> process to enable a decent scaling ? I am only comparing programs with >>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately the >>>>>>>>> same time as noted from -log_summary. If it helps, I've been testing >>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom >>>>>>>>> -grid parameter from command-line to control the number of unknowns. >>>>>>>>> >>>>>>>>> If there is something you've witnessed before in this configuration or >>>>>>>>> if you need anything else to analyze the problem, do let me know. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Vijay >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted before they begin their >>>>>>>> experiments >>>>>>>> is infinitely more interesting than any results to which their >>>>>>>> experiments >>>>>>>> lead. >>>>>>>> -- Norbert Wiener >>>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> What most experimenters take for granted before they begin their >>>>>> experiments >>>>>> is infinitely more interesting than any results to which their >>>>>> experiments >>>>>> lead. >>>>>> -- Norbert Wiener >>>>>> >>>> >>>> >>> <ex20.patch><ex20_np1.out><ex20_np2.out> >> >>
