On Wed, Feb 2, 2011 at 11:13 PM, Vijay S. Mahadevan <vijay.m at gmail.com>wrote:
> Barry, > > I understand what you are saying but which example/options then is the > best one to compute the scalability in a multi-core machine ? I chose > the nonlinear diffusion problem specifically because of its inherent > stiffness that could lead probably provide noticeable scalability in a > multi-core system. From your experience, do you think there is another > example program that will demonstrate this much more rigorously or > clearly ? Btw, I dont get good speedup even for 2 processes with > ex20.c and that was the original motivation for this thread. > Very simply, Barry said your coarse grid is way too big. Make it smaller and you will see speedup. Matt > Satish. I configured with --download-mpich now without the > mpich-device. The results are given above. I will try with the options > you provided although I dont entirely understand what they mean, which > kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu > ? > > Vijay > > On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > > > Ok, everything makes sense. Looks like you are using two level > multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant > -mg_coarse_redundant_pc_type lu This means it is solving the coarse grid > problem redundantly on each process (each process is solving the entire > coarse grid solve using LU factorization). The time for the factorization is > (in the two process case) > > > > MatLUFactorNum 14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 37 41 0 0 0 74 82 0 0 0 1307 > > MatILUFactorSym 7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 7.0e+00 0 0 0 0 1 0 0 0 0 2 0 > > > > which is 74 percent of the total solve time (and 84 percent of the > flops). When 3/4th of the entire run is not parallel at all you cannot > expect much speedup. If you run with -snes_view it will display exactly the > solver being used. You cannot expect to understand the performance if you > don't understand what the solver is actually doing. Using a 20 by 20 by 20 > coarse grid is generally a bad idea since the code spends most of the time > there, stick with something like 5 by 5 by 5. > > > > Suggest running with the default grid and -dmmg_nlevels 5 now the > percent in the coarse solve will be a trivial percent of the run time. > > > > You should get pretty good speed up for 2 processes but not much better > speedup for four processes because as Matt noted the computation is memory > bandwidth limited; > http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. > Note also that this is running multigrid which is a fast solver, but doesn't > parallel scale as well many slow algorithms. For example if you run > -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 > processors but crummy speed. > > > > Barry > > > > > > > > On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: > > > >> Barry, > >> > >> Please find attached the patch for the minor change to control the > >> number of elements from command line for snes/ex20.c. I know that this > >> can be achieved with -grid_x etc from command_line but thought this > >> just made the typing for the refinement process a little easier. I > >> apologize if there was any confusion. > >> > >> Also, find attached the full log summaries for -np=1 and -np=2. Thanks. > >> > >> Vijay > >> > >> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: > >>> > >>> We need all the information from -log_summary to see what is going on. > >>> > >>> Not sure what -grid 20 means but don't expect any good parallel > performance with less than at least 10,000 unknowns per process. > >>> > >>> Barry > >>> > >>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: > >>> > >>>> Here's the performance statistic on 1 and 2 processor runs. > >>>> > >>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 > -log_summary > >>>> > >>>> Max Max/Min Avg Total > >>>> Time (sec): 8.452e+00 1.00000 8.452e+00 > >>>> Objects: 1.470e+02 1.00000 1.470e+02 > >>>> Flops: 5.045e+09 1.00000 5.045e+09 5.045e+09 > >>>> Flops/sec: 5.969e+08 1.00000 5.969e+08 5.969e+08 > >>>> MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 > >>>> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 > >>>> MPI Reductions: 4.440e+02 1.00000 > >>>> > >>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 > -log_summary > >>>> > >>>> Max Max/Min Avg Total > >>>> Time (sec): 7.851e+00 1.00000 7.851e+00 > >>>> Objects: 2.000e+02 1.00000 2.000e+02 > >>>> Flops: 4.670e+09 1.00580 4.657e+09 9.313e+09 > >>>> Flops/sec: 5.948e+08 1.00580 5.931e+08 1.186e+09 > >>>> MPI Messages: 7.965e+02 1.00000 7.965e+02 1.593e+03 > >>>> MPI Message Lengths: 1.412e+07 1.00000 1.773e+04 2.824e+07 > >>>> MPI Reductions: 1.046e+03 1.00000 > >>>> > >>>> I am not entirely sure if I can make sense out of that statistic but > >>>> if there is something more you need, please feel free to let me know. > >>>> > >>>> Vijay > >>>> > >>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> > wrote: > >>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan < > vijay.m at gmail.com> > >>>>> wrote: > >>>>>> > >>>>>> Matt, > >>>>>> > >>>>>> The -with-debugging=1 option is certainly not meant for performance > >>>>>> studies but I didn't expect it to yield the same cpu time as a > single > >>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take > >>>>>> approximately the same amount of time for computation of solution. > But > >>>>>> I am currently configuring without debugging symbols and shall let > you > >>>>>> know what that yields. > >>>>>> > >>>>>> On a similar note, is there something extra that needs to be done to > >>>>>> make use of multi-core machines while using MPI ? I am not sure if > >>>>>> this is even related to PETSc but could be an MPI configuration > option > >>>>>> that maybe either I or the configure process is missing. All ideas > are > >>>>>> much appreciated. > >>>>> > >>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On > most > >>>>> cheap multicore machines, there is a single memory bus, and thus > using more > >>>>> cores gains you very little extra performance. I still suspect you > are not > >>>>> actually > >>>>> running in parallel, because you usually see a small speedup. That is > why I > >>>>> suggested looking at -log_summary since it tells you how many > processes were > >>>>> run and breaks down the time. > >>>>> Matt > >>>>> > >>>>>> > >>>>>> Vijay > >>>>>> > >>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com> > wrote: > >>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan < > vijay.m at gmail.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I am trying to configure my petsc install with an MPI installation > to > >>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But > >>>>>>>> eventhough the configure/make process went through without > problems, > >>>>>>>> the scalability of the programs don't seem to reflect what I > expected. > >>>>>>>> My configure options are > >>>>>>>> > >>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ > --download-mpich=1 > >>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g > >>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1 > >>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++ > >>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes > >>>>>>>> --with-debugging=1 --with-errorchecking=yes > >>>>>>> > >>>>>>> 1) For performance studies, make a build using --with-debugging=0 > >>>>>>> 2) Look at -log_summary for a breakdown of performance > >>>>>>> Matt > >>>>>>> > >>>>>>>> > >>>>>>>> Is there something else that needs to be done as part of the > configure > >>>>>>>> process to enable a decent scaling ? I am only comparing programs > with > >>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately > the > >>>>>>>> same time as noted from -log_summary. If it helps, I've been > testing > >>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom > >>>>>>>> -grid parameter from command-line to control the number of > unknowns. > >>>>>>>> > >>>>>>>> If there is something you've witnessed before in this > configuration or > >>>>>>>> if you need anything else to analyze the problem, do let me know. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Vijay > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> What most experimenters take for granted before they begin their > >>>>>>> experiments > >>>>>>> is infinitely more interesting than any results to which their > >>>>>>> experiments > >>>>>>> lead. > >>>>>>> -- Norbert Wiener > >>>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> What most experimenters take for granted before they begin their > experiments > >>>>> is infinitely more interesting than any results to which their > experiments > >>>>> lead. > >>>>> -- Norbert Wiener > >>>>> > >>> > >>> > >> <ex20.patch><ex20_np1.out><ex20_np2.out> > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110202/c6c33710/attachment-0001.htm>
