Barry, Sorry about the delay in the reply. I did not have access to the system to test out what you said, until now.
I tried with -dmmg_nlevels 5, along with the default setup: ./ex20 -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5 processor time 1 114.2 2 89.45 4 81.01 The scaleup doesn't seem to be optimal, even with two processors. I am wondering if the fault is in the MPI configuration itself. Are these results as you would expect ? I can also send you the log_summary for all cases if that will help. Vijay On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote: > >> Barry, >> >> I understand what you are saying but which example/options then is the >> best one to compute the scalability in a multi-core machine ? I chose >> the nonlinear diffusion problem specifically because of its inherent >> stiffness that could lead probably provide noticeable scalability in a >> multi-core system. From your experience, do you think there is another >> example program that will demonstrate this much more rigorously or >> clearly ? Btw, I dont get good speedup even for 2 processes with >> ex20.c and that was the original motivation for this thread. > > ? Did you follow my instructions? > > ? Barry > >> >> Satish. I configured with --download-mpich now without the >> mpich-device. The results are given above. I will try with the options >> you provided although I dont entirely understand what they mean, which >> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu >> ? >> >> Vijay >> >> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>> >>> ? Ok, everything makes sense. Looks like you are using two level multigrid >>> (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant >>> -mg_coarse_redundant_pc_type lu ?This means it is solving the coarse grid >>> problem redundantly on each process (each process is solving the entire >>> coarse grid solve using LU factorization). The time for the factorization >>> is (in the two process case) >>> >>> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307 >>> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0 >>> >>> which is 74 percent of the total solve time (and 84 percent of the flops). >>> ? When 3/4th of the entire run is not parallel at all you cannot expect >>> much speedup. ?If you run with -snes_view it will display exactly the >>> solver being used. You cannot expect to understand the performance if you >>> don't understand what the solver is actually doing. Using a 20 by 20 by 20 >>> coarse grid is generally a bad idea since the code spends most of the time >>> there, stick with something like 5 by 5 by 5. >>> >>> ?Suggest running with the default grid and -dmmg_nlevels 5 now the percent >>> in the coarse solve will be a trivial percent of the run time. >>> >>> ?You should get pretty good speed up for 2 processes but not much better >>> speedup for four processes because as Matt noted the computation is memory >>> bandwidth limited; >>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. >>> Note also that this is running multigrid which is a fast solver, but >>> doesn't parallel scale as well many slow algorithms. For example if you run >>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 >>> processors but crummy speed. >>> >>> ?Barry >>> >>> >>> >>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: >>> >>>> Barry, >>>> >>>> Please find attached the patch for the minor change to control the >>>> number of elements from command line for snes/ex20.c. I know that this >>>> can be achieved with -grid_x etc from command_line but thought this >>>> just made the typing for the refinement process a little easier. I >>>> apologize if there was any confusion. >>>> >>>> Also, find attached the full log summaries for -np=1 and -np=2. Thanks. >>>> >>>> Vijay >>>> >>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>>>> >>>>> ?We need all the information from -log_summary to see what is going on. >>>>> >>>>> ?Not sure what -grid 20 means but don't expect any good parallel >>>>> performance with less than at least 10,000 unknowns per process. >>>>> >>>>> ? Barry >>>>> >>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: >>>>> >>>>>> Here's the performance statistic on 1 and 2 processor runs. >>>>>> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 >>>>>> -log_summary >>>>>> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00 >>>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02 >>>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 ?5.045e+09 >>>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 ?5.969e+08 >>>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >>>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >>>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000 >>>>>> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 >>>>>> -log_summary >>>>>> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00 >>>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02 >>>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 ?9.313e+09 >>>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 ?1.186e+09 >>>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 ?1.593e+03 >>>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 ?2.824e+07 >>>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000 >>>>>> >>>>>> I am not entirely sure if I can make sense out of that statistic but >>>>>> if there is something more you need, please feel free to let me know. >>>>>> >>>>>> Vijay >>>>>> >>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> >>>>>> wrote: >>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <vijay.m at >>>>>>> gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Matt, >>>>>>>> >>>>>>>> The -with-debugging=1 option is certainly not meant for performance >>>>>>>> studies but I didn't expect it to yield the same cpu time as a single >>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take >>>>>>>> approximately the same amount of time for computation of solution. But >>>>>>>> I am currently configuring without debugging symbols and shall let you >>>>>>>> know what that yields. >>>>>>>> >>>>>>>> On a similar note, is there something extra that needs to be done to >>>>>>>> make use of multi-core machines while using MPI ? I am not sure if >>>>>>>> this is even related to PETSc but could be an MPI configuration option >>>>>>>> that maybe either I or the configure process is missing. All ideas are >>>>>>>> much appreciated. >>>>>>> >>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On most >>>>>>> cheap multicore machines, there is a single memory bus, and thus using >>>>>>> more >>>>>>> cores gains you very little extra performance. I still suspect you are >>>>>>> not >>>>>>> actually >>>>>>> running in parallel, because you usually see a small speedup. That is >>>>>>> why I >>>>>>> suggested looking at -log_summary since it tells you how many processes >>>>>>> were >>>>>>> run and breaks down the time. >>>>>>> ? ?Matt >>>>>>> >>>>>>>> >>>>>>>> Vijay >>>>>>>> >>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com> >>>>>>>> wrote: >>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <vijay.m at >>>>>>>>> gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am trying to configure my petsc install with an MPI installation to >>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But >>>>>>>>>> eventhough the configure/make process went through without problems, >>>>>>>>>> the scalability of the programs don't seem to reflect what I >>>>>>>>>> expected. >>>>>>>>>> My configure options are >>>>>>>>>> >>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ >>>>>>>>>> --download-mpich=1 >>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g >>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1 >>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++ >>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes >>>>>>>>>> --with-debugging=1 --with-errorchecking=yes >>>>>>>>> >>>>>>>>> 1) For performance studies, make a build using --with-debugging=0 >>>>>>>>> 2) Look at -log_summary for a breakdown of performance >>>>>>>>> ? ?Matt >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Is there something else that needs to be done as part of the >>>>>>>>>> configure >>>>>>>>>> process to enable a decent scaling ? I am only comparing programs >>>>>>>>>> with >>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately >>>>>>>>>> the >>>>>>>>>> same time as noted from -log_summary. If it helps, I've been testing >>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom >>>>>>>>>> -grid parameter from command-line to control the number of unknowns. >>>>>>>>>> >>>>>>>>>> If there is something you've witnessed before in this configuration >>>>>>>>>> or >>>>>>>>>> if you need anything else to analyze the problem, do let me know. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Vijay >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>> experiments >>>>>>>>> is infinitely more interesting than any results to which their >>>>>>>>> experiments >>>>>>>>> lead. >>>>>>>>> -- Norbert Wiener >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments >>>>>>> is infinitely more interesting than any results to which their >>>>>>> experiments >>>>>>> lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>> >>>>> >>>> <ex20.patch><ex20_np1.out><ex20_np2.out> >>> >>> > >
