Matt, I apologize for the incomplete information. Find attached the log_summary for all the cases.
The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with 2x2GB/2x4GB configuration. I do not know how to decipher the memory bandwidth with this information but if you need anything more, do let me know. VIjay On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> wrote: > On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at gmail.com> > wrote: >> >> Barry, >> >> Sorry about the delay in the reply. I did not have access to the >> system to test out what you said, until now. >> >> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20 >> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5 >> >> processor ? ? ? time >> 1 ? ? ? ? ? ? ? ? ? ? ?114.2 >> 2 ? ? ? ? ? ? ? ? ? ? ?89.45 >> 4 ? ? ? ? ? ? ? ? ? ? ?81.01 > > 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from > this data. > 2) Do you know the memory bandwidth characteristics of this machine? That is > crucial and > ?? ?you cannot begin to understand speedup on it until you do. Please look > this up. > 3) Worrying about specifics of the MPI implementation makes no sense until > the basics are nailed down. > ?? Matt > >> >> The scaleup doesn't seem to be optimal, even with two processors. I am >> wondering if the fault is in the MPI configuration itself. Are these >> results as you would expect ? I can also send you the log_summary for >> all cases if that will help. >> >> Vijay >> >> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote: >> > >> > On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote: >> > >> >> Barry, >> >> >> >> I understand what you are saying but which example/options then is the >> >> best one to compute the scalability in a multi-core machine ? I chose >> >> the nonlinear diffusion problem specifically because of its inherent >> >> stiffness that could lead probably provide noticeable scalability in a >> >> multi-core system. From your experience, do you think there is another >> >> example program that will demonstrate this much more rigorously or >> >> clearly ? Btw, I dont get good speedup even for 2 processes with >> >> ex20.c and that was the original motivation for this thread. >> > >> > ? Did you follow my instructions? >> > >> > ? Barry >> > >> >> >> >> Satish. I configured with --download-mpich now without the >> >> mpich-device. The results are given above. I will try with the options >> >> you provided although I dont entirely understand what they mean, which >> >> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu >> >> ? >> >> >> >> Vijay >> >> >> >> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >> >>> >> >>> ? Ok, everything makes sense. Looks like you are using two level >> >>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant >> >>> -mg_coarse_redundant_pc_type lu ?This means it is solving the coarse grid >> >>> problem redundantly on each process (each process is solving the entire >> >>> coarse grid solve using LU factorization). The time for the >> >>> factorization is >> >>> (in the two process case) >> >>> >> >>> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 >> >>> 0.0e+00 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307 >> >>> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 >> >>> 0.0e+00 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0 >> >>> >> >>> which is 74 percent of the total solve time (and 84 percent of the >> >>> flops). ? When 3/4th of the entire run is not parallel at all you cannot >> >>> expect much speedup. ?If you run with -snes_view it will display exactly >> >>> the >> >>> solver being used. You cannot expect to understand the performance if you >> >>> don't understand what the solver is actually doing. Using a 20 by 20 by >> >>> 20 >> >>> coarse grid is generally a bad idea since the code spends most of the >> >>> time >> >>> there, stick with something like 5 by 5 by 5. >> >>> >> >>> ?Suggest running with the default grid and -dmmg_nlevels 5 now the >> >>> percent in the coarse solve will be a trivial percent of the run time. >> >>> >> >>> ?You should get pretty good speed up for 2 processes but not much >> >>> better speedup for four processes because as Matt noted the computation >> >>> is >> >>> memory bandwidth limited; >> >>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. >> >>> Note >> >>> also that this is running multigrid which is a fast solver, but doesn't >> >>> parallel scale as well many slow algorithms. For example if you run >> >>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 >> >>> processors but crummy speed. >> >>> >> >>> ?Barry >> >>> >> >>> >> >>> >> >>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: >> >>> >> >>>> Barry, >> >>>> >> >>>> Please find attached the patch for the minor change to control the >> >>>> number of elements from command line for snes/ex20.c. I know that >> >>>> this >> >>>> can be achieved with -grid_x etc from command_line but thought this >> >>>> just made the typing for the refinement process a little easier. I >> >>>> apologize if there was any confusion. >> >>>> >> >>>> Also, find attached the full log summaries for -np=1 and -np=2. >> >>>> Thanks. >> >>>> >> >>>> Vijay >> >>>> >> >>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> >> >>>> wrote: >> >>>>> >> >>>>> ?We need all the information from -log_summary to see what is going >> >>>>> on. >> >>>>> >> >>>>> ?Not sure what -grid 20 means but don't expect any good parallel >> >>>>> performance with less than at least 10,000 unknowns per process. >> >>>>> >> >>>>> ? Barry >> >>>>> >> >>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: >> >>>>> >> >>>>>> Here's the performance statistic on 1 and 2 processor runs. >> >>>>>> >> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 >> >>>>>> -log_summary >> >>>>>> >> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >> >>>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00 >> >>>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02 >> >>>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 ?5.045e+09 >> >>>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 ?5.969e+08 >> >>>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >> >>>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >> >>>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000 >> >>>>>> >> >>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 >> >>>>>> -log_summary >> >>>>>> >> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >> >>>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00 >> >>>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02 >> >>>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 ?9.313e+09 >> >>>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 ?1.186e+09 >> >>>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 ?1.593e+03 >> >>>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 ?2.824e+07 >> >>>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000 >> >>>>>> >> >>>>>> I am not entirely sure if I can make sense out of that statistic >> >>>>>> but >> >>>>>> if there is something more you need, please feel free to let me >> >>>>>> know. >> >>>>>> >> >>>>>> Vijay >> >>>>>> >> >>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> >> >>>>>> wrote: >> >>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan >> >>>>>>> <vijay.m at gmail.com> >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> Matt, >> >>>>>>>> >> >>>>>>>> The -with-debugging=1 option is certainly not meant for >> >>>>>>>> performance >> >>>>>>>> studies but I didn't expect it to yield the same cpu time as a >> >>>>>>>> single >> >>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors >> >>>>>>>> take >> >>>>>>>> approximately the same amount of time for computation of >> >>>>>>>> solution. But >> >>>>>>>> I am currently configuring without debugging symbols and shall >> >>>>>>>> let you >> >>>>>>>> know what that yields. >> >>>>>>>> >> >>>>>>>> On a similar note, is there something extra that needs to be done >> >>>>>>>> to >> >>>>>>>> make use of multi-core machines while using MPI ? I am not sure >> >>>>>>>> if >> >>>>>>>> this is even related to PETSc but could be an MPI configuration >> >>>>>>>> option >> >>>>>>>> that maybe either I or the configure process is missing. All >> >>>>>>>> ideas are >> >>>>>>>> much appreciated. >> >>>>>>> >> >>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. >> >>>>>>> On most >> >>>>>>> cheap multicore machines, there is a single memory bus, and thus >> >>>>>>> using more >> >>>>>>> cores gains you very little extra performance. I still suspect you >> >>>>>>> are not >> >>>>>>> actually >> >>>>>>> running in parallel, because you usually see a small speedup. That >> >>>>>>> is why I >> >>>>>>> suggested looking at -log_summary since it tells you how many >> >>>>>>> processes were >> >>>>>>> run and breaks down the time. >> >>>>>>> ? ?Matt >> >>>>>>> >> >>>>>>>> >> >>>>>>>> Vijay >> >>>>>>>> >> >>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley >> >>>>>>>> <knepley at gmail.com> wrote: >> >>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan >> >>>>>>>>> <vijay.m at gmail.com> >> >>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>> Hi, >> >>>>>>>>>> >> >>>>>>>>>> I am trying to configure my petsc install with an MPI >> >>>>>>>>>> installation to >> >>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But >> >>>>>>>>>> eventhough the configure/make process went through without >> >>>>>>>>>> problems, >> >>>>>>>>>> the scalability of the programs don't seem to reflect what I >> >>>>>>>>>> expected. >> >>>>>>>>>> My configure options are >> >>>>>>>>>> >> >>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ >> >>>>>>>>>> --download-mpich=1 >> >>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g >> >>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 >> >>>>>>>>>> --download-hypre=1 >> >>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++ >> >>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes >> >>>>>>>>>> --with-debugging=1 --with-errorchecking=yes >> >>>>>>>>> >> >>>>>>>>> 1) For performance studies, make a build using >> >>>>>>>>> --with-debugging=0 >> >>>>>>>>> 2) Look at -log_summary for a breakdown of performance >> >>>>>>>>> ? ?Matt >> >>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> Is there something else that needs to be done as part of the >> >>>>>>>>>> configure >> >>>>>>>>>> process to enable a decent scaling ? I am only comparing >> >>>>>>>>>> programs with >> >>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking >> >>>>>>>>>> approximately the >> >>>>>>>>>> same time as noted from -log_summary. If it helps, I've been >> >>>>>>>>>> testing >> >>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a >> >>>>>>>>>> custom >> >>>>>>>>>> -grid parameter from command-line to control the number of >> >>>>>>>>>> unknowns. >> >>>>>>>>>> >> >>>>>>>>>> If there is something you've witnessed before in this >> >>>>>>>>>> configuration or >> >>>>>>>>>> if you need anything else to analyze the problem, do let me >> >>>>>>>>>> know. >> >>>>>>>>>> >> >>>>>>>>>> Thanks, >> >>>>>>>>>> Vijay >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> What most experimenters take for granted before they begin their >> >>>>>>>>> experiments >> >>>>>>>>> is infinitely more interesting than any results to which their >> >>>>>>>>> experiments >> >>>>>>>>> lead. >> >>>>>>>>> -- Norbert Wiener >> >>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> What most experimenters take for granted before they begin their >> >>>>>>> experiments >> >>>>>>> is infinitely more interesting than any results to which their >> >>>>>>> experiments >> >>>>>>> lead. >> >>>>>>> -- Norbert Wiener >> >>>>>>> >> >>>>> >> >>>>> >> >>>> <ex20.patch><ex20_np1.out><ex20_np2.out> >> >>> >> >>> >> > >> > > > > > -- > What most experimenters take for granted before they begin their experiments > is infinitely more interesting than any results to which their experiments > lead. > -- Norbert Wiener > -------------- next part -------------- A non-text attachment was scrubbed... Name: ex20_np1.out Type: application/octet-stream Size: 12365 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/5f8c5e2d/attachment-0003.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: ex20_np2.out Type: application/octet-stream Size: 13469 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/5f8c5e2d/attachment-0004.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: ex20_np4.out Type: application/octet-stream Size: 14749 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20110203/5f8c5e2d/attachment-0005.obj>
