Barry, Thanks for the quick reply. I ran the benchmark/stream/BasicVersion for one and two processes and the output are as follows:
-n 1 ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 50 times, but only the *best* time for each is used. ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 2529 microseconds. (= 2529 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 10161.8510 0.0032 0.0031 0.0037 Scale: 9843.6177 0.0034 0.0033 0.0038 Add: 10656.7114 0.0046 0.0045 0.0053 Triad: 10799.0448 0.0046 0.0044 0.0054 -n 2 ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 50 times, but only the *best* time for each is used. ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 4320 microseconds. (= 4320 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 5739.9704 0.0058 0.0056 0.0063 Scale: 5839.3617 0.0058 0.0055 0.0062 Add: 6116.9323 0.0081 0.0078 0.0085 Triad: 6021.0722 0.0084 0.0080 0.0088 ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 50 times, but only the *best* time for each is used. ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 2954 microseconds. (= 2954 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 6091.9448 0.0056 0.0053 0.0061 Scale: 5501.1775 0.0060 0.0058 0.0062 Add: 5960.4640 0.0084 0.0081 0.0087 Triad: 5936.2109 0.0083 0.0081 0.0089 I do not have OpenMP installed and so not sure if you wanted that when you said two threads. I also closed most of the applications that were open before running these tests and so they should hopefully be accurate. Vijay On Thu, Feb 3, 2011 at 1:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > ?Vljay > > ? Let's just look at a single embarrassingly parallel computation in the run, > this computation has NO communication and uses NO MPI and NO synchronization > between processes > > ------------------------------------------------------------------------------------------------------------------------ > Event ? ? ? ? ? ? ? ?Count ? ? ?Time (sec) ? ? Flops ? ? ? ? ? ? ? ? ? ? ? ? > ? ? --- Global --- ?--- Stage --- ? Total > ? ? ? ? ? ? ? ? ? Max Ratio ?Max ? ? Ratio ? Max ?Ratio ?Mess ? Avg len > Reduct ?%T %F %M %L %R ?%T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > ?1 process > VecMAXPY ? ? ? ? ? ?3898 1.0 1.7074e+01 1.0 3.39e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 15 20 ?0 ?0 ?0 ?29 40 ?0 ?0 ?0 ?1983 > > ?2 processes > VecMAXPY ? ? ? ? ? ?3898 1.0 1.3861e+01 1.0 1.72e+10 1.0 0.0e+00 0.0e+00 > 0.0e+00 15 20 ?0 ?0 ?0 ?31 40 ?0 ?0 ?0 ?2443 > > ? The speed up is 1.7074e+01/1.3861e+01 = 2443./1983 = 1.23 ?which is > terrible! Now why would it be so bad (remember you cannot blame MPI) > > 1) other processes are running on the machine sucking up memory bandwidth. > Make sure no other compute tasks are running during this time. > > 2) the single process run is able to use almost all of the hardware memory > bandwidth, so introducing the second process cannot increase the performance > much. This means this machine is terrible for parallelization of sparse > iterative solvers. > > 3) the machine is somehow misconfigured (beats me how) so that while the one > process job doesn't use more than half of the memory bandwidth, when two > processes are run the second process cannot utilize all that additional > memory bandwidth. > > ?In src/benchmarks/streams you can run make test and have it generate a > report of how the streams benchmark is able to utilize the memory bandwidth. > Run that and send us the output (run with just 2 threads). > > ? Barry > > > On Feb 3, 2011, at 12:05 PM, Vijay S. Mahadevan wrote: > >> Matt, >> >> I apologize for the incomplete information. Find attached the >> log_summary for all the cases. >> >> The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with >> 2x2GB/2x4GB configuration. I do not know how to decipher the memory >> bandwidth with this information but if you need anything more, do let >> me know. >> >> VIjay >> >> On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> >> wrote: >>> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at gmail.com> >>> wrote: >>>> >>>> Barry, >>>> >>>> Sorry about the delay in the reply. I did not have access to the >>>> system to test out what you said, until now. >>>> >>>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20 >>>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5 >>>> >>>> processor ? ? ? time >>>> 1 ? ? ? ? ? ? ? ? ? ? ?114.2 >>>> 2 ? ? ? ? ? ? ? ? ? ? ?89.45 >>>> 4 ? ? ? ? ? ? ? ? ? ? ?81.01 >>> >>> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from >>> this data. >>> 2) Do you know the memory bandwidth characteristics of this machine? That is >>> crucial and >>> ? ? you cannot begin to understand speedup on it until you do. Please look >>> this up. >>> 3) Worrying about specifics of the MPI implementation makes no sense until >>> the basics are nailed down. >>> ? ?Matt >>> >>>> >>>> The scaleup doesn't seem to be optimal, even with two processors. I am >>>> wondering if the fault is in the MPI configuration itself. Are these >>>> results as you would expect ? I can also send you the log_summary for >>>> all cases if that will help. >>>> >>>> Vijay >>>> >>>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>>>> >>>>> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote: >>>>> >>>>>> Barry, >>>>>> >>>>>> I understand what you are saying but which example/options then is the >>>>>> best one to compute the scalability in a multi-core machine ? I chose >>>>>> the nonlinear diffusion problem specifically because of its inherent >>>>>> stiffness that could lead probably provide noticeable scalability in a >>>>>> multi-core system. From your experience, do you think there is another >>>>>> example program that will demonstrate this much more rigorously or >>>>>> clearly ? Btw, I dont get good speedup even for 2 processes with >>>>>> ex20.c and that was the original motivation for this thread. >>>>> >>>>> ? Did you follow my instructions? >>>>> >>>>> ? Barry >>>>> >>>>>> >>>>>> Satish. I configured with --download-mpich now without the >>>>>> mpich-device. The results are given above. I will try with the options >>>>>> you provided although I dont entirely understand what they mean, which >>>>>> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu >>>>>> ? >>>>>> >>>>>> Vijay >>>>>> >>>>>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> >>>>>> wrote: >>>>>>> >>>>>>> ? Ok, everything makes sense. Looks like you are using two level >>>>>>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant >>>>>>> -mg_coarse_redundant_pc_type lu ?This means it is solving the coarse >>>>>>> grid >>>>>>> problem redundantly on each process (each process is solving the entire >>>>>>> coarse grid solve using LU factorization). The time for the >>>>>>> factorization is >>>>>>> (in the two process case) >>>>>>> >>>>>>> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 >>>>>>> 0.0e+00 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307 >>>>>>> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 >>>>>>> 0.0e+00 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0 >>>>>>> >>>>>>> which is 74 percent of the total solve time (and 84 percent of the >>>>>>> flops). ? When 3/4th of the entire run is not parallel at all you cannot >>>>>>> expect much speedup. ?If you run with -snes_view it will display >>>>>>> exactly the >>>>>>> solver being used. You cannot expect to understand the performance if >>>>>>> you >>>>>>> don't understand what the solver is actually doing. Using a 20 by 20 by >>>>>>> 20 >>>>>>> coarse grid is generally a bad idea since the code spends most of the >>>>>>> time >>>>>>> there, stick with something like 5 by 5 by 5. >>>>>>> >>>>>>> ?Suggest running with the default grid and -dmmg_nlevels 5 now the >>>>>>> percent in the coarse solve will be a trivial percent of the run time. >>>>>>> >>>>>>> ?You should get pretty good speed up for 2 processes but not much >>>>>>> better speedup for four processes because as Matt noted the computation >>>>>>> is >>>>>>> memory bandwidth limited; >>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. >>>>>>> Note >>>>>>> also that this is running multigrid which is a fast solver, but doesn't >>>>>>> parallel scale as well many slow algorithms. For example if you run >>>>>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 >>>>>>> processors but crummy speed. >>>>>>> >>>>>>> ?Barry >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: >>>>>>> >>>>>>>> Barry, >>>>>>>> >>>>>>>> Please find attached the patch for the minor change to control the >>>>>>>> number of elements from command line for snes/ex20.c. I know that >>>>>>>> this >>>>>>>> can be achieved with -grid_x etc from command_line but thought this >>>>>>>> just made the typing for the refinement process a little easier. I >>>>>>>> apologize if there was any confusion. >>>>>>>> >>>>>>>> Also, find attached the full log summaries for -np=1 and -np=2. >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Vijay >>>>>>>> >>>>>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> ?We need all the information from -log_summary to see what is going >>>>>>>>> on. >>>>>>>>> >>>>>>>>> ?Not sure what -grid 20 means but don't expect any good parallel >>>>>>>>> performance with less than at least 10,000 unknowns per process. >>>>>>>>> >>>>>>>>> ? Barry >>>>>>>>> >>>>>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: >>>>>>>>> >>>>>>>>>> Here's the performance statistic on 1 and 2 processor runs. >>>>>>>>>> >>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 >>>>>>>>>> -log_summary >>>>>>>>>> >>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>>>>>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00 >>>>>>>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02 >>>>>>>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 ?5.045e+09 >>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 ?5.969e+08 >>>>>>>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >>>>>>>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00 >>>>>>>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000 >>>>>>>>>> >>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 >>>>>>>>>> -log_summary >>>>>>>>>> >>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>>>>>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00 >>>>>>>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02 >>>>>>>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 ?9.313e+09 >>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 ?1.186e+09 >>>>>>>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 ?1.593e+03 >>>>>>>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 ?2.824e+07 >>>>>>>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000 >>>>>>>>>> >>>>>>>>>> I am not entirely sure if I can make sense out of that statistic >>>>>>>>>> but >>>>>>>>>> if there is something more you need, please feel free to let me >>>>>>>>>> know. >>>>>>>>>> >>>>>>>>>> Vijay >>>>>>>>>> >>>>>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at >>>>>>>>>> gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan >>>>>>>>>>> <vijay.m at gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Matt, >>>>>>>>>>>> >>>>>>>>>>>> The -with-debugging=1 option is certainly not meant for >>>>>>>>>>>> performance >>>>>>>>>>>> studies but I didn't expect it to yield the same cpu time as a >>>>>>>>>>>> single >>>>>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors >>>>>>>>>>>> take >>>>>>>>>>>> approximately the same amount of time for computation of >>>>>>>>>>>> solution. But >>>>>>>>>>>> I am currently configuring without debugging symbols and shall >>>>>>>>>>>> let you >>>>>>>>>>>> know what that yields. >>>>>>>>>>>> >>>>>>>>>>>> On a similar note, is there something extra that needs to be done >>>>>>>>>>>> to >>>>>>>>>>>> make use of multi-core machines while using MPI ? I am not sure >>>>>>>>>>>> if >>>>>>>>>>>> this is even related to PETSc but could be an MPI configuration >>>>>>>>>>>> option >>>>>>>>>>>> that maybe either I or the configure process is missing. All >>>>>>>>>>>> ideas are >>>>>>>>>>>> much appreciated. >>>>>>>>>>> >>>>>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. >>>>>>>>>>> On most >>>>>>>>>>> cheap multicore machines, there is a single memory bus, and thus >>>>>>>>>>> using more >>>>>>>>>>> cores gains you very little extra performance. I still suspect you >>>>>>>>>>> are not >>>>>>>>>>> actually >>>>>>>>>>> running in parallel, because you usually see a small speedup. That >>>>>>>>>>> is why I >>>>>>>>>>> suggested looking at -log_summary since it tells you how many >>>>>>>>>>> processes were >>>>>>>>>>> run and breaks down the time. >>>>>>>>>>> ? ?Matt >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Vijay >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley >>>>>>>>>>>> <knepley at gmail.com> wrote: >>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan >>>>>>>>>>>>> <vijay.m at gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am trying to configure my petsc install with an MPI >>>>>>>>>>>>>> installation to >>>>>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But >>>>>>>>>>>>>> eventhough the configure/make process went through without >>>>>>>>>>>>>> problems, >>>>>>>>>>>>>> the scalability of the programs don't seem to reflect what I >>>>>>>>>>>>>> expected. >>>>>>>>>>>>>> My configure options are >>>>>>>>>>>>>> >>>>>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ >>>>>>>>>>>>>> --download-mpich=1 >>>>>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g >>>>>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 >>>>>>>>>>>>>> --download-hypre=1 >>>>>>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++ >>>>>>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes >>>>>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes >>>>>>>>>>>>> >>>>>>>>>>>>> 1) For performance studies, make a build using >>>>>>>>>>>>> --with-debugging=0 >>>>>>>>>>>>> 2) Look at -log_summary for a breakdown of performance >>>>>>>>>>>>> ? ?Matt >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there something else that needs to be done as part of the >>>>>>>>>>>>>> configure >>>>>>>>>>>>>> process to enable a decent scaling ? I am only comparing >>>>>>>>>>>>>> programs with >>>>>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking >>>>>>>>>>>>>> approximately the >>>>>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been >>>>>>>>>>>>>> testing >>>>>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a >>>>>>>>>>>>>> custom >>>>>>>>>>>>>> -grid parameter from command-line to control the number of >>>>>>>>>>>>>> unknowns. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If there is something you've witnessed before in this >>>>>>>>>>>>>> configuration or >>>>>>>>>>>>>> if you need anything else to analyze the problem, do let me >>>>>>>>>>>>>> know. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Vijay >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>>>>>> experiments >>>>>>>>>>>>> is infinitely more interesting than any results to which their >>>>>>>>>>>>> experiments >>>>>>>>>>>>> lead. >>>>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>>>> experiments >>>>>>>>>>> is infinitely more interesting than any results to which their >>>>>>>>>>> experiments >>>>>>>>>>> lead. >>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> <ex20.patch><ex20_np1.out><ex20_np2.out> >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their experiments >>> is infinitely more interesting than any results to which their experiments >>> lead. >>> -- Norbert Wiener >>> >> <ex20_np1.out><ex20_np2.out><ex20_np4.out> > >
