Barry, That sucks. I am sure that it is not a single processor machine although I've not yet opened it up and checked it for sure ;) It is dual booted with windows and I am going to use the Intel performance counters to find the bandwidth limit in windows/linux. Also, I did find a benchmark for Ubuntu after bit of searching around and will try to see if it can provide more details. Here are the links for the benchmarks.
http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ http://manpages.ubuntu.com/manpages/maverick/lmbench.8.html Hopefully the numbers from Windows and Ubuntu will match and if not, maybe my Ubuntu configuration needs a bit of tweaking to get this correct. I will keep you updated if I find something interesting. Thanks for all the helpful comments ! Vijay On Thu, Feb 3, 2011 at 4:46 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > ? Based on these numbers (that is assuming these numbers are a correct > accounting of how much memory bandwidth you can get from the system*) you > essentially have a one processor machine that they sold to you as a 8 > processor machine for sparse matrix computation. The one core run is using > almost all the memory bandwidth, adding more cores in the computation helps > very little because it is completely starved for memory bandwidth. > > ? Barry > > * perhaps something in the OS is not configured correctly and thus not > allowing access to all the memory bandwidth, but this seems unlikely. > > On Feb 3, 2011, at 4:29 PM, Vijay S. Mahadevan wrote: > >> Barry, >> >> The outputs are attached. I do not see a big difference from the >> earlier results as you mentioned. >> >> Let me know if there exist a similar benchmark that might help. >> >> Vijay >> >> On Thu, Feb 3, 2011 at 4:00 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>> >>> ? Hmm, just running the basic version with mpiexec -n 2 processes isn't >>> that useful because there is nothing to make sure they are both running at >>> exactly the same time. >>> >>> ? I've attached a new version of BasicVersion.c that attempts to >>> synchronize the operations in the two processes using MPI_Barrier() >>> ; it is probably not a great way to do it, but better than nothing. Please >>> try that one. >>> >>> ? ?Thanks >>> >>> >>> ? Barry >>> >>> >>> On Feb 3, 2011, at 1:41 PM, Vijay S. Mahadevan wrote: >>> >>>> Barry, >>>> >>>> Thanks for the quick reply. I ran the benchmark/stream/BasicVersion >>>> for one and two processes and the output are as follows: >>>> >>>> -n 1 >>>> ------------------------------------------------------------- >>>> This system uses 8 bytes per DOUBLE PRECISION word. >>>> ------------------------------------------------------------- >>>> Array size = 2000000, Offset = 0 >>>> Total memory required = 45.8 MB. >>>> Each test is run 50 times, but only >>>> the *best* time for each is used. >>>> ------------------------------------------------------------- >>>> Your clock granularity/precision appears to be 1 microseconds. >>>> Each test below will take on the order of 2529 microseconds. >>>> ? (= 2529 clock ticks) >>>> Increase the size of the arrays if this shows that >>>> you are not getting at least 20 clock ticks per test. >>>> ------------------------------------------------------------- >>>> WARNING -- The above is only a rough guideline. >>>> For best results, please be sure you know the >>>> precision of your system timer. >>>> ------------------------------------------------------------- >>>> Function ? ? ?Rate (MB/s) ? RMS time ? ? Min time ? ? Max time >>>> Copy: ? ? ? 10161.8510 ? ? ? 0.0032 ? ? ? 0.0031 ? ? ? 0.0037 >>>> Scale: ? ? ? 9843.6177 ? ? ? 0.0034 ? ? ? 0.0033 ? ? ? 0.0038 >>>> Add: ? ? ? ?10656.7114 ? ? ? 0.0046 ? ? ? 0.0045 ? ? ? 0.0053 >>>> Triad: ? ? ?10799.0448 ? ? ? 0.0046 ? ? ? 0.0044 ? ? ? 0.0054 >>>> >>>> -n 2 >>>> ------------------------------------------------------------- >>>> This system uses 8 bytes per DOUBLE PRECISION word. >>>> ------------------------------------------------------------- >>>> Array size = 2000000, Offset = 0 >>>> Total memory required = 45.8 MB. >>>> Each test is run 50 times, but only >>>> the *best* time for each is used. >>>> ------------------------------------------------------------- >>>> Your clock granularity/precision appears to be 1 microseconds. >>>> Each test below will take on the order of 4320 microseconds. >>>> ? (= 4320 clock ticks) >>>> Increase the size of the arrays if this shows that >>>> you are not getting at least 20 clock ticks per test. >>>> ------------------------------------------------------------- >>>> WARNING -- The above is only a rough guideline. >>>> For best results, please be sure you know the >>>> precision of your system timer. >>>> ------------------------------------------------------------- >>>> Function ? ? ?Rate (MB/s) ? RMS time ? ? Min time ? ? Max time >>>> Copy: ? ? ? ?5739.9704 ? ? ? 0.0058 ? ? ? 0.0056 ? ? ? 0.0063 >>>> Scale: ? ? ? 5839.3617 ? ? ? 0.0058 ? ? ? 0.0055 ? ? ? 0.0062 >>>> Add: ? ? ? ? 6116.9323 ? ? ? 0.0081 ? ? ? 0.0078 ? ? ? 0.0085 >>>> Triad: ? ? ? 6021.0722 ? ? ? 0.0084 ? ? ? 0.0080 ? ? ? 0.0088 >>>> ------------------------------------------------------------- >>>> This system uses 8 bytes per DOUBLE PRECISION word. >>>> ------------------------------------------------------------- >>>> Array size = 2000000, Offset = 0 >>>> Total memory required = 45.8 MB. >>>> Each test is run 50 times, but only >>>> the *best* time for each is used. >>>> ------------------------------------------------------------- >>>> Your clock granularity/precision appears to be 1 microseconds. >>>> Each test below will take on the order of 2954 microseconds. >>>> ? (= 2954 clock ticks) >>>> Increase the size of the arrays if this shows that >>>> you are not getting at least 20 clock ticks per test. >>>> ------------------------------------------------------------- >>>> WARNING -- The above is only a rough guideline. >>>> For best results, please be sure you know the >>>> precision of your system timer. >>>> ------------------------------------------------------------- >>>> Function ? ? ?Rate (MB/s) ? RMS time ? ? Min time ? ? Max time >>>> Copy: ? ? ? ?6091.9448 ? ? ? 0.0056 ? ? ? 0.0053 ? ? ? 0.0061 >>>> Scale: ? ? ? 5501.1775 ? ? ? 0.0060 ? ? ? 0.0058 ? ? ? 0.0062 >>>> Add: ? ? ? ? 5960.4640 ? ? ? 0.0084 ? ? ? 0.0081 ? ? ? 0.0087 >>>> Triad: ? ? ? 5936.2109 ? ? ? 0.0083 ? ? ? 0.0081 ? ? ? 0.0089 >>>> >>>> I do not have OpenMP installed and so not sure if you wanted that when >>>> you said two threads. I also closed most of the applications that were >>>> open before running these tests and so they should hopefully be >>>> accurate. >>>> >>>> Vijay >>>> >>>> >>>> On Thu, Feb 3, 2011 at 1:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>>>> >>>>> ?Vljay >>>>> >>>>> ? Let's just look at a single embarrassingly parallel computation in the >>>>> run, this computation has NO communication and uses NO MPI and NO >>>>> synchronization between processes >>>>> >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> Event ? ? ? ? ? ? ? ?Count ? ? ?Time (sec) ? ? Flops ? ? ? ? ? ? ? ? ? ? >>>>> ? ? ? ? --- Global --- ?--- Stage --- ? Total >>>>> ? ? ? ? ? ? ? ? ? Max Ratio ?Max ? ? Ratio ? Max ?Ratio ?Mess ? Avg len >>>>> Reduct ?%T %F %M %L %R ?%T %F %M %L %R Mflop/s >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> >>>>> ?1 process >>>>> VecMAXPY ? ? ? ? ? ?3898 1.0 1.7074e+01 1.0 3.39e+10 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 15 20 ?0 ?0 ?0 ?29 40 ?0 ?0 ?0 ?1983 >>>>> >>>>> ?2 processes >>>>> VecMAXPY ? ? ? ? ? ?3898 1.0 1.3861e+01 1.0 1.72e+10 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 15 20 ?0 ?0 ?0 ?31 40 ?0 ?0 ?0 ?2443 >>>>> >>>>> ? The speed up is 1.7074e+01/1.3861e+01 = 2443./1983 = 1.23 ?which is >>>>> terrible! Now why would it be so bad (remember you cannot blame MPI) >>>>> >>>>> 1) other processes are running on the machine sucking up memory >>>>> bandwidth. Make sure no other compute tasks are running during this time. >>>>> >>>>> 2) the single process run is able to use almost all of the hardware >>>>> memory bandwidth, so introducing the second process cannot increase the >>>>> performance much. This means this machine is terrible for parallelization >>>>> of sparse iterative solvers. >>>>> >>>>> 3) the machine is somehow misconfigured (beats me how) so that while the >>>>> one process job doesn't use more than half of the memory bandwidth, when >>>>> two processes are run the second process cannot utilize all that >>>>> additional memory bandwidth. >>>>> >>>>> ?In src/benchmarks/streams you can run make test and have it generate a >>>>> report of how the streams benchmark is able to utilize the memory >>>>> bandwidth. Run that and send us the output (run with just 2 threads). >>>>> >>>>> ? Barry >>>>> >>>>> >>>>> On Feb 3, 2011, at 12:05 PM, Vijay S. Mahadevan wrote: >>>>> >>>>>> Matt, >>>>>> >>>>>> I apologize for the incomplete information. Find attached the >>>>>> log_summary for all the cases. >>>>>> >>>>>> The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with >>>>>> 2x2GB/2x4GB configuration. I do not know how to decipher the memory >>>>>> bandwidth with this information but if you need anything more, do let >>>>>> me know. >>>>>> >>>>>> VIjay >>>>>> >>>>>> On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> >>>>>> wrote: >>>>>>> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at >>>>>>> gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Barry, >>>>>>>> >>>>>>>> Sorry about the delay in the reply. I did not have access to the >>>>>>>> system to test out what you said, until now. >>>>>>>> >>>>>>>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20 >>>>>>>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5 >>>>>>>> >>>>>>>> processor ? ? ? time >>>>>>>> 1 ? ? ? ? ? ? ? ? ? ? ?114.2 >>>>>>>> 2 ? ? ? ? ? ? ? ? ? ? ?89.45 >>>>>>>> 4 ? ? ? ? ? ? ? ? ? ? ?81.01 >>>>>>> >>>>>>> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from >>>>>>> this data. >>>>>>> 2) Do you know the memory bandwidth characteristics of this machine? >>>>>>> That is >>>>>>> crucial and >>>>>>> ? ? you cannot begin to understand speedup on it until you do. Please >>>>>>> look >>>>>>> this up. >>>>>>> 3) Worrying about specifics of the MPI implementation makes no sense >>>>>>> until >>>>>>> the basics are nailed down. >>>>>>> ? ?Matt >>>>>>> >>>>>>>> >>>>>>>> The scaleup doesn't seem to be optimal, even with two processors. I am >>>>>>>> wondering if the fault is in the MPI configuration itself. Are these >>>>>>>> results as you would expect ? I can also send you the log_summary for >>>>>>>> all cases if that will help. >>>>>>>> >>>>>>>> Vijay >>>>>>>> >>>>>>>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote: >>>>>>>>> >>>>>>>>>> Barry, >>>>>>>>>> >>>>>>>>>> I understand what you are saying but which example/options then is >>>>>>>>>> the >>>>>>>>>> best one to compute the scalability in a multi-core machine ? I chose >>>>>>>>>> the nonlinear diffusion problem specifically because of its inherent >>>>>>>>>> stiffness that could lead probably provide noticeable scalability in >>>>>>>>>> a >>>>>>>>>> multi-core system. From your experience, do you think there is >>>>>>>>>> another >>>>>>>>>> example program that will demonstrate this much more rigorously or >>>>>>>>>> clearly ? Btw, I dont get good speedup even for 2 processes with >>>>>>>>>> ex20.c and that was the original motivation for this thread. >>>>>>>>> >>>>>>>>> ? Did you follow my instructions? >>>>>>>>> >>>>>>>>> ? Barry >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Satish. I configured with --download-mpich now without the >>>>>>>>>> mpich-device. The results are given above. I will try with the >>>>>>>>>> options >>>>>>>>>> you provided although I dont entirely understand what they mean, >>>>>>>>>> which >>>>>>>>>> kinda bugs me.. Also is OpenMPI the preferred implementation in >>>>>>>>>> Ubuntu >>>>>>>>>> ? >>>>>>>>>> >>>>>>>>>> Vijay >>>>>>>>>> >>>>>>>>>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> ? Ok, everything makes sense. Looks like you are using two level >>>>>>>>>>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type >>>>>>>>>>> redundant >>>>>>>>>>> -mg_coarse_redundant_pc_type lu ?This means it is solving the >>>>>>>>>>> coarse grid >>>>>>>>>>> problem redundantly on each process (each process is solving the >>>>>>>>>>> entire >>>>>>>>>>> coarse grid solve using LU factorization). The time for the >>>>>>>>>>> factorization is >>>>>>>>>>> (in the two process case) >>>>>>>>>>> >>>>>>>>>>> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 >>>>>>>>>>> 0.0e+00 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307 >>>>>>>>>>> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 >>>>>>>>>>> 0.0e+00 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0 >>>>>>>>>>> >>>>>>>>>>> which is 74 percent of the total solve time (and 84 percent of the >>>>>>>>>>> flops). ? When 3/4th of the entire run is not parallel at all you >>>>>>>>>>> cannot >>>>>>>>>>> expect much speedup. ?If you run with -snes_view it will display >>>>>>>>>>> exactly the >>>>>>>>>>> solver being used. You cannot expect to understand the performance >>>>>>>>>>> if you >>>>>>>>>>> don't understand what the solver is actually doing. Using a 20 by >>>>>>>>>>> 20 by 20 >>>>>>>>>>> coarse grid is generally a bad idea since the code spends most of >>>>>>>>>>> the time >>>>>>>>>>> there, stick with something like 5 by 5 by 5. >>>>>>>>>>> >>>>>>>>>>> ?Suggest running with the default grid and -dmmg_nlevels 5 now the >>>>>>>>>>> percent in the coarse solve will be a trivial percent of the run >>>>>>>>>>> time. >>>>>>>>>>> >>>>>>>>>>> ?You should get pretty good speed up for 2 processes but not much >>>>>>>>>>> better speedup for four processes because as Matt noted the >>>>>>>>>>> computation is >>>>>>>>>>> memory bandwidth limited; >>>>>>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. >>>>>>>>>>> Note >>>>>>>>>>> also that this is running multigrid which is a fast solver, but >>>>>>>>>>> doesn't >>>>>>>>>>> parallel scale as well many slow algorithms. For example if you run >>>>>>>>>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 >>>>>>>>>>> processors but crummy speed. >>>>>>>>>>> >>>>>>>>>>> ?Barry >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: >>>>>>>>>>> >>>>>>>>>>>> Barry, >>>>>>>>>>>> >>>>>>>>>>>> Please find attached the patch for the minor change to control the >>>>>>>>>>>> number of elements from command line for snes/ex20.c. I know that >>>>>>>>>>>> this >>>>>>>>>>>> can be achieved with -grid_x etc from command_line but thought this >>>>>>>>>>>> just made the typing for the refinement process a little easier. I >>>>>>>>>>>> apologize if there was any confusion. >>>>>>>>>>>> >>>>>>>>>>>> Also, find attached the full log summaries for -np=1 and -np=2. >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Vijay >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> ?We need all the information from -log_summary to see what is >>>>>>>>>>>>> going >>>>>>>>>>>>> on. >>>>>>>>>>>>> >>>>>>>>>>>>> ?Not sure what -grid 20 means but don't expect any good parallel >>>>>>>>>>>>> performance with less than at least 10,000 unknowns per process. >>>>>>>>>>>>> >>>>>>>>>>>>> ? Barry >>>>>>>>>>>>> >>>>>>>>>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Here's the performance statistic on 1 and 2 processor runs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 >>>>>>>>>>>>>> -log_summary >>>>>>>>>>>>>> >>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>>>>>>>>>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00 >>>>>>>>>>>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02 >>>>>>>>>>>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 >>>>>>>>>>>>>> ?5.045e+09 >>>>>>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 >>>>>>>>>>>>>> ?5.969e+08 >>>>>>>>>>>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 >>>>>>>>>>>>>> ?0.000e+00 >>>>>>>>>>>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 >>>>>>>>>>>>>> ?0.000e+00 >>>>>>>>>>>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000 >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 >>>>>>>>>>>>>> -log_summary >>>>>>>>>>>>>> >>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total >>>>>>>>>>>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00 >>>>>>>>>>>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02 >>>>>>>>>>>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 >>>>>>>>>>>>>> ?9.313e+09 >>>>>>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 >>>>>>>>>>>>>> ?1.186e+09 >>>>>>>>>>>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 >>>>>>>>>>>>>> ?1.593e+03 >>>>>>>>>>>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 >>>>>>>>>>>>>> ?2.824e+07 >>>>>>>>>>>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000 >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am not entirely sure if I can make sense out of that statistic >>>>>>>>>>>>>> but >>>>>>>>>>>>>> if there is something more you need, please feel free to let me >>>>>>>>>>>>>> know. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at >>>>>>>>>>>>>> gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan >>>>>>>>>>>>>>> <vijay.m at gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Matt, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The -with-debugging=1 option is certainly not meant for >>>>>>>>>>>>>>>> performance >>>>>>>>>>>>>>>> studies but I didn't expect it to yield the same cpu time as a >>>>>>>>>>>>>>>> single >>>>>>>>>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors >>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>> approximately the same amount of time for computation of >>>>>>>>>>>>>>>> solution. But >>>>>>>>>>>>>>>> I am currently configuring without debugging symbols and shall >>>>>>>>>>>>>>>> let you >>>>>>>>>>>>>>>> know what that yields. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On a similar note, is there something extra that needs to be >>>>>>>>>>>>>>>> done >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> make use of multi-core machines while using MPI ? I am not sure >>>>>>>>>>>>>>>> if >>>>>>>>>>>>>>>> this is even related to PETSc but could be an MPI configuration >>>>>>>>>>>>>>>> option >>>>>>>>>>>>>>>> that maybe either I or the configure process is missing. All >>>>>>>>>>>>>>>> ideas are >>>>>>>>>>>>>>>> much appreciated. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. >>>>>>>>>>>>>>> On most >>>>>>>>>>>>>>> cheap multicore machines, there is a single memory bus, and thus >>>>>>>>>>>>>>> using more >>>>>>>>>>>>>>> cores gains you very little extra performance. I still suspect >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> are not >>>>>>>>>>>>>>> actually >>>>>>>>>>>>>>> running in parallel, because you usually see a small speedup. >>>>>>>>>>>>>>> That >>>>>>>>>>>>>>> is why I >>>>>>>>>>>>>>> suggested looking at -log_summary since it tells you how many >>>>>>>>>>>>>>> processes were >>>>>>>>>>>>>>> run and breaks down the time. >>>>>>>>>>>>>>> ? ?Matt >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley >>>>>>>>>>>>>>>> <knepley at gmail.com> wrote: >>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan >>>>>>>>>>>>>>>>> <vijay.m at gmail.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I am trying to configure my petsc install with an MPI >>>>>>>>>>>>>>>>>> installation to >>>>>>>>>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. >>>>>>>>>>>>>>>>>> But >>>>>>>>>>>>>>>>>> eventhough the configure/make process went through without >>>>>>>>>>>>>>>>>> problems, >>>>>>>>>>>>>>>>>> the scalability of the programs don't seem to reflect what I >>>>>>>>>>>>>>>>>> expected. >>>>>>>>>>>>>>>>>> My configure options are >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ >>>>>>>>>>>>>>>>>> --download-mpich=1 >>>>>>>>>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g >>>>>>>>>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 >>>>>>>>>>>>>>>>>> --download-hypre=1 >>>>>>>>>>>>>>>>>> --download-blacs=1 --download-scalapack=1 >>>>>>>>>>>>>>>>>> --with-clanguage=C++ >>>>>>>>>>>>>>>>>> --download-plapack=1 --download-mumps=1 >>>>>>>>>>>>>>>>>> --download-umfpack=yes >>>>>>>>>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) For performance studies, make a build using >>>>>>>>>>>>>>>>> --with-debugging=0 >>>>>>>>>>>>>>>>> 2) Look at -log_summary for a breakdown of performance >>>>>>>>>>>>>>>>> ? ?Matt >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Is there something else that needs to be done as part of the >>>>>>>>>>>>>>>>>> configure >>>>>>>>>>>>>>>>>> process to enable a decent scaling ? I am only comparing >>>>>>>>>>>>>>>>>> programs with >>>>>>>>>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking >>>>>>>>>>>>>>>>>> approximately the >>>>>>>>>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been >>>>>>>>>>>>>>>>>> testing >>>>>>>>>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a >>>>>>>>>>>>>>>>>> custom >>>>>>>>>>>>>>>>>> -grid parameter from command-line to control the number of >>>>>>>>>>>>>>>>>> unknowns. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If there is something you've witnessed before in this >>>>>>>>>>>>>>>>>> configuration or >>>>>>>>>>>>>>>>>> if you need anything else to analyze the problem, do let me >>>>>>>>>>>>>>>>>> know. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin >>>>>>>>>>>>>>>>> their >>>>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>>>> is infinitely more interesting than any results to which their >>>>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>>>> lead. >>>>>>>>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>> is infinitely more interesting than any results to which their >>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>> lead. >>>>>>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> <ex20.patch><ex20_np1.out><ex20_np2.out> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments >>>>>>> is infinitely more interesting than any results to which their >>>>>>> experiments >>>>>>> lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>> <ex20_np1.out><ex20_np2.out><ex20_np4.out> >>>>> >>>>> >>> >>> >>> >> <basicversion_np1.out><basicversion_np2.out> > >
