On Feb 3, 2011, at 5:31 PM, Vijay S. Mahadevan wrote: > Barry, > > That sucks. I am sure that it is not a single processor machine > although I've not yet opened it up and checked it for sure ;)
I didn't mean that it was literally a single processor machine, just effectively for iterative linear solvers. Barry > It is > dual booted with windows and I am going to use the Intel performance > counters to find the bandwidth limit in windows/linux. Also, I did > find a benchmark for Ubuntu after bit of searching around and will try > to see if it can provide more details. Here are the links for the > benchmarks. > > http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ > http://manpages.ubuntu.com/manpages/maverick/lmbench.8.html > > Hopefully the numbers from Windows and Ubuntu will match and if not, > maybe my Ubuntu configuration needs a bit of tweaking to get this > correct. I will keep you updated if I find something interesting. > Thanks for all the helpful comments ! > > Vijay > > On Thu, Feb 3, 2011 at 4:46 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >> >> Based on these numbers (that is assuming these numbers are a correct >> accounting of how much memory bandwidth you can get from the system*) you >> essentially have a one processor machine that they sold to you as a 8 >> processor machine for sparse matrix computation. The one core run is using >> almost all the memory bandwidth, adding more cores in the computation helps >> very little because it is completely starved for memory bandwidth. >> >> Barry >> >> * perhaps something in the OS is not configured correctly and thus not >> allowing access to all the memory bandwidth, but this seems unlikely. >> >> On Feb 3, 2011, at 4:29 PM, Vijay S. Mahadevan wrote: >> >>> Barry, >>> >>> The outputs are attached. I do not see a big difference from the >>> earlier results as you mentioned. >>> >>> Let me know if there exist a similar benchmark that might help. >>> >>> Vijay >>> >>> On Thu, Feb 3, 2011 at 4:00 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>>> >>>> Hmm, just running the basic version with mpiexec -n 2 processes isn't >>>> that useful because there is nothing to make sure they are both running at >>>> exactly the same time. >>>> >>>> I've attached a new version of BasicVersion.c that attempts to >>>> synchronize the operations in the two processes using MPI_Barrier() >>>> ; it is probably not a great way to do it, but better than nothing. Please >>>> try that one. >>>> >>>> Thanks >>>> >>>> >>>> Barry >>>> >>>> >>>> On Feb 3, 2011, at 1:41 PM, Vijay S. Mahadevan wrote: >>>> >>>>> Barry, >>>>> >>>>> Thanks for the quick reply. I ran the benchmark/stream/BasicVersion >>>>> for one and two processes and the output are as follows: >>>>> >>>>> -n 1 >>>>> ------------------------------------------------------------- >>>>> This system uses 8 bytes per DOUBLE PRECISION word. >>>>> ------------------------------------------------------------- >>>>> Array size = 2000000, Offset = 0 >>>>> Total memory required = 45.8 MB. >>>>> Each test is run 50 times, but only >>>>> the *best* time for each is used. >>>>> ------------------------------------------------------------- >>>>> Your clock granularity/precision appears to be 1 microseconds. >>>>> Each test below will take on the order of 2529 microseconds. >>>>> (= 2529 clock ticks) >>>>> Increase the size of the arrays if this shows that >>>>> you are not getting at least 20 clock ticks per test. >>>>> ------------------------------------------------------------- >>>>> WARNING -- The above is only a rough guideline. >>>>> For best results, please be sure you know the >>>>> precision of your system timer. >>>>> ------------------------------------------------------------- >>>>> Function Rate (MB/s) RMS time Min time Max time >>>>> Copy: 10161.8510 0.0032 0.0031 0.0037 >>>>> Scale: 9843.6177 0.0034 0.0033 0.0038 >>>>> Add: 10656.7114 0.0046 0.0045 0.0053 >>>>> Triad: 10799.0448 0.0046 0.0044 0.0054 >>>>> >>>>> -n 2 >>>>> ------------------------------------------------------------- >>>>> This system uses 8 bytes per DOUBLE PRECISION word. >>>>> ------------------------------------------------------------- >>>>> Array size = 2000000, Offset = 0 >>>>> Total memory required = 45.8 MB. >>>>> Each test is run 50 times, but only >>>>> the *best* time for each is used. >>>>> ------------------------------------------------------------- >>>>> Your clock granularity/precision appears to be 1 microseconds. >>>>> Each test below will take on the order of 4320 microseconds. >>>>> (= 4320 clock ticks) >>>>> Increase the size of the arrays if this shows that >>>>> you are not getting at least 20 clock ticks per test. >>>>> ------------------------------------------------------------- >>>>> WARNING -- The above is only a rough guideline. >>>>> For best results, please be sure you know the >>>>> precision of your system timer. >>>>> ------------------------------------------------------------- >>>>> Function Rate (MB/s) RMS time Min time Max time >>>>> Copy: 5739.9704 0.0058 0.0056 0.0063 >>>>> Scale: 5839.3617 0.0058 0.0055 0.0062 >>>>> Add: 6116.9323 0.0081 0.0078 0.0085 >>>>> Triad: 6021.0722 0.0084 0.0080 0.0088 >>>>> ------------------------------------------------------------- >>>>> This system uses 8 bytes per DOUBLE PRECISION word. >>>>> ------------------------------------------------------------- >>>>> Array size = 2000000, Offset = 0 >>>>> Total memory required = 45.8 MB. >>>>> Each test is run 50 times, but only >>>>> the *best* time for each is used. >>>>> ------------------------------------------------------------- >>>>> Your clock granularity/precision appears to be 1 microseconds. >>>>> Each test below will take on the order of 2954 microseconds. >>>>> (= 2954 clock ticks) >>>>> Increase the size of the arrays if this shows that >>>>> you are not getting at least 20 clock ticks per test. >>>>> ------------------------------------------------------------- >>>>> WARNING -- The above is only a rough guideline. >>>>> For best results, please be sure you know the >>>>> precision of your system timer. >>>>> ------------------------------------------------------------- >>>>> Function Rate (MB/s) RMS time Min time Max time >>>>> Copy: 6091.9448 0.0056 0.0053 0.0061 >>>>> Scale: 5501.1775 0.0060 0.0058 0.0062 >>>>> Add: 5960.4640 0.0084 0.0081 0.0087 >>>>> Triad: 5936.2109 0.0083 0.0081 0.0089 >>>>> >>>>> I do not have OpenMP installed and so not sure if you wanted that when >>>>> you said two threads. I also closed most of the applications that were >>>>> open before running these tests and so they should hopefully be >>>>> accurate. >>>>> >>>>> Vijay >>>>> >>>>> >>>>> On Thu, Feb 3, 2011 at 1:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: >>>>>> >>>>>> Vljay >>>>>> >>>>>> Let's just look at a single embarrassingly parallel computation in the >>>>>> run, this computation has NO communication and uses NO MPI and NO >>>>>> synchronization between processes >>>>>> >>>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>>> Event Count Time (sec) Flops >>>>>> --- Global --- --- Stage --- Total >>>>>> Max Ratio Max Ratio Max Ratio Mess Avg len >>>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>>> >>>>>> 1 process >>>>>> VecMAXPY 3898 1.0 1.7074e+01 1.0 3.39e+10 1.0 0.0e+00 0.0e+00 >>>>>> 0.0e+00 15 20 0 0 0 29 40 0 0 0 1983 >>>>>> >>>>>> 2 processes >>>>>> VecMAXPY 3898 1.0 1.3861e+01 1.0 1.72e+10 1.0 0.0e+00 0.0e+00 >>>>>> 0.0e+00 15 20 0 0 0 31 40 0 0 0 2443 >>>>>> >>>>>> The speed up is 1.7074e+01/1.3861e+01 = 2443./1983 = 1.23 which is >>>>>> terrible! Now why would it be so bad (remember you cannot blame MPI) >>>>>> >>>>>> 1) other processes are running on the machine sucking up memory >>>>>> bandwidth. Make sure no other compute tasks are running during this time. >>>>>> >>>>>> 2) the single process run is able to use almost all of the hardware >>>>>> memory bandwidth, so introducing the second process cannot increase the >>>>>> performance much. This means this machine is terrible for >>>>>> parallelization of sparse iterative solvers. >>>>>> >>>>>> 3) the machine is somehow misconfigured (beats me how) so that while the >>>>>> one process job doesn't use more than half of the memory bandwidth, when >>>>>> two processes are run the second process cannot utilize all that >>>>>> additional memory bandwidth. >>>>>> >>>>>> In src/benchmarks/streams you can run make test and have it generate a >>>>>> report of how the streams benchmark is able to utilize the memory >>>>>> bandwidth. Run that and send us the output (run with just 2 threads). >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> On Feb 3, 2011, at 12:05 PM, Vijay S. Mahadevan wrote: >>>>>> >>>>>>> Matt, >>>>>>> >>>>>>> I apologize for the incomplete information. Find attached the >>>>>>> log_summary for all the cases. >>>>>>> >>>>>>> The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with >>>>>>> 2x2GB/2x4GB configuration. I do not know how to decipher the memory >>>>>>> bandwidth with this information but if you need anything more, do let >>>>>>> me know. >>>>>>> >>>>>>> VIjay >>>>>>> >>>>>>> On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> >>>>>>> wrote: >>>>>>>> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at >>>>>>>> gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Barry, >>>>>>>>> >>>>>>>>> Sorry about the delay in the reply. I did not have access to the >>>>>>>>> system to test out what you said, until now. >>>>>>>>> >>>>>>>>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20 >>>>>>>>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5 >>>>>>>>> >>>>>>>>> processor time >>>>>>>>> 1 114.2 >>>>>>>>> 2 89.45 >>>>>>>>> 4 81.01 >>>>>>>> >>>>>>>> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything >>>>>>>> from >>>>>>>> this data. >>>>>>>> 2) Do you know the memory bandwidth characteristics of this machine? >>>>>>>> That is >>>>>>>> crucial and >>>>>>>> you cannot begin to understand speedup on it until you do. Please >>>>>>>> look >>>>>>>> this up. >>>>>>>> 3) Worrying about specifics of the MPI implementation makes no sense >>>>>>>> until >>>>>>>> the basics are nailed down. >>>>>>>> Matt >>>>>>>> >>>>>>>>> >>>>>>>>> The scaleup doesn't seem to be optimal, even with two processors. I am >>>>>>>>> wondering if the fault is in the MPI configuration itself. Are these >>>>>>>>> results as you would expect ? I can also send you the log_summary for >>>>>>>>> all cases if that will help. >>>>>>>>> >>>>>>>>> Vijay >>>>>>>>> >>>>>>>>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote: >>>>>>>>>> >>>>>>>>>>> Barry, >>>>>>>>>>> >>>>>>>>>>> I understand what you are saying but which example/options then is >>>>>>>>>>> the >>>>>>>>>>> best one to compute the scalability in a multi-core machine ? I >>>>>>>>>>> chose >>>>>>>>>>> the nonlinear diffusion problem specifically because of its inherent >>>>>>>>>>> stiffness that could lead probably provide noticeable scalability >>>>>>>>>>> in a >>>>>>>>>>> multi-core system. From your experience, do you think there is >>>>>>>>>>> another >>>>>>>>>>> example program that will demonstrate this much more rigorously or >>>>>>>>>>> clearly ? Btw, I dont get good speedup even for 2 processes with >>>>>>>>>>> ex20.c and that was the original motivation for this thread. >>>>>>>>>> >>>>>>>>>> Did you follow my instructions? >>>>>>>>>> >>>>>>>>>> Barry >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Satish. I configured with --download-mpich now without the >>>>>>>>>>> mpich-device. The results are given above. I will try with the >>>>>>>>>>> options >>>>>>>>>>> you provided although I dont entirely understand what they mean, >>>>>>>>>>> which >>>>>>>>>>> kinda bugs me.. Also is OpenMPI the preferred implementation in >>>>>>>>>>> Ubuntu >>>>>>>>>>> ? >>>>>>>>>>> >>>>>>>>>>> Vijay >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Ok, everything makes sense. Looks like you are using two level >>>>>>>>>>>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type >>>>>>>>>>>> redundant >>>>>>>>>>>> -mg_coarse_redundant_pc_type lu This means it is solving the >>>>>>>>>>>> coarse grid >>>>>>>>>>>> problem redundantly on each process (each process is solving the >>>>>>>>>>>> entire >>>>>>>>>>>> coarse grid solve using LU factorization). The time for the >>>>>>>>>>>> factorization is >>>>>>>>>>>> (in the two process case) >>>>>>>>>>>> >>>>>>>>>>>> MatLUFactorNum 14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 >>>>>>>>>>>> 0.0e+00 0.0e+00 37 41 0 0 0 74 82 0 0 0 1307 >>>>>>>>>>>> MatILUFactorSym 7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 >>>>>>>>>>>> 0.0e+00 7.0e+00 0 0 0 0 1 0 0 0 0 2 0 >>>>>>>>>>>> >>>>>>>>>>>> which is 74 percent of the total solve time (and 84 percent of the >>>>>>>>>>>> flops). When 3/4th of the entire run is not parallel at all you >>>>>>>>>>>> cannot >>>>>>>>>>>> expect much speedup. If you run with -snes_view it will display >>>>>>>>>>>> exactly the >>>>>>>>>>>> solver being used. You cannot expect to understand the performance >>>>>>>>>>>> if you >>>>>>>>>>>> don't understand what the solver is actually doing. Using a 20 by >>>>>>>>>>>> 20 by 20 >>>>>>>>>>>> coarse grid is generally a bad idea since the code spends most of >>>>>>>>>>>> the time >>>>>>>>>>>> there, stick with something like 5 by 5 by 5. >>>>>>>>>>>> >>>>>>>>>>>> Suggest running with the default grid and -dmmg_nlevels 5 now the >>>>>>>>>>>> percent in the coarse solve will be a trivial percent of the run >>>>>>>>>>>> time. >>>>>>>>>>>> >>>>>>>>>>>> You should get pretty good speed up for 2 processes but not much >>>>>>>>>>>> better speedup for four processes because as Matt noted the >>>>>>>>>>>> computation is >>>>>>>>>>>> memory bandwidth limited; >>>>>>>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. >>>>>>>>>>>> Note >>>>>>>>>>>> also that this is running multigrid which is a fast solver, but >>>>>>>>>>>> doesn't >>>>>>>>>>>> parallel scale as well many slow algorithms. For example if you run >>>>>>>>>>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 >>>>>>>>>>>> processors but crummy speed. >>>>>>>>>>>> >>>>>>>>>>>> Barry >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Barry, >>>>>>>>>>>>> >>>>>>>>>>>>> Please find attached the patch for the minor change to control the >>>>>>>>>>>>> number of elements from command line for snes/ex20.c. I know that >>>>>>>>>>>>> this >>>>>>>>>>>>> can be achieved with -grid_x etc from command_line but thought >>>>>>>>>>>>> this >>>>>>>>>>>>> just made the typing for the refinement process a little easier. I >>>>>>>>>>>>> apologize if there was any confusion. >>>>>>>>>>>>> >>>>>>>>>>>>> Also, find attached the full log summaries for -np=1 and -np=2. >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Vijay >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at >>>>>>>>>>>>> mcs.anl.gov> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> We need all the information from -log_summary to see what is >>>>>>>>>>>>>> going >>>>>>>>>>>>>> on. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Not sure what -grid 20 means but don't expect any good parallel >>>>>>>>>>>>>> performance with less than at least 10,000 unknowns per process. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Barry >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here's the performance statistic on 1 and 2 processor runs. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid >>>>>>>>>>>>>>> 20 >>>>>>>>>>>>>>> -log_summary >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Max Max/Min Avg Total >>>>>>>>>>>>>>> Time (sec): 8.452e+00 1.00000 8.452e+00 >>>>>>>>>>>>>>> Objects: 1.470e+02 1.00000 1.470e+02 >>>>>>>>>>>>>>> Flops: 5.045e+09 1.00000 5.045e+09 >>>>>>>>>>>>>>> 5.045e+09 >>>>>>>>>>>>>>> Flops/sec: 5.969e+08 1.00000 5.969e+08 >>>>>>>>>>>>>>> 5.969e+08 >>>>>>>>>>>>>>> MPI Messages: 0.000e+00 0.00000 0.000e+00 >>>>>>>>>>>>>>> 0.000e+00 >>>>>>>>>>>>>>> MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 >>>>>>>>>>>>>>> 0.000e+00 >>>>>>>>>>>>>>> MPI Reductions: 4.440e+02 1.00000 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid >>>>>>>>>>>>>>> 20 >>>>>>>>>>>>>>> -log_summary >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Max Max/Min Avg Total >>>>>>>>>>>>>>> Time (sec): 7.851e+00 1.00000 7.851e+00 >>>>>>>>>>>>>>> Objects: 2.000e+02 1.00000 2.000e+02 >>>>>>>>>>>>>>> Flops: 4.670e+09 1.00580 4.657e+09 >>>>>>>>>>>>>>> 9.313e+09 >>>>>>>>>>>>>>> Flops/sec: 5.948e+08 1.00580 5.931e+08 >>>>>>>>>>>>>>> 1.186e+09 >>>>>>>>>>>>>>> MPI Messages: 7.965e+02 1.00000 7.965e+02 >>>>>>>>>>>>>>> 1.593e+03 >>>>>>>>>>>>>>> MPI Message Lengths: 1.412e+07 1.00000 1.773e+04 >>>>>>>>>>>>>>> 2.824e+07 >>>>>>>>>>>>>>> MPI Reductions: 1.046e+03 1.00000 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am not entirely sure if I can make sense out of that statistic >>>>>>>>>>>>>>> but >>>>>>>>>>>>>>> if there is something more you need, please feel free to let me >>>>>>>>>>>>>>> know. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at >>>>>>>>>>>>>>> gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan >>>>>>>>>>>>>>>> <vijay.m at gmail.com> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Matt, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The -with-debugging=1 option is certainly not meant for >>>>>>>>>>>>>>>>> performance >>>>>>>>>>>>>>>>> studies but I didn't expect it to yield the same cpu time as a >>>>>>>>>>>>>>>>> single >>>>>>>>>>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors >>>>>>>>>>>>>>>>> take >>>>>>>>>>>>>>>>> approximately the same amount of time for computation of >>>>>>>>>>>>>>>>> solution. But >>>>>>>>>>>>>>>>> I am currently configuring without debugging symbols and shall >>>>>>>>>>>>>>>>> let you >>>>>>>>>>>>>>>>> know what that yields. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On a similar note, is there something extra that needs to be >>>>>>>>>>>>>>>>> done >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> make use of multi-core machines while using MPI ? I am not >>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>> if >>>>>>>>>>>>>>>>> this is even related to PETSc but could be an MPI >>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>> option >>>>>>>>>>>>>>>>> that maybe either I or the configure process is missing. All >>>>>>>>>>>>>>>>> ideas are >>>>>>>>>>>>>>>>> much appreciated. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited >>>>>>>>>>>>>>>> operation. >>>>>>>>>>>>>>>> On most >>>>>>>>>>>>>>>> cheap multicore machines, there is a single memory bus, and >>>>>>>>>>>>>>>> thus >>>>>>>>>>>>>>>> using more >>>>>>>>>>>>>>>> cores gains you very little extra performance. I still suspect >>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>> are not >>>>>>>>>>>>>>>> actually >>>>>>>>>>>>>>>> running in parallel, because you usually see a small speedup. >>>>>>>>>>>>>>>> That >>>>>>>>>>>>>>>> is why I >>>>>>>>>>>>>>>> suggested looking at -log_summary since it tells you how many >>>>>>>>>>>>>>>> processes were >>>>>>>>>>>>>>>> run and breaks down the time. >>>>>>>>>>>>>>>> Matt >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley >>>>>>>>>>>>>>>>> <knepley at gmail.com> wrote: >>>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan >>>>>>>>>>>>>>>>>> <vijay.m at gmail.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I am trying to configure my petsc install with an MPI >>>>>>>>>>>>>>>>>>> installation to >>>>>>>>>>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. >>>>>>>>>>>>>>>>>>> But >>>>>>>>>>>>>>>>>>> eventhough the configure/make process went through without >>>>>>>>>>>>>>>>>>> problems, >>>>>>>>>>>>>>>>>>> the scalability of the programs don't seem to reflect what I >>>>>>>>>>>>>>>>>>> expected. >>>>>>>>>>>>>>>>>>> My configure options are >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ >>>>>>>>>>>>>>>>>>> --download-mpich=1 >>>>>>>>>>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g >>>>>>>>>>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 >>>>>>>>>>>>>>>>>>> --download-hypre=1 >>>>>>>>>>>>>>>>>>> --download-blacs=1 --download-scalapack=1 >>>>>>>>>>>>>>>>>>> --with-clanguage=C++ >>>>>>>>>>>>>>>>>>> --download-plapack=1 --download-mumps=1 >>>>>>>>>>>>>>>>>>> --download-umfpack=yes >>>>>>>>>>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1) For performance studies, make a build using >>>>>>>>>>>>>>>>>> --with-debugging=0 >>>>>>>>>>>>>>>>>> 2) Look at -log_summary for a breakdown of performance >>>>>>>>>>>>>>>>>> Matt >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Is there something else that needs to be done as part of the >>>>>>>>>>>>>>>>>>> configure >>>>>>>>>>>>>>>>>>> process to enable a decent scaling ? I am only comparing >>>>>>>>>>>>>>>>>>> programs with >>>>>>>>>>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking >>>>>>>>>>>>>>>>>>> approximately the >>>>>>>>>>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been >>>>>>>>>>>>>>>>>>> testing >>>>>>>>>>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a >>>>>>>>>>>>>>>>>>> custom >>>>>>>>>>>>>>>>>>> -grid parameter from command-line to control the number of >>>>>>>>>>>>>>>>>>> unknowns. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> If there is something you've witnessed before in this >>>>>>>>>>>>>>>>>>> configuration or >>>>>>>>>>>>>>>>>>> if you need anything else to analyze the problem, do let me >>>>>>>>>>>>>>>>>>> know. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Vijay >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin >>>>>>>>>>>>>>>>>> their >>>>>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>>>>> is infinitely more interesting than any results to which >>>>>>>>>>>>>>>>>> their >>>>>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>>>>> lead. >>>>>>>>>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> What most experimenters take for granted before they begin >>>>>>>>>>>>>>>> their >>>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>>> is infinitely more interesting than any results to which their >>>>>>>>>>>>>>>> experiments >>>>>>>>>>>>>>>> lead. >>>>>>>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> <ex20.patch><ex20_np1.out><ex20_np2.out> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> What most experimenters take for granted before they begin their >>>>>>>> experiments >>>>>>>> is infinitely more interesting than any results to which their >>>>>>>> experiments >>>>>>>> lead. >>>>>>>> -- Norbert Wiener >>>>>>>> >>>>>>> <ex20_np1.out><ex20_np2.out><ex20_np4.out> >>>>>> >>>>>> >>>> >>>> >>>> >>> <basicversion_np1.out><basicversion_np2.out> >> >>
