On Feb 3, 2011, at 5:31 PM, Vijay S. Mahadevan wrote:

> Barry,
> 
> That sucks. I am sure that it is not a single processor machine
> although I've not yet opened it up and checked it for sure ;)

  I didn't mean that it was literally a single processor machine, just 
effectively for iterative linear solvers.
   Barry

> It is
> dual booted with windows and I am going to use the Intel performance
> counters to find the bandwidth limit in windows/linux. Also, I did
> find a benchmark for Ubuntu after bit of searching around and will try
> to see if it can provide more details. Here are the links for the
> benchmarks.
> 
> http://software.intel.com/en-us/articles/intel-performance-counter-monitor/
> http://manpages.ubuntu.com/manpages/maverick/lmbench.8.html
> 
> Hopefully the numbers from Windows and Ubuntu will match and if not,
> maybe my Ubuntu configuration needs a bit of tweaking to get this
> correct. I will keep you updated if I find something interesting.
> Thanks for all the helpful comments !
> 
> Vijay
> 
> On Thu, Feb 3, 2011 at 4:46 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>> 
>>   Based on these numbers (that is assuming these numbers are a correct 
>> accounting of how much memory bandwidth you can get from the system*) you 
>> essentially have a one processor machine that they sold to you as a 8 
>> processor machine for sparse matrix computation. The one core run is using 
>> almost all the memory bandwidth, adding more cores in the computation helps 
>> very little because it is completely starved for memory bandwidth.
>> 
>>   Barry
>> 
>> * perhaps something in the OS is not configured correctly and thus not 
>> allowing access to all the memory bandwidth, but this seems unlikely.
>> 
>> On Feb 3, 2011, at 4:29 PM, Vijay S. Mahadevan wrote:
>> 
>>> Barry,
>>> 
>>> The outputs are attached. I do not see a big difference from the
>>> earlier results as you mentioned.
>>> 
>>> Let me know if there exist a similar benchmark that might help.
>>> 
>>> Vijay
>>> 
>>> On Thu, Feb 3, 2011 at 4:00 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>> 
>>>>   Hmm, just running the basic version with mpiexec -n 2 processes isn't 
>>>> that useful because there is nothing to make sure they are both running at 
>>>> exactly the same time.
>>>> 
>>>>   I've attached a new version of BasicVersion.c that attempts to 
>>>> synchronize the operations in the two processes using MPI_Barrier()
>>>> ; it is probably not a great way to do it, but better than nothing. Please 
>>>> try that one.
>>>> 
>>>>    Thanks
>>>> 
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> On Feb 3, 2011, at 1:41 PM, Vijay S. Mahadevan wrote:
>>>> 
>>>>> Barry,
>>>>> 
>>>>> Thanks for the quick reply. I ran the benchmark/stream/BasicVersion
>>>>> for one and two processes and the output are as follows:
>>>>> 
>>>>> -n 1
>>>>> -------------------------------------------------------------
>>>>> This system uses 8 bytes per DOUBLE PRECISION word.
>>>>> -------------------------------------------------------------
>>>>> Array size = 2000000, Offset = 0
>>>>> Total memory required = 45.8 MB.
>>>>> Each test is run 50 times, but only
>>>>> the *best* time for each is used.
>>>>> -------------------------------------------------------------
>>>>> Your clock granularity/precision appears to be 1 microseconds.
>>>>> Each test below will take on the order of 2529 microseconds.
>>>>>   (= 2529 clock ticks)
>>>>> Increase the size of the arrays if this shows that
>>>>> you are not getting at least 20 clock ticks per test.
>>>>> -------------------------------------------------------------
>>>>> WARNING -- The above is only a rough guideline.
>>>>> For best results, please be sure you know the
>>>>> precision of your system timer.
>>>>> -------------------------------------------------------------
>>>>> Function      Rate (MB/s)   RMS time     Min time     Max time
>>>>> Copy:       10161.8510       0.0032       0.0031       0.0037
>>>>> Scale:       9843.6177       0.0034       0.0033       0.0038
>>>>> Add:        10656.7114       0.0046       0.0045       0.0053
>>>>> Triad:      10799.0448       0.0046       0.0044       0.0054
>>>>> 
>>>>> -n 2
>>>>> -------------------------------------------------------------
>>>>> This system uses 8 bytes per DOUBLE PRECISION word.
>>>>> -------------------------------------------------------------
>>>>> Array size = 2000000, Offset = 0
>>>>> Total memory required = 45.8 MB.
>>>>> Each test is run 50 times, but only
>>>>> the *best* time for each is used.
>>>>> -------------------------------------------------------------
>>>>> Your clock granularity/precision appears to be 1 microseconds.
>>>>> Each test below will take on the order of 4320 microseconds.
>>>>>   (= 4320 clock ticks)
>>>>> Increase the size of the arrays if this shows that
>>>>> you are not getting at least 20 clock ticks per test.
>>>>> -------------------------------------------------------------
>>>>> WARNING -- The above is only a rough guideline.
>>>>> For best results, please be sure you know the
>>>>> precision of your system timer.
>>>>> -------------------------------------------------------------
>>>>> Function      Rate (MB/s)   RMS time     Min time     Max time
>>>>> Copy:        5739.9704       0.0058       0.0056       0.0063
>>>>> Scale:       5839.3617       0.0058       0.0055       0.0062
>>>>> Add:         6116.9323       0.0081       0.0078       0.0085
>>>>> Triad:       6021.0722       0.0084       0.0080       0.0088
>>>>> -------------------------------------------------------------
>>>>> This system uses 8 bytes per DOUBLE PRECISION word.
>>>>> -------------------------------------------------------------
>>>>> Array size = 2000000, Offset = 0
>>>>> Total memory required = 45.8 MB.
>>>>> Each test is run 50 times, but only
>>>>> the *best* time for each is used.
>>>>> -------------------------------------------------------------
>>>>> Your clock granularity/precision appears to be 1 microseconds.
>>>>> Each test below will take on the order of 2954 microseconds.
>>>>>   (= 2954 clock ticks)
>>>>> Increase the size of the arrays if this shows that
>>>>> you are not getting at least 20 clock ticks per test.
>>>>> -------------------------------------------------------------
>>>>> WARNING -- The above is only a rough guideline.
>>>>> For best results, please be sure you know the
>>>>> precision of your system timer.
>>>>> -------------------------------------------------------------
>>>>> Function      Rate (MB/s)   RMS time     Min time     Max time
>>>>> Copy:        6091.9448       0.0056       0.0053       0.0061
>>>>> Scale:       5501.1775       0.0060       0.0058       0.0062
>>>>> Add:         5960.4640       0.0084       0.0081       0.0087
>>>>> Triad:       5936.2109       0.0083       0.0081       0.0089
>>>>> 
>>>>> I do not have OpenMP installed and so not sure if you wanted that when
>>>>> you said two threads. I also closed most of the applications that were
>>>>> open before running these tests and so they should hopefully be
>>>>> accurate.
>>>>> 
>>>>> Vijay
>>>>> 
>>>>> 
>>>>> On Thu, Feb 3, 2011 at 1:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>> 
>>>>>>  Vljay
>>>>>> 
>>>>>>   Let's just look at a single embarrassingly parallel computation in the 
>>>>>> run, this computation has NO communication and uses NO MPI and NO 
>>>>>> synchronization between processes
>>>>>> 
>>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>>> Event                Count      Time (sec)     Flops                     
>>>>>>         --- Global ---  --- Stage ---   Total
>>>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
>>>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>>> 
>>>>>>  1 process
>>>>>> VecMAXPY            3898 1.0 1.7074e+01 1.0 3.39e+10 1.0 0.0e+00 0.0e+00 
>>>>>> 0.0e+00 15 20  0  0  0  29 40  0  0  0  1983
>>>>>> 
>>>>>>  2 processes
>>>>>> VecMAXPY            3898 1.0 1.3861e+01 1.0 1.72e+10 1.0 0.0e+00 0.0e+00 
>>>>>> 0.0e+00 15 20  0  0  0  31 40  0  0  0  2443
>>>>>> 
>>>>>>   The speed up is 1.7074e+01/1.3861e+01 = 2443./1983 = 1.23  which is 
>>>>>> terrible! Now why would it be so bad (remember you cannot blame MPI)
>>>>>> 
>>>>>> 1) other processes are running on the machine sucking up memory 
>>>>>> bandwidth. Make sure no other compute tasks are running during this time.
>>>>>> 
>>>>>> 2) the single process run is able to use almost all of the hardware 
>>>>>> memory bandwidth, so introducing the second process cannot increase the 
>>>>>> performance much. This means this machine is terrible for 
>>>>>> parallelization of sparse iterative solvers.
>>>>>> 
>>>>>> 3) the machine is somehow misconfigured (beats me how) so that while the 
>>>>>> one process job doesn't use more than half of the memory bandwidth, when 
>>>>>> two processes are run the second process cannot utilize all that 
>>>>>> additional memory bandwidth.
>>>>>> 
>>>>>>  In src/benchmarks/streams you can run make test and have it generate a 
>>>>>> report of how the streams benchmark is able to utilize the memory 
>>>>>> bandwidth. Run that and send us the output (run with just 2 threads).
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>> 
>>>>>> On Feb 3, 2011, at 12:05 PM, Vijay S. Mahadevan wrote:
>>>>>> 
>>>>>>> Matt,
>>>>>>> 
>>>>>>> I apologize for the incomplete information. Find attached the
>>>>>>> log_summary for all the cases.
>>>>>>> 
>>>>>>> The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with
>>>>>>> 2x2GB/2x4GB configuration. I do not know how to decipher the memory
>>>>>>> bandwidth with this information but if you need anything more, do let
>>>>>>> me know.
>>>>>>> 
>>>>>>> VIjay
>>>>>>> 
>>>>>>> On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> 
>>>>>>> wrote:
>>>>>>>> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at 
>>>>>>>> gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Barry,
>>>>>>>>> 
>>>>>>>>> Sorry about the delay in the reply. I did not have access to the
>>>>>>>>> system to test out what you said, until now.
>>>>>>>>> 
>>>>>>>>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20
>>>>>>>>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5
>>>>>>>>> 
>>>>>>>>> processor       time
>>>>>>>>> 1                      114.2
>>>>>>>>> 2                      89.45
>>>>>>>>> 4                      81.01
>>>>>>>> 
>>>>>>>> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything 
>>>>>>>> from
>>>>>>>> this data.
>>>>>>>> 2) Do you know the memory bandwidth characteristics of this machine? 
>>>>>>>> That is
>>>>>>>> crucial and
>>>>>>>>     you cannot begin to understand speedup on it until you do. Please 
>>>>>>>> look
>>>>>>>> this up.
>>>>>>>> 3) Worrying about specifics of the MPI implementation makes no sense 
>>>>>>>> until
>>>>>>>> the basics are nailed down.
>>>>>>>>    Matt
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The scaleup doesn't seem to be optimal, even with two processors. I am
>>>>>>>>> wondering if the fault is in the MPI configuration itself. Are these
>>>>>>>>> results as you would expect ? I can also send you the log_summary for
>>>>>>>>> all cases if that will help.
>>>>>>>>> 
>>>>>>>>> Vijay
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>> 
>>>>>>>>>>> Barry,
>>>>>>>>>>> 
>>>>>>>>>>> I understand what you are saying but which example/options then is 
>>>>>>>>>>> the
>>>>>>>>>>> best one to compute the scalability in a multi-core machine ? I 
>>>>>>>>>>> chose
>>>>>>>>>>> the nonlinear diffusion problem specifically because of its inherent
>>>>>>>>>>> stiffness that could lead probably provide noticeable scalability 
>>>>>>>>>>> in a
>>>>>>>>>>> multi-core system. From your experience, do you think there is 
>>>>>>>>>>> another
>>>>>>>>>>> example program that will demonstrate this much more rigorously or
>>>>>>>>>>> clearly ? Btw, I dont get good speedup even for 2 processes with
>>>>>>>>>>> ex20.c and that was the original motivation for this thread.
>>>>>>>>>> 
>>>>>>>>>>   Did you follow my instructions?
>>>>>>>>>> 
>>>>>>>>>>   Barry
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Satish. I configured with --download-mpich now without the
>>>>>>>>>>> mpich-device. The results are given above. I will try with the 
>>>>>>>>>>> options
>>>>>>>>>>> you provided although I dont entirely understand what they mean, 
>>>>>>>>>>> which
>>>>>>>>>>> kinda bugs me.. Also is OpenMPI the preferred implementation in 
>>>>>>>>>>> Ubuntu
>>>>>>>>>>> ?
>>>>>>>>>>> 
>>>>>>>>>>> Vijay
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>   Ok, everything makes sense. Looks like you are using two level
>>>>>>>>>>>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type 
>>>>>>>>>>>> redundant
>>>>>>>>>>>> -mg_coarse_redundant_pc_type lu  This means it is solving the 
>>>>>>>>>>>> coarse grid
>>>>>>>>>>>> problem redundantly on each process (each process is solving the 
>>>>>>>>>>>> entire
>>>>>>>>>>>> coarse grid solve using LU factorization). The time for the 
>>>>>>>>>>>> factorization is
>>>>>>>>>>>> (in the two process case)
>>>>>>>>>>>> 
>>>>>>>>>>>> MatLUFactorNum        14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00
>>>>>>>>>>>> 0.0e+00 0.0e+00 37 41  0  0  0  74 82  0  0  0  1307
>>>>>>>>>>>> MatILUFactorSym        7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00
>>>>>>>>>>>> 0.0e+00 7.0e+00  0  0  0  0  1   0  0  0  0  2     0
>>>>>>>>>>>> 
>>>>>>>>>>>> which is 74 percent of the total solve time (and 84 percent of the
>>>>>>>>>>>> flops).   When 3/4th of the entire run is not parallel at all you 
>>>>>>>>>>>> cannot
>>>>>>>>>>>> expect much speedup.  If you run with -snes_view it will display 
>>>>>>>>>>>> exactly the
>>>>>>>>>>>> solver being used. You cannot expect to understand the performance 
>>>>>>>>>>>> if you
>>>>>>>>>>>> don't understand what the solver is actually doing. Using a 20 by 
>>>>>>>>>>>> 20 by 20
>>>>>>>>>>>> coarse grid is generally a bad idea since the code spends most of 
>>>>>>>>>>>> the time
>>>>>>>>>>>> there, stick with something like 5 by 5 by 5.
>>>>>>>>>>>> 
>>>>>>>>>>>>  Suggest running with the default grid and -dmmg_nlevels 5 now the
>>>>>>>>>>>> percent in the coarse solve will be a trivial percent of the run 
>>>>>>>>>>>> time.
>>>>>>>>>>>> 
>>>>>>>>>>>>  You should get pretty good speed up for 2 processes but not much
>>>>>>>>>>>> better speedup for four processes because as Matt noted the 
>>>>>>>>>>>> computation is
>>>>>>>>>>>> memory bandwidth limited;
>>>>>>>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers.
>>>>>>>>>>>>  Note
>>>>>>>>>>>> also that this is running multigrid which is a fast solver, but 
>>>>>>>>>>>> doesn't
>>>>>>>>>>>> parallel scale as well many slow algorithms. For example if you run
>>>>>>>>>>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2
>>>>>>>>>>>> processors but crummy speed.
>>>>>>>>>>>> 
>>>>>>>>>>>>  Barry
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Barry,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Please find attached the patch for the minor change to control the
>>>>>>>>>>>>> number of elements from command line for snes/ex20.c. I know that
>>>>>>>>>>>>> this
>>>>>>>>>>>>> can be achieved with -grid_x etc from command_line but thought 
>>>>>>>>>>>>> this
>>>>>>>>>>>>> just made the typing for the refinement process a little easier. I
>>>>>>>>>>>>> apologize if there was any confusion.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also, find attached the full log summaries for -np=1 and -np=2.
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at 
>>>>>>>>>>>>> mcs.anl.gov>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  We need all the information from -log_summary to see what is 
>>>>>>>>>>>>>> going
>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  Not sure what -grid 20 means but don't expect any good parallel
>>>>>>>>>>>>>> performance with less than at least 10,000 unknowns per process.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Here's the performance statistic on 1 and 2 processor runs.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 
>>>>>>>>>>>>>>> 20
>>>>>>>>>>>>>>> -log_summary
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>                         Max       Max/Min        Avg      Total
>>>>>>>>>>>>>>> Time (sec):           8.452e+00      1.00000   8.452e+00
>>>>>>>>>>>>>>> Objects:              1.470e+02      1.00000   1.470e+02
>>>>>>>>>>>>>>> Flops:                5.045e+09      1.00000   5.045e+09  
>>>>>>>>>>>>>>> 5.045e+09
>>>>>>>>>>>>>>> Flops/sec:            5.969e+08      1.00000   5.969e+08  
>>>>>>>>>>>>>>> 5.969e+08
>>>>>>>>>>>>>>> MPI Messages:         0.000e+00      0.00000   0.000e+00  
>>>>>>>>>>>>>>> 0.000e+00
>>>>>>>>>>>>>>> MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  
>>>>>>>>>>>>>>> 0.000e+00
>>>>>>>>>>>>>>> MPI Reductions:       4.440e+02      1.00000
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 
>>>>>>>>>>>>>>> 20
>>>>>>>>>>>>>>> -log_summary
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>                         Max       Max/Min        Avg      Total
>>>>>>>>>>>>>>> Time (sec):           7.851e+00      1.00000   7.851e+00
>>>>>>>>>>>>>>> Objects:              2.000e+02      1.00000   2.000e+02
>>>>>>>>>>>>>>> Flops:                4.670e+09      1.00580   4.657e+09  
>>>>>>>>>>>>>>> 9.313e+09
>>>>>>>>>>>>>>> Flops/sec:            5.948e+08      1.00580   5.931e+08  
>>>>>>>>>>>>>>> 1.186e+09
>>>>>>>>>>>>>>> MPI Messages:         7.965e+02      1.00000   7.965e+02  
>>>>>>>>>>>>>>> 1.593e+03
>>>>>>>>>>>>>>> MPI Message Lengths:  1.412e+07      1.00000   1.773e+04  
>>>>>>>>>>>>>>> 2.824e+07
>>>>>>>>>>>>>>> MPI Reductions:       1.046e+03      1.00000
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am not entirely sure if I can make sense out of that statistic
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> if there is something more you need, please feel free to let me
>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at 
>>>>>>>>>>>>>>> gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan
>>>>>>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Matt,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The -with-debugging=1 option is certainly not meant for
>>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>> studies but I didn't expect it to yield the same cpu time as a
>>>>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors
>>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>> approximately the same amount of time for computation of
>>>>>>>>>>>>>>>>> solution. But
>>>>>>>>>>>>>>>>> I am currently configuring without debugging symbols and shall
>>>>>>>>>>>>>>>>> let you
>>>>>>>>>>>>>>>>> know what that yields.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On a similar note, is there something extra that needs to be 
>>>>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> make use of multi-core machines while using MPI ? I am not 
>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>> this is even related to PETSc but could be an MPI 
>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>> option
>>>>>>>>>>>>>>>>> that maybe either I or the configure process is missing. All
>>>>>>>>>>>>>>>>> ideas are
>>>>>>>>>>>>>>>>> much appreciated.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited 
>>>>>>>>>>>>>>>> operation.
>>>>>>>>>>>>>>>> On most
>>>>>>>>>>>>>>>> cheap multicore machines, there is a single memory bus, and 
>>>>>>>>>>>>>>>> thus
>>>>>>>>>>>>>>>> using more
>>>>>>>>>>>>>>>> cores gains you very little extra performance. I still suspect 
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> are not
>>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>> running in parallel, because you usually see a small speedup. 
>>>>>>>>>>>>>>>> That
>>>>>>>>>>>>>>>> is why I
>>>>>>>>>>>>>>>> suggested looking at -log_summary since it tells you how many
>>>>>>>>>>>>>>>> processes were
>>>>>>>>>>>>>>>> run and breaks down the time.
>>>>>>>>>>>>>>>>    Matt
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley
>>>>>>>>>>>>>>>>> <knepley at gmail.com> wrote:
>>>>>>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan
>>>>>>>>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I am trying to configure my petsc install with an MPI
>>>>>>>>>>>>>>>>>>> installation to
>>>>>>>>>>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. 
>>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>>> eventhough the configure/make process went through without
>>>>>>>>>>>>>>>>>>> problems,
>>>>>>>>>>>>>>>>>>> the scalability of the programs don't seem to reflect what I
>>>>>>>>>>>>>>>>>>> expected.
>>>>>>>>>>>>>>>>>>> My configure options are
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/
>>>>>>>>>>>>>>>>>>> --download-mpich=1
>>>>>>>>>>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>>>>>>>>>>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1
>>>>>>>>>>>>>>>>>>> --download-hypre=1
>>>>>>>>>>>>>>>>>>> --download-blacs=1 --download-scalapack=1 
>>>>>>>>>>>>>>>>>>> --with-clanguage=C++
>>>>>>>>>>>>>>>>>>> --download-plapack=1 --download-mumps=1 
>>>>>>>>>>>>>>>>>>> --download-umfpack=yes
>>>>>>>>>>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 1) For performance studies, make a build using
>>>>>>>>>>>>>>>>>> --with-debugging=0
>>>>>>>>>>>>>>>>>> 2) Look at -log_summary for a breakdown of performance
>>>>>>>>>>>>>>>>>>    Matt
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Is there something else that needs to be done as part of the
>>>>>>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>>>>>>> process to enable a decent scaling ? I am only comparing
>>>>>>>>>>>>>>>>>>> programs with
>>>>>>>>>>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking
>>>>>>>>>>>>>>>>>>> approximately the
>>>>>>>>>>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been
>>>>>>>>>>>>>>>>>>> testing
>>>>>>>>>>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a
>>>>>>>>>>>>>>>>>>> custom
>>>>>>>>>>>>>>>>>>> -grid parameter from command-line to control the number of
>>>>>>>>>>>>>>>>>>> unknowns.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> If there is something you've witnessed before in this
>>>>>>>>>>>>>>>>>>> configuration or
>>>>>>>>>>>>>>>>>>> if you need anything else to analyze the problem, do let me
>>>>>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin 
>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>>>>> is infinitely more interesting than any results to which 
>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>>>>> lead.
>>>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin 
>>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>>>>>>>>> experiments
>>>>>>>>>>>>>>>> lead.
>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> <ex20.patch><ex20_np1.out><ex20_np2.out>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>>> experiments
>>>>>>>> is infinitely more interesting than any results to which their 
>>>>>>>> experiments
>>>>>>>> lead.
>>>>>>>> -- Norbert Wiener
>>>>>>>> 
>>>>>>> <ex20_np1.out><ex20_np2.out><ex20_np4.out>
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>> <basicversion_np1.out><basicversion_np2.out>
>> 
>> 

Reply via email to