Barry,

Thanks for the quick reply. I ran the benchmark/stream/BasicVersion
for one and two processes and the output are as follows:

-n 1
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 50 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2529 microseconds.
   (= 2529 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:       10161.8510       0.0032       0.0031       0.0037
Scale:       9843.6177       0.0034       0.0033       0.0038
Add:        10656.7114       0.0046       0.0045       0.0053
Triad:      10799.0448       0.0046       0.0044       0.0054

-n 2
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 50 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 4320 microseconds.
   (= 4320 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        5739.9704       0.0058       0.0056       0.0063
Scale:       5839.3617       0.0058       0.0055       0.0062
Add:         6116.9323       0.0081       0.0078       0.0085
Triad:       6021.0722       0.0084       0.0080       0.0088
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 50 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2954 microseconds.
   (= 2954 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:        6091.9448       0.0056       0.0053       0.0061
Scale:       5501.1775       0.0060       0.0058       0.0062
Add:         5960.4640       0.0084       0.0081       0.0087
Triad:       5936.2109       0.0083       0.0081       0.0089

I do not have OpenMP installed and so not sure if you wanted that when
you said two threads. I also closed most of the applications that were
open before running these tests and so they should hopefully be
accurate.

Vijay


On Thu, Feb 3, 2011 at 1:17 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> ?Vljay
>
> ? Let's just look at a single embarrassingly parallel computation in the run, 
> this computation has NO communication and uses NO MPI and NO synchronization 
> between processes
>
> ------------------------------------------------------------------------------------------------------------------------
> Event ? ? ? ? ? ? ? ?Count ? ? ?Time (sec) ? ? Flops ? ? ? ? ? ? ? ? ? ? ? ? 
> ? ? --- Global --- ?--- Stage --- ? Total
> ? ? ? ? ? ? ? ? ? Max Ratio ?Max ? ? Ratio ? Max ?Ratio ?Mess ? Avg len 
> Reduct ?%T %F %M %L %R ?%T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> ?1 process
> VecMAXPY ? ? ? ? ? ?3898 1.0 1.7074e+01 1.0 3.39e+10 1.0 0.0e+00 0.0e+00 
> 0.0e+00 15 20 ?0 ?0 ?0 ?29 40 ?0 ?0 ?0 ?1983
>
> ?2 processes
> VecMAXPY ? ? ? ? ? ?3898 1.0 1.3861e+01 1.0 1.72e+10 1.0 0.0e+00 0.0e+00 
> 0.0e+00 15 20 ?0 ?0 ?0 ?31 40 ?0 ?0 ?0 ?2443
>
> ? The speed up is 1.7074e+01/1.3861e+01 = 2443./1983 = 1.23 ?which is 
> terrible! Now why would it be so bad (remember you cannot blame MPI)
>
> 1) other processes are running on the machine sucking up memory bandwidth. 
> Make sure no other compute tasks are running during this time.
>
> 2) the single process run is able to use almost all of the hardware memory 
> bandwidth, so introducing the second process cannot increase the performance 
> much. This means this machine is terrible for parallelization of sparse 
> iterative solvers.
>
> 3) the machine is somehow misconfigured (beats me how) so that while the one 
> process job doesn't use more than half of the memory bandwidth, when two 
> processes are run the second process cannot utilize all that additional 
> memory bandwidth.
>
> ?In src/benchmarks/streams you can run make test and have it generate a 
> report of how the streams benchmark is able to utilize the memory bandwidth. 
> Run that and send us the output (run with just 2 threads).
>
> ? Barry
>
>
> On Feb 3, 2011, at 12:05 PM, Vijay S. Mahadevan wrote:
>
>> Matt,
>>
>> I apologize for the incomplete information. Find attached the
>> log_summary for all the cases.
>>
>> The dual quad-core system has 12 GB DDR3 SDRAM at 1333MHz with
>> 2x2GB/2x4GB configuration. I do not know how to decipher the memory
>> bandwidth with this information but if you need anything more, do let
>> me know.
>>
>> VIjay
>>
>> On Thu, Feb 3, 2011 at 11:42 AM, Matthew Knepley <knepley at gmail.com> 
>> wrote:
>>> On Thu, Feb 3, 2011 at 11:37 AM, Vijay S. Mahadevan <vijay.m at gmail.com>
>>> wrote:
>>>>
>>>> Barry,
>>>>
>>>> Sorry about the delay in the reply. I did not have access to the
>>>> system to test out what you said, until now.
>>>>
>>>> I tried with -dmmg_nlevels 5, along with the default setup: ./ex20
>>>> -log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5
>>>>
>>>> processor ? ? ? time
>>>> 1 ? ? ? ? ? ? ? ? ? ? ?114.2
>>>> 2 ? ? ? ? ? ? ? ? ? ? ?89.45
>>>> 4 ? ? ? ? ? ? ? ? ? ? ?81.01
>>>
>>> 1) ALWAYS ALWAYS send the full -log_summary. I cannot tell anything from
>>> this data.
>>> 2) Do you know the memory bandwidth characteristics of this machine? That is
>>> crucial and
>>> ? ? you cannot begin to understand speedup on it until you do. Please look
>>> this up.
>>> 3) Worrying about specifics of the MPI implementation makes no sense until
>>> the basics are nailed down.
>>> ? ?Matt
>>>
>>>>
>>>> The scaleup doesn't seem to be optimal, even with two processors. I am
>>>> wondering if the fault is in the MPI configuration itself. Are these
>>>> results as you would expect ? I can also send you the log_summary for
>>>> all cases if that will help.
>>>>
>>>> Vijay
>>>>
>>>> On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>
>>>>> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote:
>>>>>
>>>>>> Barry,
>>>>>>
>>>>>> I understand what you are saying but which example/options then is the
>>>>>> best one to compute the scalability in a multi-core machine ? I chose
>>>>>> the nonlinear diffusion problem specifically because of its inherent
>>>>>> stiffness that could lead probably provide noticeable scalability in a
>>>>>> multi-core system. From your experience, do you think there is another
>>>>>> example program that will demonstrate this much more rigorously or
>>>>>> clearly ? Btw, I dont get good speedup even for 2 processes with
>>>>>> ex20.c and that was the original motivation for this thread.
>>>>>
>>>>> ? Did you follow my instructions?
>>>>>
>>>>> ? Barry
>>>>>
>>>>>>
>>>>>> Satish. I configured with --download-mpich now without the
>>>>>> mpich-device. The results are given above. I will try with the options
>>>>>> you provided although I dont entirely understand what they mean, which
>>>>>> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu
>>>>>> ?
>>>>>>
>>>>>> Vijay
>>>>>>
>>>>>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> 
>>>>>> wrote:
>>>>>>>
>>>>>>> ? Ok, everything makes sense. Looks like you are using two level
>>>>>>> multigrid (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant
>>>>>>> -mg_coarse_redundant_pc_type lu ?This means it is solving the coarse 
>>>>>>> grid
>>>>>>> problem redundantly on each process (each process is solving the entire
>>>>>>> coarse grid solve using LU factorization). The time for the 
>>>>>>> factorization is
>>>>>>> (in the two process case)
>>>>>>>
>>>>>>> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00
>>>>>>> 0.0e+00 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307
>>>>>>> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00
>>>>>>> 0.0e+00 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0
>>>>>>>
>>>>>>> which is 74 percent of the total solve time (and 84 percent of the
>>>>>>> flops). ? When 3/4th of the entire run is not parallel at all you cannot
>>>>>>> expect much speedup. ?If you run with -snes_view it will display 
>>>>>>> exactly the
>>>>>>> solver being used. You cannot expect to understand the performance if 
>>>>>>> you
>>>>>>> don't understand what the solver is actually doing. Using a 20 by 20 by 
>>>>>>> 20
>>>>>>> coarse grid is generally a bad idea since the code spends most of the 
>>>>>>> time
>>>>>>> there, stick with something like 5 by 5 by 5.
>>>>>>>
>>>>>>> ?Suggest running with the default grid and -dmmg_nlevels 5 now the
>>>>>>> percent in the coarse solve will be a trivial percent of the run time.
>>>>>>>
>>>>>>> ?You should get pretty good speed up for 2 processes but not much
>>>>>>> better speedup for four processes because as Matt noted the computation 
>>>>>>> is
>>>>>>> memory bandwidth limited;
>>>>>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. 
>>>>>>> Note
>>>>>>> also that this is running multigrid which is a fast solver, but doesn't
>>>>>>> parallel scale as well many slow algorithms. For example if you run
>>>>>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2
>>>>>>> processors but crummy speed.
>>>>>>>
>>>>>>> ?Barry
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
>>>>>>>
>>>>>>>> Barry,
>>>>>>>>
>>>>>>>> Please find attached the patch for the minor change to control the
>>>>>>>> number of elements from command line for snes/ex20.c. I know that
>>>>>>>> this
>>>>>>>> can be achieved with -grid_x etc from command_line but thought this
>>>>>>>> just made the typing for the refinement process a little easier. I
>>>>>>>> apologize if there was any confusion.
>>>>>>>>
>>>>>>>> Also, find attached the full log summaries for -np=1 and -np=2.
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Vijay
>>>>>>>>
>>>>>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> ?We need all the information from -log_summary to see what is going
>>>>>>>>> on.
>>>>>>>>>
>>>>>>>>> ?Not sure what -grid 20 means but don't expect any good parallel
>>>>>>>>> performance with less than at least 10,000 unknowns per process.
>>>>>>>>>
>>>>>>>>> ? Barry
>>>>>>>>>
>>>>>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>>>>>>>>>
>>>>>>>>>> Here's the performance statistic on 1 and 2 processor runs.
>>>>>>>>>>
>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20
>>>>>>>>>> -log_summary
>>>>>>>>>>
>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>>>>>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00
>>>>>>>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02
>>>>>>>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 ?5.045e+09
>>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 ?5.969e+08
>>>>>>>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00
>>>>>>>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00
>>>>>>>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000
>>>>>>>>>>
>>>>>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20
>>>>>>>>>> -log_summary
>>>>>>>>>>
>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>>>>>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00
>>>>>>>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02
>>>>>>>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 ?9.313e+09
>>>>>>>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 ?1.186e+09
>>>>>>>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 ?1.593e+03
>>>>>>>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 ?2.824e+07
>>>>>>>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000
>>>>>>>>>>
>>>>>>>>>> I am not entirely sure if I can make sense out of that statistic
>>>>>>>>>> but
>>>>>>>>>> if there is something more you need, please feel free to let me
>>>>>>>>>> know.
>>>>>>>>>>
>>>>>>>>>> Vijay
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at 
>>>>>>>>>> gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan
>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Matt,
>>>>>>>>>>>>
>>>>>>>>>>>> The -with-debugging=1 option is certainly not meant for
>>>>>>>>>>>> performance
>>>>>>>>>>>> studies but I didn't expect it to yield the same cpu time as a
>>>>>>>>>>>> single
>>>>>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors
>>>>>>>>>>>> take
>>>>>>>>>>>> approximately the same amount of time for computation of
>>>>>>>>>>>> solution. But
>>>>>>>>>>>> I am currently configuring without debugging symbols and shall
>>>>>>>>>>>> let you
>>>>>>>>>>>> know what that yields.
>>>>>>>>>>>>
>>>>>>>>>>>> On a similar note, is there something extra that needs to be done
>>>>>>>>>>>> to
>>>>>>>>>>>> make use of multi-core machines while using MPI ? I am not sure
>>>>>>>>>>>> if
>>>>>>>>>>>> this is even related to PETSc but could be an MPI configuration
>>>>>>>>>>>> option
>>>>>>>>>>>> that maybe either I or the configure process is missing. All
>>>>>>>>>>>> ideas are
>>>>>>>>>>>> much appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation.
>>>>>>>>>>> On most
>>>>>>>>>>> cheap multicore machines, there is a single memory bus, and thus
>>>>>>>>>>> using more
>>>>>>>>>>> cores gains you very little extra performance. I still suspect you
>>>>>>>>>>> are not
>>>>>>>>>>> actually
>>>>>>>>>>> running in parallel, because you usually see a small speedup. That
>>>>>>>>>>> is why I
>>>>>>>>>>> suggested looking at -log_summary since it tells you how many
>>>>>>>>>>> processes were
>>>>>>>>>>> run and breaks down the time.
>>>>>>>>>>> ? ?Matt
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Vijay
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley
>>>>>>>>>>>> <knepley at gmail.com> wrote:
>>>>>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan
>>>>>>>>>>>>> <vijay.m at gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am trying to configure my petsc install with an MPI
>>>>>>>>>>>>>> installation to
>>>>>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
>>>>>>>>>>>>>> eventhough the configure/make process went through without
>>>>>>>>>>>>>> problems,
>>>>>>>>>>>>>> the scalability of the programs don't seem to reflect what I
>>>>>>>>>>>>>> expected.
>>>>>>>>>>>>>> My configure options are
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/
>>>>>>>>>>>>>> --download-mpich=1
>>>>>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>>>>>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1
>>>>>>>>>>>>>> --download-hypre=1
>>>>>>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
>>>>>>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
>>>>>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) For performance studies, make a build using
>>>>>>>>>>>>> --with-debugging=0
>>>>>>>>>>>>> 2) Look at -log_summary for a breakdown of performance
>>>>>>>>>>>>> ? ?Matt
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there something else that needs to be done as part of the
>>>>>>>>>>>>>> configure
>>>>>>>>>>>>>> process to enable a decent scaling ? I am only comparing
>>>>>>>>>>>>>> programs with
>>>>>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking
>>>>>>>>>>>>>> approximately the
>>>>>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been
>>>>>>>>>>>>>> testing
>>>>>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a
>>>>>>>>>>>>>> custom
>>>>>>>>>>>>>> -grid parameter from command-line to control the number of
>>>>>>>>>>>>>> unknowns.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If there is something you've witnessed before in this
>>>>>>>>>>>>>> configuration or
>>>>>>>>>>>>>> if you need anything else to analyze the problem, do let me
>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Vijay
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>>>>> experiments
>>>>>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>>>>>> experiments
>>>>>>>>>>>>> lead.
>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>>> experiments
>>>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>>>> experiments
>>>>>>>>>>> lead.
>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> <ex20.patch><ex20_np1.out><ex20_np2.out>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their experiments
>>> is infinitely more interesting than any results to which their experiments
>>> lead.
>>> -- Norbert Wiener
>>>
>> <ex20_np1.out><ex20_np2.out><ex20_np4.out>
>
>

Reply via email to