Barry,

I understand what you are saying but which example/options then is the
best one to compute the scalability in a multi-core machine ? I chose
the nonlinear diffusion problem specifically because of its inherent
stiffness that could lead probably provide noticeable scalability in a
multi-core system. From your experience, do you think there is another
example program that will demonstrate this much more rigorously or
clearly ? Btw, I dont get good speedup even for 2 processes with
ex20.c and that was the original motivation for this thread.

Satish. I configured with --download-mpich now without the
mpich-device. The results are given above. I will try with the options
you provided although I dont entirely understand what they mean, which
kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu
?

Vijay

On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> ? Ok, everything makes sense. Looks like you are using two level multigrid 
> (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant 
> -mg_coarse_redundant_pc_type lu ?This means it is solving the coarse grid 
> problem redundantly on each process (each process is solving the entire 
> coarse grid solve using LU factorization). The time for the factorization is 
> (in the two process case)
>
> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307
> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0
>
> which is 74 percent of the total solve time (and 84 percent of the flops). ? 
> When 3/4th of the entire run is not parallel at all you cannot expect much 
> speedup. ?If you run with -snes_view it will display exactly the solver being 
> used. You cannot expect to understand the performance if you don't understand 
> what the solver is actually doing. Using a 20 by 20 by 20 coarse grid is 
> generally a bad idea since the code spends most of the time there, stick with 
> something like 5 by 5 by 5.
>
> ?Suggest running with the default grid and -dmmg_nlevels 5 now the percent in 
> the coarse solve will be a trivial percent of the run time.
>
> ?You should get pretty good speed up for 2 processes but not much better 
> speedup for four processes because as Matt noted the computation is memory 
> bandwidth limited; 
> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. Note 
> also that this is running multigrid which is a fast solver, but doesn't 
> parallel scale as well many slow algorithms. For example if you run 
> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 processors 
> but crummy speed.
>
> ?Barry
>
>
>
> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
>
>> Barry,
>>
>> Please find attached the patch for the minor change to control the
>> number of elements from command line for snes/ex20.c. I know that this
>> can be achieved with -grid_x etc from command_line but thought this
>> just made the typing for the refinement process a little easier. I
>> apologize if there was any confusion.
>>
>> Also, find attached the full log summaries for -np=1 and -np=2. Thanks.
>>
>> Vijay
>>
>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>
>>> ?We need all the information from -log_summary to see what is going on.
>>>
>>> ?Not sure what -grid 20 means but don't expect any good parallel 
>>> performance with less than at least 10,000 unknowns per process.
>>>
>>> ? Barry
>>>
>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>>>
>>>> Here's the performance statistic on 1 and 2 processor runs.
>>>>
>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 
>>>> -log_summary
>>>>
>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00
>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02
>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 ?5.045e+09
>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 ?5.969e+08
>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00
>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00
>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000
>>>>
>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 
>>>> -log_summary
>>>>
>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00
>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02
>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 ?9.313e+09
>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 ?1.186e+09
>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 ?1.593e+03
>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 ?2.824e+07
>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000
>>>>
>>>> I am not entirely sure if I can make sense out of that statistic but
>>>> if there is something more you need, please feel free to let me know.
>>>>
>>>> Vijay
>>>>
>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> 
>>>> wrote:
>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <vijay.m at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Matt,
>>>>>>
>>>>>> The -with-debugging=1 option is certainly not meant for performance
>>>>>> studies but I didn't expect it to yield the same cpu time as a single
>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take
>>>>>> approximately the same amount of time for computation of solution. But
>>>>>> I am currently configuring without debugging symbols and shall let you
>>>>>> know what that yields.
>>>>>>
>>>>>> On a similar note, is there something extra that needs to be done to
>>>>>> make use of multi-core machines while using MPI ? I am not sure if
>>>>>> this is even related to PETSc but could be an MPI configuration option
>>>>>> that maybe either I or the configure process is missing. All ideas are
>>>>>> much appreciated.
>>>>>
>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On most
>>>>> cheap multicore machines, there is a single memory bus, and thus using 
>>>>> more
>>>>> cores gains you very little extra performance. I still suspect you are not
>>>>> actually
>>>>> running in parallel, because you usually see a small speedup. That is why 
>>>>> I
>>>>> suggested looking at -log_summary since it tells you how many processes 
>>>>> were
>>>>> run and breaks down the time.
>>>>> ? ?Matt
>>>>>
>>>>>>
>>>>>> Vijay
>>>>>>
>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com> 
>>>>>> wrote:
>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <vijay.m at 
>>>>>>> gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am trying to configure my petsc install with an MPI installation to
>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
>>>>>>>> eventhough the configure/make process went through without problems,
>>>>>>>> the scalability of the programs don't seem to reflect what I expected.
>>>>>>>> My configure options are
>>>>>>>>
>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ --download-mpich=1
>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1
>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
>>>>>>>> --with-debugging=1 --with-errorchecking=yes
>>>>>>>
>>>>>>> 1) For performance studies, make a build using --with-debugging=0
>>>>>>> 2) Look at -log_summary for a breakdown of performance
>>>>>>> ? ?Matt
>>>>>>>
>>>>>>>>
>>>>>>>> Is there something else that needs to be done as part of the configure
>>>>>>>> process to enable a decent scaling ? I am only comparing programs with
>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately the
>>>>>>>> same time as noted from -log_summary. If it helps, I've been testing
>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom
>>>>>>>> -grid parameter from command-line to control the number of unknowns.
>>>>>>>>
>>>>>>>> If there is something you've witnessed before in this configuration or
>>>>>>>> if you need anything else to analyze the problem, do let me know.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Vijay
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their
>>>>>>> experiments
>>>>>>> is infinitely more interesting than any results to which their
>>>>>>> experiments
>>>>>>> lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> What most experimenters take for granted before they begin their 
>>>>> experiments
>>>>> is infinitely more interesting than any results to which their experiments
>>>>> lead.
>>>>> -- Norbert Wiener
>>>>>
>>>
>>>
>> <ex20.patch><ex20_np1.out><ex20_np2.out>
>
>

Reply via email to