[petsc-users] Configuring petsc with MPI on ubuntu quad-core

Vijay S. Mahadevan Thu, 3 Feb 2011 11:37:33 -0600

Barry,

Sorry about the delay in the reply. I did not have access to the
system to test out what you said, until now.


I tried with -dmmg_nlevels 5, along with the default setup: ./ex20
-log_summary -dmmg_view -pc_type jacobi -dmmg_nlevels 5

processor       time
1                      114.2
2                      89.45
4                      81.01

The scaleup doesn't seem to be optimal, even with two processors. I am
wondering if the fault is in the MPI configuration itself. Are these
results as you would expect ? I can also send you the log_summary for
all cases if that will help.

Vijay

On Thu, Feb 3, 2011 at 11:10 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> On Feb 2, 2011, at 11:13 PM, Vijay S. Mahadevan wrote:
>
>> Barry,
>>
>> I understand what you are saying but which example/options then is the
>> best one to compute the scalability in a multi-core machine ? I chose
>> the nonlinear diffusion problem specifically because of its inherent
>> stiffness that could lead probably provide noticeable scalability in a
>> multi-core system. From your experience, do you think there is another
>> example program that will demonstrate this much more rigorously or
>> clearly ? Btw, I dont get good speedup even for 2 processes with
>> ex20.c and that was the original motivation for this thread.
>
> ? Did you follow my instructions?
>
> ? Barry
>
>>
>> Satish. I configured with --download-mpich now without the
>> mpich-device. The results are given above. I will try with the options
>> you provided although I dont entirely understand what they mean, which
>> kinda bugs me.. Also is OpenMPI the preferred implementation in Ubuntu
>> ?
>>
>> Vijay
>>
>> On Wed, Feb 2, 2011 at 6:35 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>
>>> ? Ok, everything makes sense. Looks like you are using two level multigrid 
>>> (coarse grid 20 by 20 by 20) with -mg_coarse_pc_type redundant 
>>> -mg_coarse_redundant_pc_type lu ?This means it is solving the coarse grid 
>>> problem redundantly on each process (each process is solving the entire 
>>> coarse grid solve using LU factorization). The time for the factorization 
>>> is (in the two process case)
>>>
>>> MatLUFactorNum ? ? ? ?14 1.0 2.9096e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 37 41 ?0 ?0 ?0 ?74 82 ?0 ?0 ?0 ?1307
>>> MatILUFactorSym ? ? ? ?7 1.0 7.2970e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 7.0e+00 ?0 ?0 ?0 ?0 ?1 ? 0 ?0 ?0 ?0 ?2 ? ? 0
>>>
>>> which is 74 percent of the total solve time (and 84 percent of the flops). 
>>> ? When 3/4th of the entire run is not parallel at all you cannot expect 
>>> much speedup. ?If you run with -snes_view it will display exactly the 
>>> solver being used. You cannot expect to understand the performance if you 
>>> don't understand what the solver is actually doing. Using a 20 by 20 by 20 
>>> coarse grid is generally a bad idea since the code spends most of the time 
>>> there, stick with something like 5 by 5 by 5.
>>>
>>> ?Suggest running with the default grid and -dmmg_nlevels 5 now the percent 
>>> in the coarse solve will be a trivial percent of the run time.
>>>
>>> ?You should get pretty good speed up for 2 processes but not much better 
>>> speedup for four processes because as Matt noted the computation is memory 
>>> bandwidth limited; 
>>> http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers. 
>>> Note also that this is running multigrid which is a fast solver, but 
>>> doesn't parallel scale as well many slow algorithms. For example if you run 
>>> -dmmg_nlevels 5 -pc_type jacobi you will get great speed up with 2 
>>> processors but crummy speed.
>>>
>>> ?Barry
>>>
>>>
>>>
>>> On Feb 2, 2011, at 6:17 PM, Vijay S. Mahadevan wrote:
>>>
>>>> Barry,
>>>>
>>>> Please find attached the patch for the minor change to control the
>>>> number of elements from command line for snes/ex20.c. I know that this
>>>> can be achieved with -grid_x etc from command_line but thought this
>>>> just made the typing for the refinement process a little easier. I
>>>> apologize if there was any confusion.
>>>>
>>>> Also, find attached the full log summaries for -np=1 and -np=2. Thanks.
>>>>
>>>> Vijay
>>>>
>>>> On Wed, Feb 2, 2011 at 6:06 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>>>>
>>>>> ?We need all the information from -log_summary to see what is going on.
>>>>>
>>>>> ?Not sure what -grid 20 means but don't expect any good parallel 
>>>>> performance with less than at least 10,000 unknowns per process.
>>>>>
>>>>> ? Barry
>>>>>
>>>>> On Feb 2, 2011, at 5:38 PM, Vijay S. Mahadevan wrote:
>>>>>
>>>>>> Here's the performance statistic on 1 and 2 processor runs.
>>>>>>
>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 1 ./ex20 -grid 20 
>>>>>> -log_summary
>>>>>>
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>>>> Time (sec): ? ? ? ? ? 8.452e+00 ? ? ?1.00000 ? 8.452e+00
>>>>>> Objects: ? ? ? ? ? ? ?1.470e+02 ? ? ?1.00000 ? 1.470e+02
>>>>>> Flops: ? ? ? ? ? ? ? ?5.045e+09 ? ? ?1.00000 ? 5.045e+09 ?5.045e+09
>>>>>> Flops/sec: ? ? ? ? ? ?5.969e+08 ? ? ?1.00000 ? 5.969e+08 ?5.969e+08
>>>>>> MPI Messages: ? ? ? ? 0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00
>>>>>> MPI Message Lengths: ?0.000e+00 ? ? ?0.00000 ? 0.000e+00 ?0.000e+00
>>>>>> MPI Reductions: ? ? ? 4.440e+02 ? ? ?1.00000
>>>>>>
>>>>>> /usr/lib/petsc/linux-gnu-cxx-opt/bin/mpiexec -n 2 ./ex20 -grid 20 
>>>>>> -log_summary
>>>>>>
>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? Max ? ? ? Max/Min ? ? ? ?Avg ? ? ?Total
>>>>>> Time (sec): ? ? ? ? ? 7.851e+00 ? ? ?1.00000 ? 7.851e+00
>>>>>> Objects: ? ? ? ? ? ? ?2.000e+02 ? ? ?1.00000 ? 2.000e+02
>>>>>> Flops: ? ? ? ? ? ? ? ?4.670e+09 ? ? ?1.00580 ? 4.657e+09 ?9.313e+09
>>>>>> Flops/sec: ? ? ? ? ? ?5.948e+08 ? ? ?1.00580 ? 5.931e+08 ?1.186e+09
>>>>>> MPI Messages: ? ? ? ? 7.965e+02 ? ? ?1.00000 ? 7.965e+02 ?1.593e+03
>>>>>> MPI Message Lengths: ?1.412e+07 ? ? ?1.00000 ? 1.773e+04 ?2.824e+07
>>>>>> MPI Reductions: ? ? ? 1.046e+03 ? ? ?1.00000
>>>>>>
>>>>>> I am not entirely sure if I can make sense out of that statistic but
>>>>>> if there is something more you need, please feel free to let me know.
>>>>>>
>>>>>> Vijay
>>>>>>
>>>>>> On Wed, Feb 2, 2011 at 5:15 PM, Matthew Knepley <knepley at gmail.com> 
>>>>>> wrote:
>>>>>>> On Wed, Feb 2, 2011 at 5:04 PM, Vijay S. Mahadevan <vijay.m at 
>>>>>>> gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Matt,
>>>>>>>>
>>>>>>>> The -with-debugging=1 option is certainly not meant for performance
>>>>>>>> studies but I didn't expect it to yield the same cpu time as a single
>>>>>>>> processor for snes/ex20 i.e., my runs with 1 and 2 processors take
>>>>>>>> approximately the same amount of time for computation of solution. But
>>>>>>>> I am currently configuring without debugging symbols and shall let you
>>>>>>>> know what that yields.
>>>>>>>>
>>>>>>>> On a similar note, is there something extra that needs to be done to
>>>>>>>> make use of multi-core machines while using MPI ? I am not sure if
>>>>>>>> this is even related to PETSc but could be an MPI configuration option
>>>>>>>> that maybe either I or the configure process is missing. All ideas are
>>>>>>>> much appreciated.
>>>>>>>
>>>>>>> Sparse MatVec (MatMult) is a memory bandwidth limited operation. On most
>>>>>>> cheap multicore machines, there is a single memory bus, and thus using 
>>>>>>> more
>>>>>>> cores gains you very little extra performance. I still suspect you are 
>>>>>>> not
>>>>>>> actually
>>>>>>> running in parallel, because you usually see a small speedup. That is 
>>>>>>> why I
>>>>>>> suggested looking at -log_summary since it tells you how many processes 
>>>>>>> were
>>>>>>> run and breaks down the time.
>>>>>>> ? ?Matt
>>>>>>>
>>>>>>>>
>>>>>>>> Vijay
>>>>>>>>
>>>>>>>> On Wed, Feb 2, 2011 at 4:53 PM, Matthew Knepley <knepley at gmail.com> 
>>>>>>>> wrote:
>>>>>>>>> On Wed, Feb 2, 2011 at 4:46 PM, Vijay S. Mahadevan <vijay.m at 
>>>>>>>>> gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am trying to configure my petsc install with an MPI installation to
>>>>>>>>>> make use of a dual quad-core desktop system running Ubuntu. But
>>>>>>>>>> eventhough the configure/make process went through without problems,
>>>>>>>>>> the scalability of the programs don't seem to reflect what I 
>>>>>>>>>> expected.
>>>>>>>>>> My configure options are
>>>>>>>>>>
>>>>>>>>>> --download-f-blas-lapack=1 --with-mpi-dir=/usr/lib/ 
>>>>>>>>>> --download-mpich=1
>>>>>>>>>> --with-mpi-shared=0 --with-shared=0 --COPTFLAGS=-g
>>>>>>>>>> --download-parmetis=1 --download-superlu_dist=1 --download-hypre=1
>>>>>>>>>> --download-blacs=1 --download-scalapack=1 --with-clanguage=C++
>>>>>>>>>> --download-plapack=1 --download-mumps=1 --download-umfpack=yes
>>>>>>>>>> --with-debugging=1 --with-errorchecking=yes
>>>>>>>>>
>>>>>>>>> 1) For performance studies, make a build using --with-debugging=0
>>>>>>>>> 2) Look at -log_summary for a breakdown of performance
>>>>>>>>> ? ?Matt
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Is there something else that needs to be done as part of the 
>>>>>>>>>> configure
>>>>>>>>>> process to enable a decent scaling ? I am only comparing programs 
>>>>>>>>>> with
>>>>>>>>>> mpiexec (-n 1) and (-n 2) but they seem to be taking approximately 
>>>>>>>>>> the
>>>>>>>>>> same time as noted from -log_summary. If it helps, I've been testing
>>>>>>>>>> with snes/examples/tutorials/ex20.c for all purposes with a custom
>>>>>>>>>> -grid parameter from command-line to control the number of unknowns.
>>>>>>>>>>
>>>>>>>>>> If there is something you've witnessed before in this configuration 
>>>>>>>>>> or
>>>>>>>>>> if you need anything else to analyze the problem, do let me know.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Vijay
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>> experiments
>>>>>>>>> is infinitely more interesting than any results to which their
>>>>>>>>> experiments
>>>>>>>>> lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>> experiments
>>>>>>> is infinitely more interesting than any results to which their 
>>>>>>> experiments
>>>>>>> lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>
>>>>>
>>>> <ex20.patch><ex20_np1.out><ex20_np2.out>
>>>
>>>
>
>

[petsc-users] Configuring petsc with MPI on ubuntu quad-core

Reply via email to