Bill,

    It is great that you ran with -info to see that there are not excessive 
mallocs in vector and matrix assemblies and -ksp_view to show the solver being 
used but I would recommend doing that in a separate run from the -log_summary 
because we make no attempt to have -info and -xx_view options optimized for 
performance.

    To begin analysis I find it is always best not to compare 1 to 2 
processors, nor to compare at the highest level of number of processors but 
instead to compare somewhere in the middle. Hence I look at 2 and 4 processes

  1)   Looking at embarrassingly parallel operations

4procs


VecMAXPY            8677 1.0 6.9120e+00 1.0 8.15e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  5 35  0  0  0   6 35  0  0  0  4717
MatSolve            8677 1.0 6.9232e+00 1.1 3.41e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  5 15  0  0  0   6 15  0  0  0  1971
MatLUFactorNum     1 1.0 2.5489e-03 1.2 6.53e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0 
 0  0  0  0   0  0  0  0  0  1024
VecScale            8677 1.0 2.1447e+01 1.1 2.71e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 16  1  0  0  0  19  1  0  0  0    51
VecAXPY              508 1.0 8.9473e-01 1.4 3.18e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0   142

2procs
VecMAXPY            8341 1.0 9.4324e+00 1.0 1.54e+10 1.0 0.0e+00 0.0e+00 
0.0e+00 15 34  0  0  0  23 35  0  0  0  3261
MatSolve            8341 1.0 1.0210e+01 1.0 6.61e+09 1.0 0.0e+00 0.0e+00 
0.0e+00 16 15  0  0  0  25 15  0  0  0  1294
MatLUFactorNum         1 1.0 4.0622e-03 1.1 1.32e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   650
VecScale            8341 1.0 1.0367e+00 1.3 5.21e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  2  1  0  0  0   2  1  0  0  0  1006
VecAXPY              502 1.0 3.5317e-02 1.7 6.28e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  3553

These are routines where there is no communication between the MPI processes 
and no synchronization. Thus in an ideal situation one could hope for the 
routines to run TWICE as fast.  For the last three operations I calculated the 
ratio of flop rates as 1.57, 1.52, and 1.44.  Thus I conclude that the 4 MPI 
processes are sharing memory bandwidth thus you cannot expect to get 2 times 
speed up. 

But what is going on with VecScale and VecAXPY, why is the performance falling 
through the floor? I noticed that you are using OpenBLAS so did some poking 
around in google and found at https://github.com/xianyi/OpenBLAS/wiki/faq#what

If your application is already multi-threaded, it will conflict with OpenBLAS 
multi-threading. Thus, you must set OpenBLAS to use single thread as following.

        • export OPENBLAS_NUM_THREADS=1 in the environment variables. Or
        • Call openblas_set_num_threads(1) in the application on runtime. Or
        • Build OpenBLAS single thread version, e.g. make USE_THREAD=0

Of course you application is not multi-threaded it is MPI parallel but you have 
the exact same problem, the number of cores is over subscribed with too many 
threads killing performance of some routines.

So please FORCE OpenBlas to only use a single thread and rerun the 1,2,4, and 8 
with -log_summary and without the -info and -xxx_view

2)  I now compare the 4 and 8 process case with MAXPY and Solve

8procs
VecMAXPY            9336 1.0 3.0977e+00 1.0 4.59e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  3 35  0  0  0   5 35  0  0  0 11835
MatSolve            9336 1.0 3.0873e+00 1.1 1.82e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  3 14  0  0  0   4 14  0  0  0  4716

4procs
VecMAXPY            8677 1.0 6.9120e+00 1.0 8.15e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  5 35  0  0  0   6 35  0  0  0  4717
MatSolve            8677 1.0 6.9232e+00 1.1 3.41e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  5 15  0  0  0   6 15  0  0  0  1971

What the hey is going on here? The performance more than doubles!  From this I 
conclude that going from 4 to 8 processes is moving the computation to twice as 
many physical CPUs that DO NOT share memory bandwidth. 

  A general observation, since p multiple cores on the same physical CPU 
generally share memory bandwidth when you go from p/2 to p MPI processes on 
that CPU you will never see a double in performance (perfect speedup) you are 
actually lucky if you see the 1.5 speed up that you are seeing. Thus as you 
increase the number of MPI processes to extend to more and more physical CPUs 
you will see “funny jumps” in your speedup depending on when it is switching to 
more physical CPUs (and hence more memory bandwidth). Thus is is important to 
understand “where” the program is actually running. 

  So make the changes I recommend and send us the new set of -log_summary and 
we may be able to make more observations based on less “cluttered” data.

   Barry



On Mar 14, 2014, at 4:45 PM, William Coirier 
<[email protected]> wrote:

> I've written a parallel, finite-volume, transient thermal conduction solver 
> using PETSc primitives, and so far things have been going great. Comparisons 
> to theory for a simple problem (transient conduction in a semi-infinite slab) 
> looks good, but I'm not getting very good parallel scaling behavior with the 
> KSP solver. Whether I use the default KSP/PC or other sensible combinations, 
> the time spent in KSPSolve seems to not scale well at all.
> 
> I seem to have loaded up the problem well enough. The PETSc logging/profiling 
> has been really useful for reworking various code segments, and right now, 
> the bottleneck is KSPSolve, and I can't seem to figure out how to get it to 
> scale properly.
> 
> I'm attaching output produced with -log_summary, -info, -ksp_view and 
> -pc_view all specified on the command line for 1, 2, 4 and 8 processes.
> 
> If you guys have any suggestions, I'd definitely like to hear them! And I 
> apologize in advance if I've done something stupid. All the documentation has 
> been really helpful.
> 
> Thanks in advance...
> 
> Bill Coirier
> 
> --------------------------------------------------------------------------------------------------------------------
> 
> ***NOTICE*** This e-mail and/or the attached documents may contain technical 
> data within the definition of the International Traffic in Arms Regulations 
> and/or Export Administration Regulations, and are subject to the export 
> control laws of the U.S. Government. Transfer of this data by any means to a 
> foreign person, whether in the United States or abroad, without an export 
> license or other approval from the U.S. Department of State or Commerce, as 
> applicable, is prohibited. No portion of this e-mail or its attachment(s) may 
> be reproduced without written consent of Kratos Defense & Security Solutions, 
> Inc. Any views expressed in this message are those of the individual sender, 
> except where the message states otherwise and the sender is authorized to 
> state them to be the views of any such entity. The information contained in 
> this message and or attachments is intended only for the person or entity to 
> which it is addressed and may contain confidential and/or privileged 
> material. If you are not the intended recipient or believe that you may have 
> received this document in error, please notify the sender and delete this 
> e-mail and any attachments immediately.<out.1><out.2><out.4><out.8>

Reply via email to