The streams numbers 

1   8291.4887   Rate (MB/s)
2   8739.3219   Rate (MB/s) 1.05401
3  24769.5868   Rate (MB/s) 2.98735
4  31962.0242   Rate (MB/s) 3.8548
5  39603.8828   Rate (MB/s) 4.77645
6  47777.7385   Rate (MB/s) 5.76226
7  54557.5363   Rate (MB/s) 6.57994
8  62769.3910   Rate (MB/s) 7.57034
9  38649.9160   Rate (MB/s) 4.6614

indicate the MPI launcher is doing a poor job of binding MPI ranks to cores; 
you should read up on the options for your particular mpiexec for binding to 
select good binding options. Unfortunately, there is no standard for setting 
the bindings and each MPI implementation changes its options constantly so you 
need to determine them exactly for your machine and MPI implementation.  
Basically, you want to place each MPI rank on a node as "far away as possible 
in memory domains from other ranks".  If you note going from 1 to 2 ranks there 
is no speedup which can be interpreted to mean that the first two ranks are put 
very close together (and thus share all the memory resources with their 
partner).

A side note is that the raw numbers are very good (you get a speedup of 7.57 on 
8 ranks and the speedup goes up to 10. These means with proper binding you 
should get really good speedup on PETSc code to at least 8 cores per node.

  Barry



> On Jul 12, 2022, at 11:32 AM, Ce Qin <[email protected]> wrote:
> 
> For your reference, I also calculated the speedups for other procedures:
> 
>                                     VecAXPY     MatMult    SetupAMS     
> PCApply    Assembly     Solving
> NProcessors NNodes CoresPerNode                                               
>                          
> 1           1      1                    1.0         1.0         1.0         
> 1.0         1.0         1.0
> 2           1      2               1.640502    1.945753    1.418709    
> 1.898884    1.995246    1.898756
>             2      1               2.297125    2.614508    1.600718    
> 2.419798    2.121401    2.436149
> 4           1      4               4.456256    6.821532    3.614451    
> 5.991256    4.658187    6.004539
>             2      2               4.539748    6.779151    3.619661    
> 5.926112    4.666667    5.942085
>             4      1               4.480902    7.210629    3.471541    
> 6.082946     4.65272    6.101214
> 8           2      4              10.584189   17.519901     8.59046   
> 16.615395    9.380985   16.581135
>             4      2              10.980687   18.674113    8.612347   
> 17.273229    9.308575   17.258891
>             8      1              11.096298   18.210245    8.456557   
> 17.430586    9.314449   17.380612
> 16          2      8              21.929795    37.04392   18.135278     
> 34.5448   18.575953   34.483058
>             4      4               22.00331   39.581504   18.011148   
> 34.793732   18.745129   34.854409
>             8      2              22.692779    41.38289   18.354949   
> 36.388144   18.828393    36.45509
> 32          4      8              43.935774   80.003087   34.963997   
> 70.085728   37.140626   70.175879
>             8      4              44.387091   80.807608    35.62153   
> 71.471289   37.166421   71.533865
> 
> and the streams result on the computation node:
> 
> 1   8291.4887   Rate (MB/s)
> 2   8739.3219   Rate (MB/s) 1.05401
> 3  24769.5868   Rate (MB/s) 2.98735
> 4  31962.0242   Rate (MB/s) 3.8548
> 5  39603.8828   Rate (MB/s) 4.77645
> 6  47777.7385   Rate (MB/s) 5.76226
> 7  54557.5363   Rate (MB/s) 6.57994
> 8  62769.3910   Rate (MB/s) 7.57034
> 9  38649.9160   Rate (MB/s) 4.6614
> 10  58976.9536   Rate (MB/s) 7.11295
> 11  48108.7801   Rate (MB/s) 5.80219
> 12  49506.8213   Rate (MB/s) 5.9708
> 13  54810.5266   Rate (MB/s) 6.61046
> 14  62471.5234   Rate (MB/s) 7.53441
> 15  63968.0218   Rate (MB/s) 7.7149
> 16  69644.8615   Rate (MB/s) 8.39956
> 17  60791.9544   Rate (MB/s) 7.33185
> 18  65476.5162   Rate (MB/s) 7.89683
> 19  60127.0683   Rate (MB/s) 7.25166
> 20  72052.5175   Rate (MB/s) 8.68994
> 21  62045.7745   Rate (MB/s) 7.48307
> 22  64517.7771   Rate (MB/s) 7.7812
> 23  69570.2935   Rate (MB/s) 8.39057
> 24  69673.8328   Rate (MB/s) 8.40305
> 25  75196.7514   Rate (MB/s) 9.06915
> 26  72304.2685   Rate (MB/s) 8.7203
> 27  73234.1616   Rate (MB/s) 8.83245
> 28  74041.3842   Rate (MB/s) 8.9298
> 29  77117.3751   Rate (MB/s) 9.30079
> 30  78293.8496   Rate (MB/s) 9.44268
> 31  81377.0870   Rate (MB/s) 9.81453
> 32  84097.0813   Rate (MB/s) 10.1426
> 
> 
> Best,
> Ce
> 
> Mark Adams <[email protected] <mailto:[email protected]>> 于2022年7月12日周二 22:11写道:
> You may get more memory bandwidth with 32 processors vs 1, as Ce mentioned.
> Depends on the architecture.
> Do you get the whole memory bandwidth on one processor on this machine?
> 
> On Tue, Jul 12, 2022 at 8:53 AM Matthew Knepley <[email protected] 
> <mailto:[email protected]>> wrote:
> On Tue, Jul 12, 2022 at 7:32 AM Ce Qin <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> The linear system is complex-valued. We rewrite it into its real form
> and solve it using FGMRES and an optimal block-diagonal preconditioner. 
> We use CG and the AMS preconditioner implemented in HYPRE to solve the
> smaller real linear system arised from applying the block preconditioner.
> The iteration number of FGMRES and CG keep almost constant in all the runs.
> 
> So those blocks decrease in size as you add more processes?
>  
> 
> I am sorry for the unclear description of the block-diagonal preconditioner.
> Let K be the original complex system matrix, A = [Kr, -Ki; -Ki, -Kr] is the 
> equivalent
> real form of K. Let P = [Kr+Ki, 0; 0, Kr+Ki], it can beproved that P is an 
> optimal
> preconditioner for A. In our implementation, only Kr, Ki and Kr+Ki
> are explicitly stored as MATMPIAIJ. We use MATSHELL to represent A and P.
> We use FGMRES + P to solve Ax=b, and CG + AMS to
> solve (Kr+Ki)y=c. So the block size is never changed.
> 
> Then we have to break down the timings further. I suspect AMS is not taking 
> as long, since
> all other operations scale like N.
> 
>   Thanks,
> 
>      Matt
> 
>  
> Best,
> Ce
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Reply via email to