Re: [petsc-users] Scaling with number of cores

TAY wee-beng Thu, 05 Nov 2015 08:07:28 -0800

Sorry I realised that I didn't use gamg and that's why. But if I usegamg, the 8 core case worked, but the 64 core case shows p diverged.


Why is this so? Btw, I have also added nullspace in my code.


Thank you.

Yours sincerely,

TAY wee-beng

On 5/11/2015 12:03 PM, Barry Smith wrote:

   There is a problem here. The -log_summary doesn't show all the events 
associated with the -pc_type gamg preconditioner it should have rows like

VecDot                 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1613
VecMDot              134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  3  0  0  0   0  3  0  0  0  3025
VecNorm              154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0  1578
VecScale             148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1039
VecCopy              106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1742
VecAYPX              384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0   860
VecAXPBYCZ           192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0  2085
VecWAXPY               2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   636
VecMAXPY             148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  3  0  0  0   1  3  0  0  0  2399
VecPointwiseMult      66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   604
VecScatterBegin       45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSetRandom           6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecReduceArith         4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1525
VecReduceComm          2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize         148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0  1177
MatMult              424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  7 37  0  0  0   7 37  0  0  0  2343
MatMultAdd            48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0  2069
MatMultTranspose      48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  2  0  0  0   1  2  0  0  0  1069
MatSolve              16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   460
MatSOR               354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  9 31  0  0  0   9 31  0  0  0  1631
MatLUFactorSym         2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   307
MatScale              18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   874
MatResidual           48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  4  0  0  0   1  4  0  0  0  2212
MatAssemblyBegin      57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd        57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatGetRow          21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatGetRowIJ            2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCoarsen             6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAXPY                6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatFDColorCreate       1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatFDColorSetUp        1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatFDColorApply        2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  4  0  0  0   1  4  0  0  0  1826
MatFDColorFunc        42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  4  0  0  0   1  4  0  0  0  2956
MatMatMult             6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  4  2  0  0  0   4  2  0  0  0   241
MatMatMultSym          6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0   3  0  0  0  0     0
MatMatMultNum          6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  2  2  0  0  0   2  2  0  0  0   679
MatPtAP                6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
0.0e+00 18 11  0  0  0  18 11  0  0  0   283
MatPtAPSymbolic        6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  8  0  0  0  0   8  0  0  0  0     0
MatPtAPNumeric         6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  9 11  0  0  0   9 11  0  0  0   537
MatTrnMatMult          2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0    75
MatTrnMatMultSym       2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatTrnMatMultNum       2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   352
MatGetSymTrans         8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog       134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  6  0  0  0   1  6  0  0  0  2491
KSPSetUp              24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 94 95  0  0  0  94 95  0  0  0   471
PCGAMGGraph_AGG        6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
0.0e+00 10  0  0  0  0  10  0  0  0  0     2
PCGAMGCoarse_AGG       6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0    49
PCGAMGProl_AGG         6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 34  0  0  0  0  34  0  0  0  0     0
PCGAMGPOpt_AGG         6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  9 11  0  0  0   9 11  0  0  0   534
GAMG: createProl       6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 
0.0e+00 55 11  0  0  0  55 11  0  0  0    92
   Graph               12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
0.0e+00 10  0  0  0  0  10  0  0  0  0     2
   MIS/Agg              6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
   SA: col data         6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
   SA: frmProl0         6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 34  0  0  0  0  34  0  0  0  0     0
   SA: smooth           6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  9 11  0  0  0   9 11  0  0  0   534
GAMG: partLevel        6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
0.0e+00 18 11  0  0  0  18 11  0  0  0   283
PCSetUp                4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 74 22  0  0  0  74 22  0  0  0   137
PCSetUpOnBlocks       16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0    42
PCApply               16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 20 70  0  0  0  20 70  0  0  0  1637


Are you sure you ran with -pc_type gamg ? What about running with -info does it 
print anything about gamg? What about -ksp_view does it indicate it is using 
the gamg preconditioner?

On Nov 4, 2015, at 9:30 PM, TAY wee-beng <[email protected]> wrote:

Hi,

I have attached the 2 logs.

Thank you

Yours sincerely,

TAY wee-beng

On 4/11/2015 1:11 AM, Barry Smith wrote:

    Ok, the convergence looks good. Now run on 8 and 64 processes as before 
with -log_summary and not -ksp_monitor to see how it scales.

   Barry

On Nov 3, 2015, at 6:49 AM, TAY wee-beng <[email protected]> wrote:

Hi,

I tried and have attached the log.

Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some 
null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?

Thank you

Yours sincerely,

TAY wee-beng

On 3/11/2015 12:45 PM, Barry Smith wrote:

On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]>  wrote:

Hi,

I tried :

1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg

2. -poisson_pc_type gamg

    Run with -poisson_ksp_monitor_true_residual 
-poisson_ksp_monitor_converged_reason
Does your poisson have Neumann boundary conditions? Do you have any zeros on 
the diagonal for the matrix (you shouldn't).

   There may be something wrong with your poisson discretization that was also 
messing up hypre

Both options give:

    1      0.00150000      0.00000000      0.00000000 1.00000000             
NaN             NaN             NaN
M Diverged but why?, time =            2
reason =           -9

How can I check what's wrong?

Thank you

Yours sincerely,

TAY wee-beng

On 3/11/2015 3:18 AM, Barry Smith wrote:

    hypre is just not scaling well here. I do not know why. Since hypre is a 
block box for us there is no way to determine why the poor scaling.

    If you make the same two runs with -pc_type gamg there will be a lot more 
information in the log summary about in what routines it is scaling well or 
poorly.

   Barry

On Nov 2, 2015, at 3:17 AM, TAY wee-beng<[email protected]>  wrote:

Hi,

I have attached the 2 files.

Thank you

Yours sincerely,

TAY wee-beng

On 2/11/2015 2:55 PM, Barry Smith wrote:

   Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then (158)x(266)x(150) 
on 64 processors  and send the two -log_summary results

   Barry

On Nov 2, 2015, at 12:19 AM, TAY wee-beng<[email protected]>  wrote:

Hi,

I have attached the new results.

Thank you

Yours sincerely,

TAY wee-beng

On 2/11/2015 12:27 PM, Barry Smith wrote:

   Run without the -momentum_ksp_view -poisson_ksp_view and send the new results


   You can see from the log summary that the PCSetUp is taking a much smaller 
percentage of the time meaning that it is reusing the preconditioner and not 
rebuilding it each time.

Barry

   Something makes no sense with the output: it gives

KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 
5.0e+02 90100 66100 24  90100 66100 24   165

90% of the time is in the solve but there is no significant amount of time in 
other events of the code which is just not possible. I hope it is due to your 
IO.

On Nov 1, 2015, at 10:02 PM, TAY wee-beng<[email protected]>  wrote:

Hi,

I have attached the new run with 100 time steps for 48 and 96 cores.

Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the 
preconditioner, what must I do? Or what must I not do?

Why does the number of processes increase so much? Is there something wrong 
with my coding? Seems to be so too for my new run.

Thank you

Yours sincerely,

TAY wee-beng

On 2/11/2015 9:49 AM, Barry Smith wrote:

   If you are doing many time steps with the same linear solver then you MUST 
do your weak scaling studies with MANY time steps since the setup time of AMG 
only takes place in the first stimestep. So run both 48 and 96 processes with 
the same large number of time steps.

   Barry

On Nov 1, 2015, at 7:35 PM, TAY wee-beng<[email protected]>  wrote:

Hi,

Sorry I forgot and use the old a.out. I have attached the new log for 48cores 
(log48), together with the 96cores log (log96).

Why does the number of processes increase so much? Is there something wrong 
with my coding?

Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse the 
preconditioner, what must I do? Or what must I not do?

Lastly, I only simulated 2 time steps previously. Now I run for 10 timesteps 
(log48_10). Is it building the preconditioner at every timestep?

Also, what about momentum eqn? Is it working well?

I will try the gamg later too.

Thank you

Yours sincerely,

TAY wee-beng

On 2/11/2015 12:30 AM, Barry Smith wrote:

   You used gmres with 48 processes but richardson with 96. You need to be 
careful and make sure you don't change the solvers when you change the number 
of processors since you can get very different inconsistent results

    Anyways all the time is being spent in the BoomerAMG algebraic multigrid 
setup and it is is scaling badly. When you double the problem size and number 
of processes it went from 3.2445e+01 to 4.3599e+02 seconds.

PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 
4.0e+00 62  8  0  0  4  62  8  0  0  5    11

PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 
4.0e+00 85 18  0  0  6  85 18  0  0  6     2

   Now is the Poisson problem changing at each timestep or can you use the same 
preconditioner built with BoomerAMG for all the time steps? Algebraic multigrid 
has a large set up time that you often doesn't matter if you have many time 
steps but if you have to rebuild it each timestep it is too large?

   You might also try -pc_type gamg and see how PETSc's algebraic multigrid 
scales for your problem/machine.

   Barry

On Nov 1, 2015, at 7:30 AM, TAY wee-beng<[email protected]>  wrote:


On 1/11/2015 10:00 AM, Barry Smith wrote:

On Oct 31, 2015, at 8:43 PM, TAY wee-beng<[email protected]>  wrote:


On 1/11/2015 12:47 AM, Matthew Knepley wrote:

On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<[email protected]>  wrote:
Hi,

I understand that as mentioned in the faq, due to the limitations in memory, 
the scaling is not linear. So, I am trying to write a proposal to use a 
supercomputer.
Its specs are:
Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)

8 cores / processor
Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
Each cabinet contains 96 computing nodes,
One of the requirement is to give the performance of my current code with my 
current set of data, and there is a formula to calculate the estimated parallel 
efficiency when using the new large set of data
There are 2 ways to give performance:
1. Strong scaling, which is defined as how the elapsed time varies with the 
number of processors for a fixed
problem.
2. Weak scaling, which is defined as how the elapsed time varies with the 
number of processors for a
fixed problem size per processor.
I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 90 
mins respectively. This is classified as strong scaling.
Cluster specs:
CPU: AMD 6234 2.4GHz
8 cores / processor (CPU)
6 CPU / node
So 48 Cores / CPU
Not sure abt the memory / node

The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ indicates 
how much the program is
efficiently accelerated by parallel processing. ‘En’ is given by the following 
formulae. Although their
derivation processes are different depending on strong and weak scaling, 
derived formulae are the
same.
 From the estimated time, my parallel efficiency using  Amdahl's law on the 
current old cluster was 52.7%.
So is my results acceptable?
For the large data set, if using 2205 nodes (2205X8cores), my expected parallel 
efficiency is only 0.5%. The proposal recommends value of > 50%.
The problem with this analysis is that the estimated serial fraction from 
Amdahl's Law  changes as a function
of problem size, so you cannot take the strong scaling from one problem and 
apply it to another without a
model of this dependence.

Weak scaling does model changes with problem size, so I would measure weak 
scaling on your current
cluster, and extrapolate to the big machine. I realize that this does not make 
sense for many scientific
applications, but neither does requiring a certain parallel efficiency.

Ok I check the results for my weak scaling it is even worse for the expected 
parallel efficiency. From the formula used, it's obvious it's doing some sort of 
exponential extrapolation decrease. So unless I can achieve a near > 90% speed 
up when I double the cores and problem size for my current 48/96 cores setup,     
extrapolating from about 96 nodes to 10,000 nodes will give a much lower expected 
parallel efficiency for the new case.

However, it's mentioned in the FAQ that due to memory requirement, it's impossible to 
get >90% speed when I double the cores and problem size (ie linear increase in 
performance), which means that I can't get >90% speed up when I double the cores 
and problem size for my current 48/96 cores setup. Is that so?

   What is the output of -ksp_view -log_summary on the problem and then on the 
problem doubled in size and number of processors?

   Barry

Hi,

I have attached the output

48 cores: log48
96 cores: log96

There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson eqn 
uses hypre BoomerAMG.

Problem size doubled from 158x266x150 to 158x266x300.

So is it fair to say that the main problem does not lie in my programming 
skills, but rather the way the linear equations are solved?

Thanks.

   Thanks,

      Matt
Is it possible for this type of scaling in PETSc (>50%), when using 17640 
(2205X8) cores?
Btw, I do not have access to the system.



Sent using CloudMagic Email



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

<log48.txt><log96.txt>

<log48_10.txt><log48.txt><log96.txt>

<log96_100.txt><log48_100.txt>

<log96_100_2.txt><log48_100_2.txt>

<log64_100.txt><log8_100.txt>

<log.txt>

<log64_100_2.txt><log8_100_2.txt>

Re: [petsc-users] Scaling with number of cores

Reply via email to