Re: [petsc-users] Scaling with number of cores

Barry Smith Thu, 05 Nov 2015 08:11:09 -0800

> On Nov 5, 2015, at 9:58 AM, TAY wee-beng <[email protected]> wrote:
> 
> Sorry I realised that I didn't use gamg and that's why. But if I use gamg, 
> the 8 core case worked, but the 64 core case shows p diverged.
> 
> Why is this so? Btw, I have also added nullspace in my code.


   You don't need the null space and should not add it.

> 
> Thank you.
> 
> Yours sincerely,
> 
> TAY wee-beng
> 
> On 5/11/2015 12:03 PM, Barry Smith wrote:
>>   There is a problem here. The -log_summary doesn't show all the events 
>> associated with the -pc_type gamg preconditioner it should have rows like
>> 
>> VecDot                 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1613
>> VecMDot              134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  3  0  0  0   0  3  0  0  0  3025
>> VecNorm              154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  1578
>> VecScale             148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1039
>> VecCopy              106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet               474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAXPY               54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1742
>> VecAYPX              384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   0  1  0  0  0   860
>> VecAXPBYCZ           192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   0  2  0  0  0  2085
>> VecWAXPY               2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   636
>> VecMAXPY             148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  3  0  0  0   1  3  0  0  0  2399
>> VecPointwiseMult      66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   604
>> VecScatterBegin       45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSetRandom           6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecReduceArith         4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1525
>> VecReduceComm          2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecNormalize         148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  1177
>> MatMult              424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  7 37  0  0  0   7 37  0  0  0  2343
>> MatMultAdd            48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   0  2  0  0  0  2069
>> MatMultTranspose      48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  2  0  0  0   1  2  0  0  0  1069
>> MatSolve              16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   460
>> MatSOR               354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  9 31  0  0  0   9 31  0  0  0  1631
>> MatLUFactorSym         2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatLUFactorNum         2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   307
>> MatScale              18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   874
>> MatResidual           48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  4  0  0  0   1  4  0  0  0  2212
>> MatAssemblyBegin      57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd        57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>> MatGetRow          21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>> MatGetRowIJ            2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatGetOrdering         2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatCoarsen             6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatZeroEntries         2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAXPY                6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>> MatFDColorCreate       1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatFDColorSetUp        1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatFDColorApply        2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  4  0  0  0   1  4  0  0  0  1826
>> MatFDColorFunc        42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  4  0  0  0   1  4  0  0  0  2956
>> MatMatMult             6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  4  2  0  0  0   4  2  0  0  0   241
>> MatMatMultSym          6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
>> MatMatMultNum          6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  2  0  0  0   2  2  0  0  0   679
>> MatPtAP                6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   283
>> MatPtAPSymbolic        6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  8  0  0  0  0   8  0  0  0  0     0
>> MatPtAPNumeric         6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  9 11  0  0  0   9 11  0  0  0   537
>> MatTrnMatMult          2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0    75
>> MatTrnMatMultSym       2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatTrnMatMultNum       2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   352
>> MatGetSymTrans         8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPGMRESOrthog       134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  6  0  0  0   1  6  0  0  0  2491
>> KSPSetUp              24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 94 95  0  0  0  94 95  0  0  0   471
>> PCGAMGGraph_AGG        6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     2
>> PCGAMGCoarse_AGG       6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0    49
>> PCGAMGProl_AGG         6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00 34  0  0  0  0  34  0  0  0  0     0
>> PCGAMGPOpt_AGG         6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  9 11  0  0  0   9 11  0  0  0   534
>> GAMG: createProl       6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 55 11  0  0  0  55 11  0  0  0    92
>>   Graph               12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     2
>>   MIS/Agg              6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>   SA: col data         6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>   SA: frmProl0         6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00 34  0  0  0  0  34  0  0  0  0     0
>>   SA: smooth           6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  9 11  0  0  0   9 11  0  0  0   534
>> GAMG: partLevel        6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   283
>> PCSetUp                4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 74 22  0  0  0  74 22  0  0  0   137
>> PCSetUpOnBlocks       16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0    42
>> PCApply               16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 20 70  0  0  0  20 70  0  0  0  1637
>> 
>> 
>> Are you sure you ran with -pc_type gamg ? What about running with -info does 
>> it print anything about gamg? What about -ksp_view does it indicate it is 
>> using the gamg preconditioner?
>> 
>> 
>>> On Nov 4, 2015, at 9:30 PM, TAY wee-beng <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I have attached the 2 logs.
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 4/11/2015 1:11 AM, Barry Smith wrote:
>>>>    Ok, the convergence looks good. Now run on 8 and 64 processes as before 
>>>> with -log_summary and not -ksp_monitor to see how it scales.
>>>> 
>>>>   Barry
>>>> 
>>>>> On Nov 3, 2015, at 6:49 AM, TAY wee-beng <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I tried and have attached the log.
>>>>> 
>>>>> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify 
>>>>> some null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?
>>>>> 
>>>>> Thank you
>>>>> 
>>>>> Yours sincerely,
>>>>> 
>>>>> TAY wee-beng
>>>>> 
>>>>> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>>>>>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]>  wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I tried :
>>>>>>> 
>>>>>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>>>>>> 
>>>>>>> 2. -poisson_pc_type gamg
>>>>>>    Run with -poisson_ksp_monitor_true_residual 
>>>>>> -poisson_ksp_monitor_converged_reason
>>>>>> Does your poisson have Neumann boundary conditions? Do you have any 
>>>>>> zeros on the diagonal for the matrix (you shouldn't).
>>>>>> 
>>>>>>   There may be something wrong with your poisson discretization that was 
>>>>>> also messing up hypre
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Both options give:
>>>>>>> 
>>>>>>>    1      0.00150000      0.00000000      0.00000000 1.00000000         
>>>>>>>     NaN             NaN             NaN
>>>>>>> M Diverged but why?, time =            2
>>>>>>> reason =           -9
>>>>>>> 
>>>>>>> How can I check what's wrong?
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> Yours sincerely,
>>>>>>> 
>>>>>>> TAY wee-beng
>>>>>>> 
>>>>>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>>>>>>    hypre is just not scaling well here. I do not know why. Since hypre 
>>>>>>>> is a block box for us there is no way to determine why the poor 
>>>>>>>> scaling.
>>>>>>>> 
>>>>>>>>    If you make the same two runs with -pc_type gamg there will be a 
>>>>>>>> lot more information in the log summary about in what routines it is 
>>>>>>>> scaling well or poorly.
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I have attached the 2 files.
>>>>>>>>> 
>>>>>>>>> Thank you
>>>>>>>>> 
>>>>>>>>> Yours sincerely,
>>>>>>>>> 
>>>>>>>>> TAY wee-beng
>>>>>>>>> 
>>>>>>>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>>>>>>>>   Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then 
>>>>>>>>>> (158)x(266)x(150) on 64 processors  and send the two -log_summary 
>>>>>>>>>> results
>>>>>>>>>> 
>>>>>>>>>>   Barry
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have attached the new results.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you
>>>>>>>>>>> 
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>> 
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>> 
>>>>>>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>>>>>>>>>   Run without the -momentum_ksp_view -poisson_ksp_view and send 
>>>>>>>>>>>> the new results
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>   You can see from the log summary that the PCSetUp is taking a 
>>>>>>>>>>>> much smaller percentage of the time meaning that it is reusing the 
>>>>>>>>>>>> preconditioner and not rebuilding it each time.
>>>>>>>>>>>> 
>>>>>>>>>>>> Barry
>>>>>>>>>>>> 
>>>>>>>>>>>>   Something makes no sense with the output: it gives
>>>>>>>>>>>> 
>>>>>>>>>>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 
>>>>>>>>>>>> 9.9e+05 5.0e+02 90100 66100 24  90100 66100 24   165
>>>>>>>>>>>> 
>>>>>>>>>>>> 90% of the time is in the solve but there is no significant amount 
>>>>>>>>>>>> of time in other events of the code which is just not possible. I 
>>>>>>>>>>>> hope it is due to your IO.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have attached the new run with 100 time steps for 48 and 96 
>>>>>>>>>>>>> cores.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I 
>>>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must I 
>>>>>>>>>>>>> not do?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Why does the number of processes increase so much? Is there 
>>>>>>>>>>>>> something wrong with my coding? Seems to be so too for my new run.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>>>>>>>>>   If you are doing many time steps with the same linear solver 
>>>>>>>>>>>>>> then you MUST do your weak scaling studies with MANY time steps 
>>>>>>>>>>>>>> since the setup time of AMG only takes place in the first 
>>>>>>>>>>>>>> stimestep. So run both 48 and 96 processes with the same large 
>>>>>>>>>>>>>> number of time steps.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new 
>>>>>>>>>>>>>>> log for 48cores (log48), together with the 96cores log (log96).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Why does the number of processes increase so much? Is there 
>>>>>>>>>>>>>>> something wrong with my coding?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I 
>>>>>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must 
>>>>>>>>>>>>>>> I not do?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for 
>>>>>>>>>>>>>>> 10 timesteps (log48_10). Is it building the preconditioner at 
>>>>>>>>>>>>>>> every timestep?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I will try the gamg later too.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>>>>>>>>>   You used gmres with 48 processes but richardson with 96. You 
>>>>>>>>>>>>>>>> need to be careful and make sure you don't change the solvers 
>>>>>>>>>>>>>>>> when you change the number of processors since you can get 
>>>>>>>>>>>>>>>> very different inconsistent results
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>    Anyways all the time is being spent in the BoomerAMG 
>>>>>>>>>>>>>>>> algebraic multigrid setup and it is is scaling badly. When you 
>>>>>>>>>>>>>>>> double the problem size and number of processes it went from 
>>>>>>>>>>>>>>>> 3.2445e+01 to 4.3599e+02 seconds.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0 
>>>>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0 
>>>>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>   Now is the Poisson problem changing at each timestep or can 
>>>>>>>>>>>>>>>> you use the same preconditioner built with BoomerAMG for all 
>>>>>>>>>>>>>>>> the time steps? Algebraic multigrid has a large set up time 
>>>>>>>>>>>>>>>> that you often doesn't matter if you have many time steps but 
>>>>>>>>>>>>>>>> if you have to rebuild it each timestep it is too large?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>   You might also try -pc_type gamg and see how PETSc's 
>>>>>>>>>>>>>>>> algebraic multigrid scales for your problem/machine.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<[email protected]> 
>>>>>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY 
>>>>>>>>>>>>>>>>>>>> wee-beng<[email protected]>  wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the 
>>>>>>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I am 
>>>>>>>>>>>>>>>>>>>> trying to write a proposal to use a supercomputer.
>>>>>>>>>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of 
>>>>>>>>>>>>>>>>>>>> memory per node)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>>>>>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>>>>>>>>>> One of the requirement is to give the performance of my 
>>>>>>>>>>>>>>>>>>>> current code with my current set of data, and there is a 
>>>>>>>>>>>>>>>>>>>> formula to calculate the estimated parallel efficiency 
>>>>>>>>>>>>>>>>>>>> when using the new large set of data
>>>>>>>>>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed 
>>>>>>>>>>>>>>>>>>>> time varies with the number of processors for a fixed
>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time 
>>>>>>>>>>>>>>>>>>>> varies with the number of processors for a
>>>>>>>>>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current 
>>>>>>>>>>>>>>>>>>>> cluster, giving 140 and 90 mins respectively. This is 
>>>>>>>>>>>>>>>>>>>> classified as strong scaling.
>>>>>>>>>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of 
>>>>>>>>>>>>>>>>>>>> parallelism ‘n’ indicates how much the program is
>>>>>>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is 
>>>>>>>>>>>>>>>>>>>> given by the following formulae. Although their
>>>>>>>>>>>>>>>>>>>> derivation processes are different depending on strong and 
>>>>>>>>>>>>>>>>>>>> weak scaling, derived formulae are the
>>>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>>>> From the estimated time, my parallel efficiency using  
>>>>>>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%.
>>>>>>>>>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores), 
>>>>>>>>>>>>>>>>>>>> my expected parallel efficiency is only 0.5%. The proposal 
>>>>>>>>>>>>>>>>>>>> recommends value of > 50%.
>>>>>>>>>>>>>>>>>>>> The problem with this analysis is that the estimated 
>>>>>>>>>>>>>>>>>>>> serial fraction from Amdahl's Law  changes as a function
>>>>>>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling 
>>>>>>>>>>>>>>>>>>>> from one problem and apply it to another without a
>>>>>>>>>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I 
>>>>>>>>>>>>>>>>>>>> would measure weak scaling on your current
>>>>>>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize 
>>>>>>>>>>>>>>>>>>>> that this does not make sense for many scientific
>>>>>>>>>>>>>>>>>>>> applications, but neither does requiring a certain 
>>>>>>>>>>>>>>>>>>>> parallel efficiency.
>>>>>>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even worse 
>>>>>>>>>>>>>>>>>>> for the expected parallel efficiency. From the formula 
>>>>>>>>>>>>>>>>>>> used, it's obvious it's doing some sort of exponential 
>>>>>>>>>>>>>>>>>>> extrapolation decrease. So unless I can achieve a near > 
>>>>>>>>>>>>>>>>>>> 90% speed up when I double the cores and problem size for 
>>>>>>>>>>>>>>>>>>> my current 48/96 cores setup,     extrapolating from about 
>>>>>>>>>>>>>>>>>>> 96 nodes to 10,000 nodes will give a much lower expected 
>>>>>>>>>>>>>>>>>>> parallel efficiency for the new case.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory 
>>>>>>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I 
>>>>>>>>>>>>>>>>>>> double the cores and problem size (ie linear increase in 
>>>>>>>>>>>>>>>>>>> performance), which means that I can't get >90% speed up 
>>>>>>>>>>>>>>>>>>> when I double the cores and problem size for my current 
>>>>>>>>>>>>>>>>>>> 48/96 cores setup. Is that so?
>>>>>>>>>>>>>>>>>>   What is the output of -ksp_view -log_summary on the 
>>>>>>>>>>>>>>>>>> problem and then on the problem doubled in size and number 
>>>>>>>>>>>>>>>>>> of processors?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I have attached the output
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 48 cores: log48
>>>>>>>>>>>>>>>>> 96 cores: log96
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, 
>>>>>>>>>>>>>>>>> while the Poisson eqn uses hypre BoomerAMG.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie in 
>>>>>>>>>>>>>>>>>>> my programming skills, but rather the way the linear 
>>>>>>>>>>>>>>>>>>> equations are solved?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>      Matt
>>>>>>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%), 
>>>>>>>>>>>>>>>>>>>> when using 17640 (2205X8) cores?
>>>>>>>>>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin 
>>>>>>>>>>>>>>>>>>>> their experiments is infinitely more interesting than any 
>>>>>>>>>>>>>>>>>>>> results to which their experiments lead.
>>>>>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>>> <log48.txt><log96.txt>
>>>>>>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>>>>>>>>>> <log96_100.txt><log48_100.txt>
>>>>>>>>>>> <log96_100_2.txt><log48_100_2.txt>
>>>>>>>>> <log64_100.txt><log8_100.txt>
>>>>> <log.txt>
>>> <log64_100_2.txt><log8_100_2.txt>
>

Re: [petsc-users] Scaling with number of cores

Reply via email to