Re: [petsc-users] Scaling with number of cores

Barry Smith Thu, 05 Nov 2015 20:09:02 -0800

  Ok the 64 case not converging makes no sense.

   Run it with ksp_monitor and ksp_converged_reason for the pressure solve 
turned on and -info


   You need to figure out why it is not converging.

  Barry

> On Nov 5, 2015, at 8:47 PM, TAY wee-beng <[email protected]> wrote:
> 
> Hi,
> 
> I have removed the nullspace and attached the new logs.
> 
> Thank you
> 
> Yours sincerely,
> 
> TAY wee-beng
> 
> On 6/11/2015 12:07 AM, Barry Smith wrote:
>>> On Nov 5, 2015, at 9:58 AM, TAY wee-beng <[email protected]> wrote:
>>> 
>>> Sorry I realised that I didn't use gamg and that's why. But if I use gamg, 
>>> the 8 core case worked, but the 64 core case shows p diverged.
>>    Where is the log file for the 8 core case? And where is all the output 
>> from where it fails with 64 cores? Include -ksp_monitor_true_residual and 
>> -ksp_converged_reason
>> 
>>   Barry
>> 
>>> Why is this so? Btw, I have also added nullspace in my code.
>>> 
>>> Thank you.
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 5/11/2015 12:03 PM, Barry Smith wrote:
>>>>   There is a problem here. The -log_summary doesn't show all the events 
>>>> associated with the -pc_type gamg preconditioner it should have rows like
>>>> 
>>>> VecDot                 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1613
>>>> VecMDot              134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  3  0  0  0   0  3  0  0  0  3025
>>>> VecNorm              154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  1578
>>>> VecScale             148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1039
>>>> VecCopy              106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecSet               474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecAXPY               54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1742
>>>> VecAYPX              384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0   860
>>>> VecAXPBYCZ           192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  2  0  0  0   0  2  0  0  0  2085
>>>> VecWAXPY               2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   636
>>>> VecMAXPY             148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  3  0  0  0   1  3  0  0  0  2399
>>>> VecPointwiseMult      66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   604
>>>> VecScatterBegin       45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecSetRandom           6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecReduceArith         4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1525
>>>> VecReduceComm          2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecNormalize         148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  1177
>>>> MatMult              424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  7 37  0  0  0   7 37  0  0  0  2343
>>>> MatMultAdd            48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  2  0  0  0   0  2  0  0  0  2069
>>>> MatMultTranspose      48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  2  0  0  0   1  2  0  0  0  1069
>>>> MatSolve              16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   460
>>>> MatSOR               354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  9 31  0  0  0   9 31  0  0  0  1631
>>>> MatLUFactorSym         2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatLUFactorNum         2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   307
>>>> MatScale              18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   874
>>>> MatResidual           48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  4  0  0  0   1  4  0  0  0  2212
>>>> MatAssemblyBegin      57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatAssemblyEnd        57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>>> MatGetRow          21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>>> MatGetRowIJ            2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatGetOrdering         2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatCoarsen             6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatZeroEntries         2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatAXPY                6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>>> MatFDColorCreate       1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatFDColorSetUp        1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> MatFDColorApply        2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  4  0  0  0   1  4  0  0  0  1826
>>>> MatFDColorFunc        42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  4  0  0  0   1  4  0  0  0  2956
>>>> MatMatMult             6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  4  2  0  0  0   4  2  0  0  0   241
>>>> MatMatMultSym          6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  3  0  0  0  0   3  0  0  0  0     0
>>>> MatMatMultNum          6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  2  0  0  0   2  2  0  0  0   679
>>>> MatPtAP                6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   283
>>>> MatPtAPSymbolic        6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  8  0  0  0  0   8  0  0  0  0     0
>>>> MatPtAPNumeric         6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  9 11  0  0  0   9 11  0  0  0   537
>>>> MatTrnMatMult          2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0    75
>>>> MatTrnMatMultSym       2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatTrnMatMultNum       2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   352
>>>> MatGetSymTrans         8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPGMRESOrthog       134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  6  0  0  0   1  6  0  0  0  2491
>>>> KSPSetUp              24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSolve               2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 94 95  0  0  0  94 95  0  0  0   471
>>>> PCGAMGGraph_AGG        6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     2
>>>> PCGAMGCoarse_AGG       6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0    49
>>>> PCGAMGProl_AGG         6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 34  0  0  0  0  34  0  0  0  0     0
>>>> PCGAMGPOpt_AGG         6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  9 11  0  0  0   9 11  0  0  0   534
>>>> GAMG: createProl       6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 55 11  0  0  0  55 11  0  0  0    92
>>>>   Graph               12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     2
>>>>   MIS/Agg              6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>   SA: col data         6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>   SA: frmProl0         6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 34  0  0  0  0  34  0  0  0  0     0
>>>>   SA: smooth           6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  9 11  0  0  0   9 11  0  0  0   534
>>>> GAMG: partLevel        6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 18 11  0  0  0  18 11  0  0  0   283
>>>> PCSetUp                4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 74 22  0  0  0  74 22  0  0  0   137
>>>> PCSetUpOnBlocks       16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0    42
>>>> PCApply               16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 20 70  0  0  0  20 70  0  0  0  1637
>>>> 
>>>> 
>>>> Are you sure you ran with -pc_type gamg ? What about running with -info 
>>>> does it print anything about gamg? What about -ksp_view does it indicate 
>>>> it is using the gamg preconditioner?
>>>> 
>>>> 
>>>>> On Nov 4, 2015, at 9:30 PM, TAY wee-beng <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have attached the 2 logs.
>>>>> 
>>>>> Thank you
>>>>> 
>>>>> Yours sincerely,
>>>>> 
>>>>> TAY wee-beng
>>>>> 
>>>>> On 4/11/2015 1:11 AM, Barry Smith wrote:
>>>>>>    Ok, the convergence looks good. Now run on 8 and 64 processes as 
>>>>>> before with -log_summary and not -ksp_monitor to see how it scales.
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>>> On Nov 3, 2015, at 6:49 AM, TAY wee-beng <[email protected]> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I tried and have attached the log.
>>>>>>> 
>>>>>>> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify 
>>>>>>> some null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> Yours sincerely,
>>>>>>> 
>>>>>>> TAY wee-beng
>>>>>>> 
>>>>>>> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>>>>>>>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I tried :
>>>>>>>>> 
>>>>>>>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>>>>>>>> 
>>>>>>>>> 2. -poisson_pc_type gamg
>>>>>>>>    Run with -poisson_ksp_monitor_true_residual 
>>>>>>>> -poisson_ksp_monitor_converged_reason
>>>>>>>> Does your poisson have Neumann boundary conditions? Do you have any 
>>>>>>>> zeros on the diagonal for the matrix (you shouldn't).
>>>>>>>> 
>>>>>>>>   There may be something wrong with your poisson discretization that 
>>>>>>>> was also messing up hypre
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Both options give:
>>>>>>>>> 
>>>>>>>>>    1      0.00150000      0.00000000      0.00000000 1.00000000       
>>>>>>>>>       NaN             NaN             NaN
>>>>>>>>> M Diverged but why?, time =            2
>>>>>>>>> reason =           -9
>>>>>>>>> 
>>>>>>>>> How can I check what's wrong?
>>>>>>>>> 
>>>>>>>>> Thank you
>>>>>>>>> 
>>>>>>>>> Yours sincerely,
>>>>>>>>> 
>>>>>>>>> TAY wee-beng
>>>>>>>>> 
>>>>>>>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>>>>>>>>    hypre is just not scaling well here. I do not know why. Since 
>>>>>>>>>> hypre is a block box for us there is no way to determine why the 
>>>>>>>>>> poor scaling.
>>>>>>>>>> 
>>>>>>>>>>    If you make the same two runs with -pc_type gamg there will be a 
>>>>>>>>>> lot more information in the log summary about in what routines it is 
>>>>>>>>>> scaling well or poorly.
>>>>>>>>>> 
>>>>>>>>>>   Barry
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have attached the 2 files.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you
>>>>>>>>>>> 
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>> 
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>> 
>>>>>>>>>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>>>>>>>>>>   Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then 
>>>>>>>>>>>> (158)x(266)x(150) on 64 processors  and send the two -log_summary 
>>>>>>>>>>>> results
>>>>>>>>>>>> 
>>>>>>>>>>>>   Barry
>>>>>>>>>>>> 
>>>>>>>>>>>>  
>>>>>>>>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have attached the new results.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>>>>>>>>>>>   Run without the -momentum_ksp_view -poisson_ksp_view and send 
>>>>>>>>>>>>>> the new results
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   You can see from the log summary that the PCSetUp is taking a 
>>>>>>>>>>>>>> much smaller percentage of the time meaning that it is reusing 
>>>>>>>>>>>>>> the preconditioner and not rebuilding it each time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Barry
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Something makes no sense with the output: it gives
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 
>>>>>>>>>>>>>> 9.9e+05 5.0e+02 90100 66100 24  90100 66100 24   165
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 90% of the time is in the solve but there is no significant 
>>>>>>>>>>>>>> amount of time in other events of the code which is just not 
>>>>>>>>>>>>>> possible. I hope it is due to your IO.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I have attached the new run with 100 time steps for 48 and 96 
>>>>>>>>>>>>>>> cores.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I 
>>>>>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must 
>>>>>>>>>>>>>>> I not do?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Why does the number of processes increase so much? Is there 
>>>>>>>>>>>>>>> something wrong with my coding? Seems to be so too for my new 
>>>>>>>>>>>>>>> run.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>>>>>>>>>>>   If you are doing many time steps with the same linear solver 
>>>>>>>>>>>>>>>> then you MUST do your weak scaling studies with MANY time 
>>>>>>>>>>>>>>>> steps since the setup time of AMG only takes place in the 
>>>>>>>>>>>>>>>> first stimestep. So run both 48 and 96 processes with the same 
>>>>>>>>>>>>>>>> large number of time steps.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new 
>>>>>>>>>>>>>>>>> log for 48cores (log48), together with the 96cores log 
>>>>>>>>>>>>>>>>> (log96).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Why does the number of processes increase so much? Is there 
>>>>>>>>>>>>>>>>> something wrong with my coding?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I 
>>>>>>>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what 
>>>>>>>>>>>>>>>>> must I not do?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run 
>>>>>>>>>>>>>>>>> for 10 timesteps (log48_10). Is it building the 
>>>>>>>>>>>>>>>>> preconditioner at every timestep?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I will try the gamg later too.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>>>>>>>>>>>   You used gmres with 48 processes but richardson with 96. 
>>>>>>>>>>>>>>>>>> You need to be careful and make sure you don't change the 
>>>>>>>>>>>>>>>>>> solvers when you change the number of processors since you 
>>>>>>>>>>>>>>>>>> can get very different inconsistent results
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>    Anyways all the time is being spent in the BoomerAMG 
>>>>>>>>>>>>>>>>>> algebraic multigrid setup and it is is scaling badly. When 
>>>>>>>>>>>>>>>>>> you double the problem size and number of processes it went 
>>>>>>>>>>>>>>>>>> from 3.2445e+01 to 4.3599e+02 seconds.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0 
>>>>>>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0 
>>>>>>>>>>>>>>>>>> 0.0e+00 0.0e+00 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   Now is the Poisson problem changing at each timestep or 
>>>>>>>>>>>>>>>>>> can you use the same preconditioner built with BoomerAMG for 
>>>>>>>>>>>>>>>>>> all the time steps? Algebraic multigrid has a large set up 
>>>>>>>>>>>>>>>>>> time that you often doesn't matter if you have many time 
>>>>>>>>>>>>>>>>>> steps but if you have to rebuild it each timestep it is too 
>>>>>>>>>>>>>>>>>> large?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   You might also try -pc_type gamg and see how PETSc's 
>>>>>>>>>>>>>>>>>> algebraic multigrid scales for your problem/machine.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY 
>>>>>>>>>>>>>>>>>>>>> wee-beng<[email protected]>  wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY 
>>>>>>>>>>>>>>>>>>>>>> wee-beng<[email protected]>  wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the 
>>>>>>>>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I 
>>>>>>>>>>>>>>>>>>>>>> am trying to write a proposal to use a supercomputer.
>>>>>>>>>>>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of 
>>>>>>>>>>>>>>>>>>>>>> memory per node)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) 
>>>>>>>>>>>>>>>>>>>>>> Interconnect
>>>>>>>>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>>>>>>>>>>>> One of the requirement is to give the performance of my 
>>>>>>>>>>>>>>>>>>>>>> current code with my current set of data, and there is a 
>>>>>>>>>>>>>>>>>>>>>> formula to calculate the estimated parallel efficiency 
>>>>>>>>>>>>>>>>>>>>>> when using the new large set of data
>>>>>>>>>>>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed 
>>>>>>>>>>>>>>>>>>>>>> time varies with the number of processors for a fixed
>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed 
>>>>>>>>>>>>>>>>>>>>>> time varies with the number of processors for a
>>>>>>>>>>>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current 
>>>>>>>>>>>>>>>>>>>>>> cluster, giving 140 and 90 mins respectively. This is 
>>>>>>>>>>>>>>>>>>>>>> classified as strong scaling.
>>>>>>>>>>>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of 
>>>>>>>>>>>>>>>>>>>>>> parallelism ‘n’ indicates how much the program is
>>>>>>>>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is 
>>>>>>>>>>>>>>>>>>>>>> given by the following formulae. Although their
>>>>>>>>>>>>>>>>>>>>>> derivation processes are different depending on strong 
>>>>>>>>>>>>>>>>>>>>>> and weak scaling, derived formulae are the
>>>>>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>>>>>> From the estimated time, my parallel efficiency using  
>>>>>>>>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%.
>>>>>>>>>>>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes 
>>>>>>>>>>>>>>>>>>>>>> (2205X8cores), my expected parallel efficiency is only 
>>>>>>>>>>>>>>>>>>>>>> 0.5%. The proposal recommends value of > 50%.
>>>>>>>>>>>>>>>>>>>>>> The problem with this analysis is that the estimated 
>>>>>>>>>>>>>>>>>>>>>> serial fraction from Amdahl's Law  changes as a function
>>>>>>>>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling 
>>>>>>>>>>>>>>>>>>>>>> from one problem and apply it to another without a
>>>>>>>>>>>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I 
>>>>>>>>>>>>>>>>>>>>>> would measure weak scaling on your current
>>>>>>>>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize 
>>>>>>>>>>>>>>>>>>>>>> that this does not make sense for many scientific
>>>>>>>>>>>>>>>>>>>>>> applications, but neither does requiring a certain 
>>>>>>>>>>>>>>>>>>>>>> parallel efficiency.
>>>>>>>>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even 
>>>>>>>>>>>>>>>>>>>>> worse for the expected parallel efficiency. From the 
>>>>>>>>>>>>>>>>>>>>> formula used, it's obvious it's doing some sort of 
>>>>>>>>>>>>>>>>>>>>> exponential extrapolation decrease. So unless I can 
>>>>>>>>>>>>>>>>>>>>> achieve a near > 90% speed up when I double the cores and 
>>>>>>>>>>>>>>>>>>>>> problem size for my current 48/96 cores setup,     
>>>>>>>>>>>>>>>>>>>>> extrapolating from about 96 nodes to 10,000 nodes will 
>>>>>>>>>>>>>>>>>>>>> give a much lower expected parallel efficiency for the 
>>>>>>>>>>>>>>>>>>>>> new case.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory 
>>>>>>>>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I 
>>>>>>>>>>>>>>>>>>>>> double the cores and problem size (ie linear increase in 
>>>>>>>>>>>>>>>>>>>>> performance), which means that I can't get >90% speed up 
>>>>>>>>>>>>>>>>>>>>> when I double the cores and problem size for my current 
>>>>>>>>>>>>>>>>>>>>> 48/96 cores setup. Is that so?
>>>>>>>>>>>>>>>>>>>>   What is the output of -ksp_view -log_summary on the 
>>>>>>>>>>>>>>>>>>>> problem and then on the problem doubled in size and number 
>>>>>>>>>>>>>>>>>>>> of processors?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I have attached the output
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 48 cores: log48
>>>>>>>>>>>>>>>>>>> 96 cores: log96
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, 
>>>>>>>>>>>>>>>>>>> while the Poisson eqn uses hypre BoomerAMG.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie 
>>>>>>>>>>>>>>>>>>>>> in my programming skills, but rather the way the linear 
>>>>>>>>>>>>>>>>>>>>> equations are solved?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>      Matt
>>>>>>>>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%), 
>>>>>>>>>>>>>>>>>>>>>> when using 17640 (2205X8) cores?
>>>>>>>>>>>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>> What most experimenters take for granted before they 
>>>>>>>>>>>>>>>>>>>>>> begin their experiments is infinitely more interesting 
>>>>>>>>>>>>>>>>>>>>>> than any results to which their experiments lead.
>>>>>>>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>>>>>> <log48.txt><log96.txt>
>>>>>>>>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>>>>>>>>>>>> <log96_100.txt><log48_100.txt>
>>>>>>>>>>>>> <log96_100_2.txt><log48_100_2.txt>
>>>>>>>>>>> <log64_100.txt><log8_100.txt>
>>>>>>> <log.txt>
>>>>> <log64_100_2.txt><log8_100_2.txt>
> 
> <log8_100_3.txt><log64_100_3.txt>

Re: [petsc-users] Scaling with number of cores

Reply via email to