There is a problem here. The -log_summary doesn't show all the events 
associated with the -pc_type gamg preconditioner it should have rows like 

VecDot                 2 1.0 6.1989e-06 1.0 1.00e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1613
VecMDot              134 1.0 5.4145e-04 1.0 1.64e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  3  0  0  0   0  3  0  0  0  3025
VecNorm              154 1.0 2.4176e-04 1.0 3.82e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0  1578
VecScale             148 1.0 1.6928e-04 1.0 1.76e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1039
VecCopy              106 1.0 1.2255e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               474 1.0 5.1236e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               54 1.0 1.3471e-04 1.0 2.35e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1742
VecAYPX              384 1.0 5.7459e-04 1.0 4.94e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0   860
VecAXPBYCZ           192 1.0 4.7398e-04 1.0 9.88e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0  2085
VecWAXPY               2 1.0 7.8678e-06 1.0 5.00e+03 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   636
VecMAXPY             148 1.0 8.1539e-04 1.0 1.96e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  3  0  0  0   1  3  0  0  0  2399
VecPointwiseMult      66 1.0 1.1253e-04 1.0 6.79e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   604
VecScatterBegin       45 1.0 6.3419e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSetRandom           6 1.0 3.0994e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecReduceArith         4 1.0 1.3113e-05 1.0 2.00e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0  1525
VecReduceComm          2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize         148 1.0 4.4799e-04 1.0 5.27e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0  0  0  1177
MatMult              424 1.0 8.9276e-03 1.0 2.09e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  7 37  0  0  0   7 37  0  0  0  2343
MatMultAdd            48 1.0 5.0926e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  2  0  0  0   0  2  0  0  0  2069
MatMultTranspose      48 1.0 9.8586e-04 1.0 1.05e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  2  0  0  0   1  2  0  0  0  1069
MatSolve              16 1.0 2.2173e-05 1.0 1.02e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   460
MatSOR               354 1.0 1.0547e-02 1.0 1.72e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  9 31  0  0  0   9 31  0  0  0  1631
MatLUFactorSym         2 1.0 4.7922e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatLUFactorNum         2 1.0 2.5272e-05 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   307
MatScale              18 1.0 1.7142e-04 1.0 1.50e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   874
MatResidual           48 1.0 1.0548e-03 1.0 2.33e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  4  0  0  0   1  4  0  0  0  2212
MatAssemblyBegin      57 1.0 4.7684e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd        57 1.0 1.9786e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatGetRow          21616 1.0 1.8497e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatGetRowIJ            2 1.0 6.9141e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         2 1.0 6.0797e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatCoarsen             6 1.0 9.3222e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         2 1.0 3.9101e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAXPY                6 1.0 1.7998e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatFDColorCreate       1 1.0 3.2902e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatFDColorSetUp        1 1.0 1.6739e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatFDColorApply        2 1.0 1.3199e-03 1.0 2.41e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  4  0  0  0   1  4  0  0  0  1826
MatFDColorFunc        42 1.0 7.4601e-04 1.0 2.20e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  4  0  0  0   1  4  0  0  0  2956
MatMatMult             6 1.0 5.1048e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  4  2  0  0  0   4  2  0  0  0   241
MatMatMultSym          6 1.0 3.2601e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0   3  0  0  0  0     0
MatMatMultNum          6 1.0 1.8158e-03 1.0 1.23e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  2  2  0  0  0   2  2  0  0  0   679
MatPtAP                6 1.0 2.1328e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
0.0e+00 18 11  0  0  0  18 11  0  0  0   283
MatPtAPSymbolic        6 1.0 1.0073e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  8  0  0  0  0   8  0  0  0  0     0
MatPtAPNumeric         6 1.0 1.1230e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  9 11  0  0  0   9 11  0  0  0   537
MatTrnMatMult          2 1.0 7.2789e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0    75
MatTrnMatMultSym       2 1.0 5.7006e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatTrnMatMultNum       2 1.0 1.5473e-04 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   352
MatGetSymTrans         8 1.0 3.1638e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPGMRESOrthog       134 1.0 1.3156e-03 1.0 3.28e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  6  0  0  0   1  6  0  0  0  2491
KSPSetUp              24 1.0 4.6754e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               2 1.0 1.1291e-01 1.0 5.32e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 94 95  0  0  0  94 95  0  0  0   471
PCGAMGGraph_AGG        6 1.0 1.2108e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
0.0e+00 10  0  0  0  0  10  0  0  0  0     2
PCGAMGCoarse_AGG       6 1.0 1.1127e-03 1.0 5.44e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0    49
PCGAMGProl_AGG         6 1.0 4.1062e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 34  0  0  0  0  34  0  0  0  0     0
PCGAMGPOpt_AGG         6 1.0 1.1200e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  9 11  0  0  0   9 11  0  0  0   534
GAMG: createProl       6 1.0 6.5530e-02 1.0 6.06e+06 1.0 0.0e+00 0.0e+00 
0.0e+00 55 11  0  0  0  55 11  0  0  0    92
  Graph               12 1.0 1.1692e-02 1.0 1.82e+04 1.0 0.0e+00 0.0e+00 
0.0e+00 10  0  0  0  0  10  0  0  0  0     2
  MIS/Agg              6 1.0 1.4496e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
  SA: col data         6 1.0 7.1526e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
  SA: frmProl0         6 1.0 4.0917e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 34  0  0  0  0  34  0  0  0  0     0
  SA: smooth           6 1.0 1.1198e-02 1.0 5.98e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  9 11  0  0  0   9 11  0  0  0   534
GAMG: partLevel        6 1.0 2.1341e-02 1.0 6.03e+06 1.0 0.0e+00 0.0e+00 
0.0e+00 18 11  0  0  0  18 11  0  0  0   283
PCSetUp                4 1.0 8.8020e-02 1.0 1.21e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 74 22  0  0  0  74 22  0  0  0   137
PCSetUpOnBlocks       16 1.0 1.8382e-04 1.0 7.75e+03 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0    42
PCApply               16 1.0 2.3858e-02 1.0 3.91e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 20 70  0  0  0  20 70  0  0  0  1637


Are you sure you ran with -pc_type gamg ? What about running with -info does it 
print anything about gamg? What about -ksp_view does it indicate it is using 
the gamg preconditioner?


> On Nov 4, 2015, at 9:30 PM, TAY wee-beng <[email protected]> wrote:
> 
> Hi,
> 
> I have attached the 2 logs.
> 
> Thank you
> 
> Yours sincerely,
> 
> TAY wee-beng
> 
> On 4/11/2015 1:11 AM, Barry Smith wrote:
>>    Ok, the convergence looks good. Now run on 8 and 64 processes as before 
>> with -log_summary and not -ksp_monitor to see how it scales.
>> 
>>   Barry
>> 
>>> On Nov 3, 2015, at 6:49 AM, TAY wee-beng <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I tried and have attached the log.
>>> 
>>> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify 
>>> some null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>>>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]>  wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I tried :
>>>>> 
>>>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>>>> 
>>>>> 2. -poisson_pc_type gamg
>>>>    Run with -poisson_ksp_monitor_true_residual 
>>>> -poisson_ksp_monitor_converged_reason
>>>> Does your poisson have Neumann boundary conditions? Do you have any zeros 
>>>> on the diagonal for the matrix (you shouldn't).
>>>> 
>>>>   There may be something wrong with your poisson discretization that was 
>>>> also messing up hypre
>>>> 
>>>> 
>>>> 
>>>>> Both options give:
>>>>> 
>>>>>    1      0.00150000      0.00000000      0.00000000 1.00000000           
>>>>>   NaN             NaN             NaN
>>>>> M Diverged but why?, time =            2
>>>>> reason =           -9
>>>>> 
>>>>> How can I check what's wrong?
>>>>> 
>>>>> Thank you
>>>>> 
>>>>> Yours sincerely,
>>>>> 
>>>>> TAY wee-beng
>>>>> 
>>>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>>>>    hypre is just not scaling well here. I do not know why. Since hypre 
>>>>>> is a block box for us there is no way to determine why the poor scaling.
>>>>>> 
>>>>>>    If you make the same two runs with -pc_type gamg there will be a lot 
>>>>>> more information in the log summary about in what routines it is scaling 
>>>>>> well or poorly.
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng<[email protected]>  wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I have attached the 2 files.
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> Yours sincerely,
>>>>>>> 
>>>>>>> TAY wee-beng
>>>>>>> 
>>>>>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>>>>>>   Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then 
>>>>>>>> (158)x(266)x(150) on 64 processors  and send the two -log_summary 
>>>>>>>> results
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>>  
>>>>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I have attached the new results.
>>>>>>>>> 
>>>>>>>>> Thank you
>>>>>>>>> 
>>>>>>>>> Yours sincerely,
>>>>>>>>> 
>>>>>>>>> TAY wee-beng
>>>>>>>>> 
>>>>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>>>>>>>   Run without the -momentum_ksp_view -poisson_ksp_view and send the 
>>>>>>>>>> new results
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>   You can see from the log summary that the PCSetUp is taking a much 
>>>>>>>>>> smaller percentage of the time meaning that it is reusing the 
>>>>>>>>>> preconditioner and not rebuilding it each time.
>>>>>>>>>> 
>>>>>>>>>> Barry
>>>>>>>>>> 
>>>>>>>>>>   Something makes no sense with the output: it gives
>>>>>>>>>> 
>>>>>>>>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 
>>>>>>>>>> 9.9e+05 5.0e+02 90100 66100 24  90100 66100 24   165
>>>>>>>>>> 
>>>>>>>>>> 90% of the time is in the solve but there is no significant amount 
>>>>>>>>>> of time in other events of the code which is just not possible. I 
>>>>>>>>>> hope it is due to your IO.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have attached the new run with 100 time steps for 48 and 96 cores.
>>>>>>>>>>> 
>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want 
>>>>>>>>>>> to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>>> 
>>>>>>>>>>> Why does the number of processes increase so much? Is there 
>>>>>>>>>>> something wrong with my coding? Seems to be so too for my new run.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you
>>>>>>>>>>> 
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>> 
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>> 
>>>>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>>>>>>>   If you are doing many time steps with the same linear solver 
>>>>>>>>>>>> then you MUST do your weak scaling studies with MANY time steps 
>>>>>>>>>>>> since the setup time of AMG only takes place in the first 
>>>>>>>>>>>> stimestep. So run both 48 and 96 processes with the same large 
>>>>>>>>>>>> number of time steps.
>>>>>>>>>>>> 
>>>>>>>>>>>>   Barry
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new log 
>>>>>>>>>>>>> for 48cores (log48), together with the 96cores log (log96).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Why does the number of processes increase so much? Is there 
>>>>>>>>>>>>> something wrong with my coding?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I 
>>>>>>>>>>>>> want to reuse the preconditioner, what must I do? Or what must I 
>>>>>>>>>>>>> not do?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for 
>>>>>>>>>>>>> 10 timesteps (log48_10). Is it building the preconditioner at 
>>>>>>>>>>>>> every timestep?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I will try the gamg later too.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>>>>>>>   You used gmres with 48 processes but richardson with 96. You 
>>>>>>>>>>>>>> need to be careful and make sure you don't change the solvers 
>>>>>>>>>>>>>> when you change the number of processors since you can get very 
>>>>>>>>>>>>>> different inconsistent results
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>    Anyways all the time is being spent in the BoomerAMG 
>>>>>>>>>>>>>> algebraic multigrid setup and it is is scaling badly. When you 
>>>>>>>>>>>>>> double the problem size and number of processes it went from 
>>>>>>>>>>>>>> 3.2445e+01 to 4.3599e+02 seconds.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 
>>>>>>>>>>>>>> 0.0e+00 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 
>>>>>>>>>>>>>> 0.0e+00 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Now is the Poisson problem changing at each timestep or can 
>>>>>>>>>>>>>> you use the same preconditioner built with BoomerAMG for all the 
>>>>>>>>>>>>>> time steps? Algebraic multigrid has a large set up time that you 
>>>>>>>>>>>>>> often doesn't matter if you have many time steps but if you have 
>>>>>>>>>>>>>> to rebuild it each timestep it is too large?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   You might also try -pc_type gamg and see how PETSc's algebraic 
>>>>>>>>>>>>>> multigrid scales for your problem/machine.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY 
>>>>>>>>>>>>>>>>>> wee-beng<[email protected]>  wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the 
>>>>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I am 
>>>>>>>>>>>>>>>>>> trying to write a proposal to use a supercomputer.
>>>>>>>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory 
>>>>>>>>>>>>>>>>>> per node)
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>>>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>>>>>>>> One of the requirement is to give the performance of my 
>>>>>>>>>>>>>>>>>> current code with my current set of data, and there is a 
>>>>>>>>>>>>>>>>>> formula to calculate the estimated parallel efficiency when 
>>>>>>>>>>>>>>>>>> using the new large set of data
>>>>>>>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed time 
>>>>>>>>>>>>>>>>>> varies with the number of processors for a fixed
>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time 
>>>>>>>>>>>>>>>>>> varies with the number of processors for a
>>>>>>>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current cluster, 
>>>>>>>>>>>>>>>>>> giving 140 and 90 mins respectively. This is classified as 
>>>>>>>>>>>>>>>>>> strong scaling.
>>>>>>>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of 
>>>>>>>>>>>>>>>>>> parallelism ‘n’ indicates how much the program is
>>>>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is 
>>>>>>>>>>>>>>>>>> given by the following formulae. Although their
>>>>>>>>>>>>>>>>>> derivation processes are different depending on strong and 
>>>>>>>>>>>>>>>>>> weak scaling, derived formulae are the
>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>> From the estimated time, my parallel efficiency using  
>>>>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%.
>>>>>>>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores), 
>>>>>>>>>>>>>>>>>> my expected parallel efficiency is only 0.5%. The proposal 
>>>>>>>>>>>>>>>>>> recommends value of > 50%.
>>>>>>>>>>>>>>>>>> The problem with this analysis is that the estimated serial 
>>>>>>>>>>>>>>>>>> fraction from Amdahl's Law  changes as a function
>>>>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling from 
>>>>>>>>>>>>>>>>>> one problem and apply it to another without a
>>>>>>>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I 
>>>>>>>>>>>>>>>>>> would measure weak scaling on your current
>>>>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize that 
>>>>>>>>>>>>>>>>>> this does not make sense for many scientific
>>>>>>>>>>>>>>>>>> applications, but neither does requiring a certain parallel 
>>>>>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even worse 
>>>>>>>>>>>>>>>>> for the expected parallel efficiency. From the formula used, 
>>>>>>>>>>>>>>>>> it's obvious it's doing some sort of exponential 
>>>>>>>>>>>>>>>>> extrapolation decrease. So unless I can achieve a near > 90% 
>>>>>>>>>>>>>>>>> speed up when I double the cores and problem size for my 
>>>>>>>>>>>>>>>>> current 48/96 cores setup,     extrapolating from about 96 
>>>>>>>>>>>>>>>>> nodes to 10,000 nodes will give a much lower expected 
>>>>>>>>>>>>>>>>> parallel efficiency for the new case.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory 
>>>>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I double 
>>>>>>>>>>>>>>>>> the cores and problem size (ie linear increase in 
>>>>>>>>>>>>>>>>> performance), which means that I can't get >90% speed up when 
>>>>>>>>>>>>>>>>> I double the cores and problem size for my current 48/96 
>>>>>>>>>>>>>>>>> cores setup. Is that so?
>>>>>>>>>>>>>>>>   What is the output of -ksp_view -log_summary on the problem 
>>>>>>>>>>>>>>>> and then on the problem doubled in size and number of 
>>>>>>>>>>>>>>>> processors?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I have attached the output
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 48 cores: log48
>>>>>>>>>>>>>>> 96 cores: log96
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, while 
>>>>>>>>>>>>>>> the Poisson eqn uses hypre BoomerAMG.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie in my 
>>>>>>>>>>>>>>>>> programming skills, but rather the way the linear equations 
>>>>>>>>>>>>>>>>> are solved?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>      Matt
>>>>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%), 
>>>>>>>>>>>>>>>>>> when using 17640 (2205X8) cores?
>>>>>>>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin 
>>>>>>>>>>>>>>>>>> their experiments is infinitely more interesting than any 
>>>>>>>>>>>>>>>>>> results to which their experiments lead.
>>>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>>>> <log48.txt><log96.txt>
>>>>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>>>>>>>> <log96_100.txt><log48_100.txt>
>>>>>>>>> <log96_100_2.txt><log48_100_2.txt>
>>>>>>> <log64_100.txt><log8_100.txt>
>>> <log.txt>
> 
> <log64_100_2.txt><log8_100_2.txt>

Reply via email to