Re: [petsc-users] Scaling with number of cores

Barry Smith Tue, 03 Nov 2015 09:11:53 -0800

   Ok, the convergence looks good. Now run on 8 and 64 processes as before with 
-log_summary and not -ksp_monitor to see how it scales.


  Barry

> On Nov 3, 2015, at 6:49 AM, TAY wee-beng <[email protected]> wrote:
> 
> Hi,
> 
> I tried and have attached the log.
> 
> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some 
> null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?
> 
> Thank you
> 
> Yours sincerely,
> 
> TAY wee-beng
> 
> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> Hi,
>>> 
>>> I tried :
>>> 
>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>> 
>>> 2. -poisson_pc_type gamg
>>    Run with -poisson_ksp_monitor_true_residual 
>> -poisson_ksp_monitor_converged_reason
>> Does your poisson have Neumann boundary conditions? Do you have any zeros on 
>> the diagonal for the matrix (you shouldn't).
>> 
>>   There may be something wrong with your poisson discretization that was 
>> also messing up hypre
>> 
>> 
>> 
>>> Both options give:
>>> 
>>>    1      0.00150000      0.00000000      0.00000000 1.00000000             
>>> NaN             NaN             NaN
>>> M Diverged but why?, time =            2
>>> reason =           -9
>>> 
>>> How can I check what's wrong?
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>>    hypre is just not scaling well here. I do not know why. Since hypre is 
>>>> a block box for us there is no way to determine why the poor scaling.
>>>> 
>>>>    If you make the same two runs with -pc_type gamg there will be a lot 
>>>> more information in the log summary about in what routines it is scaling 
>>>> well or poorly.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> 
>>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng<[email protected]>  wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have attached the 2 files.
>>>>> 
>>>>> Thank you
>>>>> 
>>>>> Yours sincerely,
>>>>> 
>>>>> TAY wee-beng
>>>>> 
>>>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>>>>   Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then 
>>>>>> (158)x(266)x(150) on 64 processors  and send the two -log_summary results
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>>  
>>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<[email protected]>  wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I have attached the new results.
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> Yours sincerely,
>>>>>>> 
>>>>>>> TAY wee-beng
>>>>>>> 
>>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>>>>>   Run without the -momentum_ksp_view -poisson_ksp_view and send the 
>>>>>>>> new results
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   You can see from the log summary that the PCSetUp is taking a much 
>>>>>>>> smaller percentage of the time meaning that it is reusing the 
>>>>>>>> preconditioner and not rebuilding it each time.
>>>>>>>> 
>>>>>>>> Barry
>>>>>>>> 
>>>>>>>>   Something makes no sense with the output: it gives
>>>>>>>> 
>>>>>>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 
>>>>>>>> 9.9e+05 5.0e+02 90100 66100 24  90100 66100 24   165
>>>>>>>> 
>>>>>>>> 90% of the time is in the solve but there is no significant amount of 
>>>>>>>> time in other events of the code which is just not possible. I hope it 
>>>>>>>> is due to your IO.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I have attached the new run with 100 time steps for 48 and 96 cores.
>>>>>>>>> 
>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to 
>>>>>>>>> reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>> 
>>>>>>>>> Why does the number of processes increase so much? Is there something 
>>>>>>>>> wrong with my coding? Seems to be so too for my new run.
>>>>>>>>> 
>>>>>>>>> Thank you
>>>>>>>>> 
>>>>>>>>> Yours sincerely,
>>>>>>>>> 
>>>>>>>>> TAY wee-beng
>>>>>>>>> 
>>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>>>>>   If you are doing many time steps with the same linear solver then 
>>>>>>>>>> you MUST do your weak scaling studies with MANY time steps since the 
>>>>>>>>>> setup time of AMG only takes place in the first stimestep. So run 
>>>>>>>>>> both 48 and 96 processes with the same large number of time steps.
>>>>>>>>>> 
>>>>>>>>>>   Barry
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new log 
>>>>>>>>>>> for 48cores (log48), together with the 96cores log (log96).
>>>>>>>>>>> 
>>>>>>>>>>> Why does the number of processes increase so much? Is there 
>>>>>>>>>>> something wrong with my coding?
>>>>>>>>>>> 
>>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want 
>>>>>>>>>>> to reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>>>>>> 
>>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for 10 
>>>>>>>>>>> timesteps (log48_10). Is it building the preconditioner at every 
>>>>>>>>>>> timestep?
>>>>>>>>>>> 
>>>>>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>>>>>> 
>>>>>>>>>>> I will try the gamg later too.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you
>>>>>>>>>>> 
>>>>>>>>>>> Yours sincerely,
>>>>>>>>>>> 
>>>>>>>>>>> TAY wee-beng
>>>>>>>>>>> 
>>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>>>>>   You used gmres with 48 processes but richardson with 96. You 
>>>>>>>>>>>> need to be careful and make sure you don't change the solvers when 
>>>>>>>>>>>> you change the number of processors since you can get very 
>>>>>>>>>>>> different inconsistent results
>>>>>>>>>>>> 
>>>>>>>>>>>>    Anyways all the time is being spent in the BoomerAMG algebraic 
>>>>>>>>>>>> multigrid setup and it is is scaling badly. When you double the 
>>>>>>>>>>>> problem size and number of processes it went from 3.2445e+01 to 
>>>>>>>>>>>> 4.3599e+02 seconds.
>>>>>>>>>>>> 
>>>>>>>>>>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 
>>>>>>>>>>>> 0.0e+00 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>>>>>>>>>>> 
>>>>>>>>>>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 
>>>>>>>>>>>> 0.0e+00 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>>>>>>>>>>> 
>>>>>>>>>>>>   Now is the Poisson problem changing at each timestep or can you 
>>>>>>>>>>>> use the same preconditioner built with BoomerAMG for all the time 
>>>>>>>>>>>> steps? Algebraic multigrid has a large set up time that you often 
>>>>>>>>>>>> doesn't matter if you have many time steps but if you have to 
>>>>>>>>>>>> rebuild it each timestep it is too large?
>>>>>>>>>>>> 
>>>>>>>>>>>>   You might also try -pc_type gamg and see how PETSc's algebraic 
>>>>>>>>>>>> multigrid scales for your problem/machine.
>>>>>>>>>>>> 
>>>>>>>>>>>>   Barry
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng<[email protected]>  wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<[email protected]>  
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY 
>>>>>>>>>>>>>>>> wee-beng<[email protected]>  wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the 
>>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I am 
>>>>>>>>>>>>>>>> trying to write a proposal to use a supercomputer.
>>>>>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory 
>>>>>>>>>>>>>>>> per node)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>>>>>> One of the requirement is to give the performance of my 
>>>>>>>>>>>>>>>> current code with my current set of data, and there is a 
>>>>>>>>>>>>>>>> formula to calculate the estimated parallel efficiency when 
>>>>>>>>>>>>>>>> using the new large set of data
>>>>>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed time 
>>>>>>>>>>>>>>>> varies with the number of processors for a fixed
>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time 
>>>>>>>>>>>>>>>> varies with the number of processors for a
>>>>>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current cluster, 
>>>>>>>>>>>>>>>> giving 140 and 90 mins respectively. This is classified as 
>>>>>>>>>>>>>>>> strong scaling.
>>>>>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of parallelism 
>>>>>>>>>>>>>>>> ‘n’ indicates how much the program is
>>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is given 
>>>>>>>>>>>>>>>> by the following formulae. Although their
>>>>>>>>>>>>>>>> derivation processes are different depending on strong and 
>>>>>>>>>>>>>>>> weak scaling, derived formulae are the
>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>> From the estimated time, my parallel efficiency using  
>>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%.
>>>>>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores), my 
>>>>>>>>>>>>>>>> expected parallel efficiency is only 0.5%. The proposal 
>>>>>>>>>>>>>>>> recommends value of > 50%.
>>>>>>>>>>>>>>>> The problem with this analysis is that the estimated serial 
>>>>>>>>>>>>>>>> fraction from Amdahl's Law  changes as a function
>>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling from 
>>>>>>>>>>>>>>>> one problem and apply it to another without a
>>>>>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I would 
>>>>>>>>>>>>>>>> measure weak scaling on your current
>>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize that 
>>>>>>>>>>>>>>>> this does not make sense for many scientific
>>>>>>>>>>>>>>>> applications, but neither does requiring a certain parallel 
>>>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even worse for 
>>>>>>>>>>>>>>> the expected parallel efficiency. From the formula used, it's 
>>>>>>>>>>>>>>> obvious it's doing some sort of exponential extrapolation 
>>>>>>>>>>>>>>> decrease. So unless I can achieve a near > 90% speed up when I 
>>>>>>>>>>>>>>> double the cores and problem size for my current 48/96 cores 
>>>>>>>>>>>>>>> setup,     extrapolating from about 96 nodes to 10,000 nodes 
>>>>>>>>>>>>>>> will give a much lower expected parallel efficiency for the new 
>>>>>>>>>>>>>>> case.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory 
>>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I double 
>>>>>>>>>>>>>>> the cores and problem size (ie linear increase in performance), 
>>>>>>>>>>>>>>> which means that I can't get >90% speed up when I double the 
>>>>>>>>>>>>>>> cores and problem size for my current 48/96 cores setup. Is 
>>>>>>>>>>>>>>> that so?
>>>>>>>>>>>>>>   What is the output of -ksp_view -log_summary on the problem 
>>>>>>>>>>>>>> and then on the problem doubled in size and number of processors?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   Barry
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have attached the output
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 48 cores: log48
>>>>>>>>>>>>> 96 cores: log96
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, while 
>>>>>>>>>>>>> the Poisson eqn uses hypre BoomerAMG.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie in my 
>>>>>>>>>>>>>>> programming skills, but rather the way the linear equations are 
>>>>>>>>>>>>>>> solved?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>      Matt
>>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%), when 
>>>>>>>>>>>>>>>> using 17640 (2205X8) cores?
>>>>>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> What most experimenters take for granted before they begin 
>>>>>>>>>>>>>>>> their experiments is infinitely more interesting than any 
>>>>>>>>>>>>>>>> results to which their experiments lead.
>>>>>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>>> <log48.txt><log96.txt>
>>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>>>>>> <log96_100.txt><log48_100.txt>
>>>>>>> <log96_100_2.txt><log48_100_2.txt>
>>>>> <log64_100.txt><log8_100.txt>
> 
> <log.txt>

Re: [petsc-users] Scaling with number of cores

Reply via email to