Re: [petsc-users] Scaling with number of cores

Barry Smith Mon, 02 Nov 2015 11:20:15 -0800

   hypre is just not scaling well here. I do not know why. Since hypre is a 
block box for us there is no way to determine why the poor scaling.


   If you make the same two runs with -pc_type gamg there will be a lot more 
information in the log summary about in what routines it is scaling well or 
poorly.

  Barry



> On Nov 2, 2015, at 3:17 AM, TAY wee-beng <[email protected]> wrote:
> 
> Hi,
> 
> I have attached the 2 files.
> 
> Thank you
> 
> Yours sincerely,
> 
> TAY wee-beng
> 
> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>   Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then 
>> (158)x(266)x(150) on 64 processors  and send the two -log_summary results
>> 
>>   Barry
>> 
>>  
>> 
>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I have attached the new results.
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>>   Run without the -momentum_ksp_view -poisson_ksp_view and send the new 
>>>> results
>>>> 
>>>> 
>>>>   You can see from the log summary that the PCSetUp is taking a much 
>>>> smaller percentage of the time meaning that it is reusing the 
>>>> preconditioner and not rebuilding it each time.
>>>> 
>>>> Barry
>>>> 
>>>>   Something makes no sense with the output: it gives
>>>> 
>>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 
>>>> 5.0e+02 90100 66100 24  90100 66100 24   165
>>>> 
>>>> 90% of the time is in the solve but there is no significant amount of time 
>>>> in other events of the code which is just not possible. I hope it is due 
>>>> to your IO.
>>>> 
>>>> 
>>>> 
>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have attached the new run with 100 time steps for 48 and 96 cores.
>>>>> 
>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to 
>>>>> reuse the preconditioner, what must I do? Or what must I not do?
>>>>> 
>>>>> Why does the number of processes increase so much? Is there something 
>>>>> wrong with my coding? Seems to be so too for my new run.
>>>>> 
>>>>> Thank you
>>>>> 
>>>>> Yours sincerely,
>>>>> 
>>>>> TAY wee-beng
>>>>> 
>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>>>>   If you are doing many time steps with the same linear solver then you 
>>>>>> MUST do your weak scaling studies with MANY time steps since the setup 
>>>>>> time of AMG only takes place in the first stimestep. So run both 48 and 
>>>>>> 96 processes with the same large number of time steps.
>>>>>> 
>>>>>>   Barry
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng <[email protected]> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Sorry I forgot and use the old a.out. I have attached the new log for 
>>>>>>> 48cores (log48), together with the 96cores log (log96).
>>>>>>> 
>>>>>>> Why does the number of processes increase so much? Is there something 
>>>>>>> wrong with my coding?
>>>>>>> 
>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to 
>>>>>>> reuse the preconditioner, what must I do? Or what must I not do?
>>>>>>> 
>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for 10 
>>>>>>> timesteps (log48_10). Is it building the preconditioner at every 
>>>>>>> timestep?
>>>>>>> 
>>>>>>> Also, what about momentum eqn? Is it working well?
>>>>>>> 
>>>>>>> I will try the gamg later too.
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> Yours sincerely,
>>>>>>> 
>>>>>>> TAY wee-beng
>>>>>>> 
>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>>>>>>   You used gmres with 48 processes but richardson with 96. You need to 
>>>>>>>> be careful and make sure you don't change the solvers when you change 
>>>>>>>> the number of processors since you can get very different inconsistent 
>>>>>>>> results
>>>>>>>> 
>>>>>>>>    Anyways all the time is being spent in the BoomerAMG algebraic 
>>>>>>>> multigrid setup and it is is scaling badly. When you double the 
>>>>>>>> problem size and number of processes it went from 3.2445e+01 to 
>>>>>>>> 4.3599e+02 seconds.
>>>>>>>> 
>>>>>>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 
>>>>>>>> 0.0e+00 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>>>>>>> 
>>>>>>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 
>>>>>>>> 0.0e+00 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>>>>>>> 
>>>>>>>>   Now is the Poisson problem changing at each timestep or can you use 
>>>>>>>> the same preconditioner built with BoomerAMG for all the time steps? 
>>>>>>>> Algebraic multigrid has a large set up time that you often doesn't 
>>>>>>>> matter if you have many time steps but if you have to rebuild it each 
>>>>>>>> timestep it is too large?
>>>>>>>> 
>>>>>>>>   You might also try -pc_type gamg and see how PETSc's algebraic 
>>>>>>>> multigrid scales for your problem/machine.
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng <[email protected]> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I understand that as mentioned in the faq, due to the limitations 
>>>>>>>>>>>> in memory, the scaling is not linear. So, I am trying to write a 
>>>>>>>>>>>> proposal to use a supercomputer.
>>>>>>>>>>>> Its specs are:
>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per 
>>>>>>>>>>>> node)
>>>>>>>>>>>> 
>>>>>>>>>>>> 8 cores / processor
>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>>>>>>>>>>> Each cabinet contains 96 computing nodes,
>>>>>>>>>>>> One of the requirement is to give the performance of my current 
>>>>>>>>>>>> code with my current set of data, and there is a formula to 
>>>>>>>>>>>> calculate the estimated parallel efficiency when using the new 
>>>>>>>>>>>> large set of data
>>>>>>>>>>>> There are 2 ways to give performance:
>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed time varies 
>>>>>>>>>>>> with the number of processors for a fixed
>>>>>>>>>>>> problem.
>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time varies 
>>>>>>>>>>>> with the number of processors for a
>>>>>>>>>>>> fixed problem size per processor.
>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current cluster, 
>>>>>>>>>>>> giving 140 and 90 mins respectively. This is classified as strong 
>>>>>>>>>>>> scaling.
>>>>>>>>>>>> Cluster specs:
>>>>>>>>>>>> CPU: AMD 6234 2.4GHz
>>>>>>>>>>>> 8 cores / processor (CPU)
>>>>>>>>>>>> 6 CPU / node
>>>>>>>>>>>> So 48 Cores / CPU
>>>>>>>>>>>> Not sure abt the memory / node
>>>>>>>>>>>> 
>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ 
>>>>>>>>>>>> indicates how much the program is
>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is given by 
>>>>>>>>>>>> the following formulae. Although their
>>>>>>>>>>>> derivation processes are different depending on strong and weak 
>>>>>>>>>>>> scaling, derived formulae are the
>>>>>>>>>>>> same.
>>>>>>>>>>>> From the estimated time, my parallel efficiency using  Amdahl's 
>>>>>>>>>>>> law on the current old cluster was 52.7%.
>>>>>>>>>>>> So is my results acceptable?
>>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores), my 
>>>>>>>>>>>> expected parallel efficiency is only 0.5%. The proposal recommends 
>>>>>>>>>>>> value of > 50%.
>>>>>>>>>>>> The problem with this analysis is that the estimated serial 
>>>>>>>>>>>> fraction from Amdahl's Law  changes as a function
>>>>>>>>>>>> of problem size, so you cannot take the strong scaling from one 
>>>>>>>>>>>> problem and apply it to another without a
>>>>>>>>>>>> model of this dependence.
>>>>>>>>>>>> 
>>>>>>>>>>>> Weak scaling does model changes with problem size, so I would 
>>>>>>>>>>>> measure weak scaling on your current
>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize that this 
>>>>>>>>>>>> does not make sense for many scientific
>>>>>>>>>>>> applications, but neither does requiring a certain parallel 
>>>>>>>>>>>> efficiency.
>>>>>>>>>>> Ok I check the results for my weak scaling it is even worse for the 
>>>>>>>>>>> expected parallel efficiency. From the formula used, it's obvious 
>>>>>>>>>>> it's doing some sort of exponential extrapolation decrease. So 
>>>>>>>>>>> unless I can achieve a near > 90% speed up when I double the cores 
>>>>>>>>>>> and problem size for my current 48/96 cores setup,     
>>>>>>>>>>> extrapolating from about 96 nodes to 10,000 nodes will give a much 
>>>>>>>>>>> lower expected parallel efficiency for the new case.
>>>>>>>>>>> 
>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory requirement, 
>>>>>>>>>>> it's impossible to get >90% speed when I double the cores and 
>>>>>>>>>>> problem size (ie linear increase in performance), which means that 
>>>>>>>>>>> I can't get >90% speed up when I double the cores and problem size 
>>>>>>>>>>> for my current 48/96 cores setup. Is that so?
>>>>>>>>>>   What is the output of -ksp_view -log_summary on the problem and 
>>>>>>>>>> then on the problem doubled in size and number of processors?
>>>>>>>>>> 
>>>>>>>>>>   Barry
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I have attached the output
>>>>>>>>> 
>>>>>>>>> 48 cores: log48
>>>>>>>>> 96 cores: log96
>>>>>>>>> 
>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, while the 
>>>>>>>>> Poisson eqn uses hypre BoomerAMG.
>>>>>>>>> 
>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300.
>>>>>>>>>>> So is it fair to say that the main problem does not lie in my 
>>>>>>>>>>> programming skills, but rather the way the linear equations are 
>>>>>>>>>>> solved?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks.
>>>>>>>>>>>>   Thanks,
>>>>>>>>>>>> 
>>>>>>>>>>>>      Matt
>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%), when 
>>>>>>>>>>>> using 17640 (2205X8) cores?
>>>>>>>>>>>> Btw, I do not have access to the system.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent using CloudMagic Email
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -- 
>>>>>>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>>>>>>> experiments is infinitely more interesting than any results to 
>>>>>>>>>>>> which their experiments lead.
>>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>> <log48.txt><log96.txt>
>>>>>>> <log48_10.txt><log48.txt><log96.txt>
>>>>> <log96_100.txt><log48_100.txt>
>>> <log96_100_2.txt><log48_100_2.txt>
> 
> <log64_100.txt><log8_100.txt>

Re: [petsc-users] Scaling with number of cores

Reply via email to