Re: [petsc-users] Scaling with number of cores

Barry Smith Tue, 03 Nov 2015 09:05:46 -0800

> On Nov 3, 2015, at 9:04 AM, TAY wee-beng <[email protected]> wrote:
> 
> 
> On 3/11/2015 9:01 PM, Matthew Knepley wrote:
>> On Tue, Nov 3, 2015 at 6:58 AM, TAY wee-beng <[email protected]> wrote:
>> 
>> On 3/11/2015 8:52 PM, Matthew Knepley wrote:
>>> On Tue, Nov 3, 2015 at 6:49 AM, TAY wee-beng <[email protected]> wrote:
>>> Hi,
>>> 
>>> I tried and have attached the log.
>>> 
>>> Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify 
>>> some null space stuff?  Like KSPSetNullSpace or MatNullSpaceCreate?
>>> 
>>> Yes, you need to attach the constant null space to the matrix.
>>> 
>>>   Thanks,
>>> 
>>>      Matt
>> Ok so can you point me to a suitable example so that I know which one to use 
>> specifically?
>> 
>> https://bitbucket.org/petsc/petsc/src/9ae8fd060698c4d6fc0d13188aca8a1828c138ab/src/snes/examples/tutorials/ex12.c?at=master&fileviewer=file-view-default#ex12.c-761
>> 
>>   Matt
> Hi,
> 
> Actually, I realised that for my Poisson eqn, I have neumann and dirichlet 
> BC. Dirichlet BC is at the output grids by specifying pressure = 0. So do I 
> still need the null space?


  No, 

> 
> My Poisson eqn LHS is fixed but RHS is changing with every timestep.
> 
> If I need to use null space, how do I know if the null space contains the 
> constant vector and what the the no. of vectors? I follow the example given 
> and added:
> 
> call MatNullSpaceCreate(MPI_COMM_WORLD,PETSC_TRUE,0,NULL,nullsp,ierr)
>     
>     call MatSetNullSpace(A,nullsp,ierr)
>     
>     call MatNullSpaceDestroy(nullsp,ierr)
> 
> Is that all?
> 
> Before this, I was using HYPRE geometric solver and the matrix / vector in 
> the subroutine was written based on HYPRE. It worked pretty well and fast.
> 
> However, it's a black box and it's hard to diagnose problems.
> 
> I always had the PETSc subroutine to solve my Poisson eqn but I used KSPBCGS 
> or KSPGMRES with HYPRE's boomeramg as the PC. It worked but was slow. 
> 
> Matt: Thanks, I will see how it goes using the nullspace and may try 
> "-mg_coarse_pc_type svd" later.
>>  
>> Thanks.
>>>  
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 3/11/2015 12:45 PM, Barry Smith wrote:
>>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> Hi,
>>> 
>>> I tried :
>>> 
>>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg
>>> 
>>> 2. -poisson_pc_type gamg
>>>     Run with -poisson_ksp_monitor_true_residual 
>>> -poisson_ksp_monitor_converged_reason
>>> Does your poisson have Neumann boundary conditions? Do you have any zeros 
>>> on the diagonal for the matrix (you shouldn't).
>>> 
>>>    There may be something wrong with your poisson discretization that was 
>>> also messing up hypre
>>> 
>>> 
>>> 
>>> Both options give:
>>> 
>>>     1      0.00150000      0.00000000      0.00000000 1.00000000            
>>>  NaN             NaN             NaN
>>> M Diverged but why?, time =            2
>>> reason =           -9
>>> 
>>> How can I check what's wrong?
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 3/11/2015 3:18 AM, Barry Smith wrote:
>>>     hypre is just not scaling well here. I do not know why. Since hypre is 
>>> a block box for us there is no way to determine why the poor scaling.
>>> 
>>>     If you make the same two runs with -pc_type gamg there will be a lot 
>>> more information in the log summary about in what routines it is scaling 
>>> well or poorly.
>>> 
>>>    Barry
>>> 
>>> 
>>> 
>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> Hi,
>>> 
>>> I have attached the 2 files.
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 2/11/2015 2:55 PM, Barry Smith wrote:
>>>    Run (158/2)x(266/2)x(150/2) grid on 8 processes  and then 
>>> (158)x(266)x(150) on 64 processors  and send the two -log_summary results
>>> 
>>>    Barry
>>> 
>>>   
>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> Hi,
>>> 
>>> I have attached the new results.
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 2/11/2015 12:27 PM, Barry Smith wrote:
>>>    Run without the -momentum_ksp_view -poisson_ksp_view and send the new 
>>> results
>>> 
>>> 
>>>    You can see from the log summary that the PCSetUp is taking a much 
>>> smaller percentage of the time meaning that it is reusing the 
>>> preconditioner and not rebuilding it each time.
>>> 
>>> Barry
>>> 
>>>    Something makes no sense with the output: it gives
>>> 
>>> KSPSolve             199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 
>>> 5.0e+02 90100 66100 24  90100 66100 24   165
>>> 
>>> 90% of the time is in the solve but there is no significant amount of time 
>>> in other events of the code which is just not possible. I hope it is due to 
>>> your IO.
>>> 
>>> 
>>> 
>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> Hi,
>>> 
>>> I have attached the new run with 100 time steps for 48 and 96 cores.
>>> 
>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse 
>>> the preconditioner, what must I do? Or what must I not do?
>>> 
>>> Why does the number of processes increase so much? Is there something wrong 
>>> with my coding? Seems to be so too for my new run.
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 2/11/2015 9:49 AM, Barry Smith wrote:
>>>    If you are doing many time steps with the same linear solver then you 
>>> MUST do your weak scaling studies with MANY time steps since the setup time 
>>> of AMG only takes place in the first stimestep. So run both 48 and 96 
>>> processes with the same large number of time steps.
>>> 
>>>    Barry
>>> 
>>> 
>>> 
>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> Hi,
>>> 
>>> Sorry I forgot and use the old a.out. I have attached the new log for 
>>> 48cores (log48), together with the 96cores log (log96).
>>> 
>>> Why does the number of processes increase so much? Is there something wrong 
>>> with my coding?
>>> 
>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to reuse 
>>> the preconditioner, what must I do? Or what must I not do?
>>> 
>>> Lastly, I only simulated 2 time steps previously. Now I run for 10 
>>> timesteps (log48_10). Is it building the preconditioner at every timestep?
>>> 
>>> Also, what about momentum eqn? Is it working well?
>>> 
>>> I will try the gamg later too.
>>> 
>>> Thank you
>>> 
>>> Yours sincerely,
>>> 
>>> TAY wee-beng
>>> 
>>> On 2/11/2015 12:30 AM, Barry Smith wrote:
>>>    You used gmres with 48 processes but richardson with 96. You need to be 
>>> careful and make sure you don't change the solvers when you change the 
>>> number of processors since you can get very different inconsistent results
>>> 
>>>     Anyways all the time is being spent in the BoomerAMG algebraic 
>>> multigrid setup and it is is scaling badly. When you double the problem 
>>> size and number of processes it went from 3.2445e+01 to 4.3599e+02 seconds.
>>> 
>>> PCSetUp                3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 
>>> 4.0e+00 62  8  0  0  4  62  8  0  0  5    11
>>> 
>>> PCSetUp                3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 0.0e+00 
>>> 4.0e+00 85 18  0  0  6  85 18  0  0  6     2
>>> 
>>>    Now is the Poisson problem changing at each timestep or can you use the 
>>> same preconditioner built with BoomerAMG for all the time steps? Algebraic 
>>> multigrid has a large set up time that you often doesn't matter if you have 
>>> many time steps but if you have to rebuild it each timestep it is too large?
>>> 
>>>    You might also try -pc_type gamg and see how PETSc's algebraic multigrid 
>>> scales for your problem/machine.
>>> 
>>>    Barry
>>> 
>>> 
>>> 
>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> 
>>> On 1/11/2015 10:00 AM, Barry Smith wrote:
>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<[email protected]>  wrote:
>>> 
>>> 
>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote:
>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng<[email protected]>  wrote:
>>> Hi,
>>> 
>>> I understand that as mentioned in the faq, due to the limitations in 
>>> memory, the scaling is not linear. So, I am trying to write a proposal to 
>>> use a supercomputer.
>>> Its specs are:
>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per node)
>>> 
>>> 8 cores / processor
>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect
>>> Each cabinet contains 96 computing nodes,
>>> One of the requirement is to give the performance of my current code with 
>>> my current set of data, and there is a formula to calculate the estimated 
>>> parallel efficiency when using the new large set of data
>>> There are 2 ways to give performance:
>>> 1. Strong scaling, which is defined as how the elapsed time varies with the 
>>> number of processors for a fixed
>>> problem.
>>> 2. Weak scaling, which is defined as how the elapsed time varies with the 
>>> number of processors for a
>>> fixed problem size per processor.
>>> I ran my cases with 48 and 96 cores with my current cluster, giving 140 and 
>>> 90 mins respectively. This is classified as strong scaling.
>>> Cluster specs:
>>> CPU: AMD 6234 2.4GHz
>>> 8 cores / processor (CPU)
>>> 6 CPU / node
>>> So 48 Cores / CPU
>>> Not sure abt the memory / node
>>> 
>>> The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ 
>>> indicates how much the program is
>>> efficiently accelerated by parallel processing. ‘En’ is given by the 
>>> following formulae. Although their
>>> derivation processes are different depending on strong and weak scaling, 
>>> derived formulae are the
>>> same.
>>>  From the estimated time, my parallel efficiency using  Amdahl's law on the 
>>> current old cluster was 52.7%.
>>> So is my results acceptable?
>>> For the large data set, if using 2205 nodes (2205X8cores), my expected 
>>> parallel efficiency is only 0.5%. The proposal recommends value of > 50%.
>>> The problem with this analysis is that the estimated serial fraction from 
>>> Amdahl's Law  changes as a function
>>> of problem size, so you cannot take the strong scaling from one problem and 
>>> apply it to another without a
>>> model of this dependence.
>>> 
>>> Weak scaling does model changes with problem size, so I would measure weak 
>>> scaling on your current
>>> cluster, and extrapolate to the big machine. I realize that this does not 
>>> make sense for many scientific
>>> applications, but neither does requiring a certain parallel efficiency.
>>> Ok I check the results for my weak scaling it is even worse for the 
>>> expected parallel efficiency. From the formula used, it's obvious it's 
>>> doing some sort of exponential extrapolation decrease. So unless I can 
>>> achieve a near > 90% speed up when I double the cores and problem size for 
>>> my current 48/96 cores setup,     extrapolating from about 96 nodes to 
>>> 10,000 nodes will give a much lower expected parallel efficiency for the 
>>> new case.
>>> 
>>> However, it's mentioned in the FAQ that due to memory requirement, it's 
>>> impossible to get >90% speed when I double the cores and problem size (ie 
>>> linear increase in performance), which means that I can't get >90% speed up 
>>> when I double the cores and problem size for my current 48/96 cores setup. 
>>> Is that so?
>>>    What is the output of -ksp_view -log_summary on the problem and then on 
>>> the problem doubled in size and number of processors?
>>> 
>>>    Barry
>>> Hi,
>>> 
>>> I have attached the output
>>> 
>>> 48 cores: log48
>>> 96 cores: log96
>>> 
>>> There are 2 solvers - The momentum linear eqn uses bcgs, while the Poisson 
>>> eqn uses hypre                                                       
>>> BoomerAMG.
>>> 
>>> Problem size doubled from 158x266x150 to 158x266x300.
>>> So is it fair to say that the main problem does not lie in my programming 
>>> skills, but rather the way the linear equations are solved?
>>> 
>>> Thanks.
>>>    Thanks,
>>> 
>>>       Matt
>>> Is it possible for this type of scaling in PETSc (>50%), when using 17640 
>>> (2205X8) cores?
>>> Btw, I do not have access to the system.
>>> 
>>> 
>>> 
>>> Sent using CloudMagic Email
>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>>> <log48.txt><log96.txt>
>>> <log48_10.txt><log48.txt><log96.txt>
>>> <log96_100.txt><log48_100.txt>
>>> <log96_100_2.txt><log48_100_2.txt>
>>> <log64_100.txt><log8_100.txt>
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>> 
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>

Re: [petsc-users] Scaling with number of cores

Reply via email to