hypre is just not scaling well here. I do not know why. Since hypre is a block box for us there is no way to determine why the poor scaling.
If you make the same two runs with -pc_type gamg there will be a lot more information in the log summary about in what routines it is scaling well or poorly. Barry > On Nov 2, 2015, at 3:17 AM, TAY wee-beng <[email protected]> wrote: > > Hi, > > I have attached the 2 files. > > Thank you > > Yours sincerely, > > TAY wee-beng > > On 2/11/2015 2:55 PM, Barry Smith wrote: >> Run (158/2)x(266/2)x(150/2) grid on 8 processes and then >> (158)x(266)x(150) on 64 processors and send the two -log_summary results >> >> Barry >> >> >> >>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng <[email protected]> wrote: >>> >>> Hi, >>> >>> I have attached the new results. >>> >>> Thank you >>> >>> Yours sincerely, >>> >>> TAY wee-beng >>> >>> On 2/11/2015 12:27 PM, Barry Smith wrote: >>>> Run without the -momentum_ksp_view -poisson_ksp_view and send the new >>>> results >>>> >>>> >>>> You can see from the log summary that the PCSetUp is taking a much >>>> smaller percentage of the time meaning that it is reusing the >>>> preconditioner and not rebuilding it each time. >>>> >>>> Barry >>>> >>>> Something makes no sense with the output: it gives >>>> >>>> KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 9.9e+05 >>>> 5.0e+02 90100 66100 24 90100 66100 24 165 >>>> >>>> 90% of the time is in the solve but there is no significant amount of time >>>> in other events of the code which is just not possible. I hope it is due >>>> to your IO. >>>> >>>> >>>> >>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have attached the new run with 100 time steps for 48 and 96 cores. >>>>> >>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to >>>>> reuse the preconditioner, what must I do? Or what must I not do? >>>>> >>>>> Why does the number of processes increase so much? Is there something >>>>> wrong with my coding? Seems to be so too for my new run. >>>>> >>>>> Thank you >>>>> >>>>> Yours sincerely, >>>>> >>>>> TAY wee-beng >>>>> >>>>> On 2/11/2015 9:49 AM, Barry Smith wrote: >>>>>> If you are doing many time steps with the same linear solver then you >>>>>> MUST do your weak scaling studies with MANY time steps since the setup >>>>>> time of AMG only takes place in the first stimestep. So run both 48 and >>>>>> 96 processes with the same large number of time steps. >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> >>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng <[email protected]> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Sorry I forgot and use the old a.out. I have attached the new log for >>>>>>> 48cores (log48), together with the 96cores log (log96). >>>>>>> >>>>>>> Why does the number of processes increase so much? Is there something >>>>>>> wrong with my coding? >>>>>>> >>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to >>>>>>> reuse the preconditioner, what must I do? Or what must I not do? >>>>>>> >>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for 10 >>>>>>> timesteps (log48_10). Is it building the preconditioner at every >>>>>>> timestep? >>>>>>> >>>>>>> Also, what about momentum eqn? Is it working well? >>>>>>> >>>>>>> I will try the gamg later too. >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Yours sincerely, >>>>>>> >>>>>>> TAY wee-beng >>>>>>> >>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote: >>>>>>>> You used gmres with 48 processes but richardson with 96. You need to >>>>>>>> be careful and make sure you don't change the solvers when you change >>>>>>>> the number of processors since you can get very different inconsistent >>>>>>>> results >>>>>>>> >>>>>>>> Anyways all the time is being spent in the BoomerAMG algebraic >>>>>>>> multigrid setup and it is is scaling badly. When you double the >>>>>>>> problem size and number of processes it went from 3.2445e+01 to >>>>>>>> 4.3599e+02 seconds. >>>>>>>> >>>>>>>> PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 >>>>>>>> 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11 >>>>>>>> >>>>>>>> PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 >>>>>>>> 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2 >>>>>>>> >>>>>>>> Now is the Poisson problem changing at each timestep or can you use >>>>>>>> the same preconditioner built with BoomerAMG for all the time steps? >>>>>>>> Algebraic multigrid has a large set up time that you often doesn't >>>>>>>> matter if you have many time steps but if you have to rebuild it each >>>>>>>> timestep it is too large? >>>>>>>> >>>>>>>> You might also try -pc_type gamg and see how PETSc's algebraic >>>>>>>> multigrid scales for your problem/machine. >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng <[email protected]> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote: >>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote: >>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY wee-beng <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I understand that as mentioned in the faq, due to the limitations >>>>>>>>>>>> in memory, the scaling is not linear. So, I am trying to write a >>>>>>>>>>>> proposal to use a supercomputer. >>>>>>>>>>>> Its specs are: >>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory per >>>>>>>>>>>> node) >>>>>>>>>>>> >>>>>>>>>>>> 8 cores / processor >>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect >>>>>>>>>>>> Each cabinet contains 96 computing nodes, >>>>>>>>>>>> One of the requirement is to give the performance of my current >>>>>>>>>>>> code with my current set of data, and there is a formula to >>>>>>>>>>>> calculate the estimated parallel efficiency when using the new >>>>>>>>>>>> large set of data >>>>>>>>>>>> There are 2 ways to give performance: >>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed time varies >>>>>>>>>>>> with the number of processors for a fixed >>>>>>>>>>>> problem. >>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time varies >>>>>>>>>>>> with the number of processors for a >>>>>>>>>>>> fixed problem size per processor. >>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current cluster, >>>>>>>>>>>> giving 140 and 90 mins respectively. This is classified as strong >>>>>>>>>>>> scaling. >>>>>>>>>>>> Cluster specs: >>>>>>>>>>>> CPU: AMD 6234 2.4GHz >>>>>>>>>>>> 8 cores / processor (CPU) >>>>>>>>>>>> 6 CPU / node >>>>>>>>>>>> So 48 Cores / CPU >>>>>>>>>>>> Not sure abt the memory / node >>>>>>>>>>>> >>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of parallelism ‘n’ >>>>>>>>>>>> indicates how much the program is >>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is given by >>>>>>>>>>>> the following formulae. Although their >>>>>>>>>>>> derivation processes are different depending on strong and weak >>>>>>>>>>>> scaling, derived formulae are the >>>>>>>>>>>> same. >>>>>>>>>>>> From the estimated time, my parallel efficiency using Amdahl's >>>>>>>>>>>> law on the current old cluster was 52.7%. >>>>>>>>>>>> So is my results acceptable? >>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores), my >>>>>>>>>>>> expected parallel efficiency is only 0.5%. The proposal recommends >>>>>>>>>>>> value of > 50%. >>>>>>>>>>>> The problem with this analysis is that the estimated serial >>>>>>>>>>>> fraction from Amdahl's Law changes as a function >>>>>>>>>>>> of problem size, so you cannot take the strong scaling from one >>>>>>>>>>>> problem and apply it to another without a >>>>>>>>>>>> model of this dependence. >>>>>>>>>>>> >>>>>>>>>>>> Weak scaling does model changes with problem size, so I would >>>>>>>>>>>> measure weak scaling on your current >>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize that this >>>>>>>>>>>> does not make sense for many scientific >>>>>>>>>>>> applications, but neither does requiring a certain parallel >>>>>>>>>>>> efficiency. >>>>>>>>>>> Ok I check the results for my weak scaling it is even worse for the >>>>>>>>>>> expected parallel efficiency. From the formula used, it's obvious >>>>>>>>>>> it's doing some sort of exponential extrapolation decrease. So >>>>>>>>>>> unless I can achieve a near > 90% speed up when I double the cores >>>>>>>>>>> and problem size for my current 48/96 cores setup, >>>>>>>>>>> extrapolating from about 96 nodes to 10,000 nodes will give a much >>>>>>>>>>> lower expected parallel efficiency for the new case. >>>>>>>>>>> >>>>>>>>>>> However, it's mentioned in the FAQ that due to memory requirement, >>>>>>>>>>> it's impossible to get >90% speed when I double the cores and >>>>>>>>>>> problem size (ie linear increase in performance), which means that >>>>>>>>>>> I can't get >90% speed up when I double the cores and problem size >>>>>>>>>>> for my current 48/96 cores setup. Is that so? >>>>>>>>>> What is the output of -ksp_view -log_summary on the problem and >>>>>>>>>> then on the problem doubled in size and number of processors? >>>>>>>>>> >>>>>>>>>> Barry >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I have attached the output >>>>>>>>> >>>>>>>>> 48 cores: log48 >>>>>>>>> 96 cores: log96 >>>>>>>>> >>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, while the >>>>>>>>> Poisson eqn uses hypre BoomerAMG. >>>>>>>>> >>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300. >>>>>>>>>>> So is it fair to say that the main problem does not lie in my >>>>>>>>>>> programming skills, but rather the way the linear equations are >>>>>>>>>>> solved? >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Matt >>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%), when >>>>>>>>>>>> using 17640 (2205X8) cores? >>>>>>>>>>>> Btw, I do not have access to the system. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sent using CloudMagic Email >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>>>>> experiments is infinitely more interesting than any results to >>>>>>>>>>>> which their experiments lead. >>>>>>>>>>>> -- Norbert Wiener >>>>>>>>> <log48.txt><log96.txt> >>>>>>> <log48_10.txt><log48.txt><log96.txt> >>>>> <log96_100.txt><log48_100.txt> >>> <log96_100_2.txt><log48_100_2.txt> > > <log64_100.txt><log8_100.txt>
