Ok, the convergence looks good. Now run on 8 and 64 processes as before with -log_summary and not -ksp_monitor to see how it scales.
Barry > On Nov 3, 2015, at 6:49 AM, TAY wee-beng <[email protected]> wrote: > > Hi, > > I tried and have attached the log. > > Ya, my Poisson eqn has Neumann boundary condition. Do I need to specify some > null space stuff? Like KSPSetNullSpace or MatNullSpaceCreate? > > Thank you > > Yours sincerely, > > TAY wee-beng > > On 3/11/2015 12:45 PM, Barry Smith wrote: >>> On Nov 2, 2015, at 10:37 PM, TAY wee-beng<[email protected]> wrote: >>> >>> Hi, >>> >>> I tried : >>> >>> 1. -poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg >>> >>> 2. -poisson_pc_type gamg >> Run with -poisson_ksp_monitor_true_residual >> -poisson_ksp_monitor_converged_reason >> Does your poisson have Neumann boundary conditions? Do you have any zeros on >> the diagonal for the matrix (you shouldn't). >> >> There may be something wrong with your poisson discretization that was >> also messing up hypre >> >> >> >>> Both options give: >>> >>> 1 0.00150000 0.00000000 0.00000000 1.00000000 >>> NaN NaN NaN >>> M Diverged but why?, time = 2 >>> reason = -9 >>> >>> How can I check what's wrong? >>> >>> Thank you >>> >>> Yours sincerely, >>> >>> TAY wee-beng >>> >>> On 3/11/2015 3:18 AM, Barry Smith wrote: >>>> hypre is just not scaling well here. I do not know why. Since hypre is >>>> a block box for us there is no way to determine why the poor scaling. >>>> >>>> If you make the same two runs with -pc_type gamg there will be a lot >>>> more information in the log summary about in what routines it is scaling >>>> well or poorly. >>>> >>>> Barry >>>> >>>> >>>> >>>>> On Nov 2, 2015, at 3:17 AM, TAY wee-beng<[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have attached the 2 files. >>>>> >>>>> Thank you >>>>> >>>>> Yours sincerely, >>>>> >>>>> TAY wee-beng >>>>> >>>>> On 2/11/2015 2:55 PM, Barry Smith wrote: >>>>>> Run (158/2)x(266/2)x(150/2) grid on 8 processes and then >>>>>> (158)x(266)x(150) on 64 processors and send the two -log_summary results >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>>> On Nov 2, 2015, at 12:19 AM, TAY wee-beng<[email protected]> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have attached the new results. >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> Yours sincerely, >>>>>>> >>>>>>> TAY wee-beng >>>>>>> >>>>>>> On 2/11/2015 12:27 PM, Barry Smith wrote: >>>>>>>> Run without the -momentum_ksp_view -poisson_ksp_view and send the >>>>>>>> new results >>>>>>>> >>>>>>>> >>>>>>>> You can see from the log summary that the PCSetUp is taking a much >>>>>>>> smaller percentage of the time meaning that it is reusing the >>>>>>>> preconditioner and not rebuilding it each time. >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> Something makes no sense with the output: it gives >>>>>>>> >>>>>>>> KSPSolve 199 1.0 2.3298e+03 1.0 5.20e+09 1.8 3.8e+04 >>>>>>>> 9.9e+05 5.0e+02 90100 66100 24 90100 66100 24 165 >>>>>>>> >>>>>>>> 90% of the time is in the solve but there is no significant amount of >>>>>>>> time in other events of the code which is just not possible. I hope it >>>>>>>> is due to your IO. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 1, 2015, at 10:02 PM, TAY wee-beng<[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I have attached the new run with 100 time steps for 48 and 96 cores. >>>>>>>>> >>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want to >>>>>>>>> reuse the preconditioner, what must I do? Or what must I not do? >>>>>>>>> >>>>>>>>> Why does the number of processes increase so much? Is there something >>>>>>>>> wrong with my coding? Seems to be so too for my new run. >>>>>>>>> >>>>>>>>> Thank you >>>>>>>>> >>>>>>>>> Yours sincerely, >>>>>>>>> >>>>>>>>> TAY wee-beng >>>>>>>>> >>>>>>>>> On 2/11/2015 9:49 AM, Barry Smith wrote: >>>>>>>>>> If you are doing many time steps with the same linear solver then >>>>>>>>>> you MUST do your weak scaling studies with MANY time steps since the >>>>>>>>>> setup time of AMG only takes place in the first stimestep. So run >>>>>>>>>> both 48 and 96 processes with the same large number of time steps. >>>>>>>>>> >>>>>>>>>> Barry >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Nov 1, 2015, at 7:35 PM, TAY wee-beng<[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Sorry I forgot and use the old a.out. I have attached the new log >>>>>>>>>>> for 48cores (log48), together with the 96cores log (log96). >>>>>>>>>>> >>>>>>>>>>> Why does the number of processes increase so much? Is there >>>>>>>>>>> something wrong with my coding? >>>>>>>>>>> >>>>>>>>>>> Only the Poisson eqn 's RHS changes, the LHS doesn't. So if I want >>>>>>>>>>> to reuse the preconditioner, what must I do? Or what must I not do? >>>>>>>>>>> >>>>>>>>>>> Lastly, I only simulated 2 time steps previously. Now I run for 10 >>>>>>>>>>> timesteps (log48_10). Is it building the preconditioner at every >>>>>>>>>>> timestep? >>>>>>>>>>> >>>>>>>>>>> Also, what about momentum eqn? Is it working well? >>>>>>>>>>> >>>>>>>>>>> I will try the gamg later too. >>>>>>>>>>> >>>>>>>>>>> Thank you >>>>>>>>>>> >>>>>>>>>>> Yours sincerely, >>>>>>>>>>> >>>>>>>>>>> TAY wee-beng >>>>>>>>>>> >>>>>>>>>>> On 2/11/2015 12:30 AM, Barry Smith wrote: >>>>>>>>>>>> You used gmres with 48 processes but richardson with 96. You >>>>>>>>>>>> need to be careful and make sure you don't change the solvers when >>>>>>>>>>>> you change the number of processors since you can get very >>>>>>>>>>>> different inconsistent results >>>>>>>>>>>> >>>>>>>>>>>> Anyways all the time is being spent in the BoomerAMG algebraic >>>>>>>>>>>> multigrid setup and it is is scaling badly. When you double the >>>>>>>>>>>> problem size and number of processes it went from 3.2445e+01 to >>>>>>>>>>>> 4.3599e+02 seconds. >>>>>>>>>>>> >>>>>>>>>>>> PCSetUp 3 1.0 3.2445e+01 1.0 9.58e+06 2.0 0.0e+00 >>>>>>>>>>>> 0.0e+00 4.0e+00 62 8 0 0 4 62 8 0 0 5 11 >>>>>>>>>>>> >>>>>>>>>>>> PCSetUp 3 1.0 4.3599e+02 1.0 9.58e+06 2.0 0.0e+00 >>>>>>>>>>>> 0.0e+00 4.0e+00 85 18 0 0 6 85 18 0 0 6 2 >>>>>>>>>>>> >>>>>>>>>>>> Now is the Poisson problem changing at each timestep or can you >>>>>>>>>>>> use the same preconditioner built with BoomerAMG for all the time >>>>>>>>>>>> steps? Algebraic multigrid has a large set up time that you often >>>>>>>>>>>> doesn't matter if you have many time steps but if you have to >>>>>>>>>>>> rebuild it each timestep it is too large? >>>>>>>>>>>> >>>>>>>>>>>> You might also try -pc_type gamg and see how PETSc's algebraic >>>>>>>>>>>> multigrid scales for your problem/machine. >>>>>>>>>>>> >>>>>>>>>>>> Barry >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Nov 1, 2015, at 7:30 AM, TAY wee-beng<[email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 1/11/2015 10:00 AM, Barry Smith wrote: >>>>>>>>>>>>>>> On Oct 31, 2015, at 8:43 PM, TAY wee-beng<[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 1/11/2015 12:47 AM, Matthew Knepley wrote: >>>>>>>>>>>>>>>> On Sat, Oct 31, 2015 at 11:34 AM, TAY >>>>>>>>>>>>>>>> wee-beng<[email protected]> wrote: >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I understand that as mentioned in the faq, due to the >>>>>>>>>>>>>>>> limitations in memory, the scaling is not linear. So, I am >>>>>>>>>>>>>>>> trying to write a proposal to use a supercomputer. >>>>>>>>>>>>>>>> Its specs are: >>>>>>>>>>>>>>>> Compute nodes: 82,944 nodes (SPARC64 VIIIfx; 16GB of memory >>>>>>>>>>>>>>>> per node) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 8 cores / processor >>>>>>>>>>>>>>>> Interconnect: Tofu (6-dimensional mesh/torus) Interconnect >>>>>>>>>>>>>>>> Each cabinet contains 96 computing nodes, >>>>>>>>>>>>>>>> One of the requirement is to give the performance of my >>>>>>>>>>>>>>>> current code with my current set of data, and there is a >>>>>>>>>>>>>>>> formula to calculate the estimated parallel efficiency when >>>>>>>>>>>>>>>> using the new large set of data >>>>>>>>>>>>>>>> There are 2 ways to give performance: >>>>>>>>>>>>>>>> 1. Strong scaling, which is defined as how the elapsed time >>>>>>>>>>>>>>>> varies with the number of processors for a fixed >>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>> 2. Weak scaling, which is defined as how the elapsed time >>>>>>>>>>>>>>>> varies with the number of processors for a >>>>>>>>>>>>>>>> fixed problem size per processor. >>>>>>>>>>>>>>>> I ran my cases with 48 and 96 cores with my current cluster, >>>>>>>>>>>>>>>> giving 140 and 90 mins respectively. This is classified as >>>>>>>>>>>>>>>> strong scaling. >>>>>>>>>>>>>>>> Cluster specs: >>>>>>>>>>>>>>>> CPU: AMD 6234 2.4GHz >>>>>>>>>>>>>>>> 8 cores / processor (CPU) >>>>>>>>>>>>>>>> 6 CPU / node >>>>>>>>>>>>>>>> So 48 Cores / CPU >>>>>>>>>>>>>>>> Not sure abt the memory / node >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The parallel efficiency ‘En’ for a given degree of parallelism >>>>>>>>>>>>>>>> ‘n’ indicates how much the program is >>>>>>>>>>>>>>>> efficiently accelerated by parallel processing. ‘En’ is given >>>>>>>>>>>>>>>> by the following formulae. Although their >>>>>>>>>>>>>>>> derivation processes are different depending on strong and >>>>>>>>>>>>>>>> weak scaling, derived formulae are the >>>>>>>>>>>>>>>> same. >>>>>>>>>>>>>>>> From the estimated time, my parallel efficiency using >>>>>>>>>>>>>>>> Amdahl's law on the current old cluster was 52.7%. >>>>>>>>>>>>>>>> So is my results acceptable? >>>>>>>>>>>>>>>> For the large data set, if using 2205 nodes (2205X8cores), my >>>>>>>>>>>>>>>> expected parallel efficiency is only 0.5%. The proposal >>>>>>>>>>>>>>>> recommends value of > 50%. >>>>>>>>>>>>>>>> The problem with this analysis is that the estimated serial >>>>>>>>>>>>>>>> fraction from Amdahl's Law changes as a function >>>>>>>>>>>>>>>> of problem size, so you cannot take the strong scaling from >>>>>>>>>>>>>>>> one problem and apply it to another without a >>>>>>>>>>>>>>>> model of this dependence. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Weak scaling does model changes with problem size, so I would >>>>>>>>>>>>>>>> measure weak scaling on your current >>>>>>>>>>>>>>>> cluster, and extrapolate to the big machine. I realize that >>>>>>>>>>>>>>>> this does not make sense for many scientific >>>>>>>>>>>>>>>> applications, but neither does requiring a certain parallel >>>>>>>>>>>>>>>> efficiency. >>>>>>>>>>>>>>> Ok I check the results for my weak scaling it is even worse for >>>>>>>>>>>>>>> the expected parallel efficiency. From the formula used, it's >>>>>>>>>>>>>>> obvious it's doing some sort of exponential extrapolation >>>>>>>>>>>>>>> decrease. So unless I can achieve a near > 90% speed up when I >>>>>>>>>>>>>>> double the cores and problem size for my current 48/96 cores >>>>>>>>>>>>>>> setup, extrapolating from about 96 nodes to 10,000 nodes >>>>>>>>>>>>>>> will give a much lower expected parallel efficiency for the new >>>>>>>>>>>>>>> case. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, it's mentioned in the FAQ that due to memory >>>>>>>>>>>>>>> requirement, it's impossible to get >90% speed when I double >>>>>>>>>>>>>>> the cores and problem size (ie linear increase in performance), >>>>>>>>>>>>>>> which means that I can't get >90% speed up when I double the >>>>>>>>>>>>>>> cores and problem size for my current 48/96 cores setup. Is >>>>>>>>>>>>>>> that so? >>>>>>>>>>>>>> What is the output of -ksp_view -log_summary on the problem >>>>>>>>>>>>>> and then on the problem doubled in size and number of processors? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Barry >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I have attached the output >>>>>>>>>>>>> >>>>>>>>>>>>> 48 cores: log48 >>>>>>>>>>>>> 96 cores: log96 >>>>>>>>>>>>> >>>>>>>>>>>>> There are 2 solvers - The momentum linear eqn uses bcgs, while >>>>>>>>>>>>> the Poisson eqn uses hypre BoomerAMG. >>>>>>>>>>>>> >>>>>>>>>>>>> Problem size doubled from 158x266x150 to 158x266x300. >>>>>>>>>>>>>>> So is it fair to say that the main problem does not lie in my >>>>>>>>>>>>>>> programming skills, but rather the way the linear equations are >>>>>>>>>>>>>>> solved? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Matt >>>>>>>>>>>>>>>> Is it possible for this type of scaling in PETSc (>50%), when >>>>>>>>>>>>>>>> using 17640 (2205X8) cores? >>>>>>>>>>>>>>>> Btw, I do not have access to the system. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sent using CloudMagic Email >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> What most experimenters take for granted before they begin >>>>>>>>>>>>>>>> their experiments is infinitely more interesting than any >>>>>>>>>>>>>>>> results to which their experiments lead. >>>>>>>>>>>>>>>> -- Norbert Wiener >>>>>>>>>>>>> <log48.txt><log96.txt> >>>>>>>>>>> <log48_10.txt><log48.txt><log96.txt> >>>>>>>>> <log96_100.txt><log48_100.txt> >>>>>>> <log96_100_2.txt><log48_100_2.txt> >>>>> <log64_100.txt><log8_100.txt> > > <log.txt>
