> On Aug 25, 2015, at 9:19 PM, Timothée Nicolas <[email protected]> > wrote: > > Hi, > > Problem solved ! > > Several points to be noticed : > > 1. First the discrepancy between creations and destructions showed I had > forgotten some VecDestroy, and also a VecRestoreArrayReadF90. Repairing this > seemed to increase a bit the speed but did not solve the actual more serious > problem seen in the log_summary. > > 2. The actual problem was a very stupid one on my side. At some point I print > small diagnostics at every time step to a text file with standard Fortran > write statement rather than a viewer to a binary file. I had simply forgotten > to put the statement between an if statement on the rank > > if (rank.eq.0) then > > write(50) .... > > end if > > So all the processors were trying to write together to the file, which, I > suppose, somehow caused all the Scatters. After adding the if statement, I > recover a fast speed (about 10 ms per function evaluation). Thank you so much > for your help, I would never have made it so far without it !!! > > 3. Last minor point, a discrepancy of 1 remains between creations and > destructions (see below extract of log_summary) for the category "viewer". I > have checked that it is also the case for the examples ex5f90.F and ex5.c on > which my code is based. I can't track it down, but it's probably a minor > point anyway. > > --- Event Stage 0: Main Stage > > SNES 1 1 1332 0 > SNESLineSearch 1 1 864 0 > DMSNES 2 2 1328 0 > Vector 20 20 34973120 0 > Vector Scatter 3 3 503488 0 > MatMFFD 1 1 768 0 > Matrix 1 1 2304 0 > Distributed Mesh 3 3 14416 0 > Star Forest Bipartite Graph 6 6 5024 0 > Discrete System 3 3 2544 0 > Index Set 6 6 187248 0 > IS L to G Mapping 2 2 184524 0 > Krylov Solver 1 1 1304 0 > DMKSP interface 1 1 648 0 > Preconditioner 1 1 880 0 > Viewer 3 2 1536 0 > PetscRandom 1 1 624 0 > ======================================================================================================================== > Because the -log_summary output is done with a viewer there has to be a viewer not yet destroyed with the output is made. Hence it will indicate one viewer still exists. This does not mean that it does not get destroyed eventually.
Barry > > Best > > Timothee > > > > 2015-08-26 2:39 GMT+09:00 Barry Smith <[email protected]>: > > > On Aug 25, 2015, at 2:06 AM, Timothée Nicolas <[email protected]> > > wrote: > > > > OK, I see, > > > > Might it be that I do something a bit funky to obtain a good guess for > > solve ? I had he following idea, which I used with great success on a very > > different problem (much simpler, maybe that's why it worked) : obtain the > > initial guess as a cubic extrapolation of the preceding solutions. The idea > > is that I expect my solution to be reasonably smooth over time, so > > considering this, the increment of the fields should also be continuous (I > > solve for the increments, not the fields themselves). Therefore, I store in > > my user context the current vector Xk as well as the last three solutions > > Xkm1 and Xkm2. > > > > I define > > > > dxm2 = Xkm1 - Xkm2 > > dxm1 = Xk - Xkm1 > > > > And I use the result of the last SNESSolve as > > > > dx = Xkp1 - Xk > > > > Then I set the new dx initial guess as the pointwise cubic extrapolation of > > (dxm2,dxm1,dx) > > > > However it seems pretty local and I don't see why scatters would be > > required for this. > > Yes, no scatters here. > > > > > I printed the routine I use to do this below. In any case I will clean up a > > bit, remove the extra stuff (not much there however). If it is not > > sufficient, I will transform my form function in a dummy which does not > > require computations and see what happens. > > > > Timothee > > > > PetscErrorCode :: ierr > > > > PetscScalar :: M(3,3) > > Vec :: xkm2,xkm1 > > Vec :: coef1,coef2,coef3 > > PetscScalar :: a,b,c,t,det > > > > a = user%tkm1 > > b = user%tk > > c = user%t > > t = user%t+user%dt > > > > det = b*a**2 + c*b**2 + a*c**2 - (c*a**2 + a*b**2 + b*c**2) > > > > M(1,1) = (b-c)/det > > M(2,1) = (c**2-b**2)/det > > M(3,1) = (c*b**2-b*c**2)/det > > > > M(1,2) = (c-a)/det > > M(2,2) = (a**2-c**2)/det > > M(3,2) = (a*c**2-c*a**2)/det > > > > M(1,3) = (a-b)/det > > M(2,3) = (b**2-a**2)/det > > M(3,3) = (b*a**2-a*b**2)/det > > > > call VecDuplicate(x,xkm1,ierr) > > call VecDuplicate(x,xkm2,ierr) > > > > call VecDuplicate(x,coef1,ierr) > > call VecDuplicate(x,coef2,ierr) > > call VecDuplicate(x,coef3,ierr) > > > > call VecWAXPY(xkm2,-one,user%Xkm2,user%Xkm1,ierr) > > call VecWAXPY(xkm1,-one,user%Xkm1,user%Xk,ierr) > > > > ! The following lines correspond to the following simple operation > > ! coef1 = M(1,1)*alpha + M(1,2)*beta + M(1,3)*gamma > > ! coef2 = M(2,1)*alpha + M(2,2)*beta + M(2,3)*gamma > > ! coef3 = M(3,1)*alpha + M(3,2)*beta + M(3,3)*gamma > > call VecCopy(xkm2,coef1,ierr) > > call VecScale(coef1,M(1,1),ierr) > > call VecAXPY(coef1,M(1,2),xkm1,ierr) > > call VecAXPY(coef1,M(1,3),x,ierr) > > > > call VecCopy(xkm2,coef2,ierr) > > call VecScale(coef2,M(2,1),ierr) > > call VecAXPY(coef2,M(2,2),xkm1,ierr) > > call VecAXPY(coef2,M(2,3),x,ierr) > > > > call VecCopy(xkm2,coef3,ierr) > > call VecScale(coef3,M(3,1),ierr) > > call VecAXPY(coef3,M(3,2),xkm1,ierr) > > call VecAXPY(coef3,M(3,3),x,ierr) > > > > call VecCopy(coef3,x,ierr) > > call VecAXPY(x,t,coef2,ierr) > > call VecAXPY(x,t**2,coef1,ierr) > > > > call VecDestroy(xkm2,ierr) > > call VecDestroy(xkm1,ierr) > > > > call VecDestroy(coef1,ierr) > > call VecDestroy(coef2,ierr) > > call VecDestroy(coef3,ierr) > > > > > > > > 2015-08-25 15:47 GMT+09:00 Barry Smith <[email protected]>: > > > > The results are kind of funky, > > > > ------------------------------------------------------------------------------------------------------------------------ > > Event Count Time (sec) Flops > > --- Global --- --- Stage --- Total > > Max Ratio Max Ratio Max Ratio Mess Avg len > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > SNESSolve 40 1.0 4.9745e+02 3.3 4.25e+09 1.0 1.7e+06 3.8e+04 > > 2.7e+03 46 93 99 95 80 46 93 99 95 80 2187 > > SNESFunctionEval 666 1.0 4.8990e+02 3.4 5.73e+08 1.0 1.7e+06 3.8e+04 > > 1.3e+03 45 13 99 95 40 45 13 99 95 40 299 > > SNESLineSearch 79 1.0 3.8578e+00 1.0 4.98e+08 1.0 4.0e+05 3.8e+04 > > 6.3e+02 1 11 23 23 19 1 11 23 23 19 33068 > > VecScatterEnd 1335 1.0 3.4761e+02 5.8 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 31 0 0 0 0 31 0 0 0 0 0 > > MatMult MF 547 1.0 1.2570e+01 1.1 1.27e+09 1.0 1.4e+06 3.8e+04 > > 1.1e+03 2 28 81 78 34 2 28 81 78 34 25962 > > MatMult 547 1.0 1.2571e+01 1.1 1.27e+09 1.0 1.4e+06 3.8e+04 > > 1.1e+03 2 28 81 78 34 2 28 81 78 34 25960 > > > > look at the %T time for global SNES solve is 46 % of the total time, > > function evaluations are 45% but MatMult are only 2% (and yet matmult > > should contain most of the function evaluations). I cannot explain this. > > Also the VecScatterEnd is HUGE and has a bad load balance of 5.8 Why are > > there so many more scatters than function evaluations? What other > > operations are you doing that require scatters? > > > > It's almost like you have some mysterious "extra" function calls outside of > > the SNESSolve that are killing the performance? It might help to understand > > the performance to strip out all extraneous computations not needed (like > > in custom monitors etc). > > > > Barry > > > > > > > > > > > > > > > On Aug 25, 2015, at 1:21 AM, Timothée Nicolas > > > <[email protected]> wrote: > > > > > > Here is the log summary (attached). At the beginning are personal prints, > > > you can skip. I seem to have a memory crash in the present state after > > > typically 45 iterations (that's why I used 40 here), the log summary > > > indicates some creations without destruction of Petsc objects (I will fix > > > this immediately), that may cause the memory crash, but I don't think > > > it's the cause of the slow function evaluations. > > > > > > The log_summary is consistent with 0.7s per function evaluation > > > (4.8990e+02/666 = 0.736). In addition, SNESSolve itself takes > > > approximately the same amount of time (is it normal ?). And the other > > > long operation is VecScatterEnd. I assume it is the time used in process > > > communications ? In which case I suppose it is normal that it takes a > > > significant amount of time. > > > > > > So this ~10 times increase does not look normal right ? > > > > > > Best > > > > > > Timothee NICOLAS > > > > > > > > > 2015-08-25 14:56 GMT+09:00 Barry Smith <[email protected]>: > > > > > > > On Aug 25, 2015, at 12:45 AM, Timothée Nicolas > > > > <[email protected]> wrote: > > > > > > > > Hi, > > > > > > > > I am testing PETSc on the supercomputer where I used to run my explicit > > > > MHD code. For my tests I use 256 processes on a problem of size > > > > 128*128*640 = 10485760, that is, 40960 grid points per process, and 8 > > > > degrees of freedom (or physical fields). The explicit code was using > > > > Runge-Kutta 4 for the time scheme, which means 4 function evaluation > > > > per time step (plus one operation to put everything together, but let's > > > > forget this one). > > > > > > > > I could thus easily determine that the typical time required for a > > > > function evaluation was of the order of 50 ms. > > > > > > > > Now with the implicit Newton-Krylov solver written in PETSc, in the > > > > present state where for now I have not implemented any Jacobian or > > > > preconditioner whatsoever (so I run with -snes_mf), I measure a typical > > > > time between two time steps of between 5 and 20 seconds, and the number > > > > of function evaluations for each time step obtained with > > > > SNESGetNumberFunctionEvals is 17 (I am speaking of a particular case of > > > > course) > > > > > > > > This means a time per function evaluation of about 0.5 to 1 second, > > > > that is, 10 to 20 times slower. > > > > > > > > So I have some questions about this. > > > > > > > > 1. First does SNESGetNumberFunctionEvals take into account the function > > > > evaluations required to evaluate the Jacobian when -snes_mf is used, as > > > > well as the operations required by the GMRES (Krylov) method ? If it > > > > were the case, I would somehow intuitively expect a number larger than > > > > 17, which could explain the increase in time. > > > > > > PetscErrorCode SNESGetNumberFunctionEvals(SNES snes, PetscInt *nfuncs) > > > { > > > *nfuncs = snes->nfuncs; > > > } > > > > > > PetscErrorCode SNESComputeFunction(SNES snes,Vec x,Vec y) > > > { > > > ... > > > snes->nfuncs++; > > > } > > > > > > PetscErrorCode MatCreateSNESMF(SNES snes,Mat *J) > > > { > > > ..... > > > if (snes->pc && snes->pcside == PC_LEFT) { > > > ierr = MatMFFDSetFunction(*J,(PetscErrorCode > > > (*)(void*,Vec,Vec))SNESComputeFunctionDefaultNPC,snes);CHKERRQ(ierr); > > > } else { > > > ierr = MatMFFDSetFunction(*J,(PetscErrorCode > > > (*)(void*,Vec,Vec))SNESComputeFunction,snes);CHKERRQ(ierr); > > > } > > > } > > > > > > So, yes I would expect all the function evaluations needed for the > > > matrix-free Jacobian matrix vector product to be counted. You can also > > > look at the number of GMRES Krylov iterations it took (which should have > > > one multiply per iteration) to double check that the numbers make sense. > > > > > > What does your -log_summary output look like? One thing that GMRES does > > > is it introduces a global reduction with each multiple (hence a barrier > > > across all your processes) on some systems this can be deadly. > > > > > > Barry > > > > > > > > > > > > > > 2. In any case, I thought that all things considered, the function > > > > evaluation would be the most time consuming part of a Newton-Krylov > > > > solver, am I completely wrong about that ? Is the 10-20 factor legit ? > > > > > > > > I realize of course that preconditioning should make all this smoother, > > > > in particular allowing larger time steps, but here I am just concerned > > > > about the sheer Function evaluation time. > > > > > > > > Best regards > > > > > > > > Timothee NICOLAS > > > > > > > > > <log_summary.txt> > > > > > >
