> On Aug 25, 2015, at 9:19 PM, Timothée Nicolas <[email protected]> 
> wrote:
> 
> Hi,
> 
> Problem solved !
> 
> Several points to be noticed :
> 
> 1. First the discrepancy between creations and destructions showed I had 
> forgotten some VecDestroy, and also a VecRestoreArrayReadF90. Repairing this 
> seemed to increase a bit the speed but did not solve the actual more serious 
> problem seen in the log_summary.
> 
> 2. The actual problem was a very stupid one on my side. At some point I print 
> small diagnostics at every time step to a text file with standard Fortran 
> write statement rather than a viewer to a binary file. I had simply forgotten 
> to put the statement between an if statement on the rank
> 
> if (rank.eq.0) then
> 
>     write(50) ....
> 
> end if
> 
> So all the processors were trying to write together to the file, which, I 
> suppose, somehow caused all the Scatters. After adding the if statement, I 
> recover a fast speed (about 10 ms per function evaluation). Thank you so much 
> for your help, I would never have made it so far without it !!!
> 
> 3. Last minor point, a discrepancy of 1 remains between creations and 
> destructions (see below extract of log_summary) for the category "viewer". I 
> have checked that it is also the case for the examples ex5f90.F and ex5.c on 
> which my code is based. I can't track it down, but it's probably a minor 
> point anyway.
> 
> --- Event Stage 0: Main Stage
> 
>                 SNES     1              1         1332     0
>       SNESLineSearch     1              1          864     0
>               DMSNES     2              2         1328     0
>               Vector    20             20     34973120     0
>       Vector Scatter     3              3       503488     0
>              MatMFFD     1              1          768     0
>               Matrix     1              1         2304     0
>     Distributed Mesh     3              3        14416     0
> Star Forest Bipartite Graph     6              6         5024     0
>      Discrete System     3              3         2544     0
>            Index Set     6              6       187248     0
>    IS L to G Mapping     2              2       184524     0
>        Krylov Solver     1              1         1304     0
>      DMKSP interface     1              1          648     0
>       Preconditioner     1              1          880     0
>               Viewer     3              2         1536     0
>          PetscRandom     1              1          624     0
> ========================================================================================================================
> 
  Because the -log_summary output is done with a viewer there has to be a 
viewer not yet destroyed with the output is made. Hence it will indicate one 
viewer still exists. This does not mean that it does not get destroyed 
eventually.

  Barry

> 
> Best 
> 
> Timothee
> 
> 
> 
> 2015-08-26 2:39 GMT+09:00 Barry Smith <[email protected]>:
> 
> > On Aug 25, 2015, at 2:06 AM, Timothée Nicolas <[email protected]> 
> > wrote:
> >
> > OK, I see,
> >
> > Might it be that I do something a bit funky to obtain a good guess for 
> > solve ? I had he following idea, which I used with great success on a very 
> > different problem (much simpler, maybe that's why it worked) : obtain the 
> > initial guess as a cubic extrapolation of the preceding solutions. The idea 
> > is that I expect my solution to be reasonably smooth over time, so 
> > considering this, the increment of the fields should also be continuous (I 
> > solve for the increments, not the fields themselves). Therefore, I store in 
> > my user context the current vector Xk as well as the last three solutions 
> > Xkm1 and Xkm2.
> >
> > I define
> >
> > dxm2 = Xkm1 - Xkm2
> > dxm1 = Xk - Xkm1
> >
> > And I use the result of the last SNESSolve as
> >
> > dx = Xkp1 - Xk
> >
> > Then I set the new dx initial guess as the pointwise cubic extrapolation of 
> > (dxm2,dxm1,dx)
> >
> > However it seems pretty local and I don't see why scatters would be 
> > required for this.
> 
>    Yes, no scatters here.
> 
> >
> > I printed the routine I use to do this below. In any case I will clean up a 
> > bit, remove the extra stuff (not much there however). If it is not 
> > sufficient, I will transform my form function in a dummy which does not 
> > require computations and see what happens.
> >
> > Timothee
> >
> >   PetscErrorCode :: ierr
> >
> >   PetscScalar :: M(3,3)
> >   Vec         :: xkm2,xkm1
> >   Vec         :: coef1,coef2,coef3
> >   PetscScalar :: a,b,c,t,det
> >
> >   a = user%tkm1
> >   b = user%tk
> >   c = user%t
> >   t = user%t+user%dt
> >
> >   det = b*a**2 + c*b**2 + a*c**2 - (c*a**2 + a*b**2 + b*c**2)
> >
> >   M(1,1) = (b-c)/det
> >   M(2,1) = (c**2-b**2)/det
> >   M(3,1) = (c*b**2-b*c**2)/det
> >
> >   M(1,2) = (c-a)/det
> >   M(2,2) = (a**2-c**2)/det
> >   M(3,2) = (a*c**2-c*a**2)/det
> >
> >   M(1,3) = (a-b)/det
> >   M(2,3) = (b**2-a**2)/det
> >   M(3,3) = (b*a**2-a*b**2)/det
> >
> >   call VecDuplicate(x,xkm1,ierr)
> >   call VecDuplicate(x,xkm2,ierr)
> >
> >   call VecDuplicate(x,coef1,ierr)
> >   call VecDuplicate(x,coef2,ierr)
> >   call VecDuplicate(x,coef3,ierr)
> >
> >   call VecWAXPY(xkm2,-one,user%Xkm2,user%Xkm1,ierr)
> >   call VecWAXPY(xkm1,-one,user%Xkm1,user%Xk,ierr)
> >
> >   ! The following lines correspond to the following simple operation
> >   ! coef1 = M(1,1)*alpha + M(1,2)*beta + M(1,3)*gamma
> >   ! coef2 = M(2,1)*alpha + M(2,2)*beta + M(2,3)*gamma
> >   ! coef3 = M(3,1)*alpha + M(3,2)*beta + M(3,3)*gamma
> >   call VecCopy(xkm2,coef1,ierr)
> >   call VecScale(coef1,M(1,1),ierr)
> >   call VecAXPY(coef1,M(1,2),xkm1,ierr)
> >   call VecAXPY(coef1,M(1,3),x,ierr)
> >
> >   call VecCopy(xkm2,coef2,ierr)
> >   call VecScale(coef2,M(2,1),ierr)
> >   call VecAXPY(coef2,M(2,2),xkm1,ierr)
> >   call VecAXPY(coef2,M(2,3),x,ierr)
> >
> >   call VecCopy(xkm2,coef3,ierr)
> >   call VecScale(coef3,M(3,1),ierr)
> >   call VecAXPY(coef3,M(3,2),xkm1,ierr)
> >   call VecAXPY(coef3,M(3,3),x,ierr)
> >
> >   call VecCopy(coef3,x,ierr)
> >   call VecAXPY(x,t,coef2,ierr)
> >   call VecAXPY(x,t**2,coef1,ierr)
> >
> >   call VecDestroy(xkm2,ierr)
> >   call VecDestroy(xkm1,ierr)
> >
> >   call VecDestroy(coef1,ierr)
> >   call VecDestroy(coef2,ierr)
> >   call VecDestroy(coef3,ierr)
> >
> >
> >
> > 2015-08-25 15:47 GMT+09:00 Barry Smith <[email protected]>:
> >
> >   The results are kind of funky,
> >
> > ------------------------------------------------------------------------------------------------------------------------
> > Event                Count      Time (sec)     Flops                        
> >      --- Global ---  --- Stage ---   Total
> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
> > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > ------------------------------------------------------------------------------------------------------------------------
> > SNESSolve             40 1.0 4.9745e+02 3.3 4.25e+09 1.0 1.7e+06 3.8e+04 
> > 2.7e+03 46 93 99 95 80  46 93 99 95 80  2187
> > SNESFunctionEval     666 1.0 4.8990e+02 3.4 5.73e+08 1.0 1.7e+06 3.8e+04 
> > 1.3e+03 45 13 99 95 40  45 13 99 95 40   299
> > SNESLineSearch        79 1.0 3.8578e+00 1.0 4.98e+08 1.0 4.0e+05 3.8e+04 
> > 6.3e+02  1 11 23 23 19   1 11 23 23 19 33068
> > VecScatterEnd       1335 1.0 3.4761e+02 5.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00 31  0  0  0  0  31  0  0  0  0     0
> > MatMult MF           547 1.0 1.2570e+01 1.1 1.27e+09 1.0 1.4e+06 3.8e+04 
> > 1.1e+03  2 28 81 78 34   2 28 81 78 34 25962
> > MatMult              547 1.0 1.2571e+01 1.1 1.27e+09 1.0 1.4e+06 3.8e+04 
> > 1.1e+03  2 28 81 78 34   2 28 81 78 34 25960
> >
> > look at the %T time for global SNES solve is 46 % of the total time, 
> > function evaluations are 45% but MatMult are only 2% (and yet matmult 
> > should contain most of the function evaluations). I cannot explain this. 
> > Also the VecScatterEnd is HUGE and has a bad load balance of 5.8  Why are 
> > there so many more scatters than function evaluations? What other 
> > operations are you doing that require scatters?
> >
> > It's almost like you have some mysterious "extra" function calls outside of 
> > the SNESSolve that are killing the performance? It might help to understand 
> > the performance to strip out all extraneous computations not needed (like 
> > in custom monitors etc).
> >
> >  Barry
> >
> >
> >
> >
> >
> >
> > > On Aug 25, 2015, at 1:21 AM, Timothée Nicolas 
> > > <[email protected]> wrote:
> > >
> > > Here is the log summary (attached). At the beginning are personal prints, 
> > > you can skip. I seem to have a memory crash in the present state after 
> > > typically 45 iterations (that's why I used 40 here), the log summary 
> > > indicates some creations without destruction of Petsc objects (I will fix 
> > > this immediately), that may cause the memory crash, but I don't think 
> > > it's the cause of the slow function evaluations.
> > >
> > > The log_summary is consistent with 0.7s per function evaluation 
> > > (4.8990e+02/666 = 0.736). In addition, SNESSolve itself takes 
> > > approximately the same amount of time (is it normal ?). And the other 
> > > long operation is VecScatterEnd. I assume it is the time used in process 
> > > communications ? In which case I suppose it is normal that it takes a 
> > > significant amount of time.
> > >
> > > So this ~10 times increase does not look normal right ?
> > >
> > > Best
> > >
> > > Timothee NICOLAS
> > >
> > >
> > > 2015-08-25 14:56 GMT+09:00 Barry Smith <[email protected]>:
> > >
> > > > On Aug 25, 2015, at 12:45 AM, Timothée Nicolas 
> > > > <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am testing PETSc on the supercomputer where I used to run my explicit 
> > > > MHD code. For my tests I use 256 processes on a problem of size 
> > > > 128*128*640 = 10485760, that is, 40960 grid points per process, and 8 
> > > > degrees of freedom (or physical fields). The explicit code was using 
> > > > Runge-Kutta 4 for the time scheme, which means 4 function evaluation 
> > > > per time step (plus one operation to put everything together, but let's 
> > > > forget this one).
> > > >
> > > > I could thus easily determine that the typical time required for a 
> > > > function evaluation was of the order of 50 ms.
> > > >
> > > > Now with the implicit Newton-Krylov solver written in PETSc, in the 
> > > > present state where for now I have not implemented any Jacobian or 
> > > > preconditioner whatsoever (so I run with -snes_mf), I measure a typical 
> > > > time between two time steps of between 5 and 20 seconds, and the number 
> > > > of function evaluations for each time step obtained with 
> > > > SNESGetNumberFunctionEvals is 17 (I am speaking of a particular case of 
> > > > course)
> > > >
> > > > This means a time per function evaluation of about 0.5 to 1 second, 
> > > > that is, 10 to 20 times slower.
> > > >
> > > > So I have some questions about this.
> > > >
> > > > 1. First does SNESGetNumberFunctionEvals take into account the function 
> > > > evaluations required to evaluate the Jacobian when -snes_mf is used, as 
> > > > well as the operations required by the GMRES (Krylov) method ? If it 
> > > > were the case, I would somehow intuitively expect a number larger than 
> > > > 17, which could explain the increase in time.
> > >
> > > PetscErrorCode  SNESGetNumberFunctionEvals(SNES snes, PetscInt *nfuncs)
> > > {
> > >   *nfuncs = snes->nfuncs;
> > > }
> > >
> > > PetscErrorCode  SNESComputeFunction(SNES snes,Vec x,Vec y)
> > > {
> > > ...
> > >   snes->nfuncs++;
> > > }
> > >
> > > PetscErrorCode  MatCreateSNESMF(SNES snes,Mat *J)
> > > {
> > > .....
> > >   if (snes->pc && snes->pcside == PC_LEFT) {
> > >     ierr = MatMFFDSetFunction(*J,(PetscErrorCode 
> > > (*)(void*,Vec,Vec))SNESComputeFunctionDefaultNPC,snes);CHKERRQ(ierr);
> > >   } else {
> > >     ierr = MatMFFDSetFunction(*J,(PetscErrorCode 
> > > (*)(void*,Vec,Vec))SNESComputeFunction,snes);CHKERRQ(ierr);
> > >   }
> > > }
> > >
> > >   So, yes I would expect all the function evaluations needed for the 
> > > matrix-free Jacobian matrix vector product to be counted. You can also 
> > > look at the number of GMRES Krylov iterations it took (which should have 
> > > one multiply per iteration) to double check that the numbers make sense.
> > >
> > >   What does your -log_summary output look like? One thing that GMRES does 
> > > is it introduces a global reduction with each multiple (hence a barrier 
> > > across all your processes) on some systems this can be deadly.
> > >
> > >   Barry
> > >
> > >
> > > >
> > > > 2. In any case, I thought that all things considered, the function 
> > > > evaluation would be the most time consuming part of a Newton-Krylov 
> > > > solver, am I completely wrong about that ? Is the 10-20 factor legit ?
> > > >
> > > > I realize of course that preconditioning should make all this smoother, 
> > > > in particular allowing larger time steps, but here I am just concerned 
> > > > about the sheer Function evaluation time.
> > > >
> > > > Best regards
> > > >
> > > > Timothee NICOLAS
> > >
> > >
> > > <log_summary.txt>
> >
> >
> 
> 

Reply via email to