I'll take a look at it this weekend. The computed preconditioned residual norm is a real number so I am not sure where PETSc will be able to detect the problem appropriately before it is too late.
> On Apr 1, 2022, at 6:14 PM, Giovane Avancini <[email protected]> wrote: > > Hi Barry, it's me again. > > Sorry to bother you with this issue, but the problem is still happening, now > when using KSPIBCGS. As you can see below, even when a NaN pops up in the > residual, the solver still converges to an INF solution. > > ----------------------- TIME STEP = 3318, time = 0.663600 > ----------------------- > > Mesh Regenerated. Elapsed time: 0.018536 > Isolated nodes: 14 > Assemble Linear System. Elapsed time: 0.030077 > 0 KSP preconditioned resid norm 4.087133454416e+04 true resid norm > -nan ||r(i)||/||b|| -nan > 1 KSP preconditioned resid norm 8.670288259109e+03 true resid norm > -nan ||r(i)||/||b|| -nan > 2 KSP preconditioned resid norm 4.875596419197e+03 true resid norm > -nan ||r(i)||/||b|| -nan > 3 KSP preconditioned resid norm 1.226640070761e+03 true resid norm > -nan ||r(i)||/||b|| -nan > 4 KSP preconditioned resid norm 7.121904546851e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 5 KSP preconditioned resid norm 5.990560906831e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 6 KSP preconditioned resid norm 4.256157374933e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 7 KSP preconditioned resid norm 3.274351035311e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 8 KSP preconditioned resid norm 2.436138522439e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 9 KSP preconditioned resid norm 1.268089193578e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 10 KSP preconditioned resid norm 1.093950736015e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 11 KSP preconditioned resid norm 9.950531836062e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 12 KSP preconditioned resid norm 1.066841140901e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 13 KSP preconditioned resid norm 1.003475554456e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 14 KSP preconditioned resid norm 1.073513486989e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 15 KSP preconditioned resid norm 8.724609972930e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 16 KSP preconditioned resid norm 1.445166180332e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 17 KSP preconditioned resid norm 3.767376396291e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 18 KSP preconditioned resid norm 7.597770355737e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 19 KSP preconditioned resid norm 3.208030402538e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 20 KSP preconditioned resid norm 3.477715841173e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 21 KSP preconditioned resid norm 2.880337856055e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 22 KSP preconditioned resid norm 2.730108581171e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 23 KSP preconditioned resid norm 2.111131168298e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 24 KSP preconditioned resid norm 1.635560497545e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 25 KSP preconditioned resid norm 1.550914551701e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 26 KSP preconditioned resid norm 1.409066040669e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 27 KSP preconditioned resid norm 1.032086999081e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 28 KSP preconditioned resid norm 1.111168488798e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 29 KSP preconditioned resid norm 9.898696915473e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 30 KSP preconditioned resid norm 1.234283818664e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 31 KSP preconditioned resid norm 2.735222111838e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 32 KSP preconditioned resid norm 6.431272223321e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 33 KSP preconditioned resid norm 6.320133000091e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 34 KSP preconditioned resid norm 6.568217058049e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 35 KSP preconditioned resid norm 6.483075335206e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 36 KSP preconditioned resid norm 6.419074566626e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 37 KSP preconditioned resid norm 6.372749647101e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 38 KSP preconditioned resid norm 5.920214853455e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 39 KSP preconditioned resid norm 5.953698988377e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 40 KSP preconditioned resid norm 4.009279521077e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 41 KSP preconditioned resid norm 8.407438130288e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 42 KSP preconditioned resid norm 1.924008529878e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 43 KSP preconditioned resid norm 9.126618449455e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 44 KSP preconditioned resid norm 2.747853629308e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 45 KSP preconditioned resid norm 2.556706051040e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 46 KSP preconditioned resid norm 2.427212844835e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 47 KSP preconditioned resid norm 7.630151877379e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 48 KSP preconditioned resid norm 5.895961768741e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 49 KSP preconditioned resid norm 2.271378954392e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 50 KSP preconditioned resid norm 1.779670755839e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 51 KSP preconditioned resid norm 1.488459722777e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 52 KSP preconditioned resid norm 1.479802491212e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 53 KSP preconditioned resid norm 1.316523287251e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 54 KSP preconditioned resid norm 1.347849424457e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 55 KSP preconditioned resid norm 6.739405576032e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 56 KSP preconditioned resid norm 6.699633313335e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 57 KSP preconditioned resid norm 8.064741830609e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 58 KSP preconditioned resid norm 6.744985187452e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 59 KSP preconditioned resid norm 6.981071339163e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 60 KSP preconditioned resid norm 4.410819986572e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 61 KSP preconditioned resid norm 4.062281042354e-02 true resid norm > -nan ||r(i)||/||b|| -nan > Linear solve converged due to CONVERGED_RTOL iterations 61 > Solver converged within 61 iterations. Elapsed time: 0.117009 > Newton iteration: 0 - L2 Position Norm: INF - L2 Pressure Norm: INF > Memory used by each processor: 47.843750 Mb > > Could you please check if the issue can be fixed the same way as you did for > the GMRES family solvers? Thanks in advance, > > Kind regards, > > Giovane > > Em ter., 8 de mar. de 2022 às 01:05, Barry Smith <[email protected] > <mailto:[email protected]>> escreveu: > > I ran with -info and get repeated > > MatPivotCheck_none(): Detected zero pivot in factorization in row 2547 value > 0. tolerance 2.22045e-14 > > after the first linear solve failure. The values are always slightly > different. My conclusion is that from this point on the default factorization > is truly failing each time which is why it is always switching the linear > solver. > > Barry > > >> On Mar 7, 2022, at 7:01 PM, Giovane Avancini <[email protected] >> <mailto:[email protected]>> wrote: >> >> Sorry, I forgot to attach the file. >> >> Em seg., 7 de mar. de 2022 às 21:01, Giovane Avancini <[email protected] >> <mailto:[email protected]>> escreveu: >> Thanks Barry! I included the piece of code you sent and now it seems to be >> working pretty well. It has completed all the 5000 time steps and the solver >> is indeed triggering the failure when a NaN/Inf is found. >> >> I just noticed a strange behaviour in my code after the patch that was not >> happening before, so I was wondering if it could be related to the way you >> fixed the bug or if it is a coincidence, please find attached the log file. >> >> At time step 913, the first failure occurs,and it doesn't print the norms of >> iteration 0 for instance (before, even when the pc ended up failing during >> the first ksp iteration, the norms were plotted indicating the NaN). Ok, >> maybe now it verifies that a NaN appeared before the norms are actually >> computed. >> >> What is strange to me is that, after the first failure, all the remaining >> calls to FGMRES have failed as well, which is unlikely to be the case in my >> view. Would it be possible that some error flags of FGMRES are not being >> reseted from one call to another? So after the first iteration of step 913, >> FGMRES is being called with an error flag already set to true? >> >> Anyway, I really appreciate your efforts in finding the bug and trying to >> help me, thank you very much! >> >> Em seg., 7 de mar. de 2022 às 18:08, Barry Smith <[email protected] >> <mailto:[email protected]>> escreveu: >> >> The fix for the problem Geiovane encountered is in >> https://gitlab.com/petsc/petsc/-/merge_requests/4934 >> <https://gitlab.com/petsc/petsc/-/merge_requests/4934> >> >> >>> On Mar 3, 2022, at 11:24 AM, Giovane Avancini <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Sorry for my late reply Barry, >>> >>> Sure I can share the code with you, but unfortunately I don't know how to >>> make docker images. If you don't mind, you can clone the code from github >>> through this link: [email protected] >>> <mailto:[email protected]>:giavancini/runPFEM.git >>> It can be easily compiled with cmake, and you can see the dependencies in >>> README.md. Please let me know if you need any other information. >>> >>> Kind regards, >>> >>> Giovane >>> >>> Em sex., 25 de fev. de 2022 às 18:22, Barry Smith <[email protected] >>> <mailto:[email protected]>> escreveu: >>> >>> Hmm, this is going to be tricky to debug why it the Inf/Nan is not >>> found when it should be. >>> >>> In a debugger you can catch/trap floating point exceptions (how to do >>> this depends on your debugger) and then step through the code after that to >>> see why PETSc KSP is not properly noting the Inf/Nan and returning. This >>> may be cumbersome to do if you don't know PETSc well. Is your code easy to >>> build, would be willing to share it to me so I can run it and debug >>> directly? If you know how to make docker images or something you might be >>> able to give it to me easily. >>> >>> Barry >>> >>> >>>> On Feb 25, 2022, at 3:59 PM, Giovane Avancini <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Mark, Matthew and Barry, >>>> >>>> Thank you all for the quick responses. >>>> >>>> Others might have a better idea, but you could run with '-info :ksp' and >>>> see if you see any messages like "Linear solver has created a not a number >>>> (NaN) as the residual norm, declaring divergence \n" >>>> You could also run with -log_trace and see if it is using >>>> KSPConvergedDefault. I'm not sure if this is the method used given your >>>> parameters, but I think it is. >>>> Mark, I ran with both options. I didn't get any messages like "linear >>>> solver has created a not a number..." when using -info: ksp. When turning >>>> on -log_trace, I could verify that it is using KSPConvergedDefault but >>>> what does it mean exactly? When FGMRES converges with the true residual >>>> being NaN, I get the following message: [0] KSPConvergedDefault(): Linear >>>> solver has converged. Residual norm 8.897908325511e-05 is less than >>>> relative tolerance 1.000000000000e-08 times initial right hand side norm >>>> 1.466597558465e+04 at iteration 53. No information about NaN whatsoever. >>>> >>>> We check for NaN or Inf, for example, in KSPCheckDot(). if you have the >>>> KSP set to error >>>> (https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html >>>> >>>> <https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html>) >>>> then we throw an error, but the return codes do not seem to be checked in >>>> your implementation. If not, then we set the flag for divergence. >>>> Matthew, I do not check the return code in this case because I don't want >>>> PETSc to stop if an error occurs during the solving step. I just want to >>>> know that it didn't converge and treat this error inside my code. The >>>> problem is that the flag for divergence is not always being set when >>>> FGMRES is not converging. I was just wondering why it was set during time >>>> step 921 and why not for time step 922 as well. >>>> >>>> Thanks for the complete report. It looks like we may be missing a check in >>>> our FGMRES implementation that allows the iteration to continue after a >>>> NaN/Inf. >>>> >>>> I will explain how we handle the checking and then attach a patch that >>>> you can apply to see if it resolves the problem. Whenever our KSP solvers >>>> compute a norm we >>>> check after that calculation to verify that the norm is not an Inf or Nan. >>>> This is an inexpensive global check across all MPI ranks because >>>> immediately after the norm computation all ranks that share the KSP have >>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP >>>> solve and return immediately with an appropriate not converged code. A >>>> quick eye-ball inspection of the FGMRES code found a missing check. >>>> >>>> You can apply the attached patch file in the PETSC_DIR with >>>> >>>> patch -p1 < fgmres.patch >>>> make libs >>>> >>>> then rerun your code and see if it now handles the Inf/NaN correctly. If >>>> so we'll patch our release branch with the fix. >>>> Thank you for checking this, Barry. I applied the patch exactly the way >>>> you instructed, however, the problem is still happening. Is there a way to >>>> check if the patch was in fact applied? You can see in the attached >>>> screenshot the terminal information. >>>> >>>> Kind regards, >>>> >>>> Giovane >>>> >>>> Em sex., 25 de fev. de 2022 às 13:48, Barry Smith <[email protected] >>>> <mailto:[email protected]>> escreveu: >>>> >>>> Giovane, >>>> >>>> Thanks for the complete report. It looks like we may be missing a >>>> check in our FGMRES implementation that allows the iteration to continue >>>> after a NaN/Inf. >>>> >>>> I will explain how we handle the checking and then attach a patch that >>>> you can apply to see if it resolves the problem. Whenever our KSP solvers >>>> compute a norm we >>>> check after that calculation to verify that the norm is not an Inf or Nan. >>>> This is an inexpensive global check across all MPI ranks because >>>> immediately after the norm computation all ranks that share the KSP have >>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP >>>> solve and return immediately with an appropriate not converged code. A >>>> quick eye-ball inspection of the FGMRES code found a missing check. >>>> >>>> You can apply the attached patch file in the PETSC_DIR with >>>> >>>> patch -p1 < fgmres.patch >>>> make libs >>>> >>>> then rerun your code and see if it now handles the Inf/NaN correctly. If >>>> so we'll patch our release branch with the fix. >>>> >>>> Barry >>>> >>>> >>>> >>>>> Giovane >>>> >>>> >>>>> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> Dear PETSc users, >>>>> >>>>> I'm working on an inhouse code that solves the Navier-Stokes equation in >>>>> a Lagrangian fashion for free surface flows. Because of the large >>>>> distortions and pressure gradients, it is quite common to encounter some >>>>> issues with iterative solvers for some time steps, and because of that, I >>>>> implemented a function that changes the solver type based on the flag >>>>> KSPConvergedReason. If this flag is negative after a call to KSPSolve, I >>>>> solve the same linear system again using a direct method. >>>>> >>>>> The problem is that, sometimes, KSP keeps converging even though the >>>>> residual is NaN, and because of that, I'm not able to identify the >>>>> problem and change the solver, which leads to a solution vector equals to >>>>> INF and obviously the code ends up crashing. Is it normal to observe this >>>>> kind of behaviour? >>>>> >>>>> Please find attached the log produced with the options >>>>> -ksp_monitor_lg_residualnorm -ksp_log -ksp_view >>>>> -ksp_monitor_true_residual -ksp_converged_reason and the function that >>>>> changes the solver. I'm currently using FGMRES and BJACOBI preconditioner >>>>> with LU for each block. The problem still happens with ILU for example. >>>>> We can see in the log file that for the time step 921, the true residual >>>>> is NaN and within just one iteration, the solver fails and it gives the >>>>> reason DIVERGED_PC_FAILED. I simply changed the solver to MUMPS and it >>>>> converged for that time step. However, when solving time step 922 we can >>>>> see that FGMRES converges while the true residual is NaN. Why is that >>>>> possible? I would appreciate it if someone could clarify this issue to me. >>>>> >>>>> Kind regards, >>>>> Giovane >>>>> >>>>> >>>>> >>>>> -- >>>>> Giovane Avancini >>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>>>> Carlos, USP >>>>> >>>>> PhD researcher in Structural Engineering - School of Engineering of São >>>>> Carlos. USP >>>>> <function.txt><log.txt> >>>> >>>> >>>> >>>> -- >>>> Giovane Avancini >>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>>> Carlos, USP >>>> >>>> PhD researcher in Structural Engineering - School of Engineering of São >>>> Carlos. USP >>>> <log.txt><patch.png> >>> >>> >>> >>> -- >>> Giovane Avancini >>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>> Carlos, USP >>> >>> PhD researcher in Structural Engineering - School of Engineering of São >>> Carlos. USP >> >> >> >> -- >> Giovane Avancini >> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, >> USP >> >> PhD researcher in Structural Engineering - School of Engineering of São >> Carlos. USP >> >> >> -- >> Giovane Avancini >> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, >> USP >> >> PhD researcher in Structural Engineering - School of Engineering of São >> Carlos. USP >> <log.txt> > > > > -- > Giovane Avancini > Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, > USP > > PhD researcher in Structural Engineering - School of Engineering of São > Carlos. USP
