The fix for the problem Geiovane encountered is in https://gitlab.com/petsc/petsc/-/merge_requests/4934 <https://gitlab.com/petsc/petsc/-/merge_requests/4934>
> On Mar 3, 2022, at 11:24 AM, Giovane Avancini <[email protected]> wrote: > > Sorry for my late reply Barry, > > Sure I can share the code with you, but unfortunately I don't know how to > make docker images. If you don't mind, you can clone the code from github > through this link: [email protected]:giavancini/runPFEM.git > It can be easily compiled with cmake, and you can see the dependencies in > README.md. Please let me know if you need any other information. > > Kind regards, > > Giovane > > Em sex., 25 de fev. de 2022 às 18:22, Barry Smith <[email protected] > <mailto:[email protected]>> escreveu: > > Hmm, this is going to be tricky to debug why it the Inf/Nan is not found > when it should be. > > In a debugger you can catch/trap floating point exceptions (how to do > this depends on your debugger) and then step through the code after that to > see why PETSc KSP is not properly noting the Inf/Nan and returning. This may > be cumbersome to do if you don't know PETSc well. Is your code easy to build, > would be willing to share it to me so I can run it and debug directly? If you > know how to make docker images or something you might be able to give it to > me easily. > > Barry > > >> On Feb 25, 2022, at 3:59 PM, Giovane Avancini <[email protected] >> <mailto:[email protected]>> wrote: >> >> Mark, Matthew and Barry, >> >> Thank you all for the quick responses. >> >> Others might have a better idea, but you could run with '-info :ksp' and see >> if you see any messages like "Linear solver has created a not a number (NaN) >> as the residual norm, declaring divergence \n" >> You could also run with -log_trace and see if it is using >> KSPConvergedDefault. I'm not sure if this is the method used given your >> parameters, but I think it is. >> Mark, I ran with both options. I didn't get any messages like "linear solver >> has created a not a number..." when using -info: ksp. When turning on >> -log_trace, I could verify that it is using KSPConvergedDefault but what >> does it mean exactly? When FGMRES converges with the true residual being >> NaN, I get the following message: [0] KSPConvergedDefault(): Linear solver >> has converged. Residual norm 8.897908325511e-05 is less than relative >> tolerance 1.000000000000e-08 times initial right hand side norm >> 1.466597558465e+04 at iteration 53. No information about NaN whatsoever. >> >> We check for NaN or Inf, for example, in KSPCheckDot(). if you have the KSP >> set to error >> (https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html >> <https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html>) >> then we throw an error, but the return codes do not seem to be checked in >> your implementation. If not, then we set the flag for divergence. >> Matthew, I do not check the return code in this case because I don't want >> PETSc to stop if an error occurs during the solving step. I just want to >> know that it didn't converge and treat this error inside my code. The >> problem is that the flag for divergence is not always being set when FGMRES >> is not converging. I was just wondering why it was set during time step 921 >> and why not for time step 922 as well. >> >> Thanks for the complete report. It looks like we may be missing a check in >> our FGMRES implementation that allows the iteration to continue after a >> NaN/Inf. >> >> I will explain how we handle the checking and then attach a patch that >> you can apply to see if it resolves the problem. Whenever our KSP solvers >> compute a norm we >> check after that calculation to verify that the norm is not an Inf or Nan. >> This is an inexpensive global check across all MPI ranks because immediately >> after the norm computation all ranks that share the KSP have the same value. >> If the norm is a Inf or Nan we "short-circuit" the KSP solve and return >> immediately with an appropriate not converged code. A quick eye-ball >> inspection of the FGMRES code found a missing check. >> >> You can apply the attached patch file in the PETSC_DIR with >> >> patch -p1 < fgmres.patch >> make libs >> >> then rerun your code and see if it now handles the Inf/NaN correctly. If so >> we'll patch our release branch with the fix. >> Thank you for checking this, Barry. I applied the patch exactly the way you >> instructed, however, the problem is still happening. Is there a way to check >> if the patch was in fact applied? You can see in the attached screenshot the >> terminal information. >> >> Kind regards, >> >> Giovane >> >> Em sex., 25 de fev. de 2022 às 13:48, Barry Smith <[email protected] >> <mailto:[email protected]>> escreveu: >> >> Giovane, >> >> Thanks for the complete report. It looks like we may be missing a check >> in our FGMRES implementation that allows the iteration to continue after a >> NaN/Inf. >> >> I will explain how we handle the checking and then attach a patch that >> you can apply to see if it resolves the problem. Whenever our KSP solvers >> compute a norm we >> check after that calculation to verify that the norm is not an Inf or Nan. >> This is an inexpensive global check across all MPI ranks because immediately >> after the norm computation all ranks that share the KSP have the same value. >> If the norm is a Inf or Nan we "short-circuit" the KSP solve and return >> immediately with an appropriate not converged code. A quick eye-ball >> inspection of the FGMRES code found a missing check. >> >> You can apply the attached patch file in the PETSC_DIR with >> >> patch -p1 < fgmres.patch >> make libs >> >> then rerun your code and see if it now handles the Inf/NaN correctly. If so >> we'll patch our release branch with the fix. >> >> Barry >> >> >> >>> Giovane >> >> >>> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users >>> <[email protected] <mailto:[email protected]>> wrote: >>> >>> Dear PETSc users, >>> >>> I'm working on an inhouse code that solves the Navier-Stokes equation in a >>> Lagrangian fashion for free surface flows. Because of the large distortions >>> and pressure gradients, it is quite common to encounter some issues with >>> iterative solvers for some time steps, and because of that, I implemented a >>> function that changes the solver type based on the flag KSPConvergedReason. >>> If this flag is negative after a call to KSPSolve, I solve the same linear >>> system again using a direct method. >>> >>> The problem is that, sometimes, KSP keeps converging even though the >>> residual is NaN, and because of that, I'm not able to identify the problem >>> and change the solver, which leads to a solution vector equals to INF and >>> obviously the code ends up crashing. Is it normal to observe this kind of >>> behaviour? >>> >>> Please find attached the log produced with the options >>> -ksp_monitor_lg_residualnorm -ksp_log -ksp_view -ksp_monitor_true_residual >>> -ksp_converged_reason and the function that changes the solver. I'm >>> currently using FGMRES and BJACOBI preconditioner with LU for each block. >>> The problem still happens with ILU for example. We can see in the log file >>> that for the time step 921, the true residual is NaN and within just one >>> iteration, the solver fails and it gives the reason DIVERGED_PC_FAILED. I >>> simply changed the solver to MUMPS and it converged for that time step. >>> However, when solving time step 922 we can see that FGMRES converges while >>> the true residual is NaN. Why is that possible? I would appreciate it if >>> someone could clarify this issue to me. >>> >>> Kind regards, >>> Giovane >>> >>> >>> >>> -- >>> Giovane Avancini >>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>> Carlos, USP >>> >>> PhD researcher in Structural Engineering - School of Engineering of São >>> Carlos. USP >>> <function.txt><log.txt> >> >> >> >> -- >> Giovane Avancini >> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, >> USP >> >> PhD researcher in Structural Engineering - School of Engineering of São >> Carlos. USP >> <log.txt><patch.png> > > > > -- > Giovane Avancini > Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, > USP > > PhD researcher in Structural Engineering - School of Engineering of São > Carlos. USP
