Re: [petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Barry Smith Mon, 07 Mar 2022 13:08:27 -0800

   The fix for the problem Geiovane encountered is in 
https://gitlab.com/petsc/petsc/-/merge_requests/4934 
<https://gitlab.com/petsc/petsc/-/merge_requests/4934>



> On Mar 3, 2022, at 11:24 AM, Giovane Avancini <[email protected]> wrote:
> 
> Sorry for my late reply Barry,
> 
> Sure I can share the code with you, but unfortunately I don't know how to 
> make docker images. If you don't mind, you can clone the code from github 
> through this link: [email protected]:giavancini/runPFEM.git
> It can be easily compiled with cmake, and you can see the dependencies in 
> README.md. Please let me know if you need any other information.
> 
> Kind regards,
> 
> Giovane
> 
> Em sex., 25 de fev. de 2022 às 18:22, Barry Smith <[email protected] 
> <mailto:[email protected]>> escreveu:
> 
>      Hmm, this is going to be tricky to debug why it the Inf/Nan is not found 
> when it should be. 
> 
>      In a debugger you can catch/trap floating point exceptions (how to do 
> this depends on your debugger) and then step through the code after that to 
> see why PETSc KSP is not properly noting the Inf/Nan and returning. This may 
> be cumbersome to do if you don't know PETSc well. Is your code easy to build, 
> would be willing to share it to me so I can run it and debug directly? If you 
> know how to make docker images or something you might be able to give it to 
> me easily.
> 
>   Barry
> 
> 
>> On Feb 25, 2022, at 3:59 PM, Giovane Avancini <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Mark, Matthew and Barry,
>> 
>> Thank you all for the quick responses.
>> 
>> Others might have a better idea, but you could run with '-info :ksp' and see 
>> if you see any messages like "Linear solver has created a not a number (NaN) 
>> as the residual norm, declaring divergence \n"
>> You could also run with -log_trace and see if it is using 
>> KSPConvergedDefault. I'm not sure if this is the method used given your 
>> parameters, but I think it is.
>> Mark, I ran with both options. I didn't get any messages like "linear solver 
>> has created a not a number..." when using -info: ksp. When turning on 
>> -log_trace, I could verify that it is using KSPConvergedDefault but what 
>> does it mean exactly? When FGMRES converges with the true residual being 
>> NaN, I get the following message: [0] KSPConvergedDefault(): Linear solver 
>> has converged. Residual norm 8.897908325511e-05 is less than relative 
>> tolerance 1.000000000000e-08 times initial right hand side norm 
>> 1.466597558465e+04 at iteration 53. No information about NaN whatsoever.
>> 
>> We check for NaN or Inf, for example, in KSPCheckDot(). if you have the KSP 
>> set to error 
>> (https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html 
>> <https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html>)
>> then we throw an error, but the return codes do not seem to be checked in 
>> your implementation. If not, then we set the flag for divergence.
>> Matthew, I do not check the return code in this case because I don't want 
>> PETSc to stop if an error occurs during the solving step. I just want to 
>> know that it didn't converge and treat this error inside my code. The 
>> problem is that the flag for divergence is not always being set when FGMRES 
>> is not converging. I was just wondering why it was set during time step 921 
>> and why not for time step 922 as well.
>> 
>> Thanks for the complete report. It looks like we may be missing a check in 
>> our FGMRES implementation that allows the iteration to continue after a 
>> NaN/Inf. 
>> 
>>     I will explain how we handle the checking and then attach a patch that 
>> you can apply to see if it resolves the problem.  Whenever our KSP solvers 
>> compute a norm we
>> check after that calculation to verify that the norm is not an Inf or Nan. 
>> This is an inexpensive global check across all MPI ranks because immediately 
>> after the norm computation all ranks that share the KSP have the same value. 
>> If the norm is a Inf or Nan we "short-circuit" the KSP solve and return 
>> immediately with an appropriate not converged code. A quick eye-ball 
>> inspection of the FGMRES code found a missing check. 
>> 
>>    You can apply the attached patch file in the PETSC_DIR with 
>> 
>> patch -p1 < fgmres.patch
>> make libs
>> 
>> then rerun your code and see if it now handles the Inf/NaN correctly. If so 
>> we'll patch our release branch with the fix.
>> Thank you for checking this, Barry. I applied the patch exactly the way you 
>> instructed, however, the problem is still happening. Is there a way to check 
>> if the patch was in fact applied? You can see in the attached screenshot the 
>> terminal information.
>> 
>> Kind regards,
>> 
>> Giovane
>> 
>> Em sex., 25 de fev. de 2022 às 13:48, Barry Smith <[email protected] 
>> <mailto:[email protected]>> escreveu:
>> 
>>   Giovane,
>> 
>>     Thanks for the complete report. It looks like we may be missing a check 
>> in our FGMRES implementation that allows the iteration to continue after a 
>> NaN/Inf. 
>> 
>>     I will explain how we handle the checking and then attach a patch that 
>> you can apply to see if it resolves the problem.  Whenever our KSP solvers 
>> compute a norm we
>> check after that calculation to verify that the norm is not an Inf or Nan. 
>> This is an inexpensive global check across all MPI ranks because immediately 
>> after the norm computation all ranks that share the KSP have the same value. 
>> If the norm is a Inf or Nan we "short-circuit" the KSP solve and return 
>> immediately with an appropriate not converged code. A quick eye-ball 
>> inspection of the FGMRES code found a missing check. 
>> 
>>    You can apply the attached patch file in the PETSC_DIR with 
>> 
>> patch -p1 < fgmres.patch
>> make libs
>> 
>> then rerun your code and see if it now handles the Inf/NaN correctly. If so 
>> we'll patch our release branch with the fix.
>> 
>>   Barry
>> 
>> 
>> 
>>> Giovane
>>   
>> 
>>> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users 
>>> <[email protected] <mailto:[email protected]>> wrote:
>>> 
>>> Dear PETSc users,
>>> 
>>> I'm working on an inhouse code that solves the Navier-Stokes equation in a 
>>> Lagrangian fashion for free surface flows. Because of the large distortions 
>>> and pressure gradients, it is quite common to encounter some issues with 
>>> iterative solvers for some time steps, and because of that, I implemented a 
>>> function that changes the solver type based on the flag KSPConvergedReason. 
>>> If this flag is negative after a call to KSPSolve, I solve the same linear 
>>> system again using a direct method.
>>> 
>>> The problem is that, sometimes, KSP keeps converging even though the 
>>> residual is NaN, and because of that, I'm not able to identify the problem 
>>> and change the solver, which leads to a solution vector equals to INF and 
>>> obviously the code ends up crashing. Is it normal to observe this kind of 
>>> behaviour?
>>> 
>>> Please find attached the log produced with the options 
>>> -ksp_monitor_lg_residualnorm -ksp_log -ksp_view -ksp_monitor_true_residual 
>>> -ksp_converged_reason and the function that changes the solver. I'm 
>>> currently using FGMRES and BJACOBI preconditioner with LU for each block. 
>>> The problem still happens with ILU for example. We can see in the log file 
>>> that for the time step 921, the true residual is NaN and within just one 
>>> iteration, the solver fails and it gives the reason DIVERGED_PC_FAILED. I 
>>> simply changed the solver to MUMPS and it converged for that time step. 
>>> However, when solving time step 922 we can see that FGMRES converges while 
>>> the true residual is NaN. Why is that possible? I would appreciate it if 
>>> someone could clarify this issue to me.
>>> 
>>> Kind regards,
>>> Giovane
>>> 
>>> 
>>> 
>>> -- 
>>> Giovane Avancini
>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São 
>>> Carlos, USP
>>> 
>>> PhD researcher in Structural Engineering - School of Engineering of São 
>>> Carlos. USP
>>> <function.txt><log.txt>
>> 
>> 
>> 
>> -- 
>> Giovane Avancini
>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, 
>> USP
>> 
>> PhD researcher in Structural Engineering - School of Engineering of São 
>> Carlos. USP
>> <log.txt><patch.png>
> 
> 
> 
> -- 
> Giovane Avancini
> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, 
> USP
> 
> PhD researcher in Structural Engineering - School of Engineering of São 
> Carlos. USP

Re: [petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Reply via email to