Re: [petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Barry Smith Fri, 01 Apr 2022 15:27:37 -0700

  I'll take a look at it this weekend. The computed preconditioned residual 
norm is a real number so I am not sure where PETSc will be able to detect the 
problem appropriately before it is too late.



> On Apr 1, 2022, at 6:14 PM, Giovane Avancini <[email protected]> wrote:
> 
> Hi Barry, it's me again.
> 
> Sorry to bother you with this issue, but the problem is still happening, now 
> when using KSPIBCGS. As you can see below, even when a NaN pops up in the 
> residual, the solver still converges to an INF solution.
> 
> ----------------------- TIME STEP = 3318, time = 0.663600  
> -----------------------
> 
> Mesh Regenerated. Elapsed time: 0.018536
> Isolated nodes: 14
> Assemble Linear System. Elapsed time: 0.030077
>   0 KSP preconditioned resid norm 4.087133454416e+04 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   1 KSP preconditioned resid norm 8.670288259109e+03 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   2 KSP preconditioned resid norm 4.875596419197e+03 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   3 KSP preconditioned resid norm 1.226640070761e+03 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   4 KSP preconditioned resid norm 7.121904546851e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   5 KSP preconditioned resid norm 5.990560906831e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   6 KSP preconditioned resid norm 4.256157374933e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   7 KSP preconditioned resid norm 3.274351035311e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   8 KSP preconditioned resid norm 2.436138522439e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>   9 KSP preconditioned resid norm 1.268089193578e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  10 KSP preconditioned resid norm 1.093950736015e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  11 KSP preconditioned resid norm 9.950531836062e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  12 KSP preconditioned resid norm 1.066841140901e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  13 KSP preconditioned resid norm 1.003475554456e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  14 KSP preconditioned resid norm 1.073513486989e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  15 KSP preconditioned resid norm 8.724609972930e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  16 KSP preconditioned resid norm 1.445166180332e+02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  17 KSP preconditioned resid norm 3.767376396291e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  18 KSP preconditioned resid norm 7.597770355737e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  19 KSP preconditioned resid norm 3.208030402538e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  20 KSP preconditioned resid norm 3.477715841173e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  21 KSP preconditioned resid norm 2.880337856055e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  22 KSP preconditioned resid norm 2.730108581171e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  23 KSP preconditioned resid norm 2.111131168298e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  24 KSP preconditioned resid norm 1.635560497545e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  25 KSP preconditioned resid norm 1.550914551701e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  26 KSP preconditioned resid norm 1.409066040669e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  27 KSP preconditioned resid norm 1.032086999081e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  28 KSP preconditioned resid norm 1.111168488798e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  29 KSP preconditioned resid norm 9.898696915473e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  30 KSP preconditioned resid norm 1.234283818664e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  31 KSP preconditioned resid norm 2.735222111838e+01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  32 KSP preconditioned resid norm 6.431272223321e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  33 KSP preconditioned resid norm 6.320133000091e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  34 KSP preconditioned resid norm 6.568217058049e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  35 KSP preconditioned resid norm 6.483075335206e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  36 KSP preconditioned resid norm 6.419074566626e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  37 KSP preconditioned resid norm 6.372749647101e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  38 KSP preconditioned resid norm 5.920214853455e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  39 KSP preconditioned resid norm 5.953698988377e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  40 KSP preconditioned resid norm 4.009279521077e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  41 KSP preconditioned resid norm 8.407438130288e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  42 KSP preconditioned resid norm 1.924008529878e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  43 KSP preconditioned resid norm 9.126618449455e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  44 KSP preconditioned resid norm 2.747853629308e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  45 KSP preconditioned resid norm 2.556706051040e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  46 KSP preconditioned resid norm 2.427212844835e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  47 KSP preconditioned resid norm 7.630151877379e+00 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  48 KSP preconditioned resid norm 5.895961768741e-01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  49 KSP preconditioned resid norm 2.271378954392e-01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  50 KSP preconditioned resid norm 1.779670755839e-01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  51 KSP preconditioned resid norm 1.488459722777e-01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  52 KSP preconditioned resid norm 1.479802491212e-01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  53 KSP preconditioned resid norm 1.316523287251e-01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  54 KSP preconditioned resid norm 1.347849424457e-01 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  55 KSP preconditioned resid norm 6.739405576032e-02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  56 KSP preconditioned resid norm 6.699633313335e-02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  57 KSP preconditioned resid norm 8.064741830609e-02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  58 KSP preconditioned resid norm 6.744985187452e-02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  59 KSP preconditioned resid norm 6.981071339163e-02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  60 KSP preconditioned resid norm 4.410819986572e-02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
>  61 KSP preconditioned resid norm 4.062281042354e-02 true resid norm          
>  -nan ||r(i)||/||b||           -nan
> Linear solve converged due to CONVERGED_RTOL iterations 61
> Solver converged within 61 iterations. Elapsed time: 0.117009
> Newton iteration: 0 - L2 Position Norm: INF - L2 Pressure Norm: INF
> Memory used by each processor: 47.843750 Mb
> 
> Could you please check if the issue can be fixed the same way as you did for 
> the GMRES family solvers? Thanks in advance,
> 
> Kind regards,
> 
> Giovane
> 
> Em ter., 8 de mar. de 2022 às 01:05, Barry Smith <[email protected] 
> <mailto:[email protected]>> escreveu:
> 
>   I ran with -info and get repeated 
> 
> MatPivotCheck_none(): Detected zero pivot in factorization in row 2547 value 
> 0. tolerance 2.22045e-14
> 
> after the first linear solve failure. The values are always slightly 
> different. My conclusion is that from this point on the default factorization 
> is truly failing each time which is why it is always switching the linear 
> solver. 
> 
>   Barry
> 
> 
>> On Mar 7, 2022, at 7:01 PM, Giovane Avancini <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Sorry, I forgot to attach the file.
>> 
>> Em seg., 7 de mar. de 2022 às 21:01, Giovane Avancini <[email protected] 
>> <mailto:[email protected]>> escreveu:
>> Thanks Barry! I included the piece of code you sent and now it seems to be 
>> working pretty well. It has completed all the 5000 time steps and the solver 
>> is indeed triggering the failure when a NaN/Inf is found.
>> 
>> I just noticed a strange behaviour in my code after the patch that was not 
>> happening before, so I was wondering if it could be related to the way you 
>> fixed the bug or if it is a coincidence, please find attached the log file.
>> 
>> At time step 913, the first failure occurs,and it doesn't print the norms of 
>> iteration 0 for instance (before, even when the pc ended up failing during 
>> the first ksp iteration, the norms were plotted indicating the NaN). Ok, 
>> maybe now it verifies that a NaN appeared before the norms are actually 
>> computed.
>> 
>> What is strange to me is that, after the first failure, all the remaining 
>> calls to FGMRES have failed as well, which is unlikely to be the case in my 
>> view. Would it be possible that some error flags of FGMRES are not being 
>> reseted from one call to another? So after the first iteration of step 913, 
>> FGMRES is being called with an error flag already set to true?
>> 
>> Anyway, I really appreciate your efforts in finding the bug and trying to 
>> help me, thank you very much!
>> 
>> Em seg., 7 de mar. de 2022 às 18:08, Barry Smith <[email protected] 
>> <mailto:[email protected]>> escreveu:
>> 
>>    The fix for the problem Geiovane encountered is in 
>> https://gitlab.com/petsc/petsc/-/merge_requests/4934 
>> <https://gitlab.com/petsc/petsc/-/merge_requests/4934>
>> 
>> 
>>> On Mar 3, 2022, at 11:24 AM, Giovane Avancini <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Sorry for my late reply Barry,
>>> 
>>> Sure I can share the code with you, but unfortunately I don't know how to 
>>> make docker images. If you don't mind, you can clone the code from github 
>>> through this link: [email protected] 
>>> <mailto:[email protected]>:giavancini/runPFEM.git
>>> It can be easily compiled with cmake, and you can see the dependencies in 
>>> README.md. Please let me know if you need any other information.
>>> 
>>> Kind regards,
>>> 
>>> Giovane
>>> 
>>> Em sex., 25 de fev. de 2022 às 18:22, Barry Smith <[email protected] 
>>> <mailto:[email protected]>> escreveu:
>>> 
>>>      Hmm, this is going to be tricky to debug why it the Inf/Nan is not 
>>> found when it should be. 
>>> 
>>>      In a debugger you can catch/trap floating point exceptions (how to do 
>>> this depends on your debugger) and then step through the code after that to 
>>> see why PETSc KSP is not properly noting the Inf/Nan and returning. This 
>>> may be cumbersome to do if you don't know PETSc well. Is your code easy to 
>>> build, would be willing to share it to me so I can run it and debug 
>>> directly? If you know how to make docker images or something you might be 
>>> able to give it to me easily.
>>> 
>>>   Barry
>>> 
>>> 
>>>> On Feb 25, 2022, at 3:59 PM, Giovane Avancini <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Mark, Matthew and Barry,
>>>> 
>>>> Thank you all for the quick responses.
>>>> 
>>>> Others might have a better idea, but you could run with '-info :ksp' and 
>>>> see if you see any messages like "Linear solver has created a not a number 
>>>> (NaN) as the residual norm, declaring divergence \n"
>>>> You could also run with -log_trace and see if it is using 
>>>> KSPConvergedDefault. I'm not sure if this is the method used given your 
>>>> parameters, but I think it is.
>>>> Mark, I ran with both options. I didn't get any messages like "linear 
>>>> solver has created a not a number..." when using -info: ksp. When turning 
>>>> on -log_trace, I could verify that it is using KSPConvergedDefault but 
>>>> what does it mean exactly? When FGMRES converges with the true residual 
>>>> being NaN, I get the following message: [0] KSPConvergedDefault(): Linear 
>>>> solver has converged. Residual norm 8.897908325511e-05 is less than 
>>>> relative tolerance 1.000000000000e-08 times initial right hand side norm 
>>>> 1.466597558465e+04 at iteration 53. No information about NaN whatsoever.
>>>> 
>>>> We check for NaN or Inf, for example, in KSPCheckDot(). if you have the 
>>>> KSP set to error 
>>>> (https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html
>>>>  
>>>> <https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html>)
>>>> then we throw an error, but the return codes do not seem to be checked in 
>>>> your implementation. If not, then we set the flag for divergence.
>>>> Matthew, I do not check the return code in this case because I don't want 
>>>> PETSc to stop if an error occurs during the solving step. I just want to 
>>>> know that it didn't converge and treat this error inside my code. The 
>>>> problem is that the flag for divergence is not always being set when 
>>>> FGMRES is not converging. I was just wondering why it was set during time 
>>>> step 921 and why not for time step 922 as well.
>>>> 
>>>> Thanks for the complete report. It looks like we may be missing a check in 
>>>> our FGMRES implementation that allows the iteration to continue after a 
>>>> NaN/Inf. 
>>>> 
>>>>     I will explain how we handle the checking and then attach a patch that 
>>>> you can apply to see if it resolves the problem.  Whenever our KSP solvers 
>>>> compute a norm we
>>>> check after that calculation to verify that the norm is not an Inf or Nan. 
>>>> This is an inexpensive global check across all MPI ranks because 
>>>> immediately after the norm computation all ranks that share the KSP have 
>>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP 
>>>> solve and return immediately with an appropriate not converged code. A 
>>>> quick eye-ball inspection of the FGMRES code found a missing check. 
>>>> 
>>>>    You can apply the attached patch file in the PETSC_DIR with 
>>>> 
>>>> patch -p1 < fgmres.patch
>>>> make libs
>>>> 
>>>> then rerun your code and see if it now handles the Inf/NaN correctly. If 
>>>> so we'll patch our release branch with the fix.
>>>> Thank you for checking this, Barry. I applied the patch exactly the way 
>>>> you instructed, however, the problem is still happening. Is there a way to 
>>>> check if the patch was in fact applied? You can see in the attached 
>>>> screenshot the terminal information.
>>>> 
>>>> Kind regards,
>>>> 
>>>> Giovane
>>>> 
>>>> Em sex., 25 de fev. de 2022 às 13:48, Barry Smith <[email protected] 
>>>> <mailto:[email protected]>> escreveu:
>>>> 
>>>>   Giovane,
>>>> 
>>>>     Thanks for the complete report. It looks like we may be missing a 
>>>> check in our FGMRES implementation that allows the iteration to continue 
>>>> after a NaN/Inf. 
>>>> 
>>>>     I will explain how we handle the checking and then attach a patch that 
>>>> you can apply to see if it resolves the problem.  Whenever our KSP solvers 
>>>> compute a norm we
>>>> check after that calculation to verify that the norm is not an Inf or Nan. 
>>>> This is an inexpensive global check across all MPI ranks because 
>>>> immediately after the norm computation all ranks that share the KSP have 
>>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP 
>>>> solve and return immediately with an appropriate not converged code. A 
>>>> quick eye-ball inspection of the FGMRES code found a missing check. 
>>>> 
>>>>    You can apply the attached patch file in the PETSC_DIR with 
>>>> 
>>>> patch -p1 < fgmres.patch
>>>> make libs
>>>> 
>>>> then rerun your code and see if it now handles the Inf/NaN correctly. If 
>>>> so we'll patch our release branch with the fix.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> 
>>>>> Giovane
>>>>   
>>>> 
>>>>> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Dear PETSc users,
>>>>> 
>>>>> I'm working on an inhouse code that solves the Navier-Stokes equation in 
>>>>> a Lagrangian fashion for free surface flows. Because of the large 
>>>>> distortions and pressure gradients, it is quite common to encounter some 
>>>>> issues with iterative solvers for some time steps, and because of that, I 
>>>>> implemented a function that changes the solver type based on the flag 
>>>>> KSPConvergedReason. If this flag is negative after a call to KSPSolve, I 
>>>>> solve the same linear system again using a direct method.
>>>>> 
>>>>> The problem is that, sometimes, KSP keeps converging even though the 
>>>>> residual is NaN, and because of that, I'm not able to identify the 
>>>>> problem and change the solver, which leads to a solution vector equals to 
>>>>> INF and obviously the code ends up crashing. Is it normal to observe this 
>>>>> kind of behaviour?
>>>>> 
>>>>> Please find attached the log produced with the options 
>>>>> -ksp_monitor_lg_residualnorm -ksp_log -ksp_view 
>>>>> -ksp_monitor_true_residual -ksp_converged_reason and the function that 
>>>>> changes the solver. I'm currently using FGMRES and BJACOBI preconditioner 
>>>>> with LU for each block. The problem still happens with ILU for example. 
>>>>> We can see in the log file that for the time step 921, the true residual 
>>>>> is NaN and within just one iteration, the solver fails and it gives the 
>>>>> reason DIVERGED_PC_FAILED. I simply changed the solver to MUMPS and it 
>>>>> converged for that time step. However, when solving time step 922 we can 
>>>>> see that FGMRES converges while the true residual is NaN. Why is that 
>>>>> possible? I would appreciate it if someone could clarify this issue to me.
>>>>> 
>>>>> Kind regards,
>>>>> Giovane
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Giovane Avancini
>>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São 
>>>>> Carlos, USP
>>>>> 
>>>>> PhD researcher in Structural Engineering - School of Engineering of São 
>>>>> Carlos. USP
>>>>> <function.txt><log.txt>
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Giovane Avancini
>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São 
>>>> Carlos, USP
>>>> 
>>>> PhD researcher in Structural Engineering - School of Engineering of São 
>>>> Carlos. USP
>>>> <log.txt><patch.png>
>>> 
>>> 
>>> 
>>> -- 
>>> Giovane Avancini
>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São 
>>> Carlos, USP
>>> 
>>> PhD researcher in Structural Engineering - School of Engineering of São 
>>> Carlos. USP
>> 
>> 
>> 
>> -- 
>> Giovane Avancini
>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, 
>> USP
>> 
>> PhD researcher in Structural Engineering - School of Engineering of São 
>> Carlos. USP
>> 
>> 
>> -- 
>> Giovane Avancini
>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, 
>> USP
>> 
>> PhD researcher in Structural Engineering - School of Engineering of São 
>> Carlos. USP
>> <log.txt>
> 
> 
> 
> -- 
> Giovane Avancini
> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, 
> USP
> 
> PhD researcher in Structural Engineering - School of Engineering of São 
> Carlos. USP

Re: [petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Reply via email to