Re: [petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Giovane Avancini via petsc-users Fri, 01 Apr 2022 15:36:21 -0700

Thanks a lot. If you need more information and/or my code again, please let
me know.


Em sex., 1 de abr. de 2022 às 19:27, Barry Smith <[email protected]>
escreveu:

>
>   I'll take a look at it this weekend. The computed preconditioned
> residual norm is a real number so I am not sure where PETSc will be able to
> detect the problem appropriately before it is too late.
>
>
> On Apr 1, 2022, at 6:14 PM, Giovane Avancini <[email protected]> wrote:
>
> Hi Barry, it's me again.
>
> Sorry to bother you with this issue, but the problem is still happening,
> now when using KSPIBCGS. As you can see below, even when a NaN pops up in
> the residual, the solver still converges to an INF solution.
>
> ----------------------- TIME STEP = 3318, time = 0.663600
>  -----------------------
>
> Mesh Regenerated. Elapsed time: 0.018536
> Isolated nodes: 14
> Assemble Linear System. Elapsed time: 0.030077
>   0 KSP preconditioned resid norm 4.087133454416e+04 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   1 KSP preconditioned resid norm 8.670288259109e+03 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   2 KSP preconditioned resid norm 4.875596419197e+03 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   3 KSP preconditioned resid norm 1.226640070761e+03 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   4 KSP preconditioned resid norm 7.121904546851e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   5 KSP preconditioned resid norm 5.990560906831e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   6 KSP preconditioned resid norm 4.256157374933e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   7 KSP preconditioned resid norm 3.274351035311e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   8 KSP preconditioned resid norm 2.436138522439e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>   9 KSP preconditioned resid norm 1.268089193578e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  10 KSP preconditioned resid norm 1.093950736015e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  11 KSP preconditioned resid norm 9.950531836062e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  12 KSP preconditioned resid norm 1.066841140901e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  13 KSP preconditioned resid norm 1.003475554456e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  14 KSP preconditioned resid norm 1.073513486989e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  15 KSP preconditioned resid norm 8.724609972930e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  16 KSP preconditioned resid norm 1.445166180332e+02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  17 KSP preconditioned resid norm 3.767376396291e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  18 KSP preconditioned resid norm 7.597770355737e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  19 KSP preconditioned resid norm 3.208030402538e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  20 KSP preconditioned resid norm 3.477715841173e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  21 KSP preconditioned resid norm 2.880337856055e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  22 KSP preconditioned resid norm 2.730108581171e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  23 KSP preconditioned resid norm 2.111131168298e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  24 KSP preconditioned resid norm 1.635560497545e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  25 KSP preconditioned resid norm 1.550914551701e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  26 KSP preconditioned resid norm 1.409066040669e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  27 KSP preconditioned resid norm 1.032086999081e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  28 KSP preconditioned resid norm 1.111168488798e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  29 KSP preconditioned resid norm 9.898696915473e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  30 KSP preconditioned resid norm 1.234283818664e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  31 KSP preconditioned resid norm 2.735222111838e+01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  32 KSP preconditioned resid norm 6.431272223321e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  33 KSP preconditioned resid norm 6.320133000091e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  34 KSP preconditioned resid norm 6.568217058049e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  35 KSP preconditioned resid norm 6.483075335206e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  36 KSP preconditioned resid norm 6.419074566626e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  37 KSP preconditioned resid norm 6.372749647101e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  38 KSP preconditioned resid norm 5.920214853455e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  39 KSP preconditioned resid norm 5.953698988377e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  40 KSP preconditioned resid norm 4.009279521077e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  41 KSP preconditioned resid norm 8.407438130288e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  42 KSP preconditioned resid norm 1.924008529878e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  43 KSP preconditioned resid norm 9.126618449455e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  44 KSP preconditioned resid norm 2.747853629308e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  45 KSP preconditioned resid norm 2.556706051040e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  46 KSP preconditioned resid norm 2.427212844835e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  47 KSP preconditioned resid norm 7.630151877379e+00 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  48 KSP preconditioned resid norm 5.895961768741e-01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  49 KSP preconditioned resid norm 2.271378954392e-01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  50 KSP preconditioned resid norm 1.779670755839e-01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  51 KSP preconditioned resid norm 1.488459722777e-01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  52 KSP preconditioned resid norm 1.479802491212e-01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  53 KSP preconditioned resid norm 1.316523287251e-01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  54 KSP preconditioned resid norm 1.347849424457e-01 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  55 KSP preconditioned resid norm 6.739405576032e-02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  56 KSP preconditioned resid norm 6.699633313335e-02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  57 KSP preconditioned resid norm 8.064741830609e-02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  58 KSP preconditioned resid norm 6.744985187452e-02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  59 KSP preconditioned resid norm 6.981071339163e-02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  60 KSP preconditioned resid norm 4.410819986572e-02 true resid norm
>     -nan ||r(i)||/||b||           -nan
>  61 KSP preconditioned resid norm 4.062281042354e-02 true resid norm
>     -nan ||r(i)||/||b||           -nan
> Linear solve converged due to CONVERGED_RTOL iterations 61
> Solver converged within 61 iterations. Elapsed time: 0.117009
> Newton iteration: 0 - L2 Position Norm: INF - L2 Pressure Norm: INF
> Memory used by each processor: 47.843750 Mb
>
> Could you please check if the issue can be fixed the same way as you did
> for the GMRES family solvers? Thanks in advance,
>
> Kind regards,
>
> Giovane
>
> Em ter., 8 de mar. de 2022 às 01:05, Barry Smith <[email protected]>
> escreveu:
>
>>
>>   I ran with -info and get repeated
>>
>> MatPivotCheck_none(): Detected zero pivot in factorization in row 2547
>> value 0. tolerance 2.22045e-14
>>
>> after the first linear solve failure. The values are always slightly
>> different. My conclusion is that from this point on the default
>> factorization is truly failing each time which is why it is always
>> switching the linear solver.
>>
>>   Barry
>>
>>
>> On Mar 7, 2022, at 7:01 PM, Giovane Avancini <[email protected]> wrote:
>>
>> Sorry, I forgot to attach the file.
>>
>> Em seg., 7 de mar. de 2022 às 21:01, Giovane Avancini <[email protected]>
>> escreveu:
>>
>>> Thanks Barry! I included the piece of code you sent and now it seems to
>>> be working pretty well. It has completed all the 5000 time steps and the
>>> solver is indeed triggering the failure when a NaN/Inf is found.
>>>
>>> I just noticed a strange behaviour in my code after the patch that was
>>> not happening before, so I was wondering if it could be related to the way
>>> you fixed the bug or if it is a coincidence, please find attached the log
>>> file.
>>>
>>> At time step 913, the first failure occurs,and it doesn't print the
>>> norms of iteration 0 for instance (before, even when the pc ended up
>>> failing during the first ksp iteration, the norms were plotted indicating
>>> the NaN). Ok, maybe now it verifies that a NaN appeared before the norms
>>> are actually computed.
>>>
>>> What is strange to me is that, after the first failure, all the
>>> remaining calls to FGMRES have failed as well, which is unlikely to be the
>>> case in my view. Would it be possible that some error flags of FGMRES are
>>> not being reseted from one call to another? So after the first iteration of
>>> step 913, FGMRES is being called with an error flag already set to true?
>>>
>>> Anyway, I really appreciate your efforts in finding the bug and trying
>>> to help me, thank you very much!
>>>
>>> Em seg., 7 de mar. de 2022 às 18:08, Barry Smith <[email protected]>
>>> escreveu:
>>>
>>>>
>>>>    The fix for the problem Geiovane encountered is in
>>>> https://gitlab.com/petsc/petsc/-/merge_requests/4934
>>>>
>>>>
>>>> On Mar 3, 2022, at 11:24 AM, Giovane Avancini <[email protected]>
>>>> wrote:
>>>>
>>>> Sorry for my late reply Barry,
>>>>
>>>> Sure I can share the code with you, but unfortunately I don't know how
>>>> to make docker images. If you don't mind, you can clone the code from
>>>> github through this link: [email protected]:giavancini/runPFEM.git
>>>> It can be easily compiled with cmake, and you can see the dependencies
>>>> in README.md. Please let me know if you need any other information.
>>>>
>>>> Kind regards,
>>>>
>>>> Giovane
>>>>
>>>> Em sex., 25 de fev. de 2022 às 18:22, Barry Smith <[email protected]>
>>>> escreveu:
>>>>
>>>>>
>>>>>      Hmm, this is going to be tricky to debug why it the Inf/Nan is
>>>>> not found when it should be.
>>>>>
>>>>>      In a debugger you can catch/trap floating point exceptions (how
>>>>> to do this depends on your debugger) and then step through the code after
>>>>> that to see why PETSc KSP is not properly noting the Inf/Nan and 
>>>>> returning.
>>>>> This may be cumbersome to do if you don't know PETSc well. Is your code
>>>>> easy to build, would be willing to share it to me so I can run it and 
>>>>> debug
>>>>> directly? If you know how to make docker images or something you might be
>>>>> able to give it to me easily.
>>>>>
>>>>>   Barry
>>>>>
>>>>>
>>>>> On Feb 25, 2022, at 3:59 PM, Giovane Avancini <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Mark, Matthew and Barry,
>>>>>
>>>>> Thank you all for the quick responses.
>>>>>
>>>>> Others might have a better idea, but you could run with '-info :ksp'
>>>>> and see if you see any messages like "Linear solver has created a not a
>>>>> number (NaN) as the residual norm, declaring divergence \n"
>>>>> You could also run with -log_trace and see if it is
>>>>> using KSPConvergedDefault. I'm not sure if this is the method used given
>>>>> your parameters, but I think it is.
>>>>>
>>>>> Mark, I ran with both options. I didn't get any messages like "linear
>>>>> solver has created a not a number..." when using -info: ksp. When turning
>>>>> on -log_trace, I could verify that it is using KSPConvergedDefault but 
>>>>> what
>>>>> does it mean exactly? When FGMRES converges with the true residual being
>>>>> NaN, I get the following message: [0] KSPConvergedDefault(): Linear solver
>>>>> has converged. Residual norm 8.897908325511e-05 is less than relative
>>>>> tolerance 1.000000000000e-08 times initial right hand side norm
>>>>> 1.466597558465e+04 at iteration 53. No information about NaN whatsoever.
>>>>>
>>>>> We check for NaN or Inf, for example, in KSPCheckDot(). if you have
>>>>> the KSP set to error (
>>>>> https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html
>>>>> )
>>>>> then we throw an error, but the return codes do not seem to be checked
>>>>> in your implementation. If not, then we set the flag for divergence.
>>>>>
>>>>> Matthew, I do not check the return code in this case because I don't
>>>>> want PETSc to stop if an error occurs during the solving step. I just want
>>>>> to know that it didn't converge and treat this error inside my code. The
>>>>> problem is that the flag for divergence is not always being set when 
>>>>> FGMRES
>>>>> is not converging. I was just wondering why it was set during time step 
>>>>> 921
>>>>> and why not for time step 922 as well.
>>>>>
>>>>> Thanks for the complete report. It looks like we may be missing a
>>>>> check in our FGMRES implementation that allows the iteration to continue
>>>>> after a NaN/Inf.
>>>>>
>>>>>     I will explain how we handle the checking and then attach a patch
>>>>> that you can apply to see if it resolves the problem.  Whenever our KSP
>>>>> solvers compute a norm we
>>>>> check after that calculation to verify that the norm is not an Inf or
>>>>> Nan. This is an inexpensive global check across all MPI ranks because
>>>>> immediately after the norm computation all ranks that share the KSP have
>>>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP
>>>>> solve and return immediately with an appropriate not converged code. A
>>>>> quick eye-ball inspection of the FGMRES code found a missing check.
>>>>>
>>>>>    You can apply the attached patch file in the PETSC_DIR with
>>>>>
>>>>> patch -p1 < fgmres.patch
>>>>> make libs
>>>>>
>>>>> then rerun your code and see if it now handles the Inf/NaN correctly.
>>>>> If so we'll patch our release branch with the fix.
>>>>>
>>>>> Thank you for checking this, Barry. I applied the patch exactly the
>>>>> way you instructed, however, the problem is still happening. Is there a 
>>>>> way
>>>>> to check if the patch was in fact applied? You can see in the attached
>>>>> screenshot the terminal information.
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Giovane
>>>>>
>>>>> Em sex., 25 de fev. de 2022 às 13:48, Barry Smith <[email protected]>
>>>>> escreveu:
>>>>>
>>>>>>
>>>>>>   Giovane,
>>>>>>
>>>>>>     Thanks for the complete report. It looks like we may be missing a
>>>>>> check in our FGMRES implementation that allows the iteration to continue
>>>>>> after a NaN/Inf.
>>>>>>
>>>>>>     I will explain how we handle the checking and then attach a patch
>>>>>> that you can apply to see if it resolves the problem.  Whenever our KSP
>>>>>> solvers compute a norm we
>>>>>> check after that calculation to verify that the norm is not an Inf or
>>>>>> Nan. This is an inexpensive global check across all MPI ranks because
>>>>>> immediately after the norm computation all ranks that share the KSP have
>>>>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP
>>>>>> solve and return immediately with an appropriate not converged code. A
>>>>>> quick eye-ball inspection of the FGMRES code found a missing check.
>>>>>>
>>>>>>    You can apply the attached patch file in the PETSC_DIR with
>>>>>>
>>>>>> patch -p1 < fgmres.patch
>>>>>> make libs
>>>>>>
>>>>>> then rerun your code and see if it now handles the Inf/NaN correctly.
>>>>>> If so we'll patch our release branch with the fix.
>>>>>>
>>>>>>   Barry
>>>>>>
>>>>>>
>>>>>>
>>>>>> Giovane
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Dear PETSc users,
>>>>>>
>>>>>> I'm working on an inhouse code that solves the Navier-Stokes equation
>>>>>> in a Lagrangian fashion for free surface flows. Because of the large
>>>>>> distortions and pressure gradients, it is quite common to encounter some
>>>>>> issues with iterative solvers for some time steps, and because of that, I
>>>>>> implemented a function that changes the solver type based on the flag
>>>>>> KSPConvergedReason. If this flag is negative after a call to KSPSolve, I
>>>>>> solve the same linear system again using a direct method.
>>>>>>
>>>>>> The problem is that, sometimes, KSP keeps converging even though the
>>>>>> residual is NaN, and because of that, I'm not able to identify the 
>>>>>> problem
>>>>>> and change the solver, which leads to a solution vector equals to INF and
>>>>>> obviously the code ends up crashing. Is it normal to observe this kind of
>>>>>> behaviour?
>>>>>>
>>>>>> Please find attached the log produced with the options
>>>>>> -ksp_monitor_lg_residualnorm -ksp_log -ksp_view 
>>>>>> -ksp_monitor_true_residual
>>>>>> -ksp_converged_reason and the function that changes the solver. I'm
>>>>>> currently using FGMRES and BJACOBI preconditioner with LU for each block.
>>>>>> The problem still happens with ILU for example. We can see in the log 
>>>>>> file
>>>>>> that for the time step 921, the true residual is NaN and within just one
>>>>>> iteration, the solver fails and it gives the reason DIVERGED_PC_FAILED. I
>>>>>> simply changed the solver to MUMPS and it converged for that time step.
>>>>>> However, when solving time step 922 we can see that FGMRES converges 
>>>>>> while
>>>>>> the true residual is NaN. Why is that possible? I would appreciate it if
>>>>>> someone could clarify this issue to me.
>>>>>>
>>>>>> Kind regards,
>>>>>> Giovane
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Giovane Avancini
>>>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
>>>>>> Carlos, USP
>>>>>>
>>>>>> PhD researcher in Structural Engineering - School of Engineering of
>>>>>> São Carlos. USP
>>>>>> <function.txt><log.txt>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Giovane Avancini
>>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
>>>>> Carlos, USP
>>>>>
>>>>> PhD researcher in Structural Engineering - School of Engineering of
>>>>> São Carlos. USP
>>>>> <log.txt><patch.png>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Giovane Avancini
>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
>>>> Carlos, USP
>>>>
>>>> PhD researcher in Structural Engineering - School of Engineering of São
>>>> Carlos. USP
>>>>
>>>>
>>>>
>>>
>>> --
>>> Giovane Avancini
>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
>>> Carlos, USP
>>>
>>> PhD researcher in Structural Engineering - School of Engineering of São
>>> Carlos. USP
>>>
>>
>>
>> --
>> Giovane Avancini
>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
>> Carlos, USP
>>
>> PhD researcher in Structural Engineering - School of Engineering of São
>> Carlos. USP
>> <log.txt>
>>
>>
>>
>
> --
> Giovane Avancini
> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
> Carlos, USP
>
> PhD researcher in Structural Engineering - School of Engineering of São
> Carlos. USP
>
>
>

-- 
Giovane Avancini
Doutorando em Engenharia de Estruturas - Escola de Engenharia de São
Carlos, USP

PhD researcher in Structural Engineering - School of Engineering of São
Carlos. USP

Re: [petsc-users] [KSP] PETSc not reporting a KSP fail when true residual is NaN

Reply via email to