Thanks a lot. If you need more information and/or my code again, please let me know.
Em sex., 1 de abr. de 2022 às 19:27, Barry Smith <[email protected]> escreveu: > > I'll take a look at it this weekend. The computed preconditioned > residual norm is a real number so I am not sure where PETSc will be able to > detect the problem appropriately before it is too late. > > > On Apr 1, 2022, at 6:14 PM, Giovane Avancini <[email protected]> wrote: > > Hi Barry, it's me again. > > Sorry to bother you with this issue, but the problem is still happening, > now when using KSPIBCGS. As you can see below, even when a NaN pops up in > the residual, the solver still converges to an INF solution. > > ----------------------- TIME STEP = 3318, time = 0.663600 > ----------------------- > > Mesh Regenerated. Elapsed time: 0.018536 > Isolated nodes: 14 > Assemble Linear System. Elapsed time: 0.030077 > 0 KSP preconditioned resid norm 4.087133454416e+04 true resid norm > -nan ||r(i)||/||b|| -nan > 1 KSP preconditioned resid norm 8.670288259109e+03 true resid norm > -nan ||r(i)||/||b|| -nan > 2 KSP preconditioned resid norm 4.875596419197e+03 true resid norm > -nan ||r(i)||/||b|| -nan > 3 KSP preconditioned resid norm 1.226640070761e+03 true resid norm > -nan ||r(i)||/||b|| -nan > 4 KSP preconditioned resid norm 7.121904546851e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 5 KSP preconditioned resid norm 5.990560906831e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 6 KSP preconditioned resid norm 4.256157374933e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 7 KSP preconditioned resid norm 3.274351035311e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 8 KSP preconditioned resid norm 2.436138522439e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 9 KSP preconditioned resid norm 1.268089193578e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 10 KSP preconditioned resid norm 1.093950736015e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 11 KSP preconditioned resid norm 9.950531836062e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 12 KSP preconditioned resid norm 1.066841140901e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 13 KSP preconditioned resid norm 1.003475554456e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 14 KSP preconditioned resid norm 1.073513486989e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 15 KSP preconditioned resid norm 8.724609972930e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 16 KSP preconditioned resid norm 1.445166180332e+02 true resid norm > -nan ||r(i)||/||b|| -nan > 17 KSP preconditioned resid norm 3.767376396291e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 18 KSP preconditioned resid norm 7.597770355737e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 19 KSP preconditioned resid norm 3.208030402538e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 20 KSP preconditioned resid norm 3.477715841173e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 21 KSP preconditioned resid norm 2.880337856055e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 22 KSP preconditioned resid norm 2.730108581171e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 23 KSP preconditioned resid norm 2.111131168298e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 24 KSP preconditioned resid norm 1.635560497545e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 25 KSP preconditioned resid norm 1.550914551701e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 26 KSP preconditioned resid norm 1.409066040669e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 27 KSP preconditioned resid norm 1.032086999081e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 28 KSP preconditioned resid norm 1.111168488798e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 29 KSP preconditioned resid norm 9.898696915473e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 30 KSP preconditioned resid norm 1.234283818664e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 31 KSP preconditioned resid norm 2.735222111838e+01 true resid norm > -nan ||r(i)||/||b|| -nan > 32 KSP preconditioned resid norm 6.431272223321e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 33 KSP preconditioned resid norm 6.320133000091e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 34 KSP preconditioned resid norm 6.568217058049e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 35 KSP preconditioned resid norm 6.483075335206e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 36 KSP preconditioned resid norm 6.419074566626e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 37 KSP preconditioned resid norm 6.372749647101e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 38 KSP preconditioned resid norm 5.920214853455e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 39 KSP preconditioned resid norm 5.953698988377e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 40 KSP preconditioned resid norm 4.009279521077e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 41 KSP preconditioned resid norm 8.407438130288e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 42 KSP preconditioned resid norm 1.924008529878e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 43 KSP preconditioned resid norm 9.126618449455e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 44 KSP preconditioned resid norm 2.747853629308e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 45 KSP preconditioned resid norm 2.556706051040e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 46 KSP preconditioned resid norm 2.427212844835e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 47 KSP preconditioned resid norm 7.630151877379e+00 true resid norm > -nan ||r(i)||/||b|| -nan > 48 KSP preconditioned resid norm 5.895961768741e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 49 KSP preconditioned resid norm 2.271378954392e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 50 KSP preconditioned resid norm 1.779670755839e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 51 KSP preconditioned resid norm 1.488459722777e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 52 KSP preconditioned resid norm 1.479802491212e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 53 KSP preconditioned resid norm 1.316523287251e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 54 KSP preconditioned resid norm 1.347849424457e-01 true resid norm > -nan ||r(i)||/||b|| -nan > 55 KSP preconditioned resid norm 6.739405576032e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 56 KSP preconditioned resid norm 6.699633313335e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 57 KSP preconditioned resid norm 8.064741830609e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 58 KSP preconditioned resid norm 6.744985187452e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 59 KSP preconditioned resid norm 6.981071339163e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 60 KSP preconditioned resid norm 4.410819986572e-02 true resid norm > -nan ||r(i)||/||b|| -nan > 61 KSP preconditioned resid norm 4.062281042354e-02 true resid norm > -nan ||r(i)||/||b|| -nan > Linear solve converged due to CONVERGED_RTOL iterations 61 > Solver converged within 61 iterations. Elapsed time: 0.117009 > Newton iteration: 0 - L2 Position Norm: INF - L2 Pressure Norm: INF > Memory used by each processor: 47.843750 Mb > > Could you please check if the issue can be fixed the same way as you did > for the GMRES family solvers? Thanks in advance, > > Kind regards, > > Giovane > > Em ter., 8 de mar. de 2022 às 01:05, Barry Smith <[email protected]> > escreveu: > >> >> I ran with -info and get repeated >> >> MatPivotCheck_none(): Detected zero pivot in factorization in row 2547 >> value 0. tolerance 2.22045e-14 >> >> after the first linear solve failure. The values are always slightly >> different. My conclusion is that from this point on the default >> factorization is truly failing each time which is why it is always >> switching the linear solver. >> >> Barry >> >> >> On Mar 7, 2022, at 7:01 PM, Giovane Avancini <[email protected]> wrote: >> >> Sorry, I forgot to attach the file. >> >> Em seg., 7 de mar. de 2022 às 21:01, Giovane Avancini <[email protected]> >> escreveu: >> >>> Thanks Barry! I included the piece of code you sent and now it seems to >>> be working pretty well. It has completed all the 5000 time steps and the >>> solver is indeed triggering the failure when a NaN/Inf is found. >>> >>> I just noticed a strange behaviour in my code after the patch that was >>> not happening before, so I was wondering if it could be related to the way >>> you fixed the bug or if it is a coincidence, please find attached the log >>> file. >>> >>> At time step 913, the first failure occurs,and it doesn't print the >>> norms of iteration 0 for instance (before, even when the pc ended up >>> failing during the first ksp iteration, the norms were plotted indicating >>> the NaN). Ok, maybe now it verifies that a NaN appeared before the norms >>> are actually computed. >>> >>> What is strange to me is that, after the first failure, all the >>> remaining calls to FGMRES have failed as well, which is unlikely to be the >>> case in my view. Would it be possible that some error flags of FGMRES are >>> not being reseted from one call to another? So after the first iteration of >>> step 913, FGMRES is being called with an error flag already set to true? >>> >>> Anyway, I really appreciate your efforts in finding the bug and trying >>> to help me, thank you very much! >>> >>> Em seg., 7 de mar. de 2022 às 18:08, Barry Smith <[email protected]> >>> escreveu: >>> >>>> >>>> The fix for the problem Geiovane encountered is in >>>> https://gitlab.com/petsc/petsc/-/merge_requests/4934 >>>> >>>> >>>> On Mar 3, 2022, at 11:24 AM, Giovane Avancini <[email protected]> >>>> wrote: >>>> >>>> Sorry for my late reply Barry, >>>> >>>> Sure I can share the code with you, but unfortunately I don't know how >>>> to make docker images. If you don't mind, you can clone the code from >>>> github through this link: [email protected]:giavancini/runPFEM.git >>>> It can be easily compiled with cmake, and you can see the dependencies >>>> in README.md. Please let me know if you need any other information. >>>> >>>> Kind regards, >>>> >>>> Giovane >>>> >>>> Em sex., 25 de fev. de 2022 às 18:22, Barry Smith <[email protected]> >>>> escreveu: >>>> >>>>> >>>>> Hmm, this is going to be tricky to debug why it the Inf/Nan is >>>>> not found when it should be. >>>>> >>>>> In a debugger you can catch/trap floating point exceptions (how >>>>> to do this depends on your debugger) and then step through the code after >>>>> that to see why PETSc KSP is not properly noting the Inf/Nan and >>>>> returning. >>>>> This may be cumbersome to do if you don't know PETSc well. Is your code >>>>> easy to build, would be willing to share it to me so I can run it and >>>>> debug >>>>> directly? If you know how to make docker images or something you might be >>>>> able to give it to me easily. >>>>> >>>>> Barry >>>>> >>>>> >>>>> On Feb 25, 2022, at 3:59 PM, Giovane Avancini <[email protected]> >>>>> wrote: >>>>> >>>>> Mark, Matthew and Barry, >>>>> >>>>> Thank you all for the quick responses. >>>>> >>>>> Others might have a better idea, but you could run with '-info :ksp' >>>>> and see if you see any messages like "Linear solver has created a not a >>>>> number (NaN) as the residual norm, declaring divergence \n" >>>>> You could also run with -log_trace and see if it is >>>>> using KSPConvergedDefault. I'm not sure if this is the method used given >>>>> your parameters, but I think it is. >>>>> >>>>> Mark, I ran with both options. I didn't get any messages like "linear >>>>> solver has created a not a number..." when using -info: ksp. When turning >>>>> on -log_trace, I could verify that it is using KSPConvergedDefault but >>>>> what >>>>> does it mean exactly? When FGMRES converges with the true residual being >>>>> NaN, I get the following message: [0] KSPConvergedDefault(): Linear solver >>>>> has converged. Residual norm 8.897908325511e-05 is less than relative >>>>> tolerance 1.000000000000e-08 times initial right hand side norm >>>>> 1.466597558465e+04 at iteration 53. No information about NaN whatsoever. >>>>> >>>>> We check for NaN or Inf, for example, in KSPCheckDot(). if you have >>>>> the KSP set to error ( >>>>> https://petsc.org/main/docs/manualpages/KSP/KSPSetErrorIfNotConverged.html >>>>> ) >>>>> then we throw an error, but the return codes do not seem to be checked >>>>> in your implementation. If not, then we set the flag for divergence. >>>>> >>>>> Matthew, I do not check the return code in this case because I don't >>>>> want PETSc to stop if an error occurs during the solving step. I just want >>>>> to know that it didn't converge and treat this error inside my code. The >>>>> problem is that the flag for divergence is not always being set when >>>>> FGMRES >>>>> is not converging. I was just wondering why it was set during time step >>>>> 921 >>>>> and why not for time step 922 as well. >>>>> >>>>> Thanks for the complete report. It looks like we may be missing a >>>>> check in our FGMRES implementation that allows the iteration to continue >>>>> after a NaN/Inf. >>>>> >>>>> I will explain how we handle the checking and then attach a patch >>>>> that you can apply to see if it resolves the problem. Whenever our KSP >>>>> solvers compute a norm we >>>>> check after that calculation to verify that the norm is not an Inf or >>>>> Nan. This is an inexpensive global check across all MPI ranks because >>>>> immediately after the norm computation all ranks that share the KSP have >>>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP >>>>> solve and return immediately with an appropriate not converged code. A >>>>> quick eye-ball inspection of the FGMRES code found a missing check. >>>>> >>>>> You can apply the attached patch file in the PETSC_DIR with >>>>> >>>>> patch -p1 < fgmres.patch >>>>> make libs >>>>> >>>>> then rerun your code and see if it now handles the Inf/NaN correctly. >>>>> If so we'll patch our release branch with the fix. >>>>> >>>>> Thank you for checking this, Barry. I applied the patch exactly the >>>>> way you instructed, however, the problem is still happening. Is there a >>>>> way >>>>> to check if the patch was in fact applied? You can see in the attached >>>>> screenshot the terminal information. >>>>> >>>>> Kind regards, >>>>> >>>>> Giovane >>>>> >>>>> Em sex., 25 de fev. de 2022 às 13:48, Barry Smith <[email protected]> >>>>> escreveu: >>>>> >>>>>> >>>>>> Giovane, >>>>>> >>>>>> Thanks for the complete report. It looks like we may be missing a >>>>>> check in our FGMRES implementation that allows the iteration to continue >>>>>> after a NaN/Inf. >>>>>> >>>>>> I will explain how we handle the checking and then attach a patch >>>>>> that you can apply to see if it resolves the problem. Whenever our KSP >>>>>> solvers compute a norm we >>>>>> check after that calculation to verify that the norm is not an Inf or >>>>>> Nan. This is an inexpensive global check across all MPI ranks because >>>>>> immediately after the norm computation all ranks that share the KSP have >>>>>> the same value. If the norm is a Inf or Nan we "short-circuit" the KSP >>>>>> solve and return immediately with an appropriate not converged code. A >>>>>> quick eye-ball inspection of the FGMRES code found a missing check. >>>>>> >>>>>> You can apply the attached patch file in the PETSC_DIR with >>>>>> >>>>>> patch -p1 < fgmres.patch >>>>>> make libs >>>>>> >>>>>> then rerun your code and see if it now handles the Inf/NaN correctly. >>>>>> If so we'll patch our release branch with the fix. >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> >>>>>> Giovane >>>>>> >>>>>> >>>>>> >>>>>> On Feb 25, 2022, at 11:06 AM, Giovane Avancini via petsc-users < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Dear PETSc users, >>>>>> >>>>>> I'm working on an inhouse code that solves the Navier-Stokes equation >>>>>> in a Lagrangian fashion for free surface flows. Because of the large >>>>>> distortions and pressure gradients, it is quite common to encounter some >>>>>> issues with iterative solvers for some time steps, and because of that, I >>>>>> implemented a function that changes the solver type based on the flag >>>>>> KSPConvergedReason. If this flag is negative after a call to KSPSolve, I >>>>>> solve the same linear system again using a direct method. >>>>>> >>>>>> The problem is that, sometimes, KSP keeps converging even though the >>>>>> residual is NaN, and because of that, I'm not able to identify the >>>>>> problem >>>>>> and change the solver, which leads to a solution vector equals to INF and >>>>>> obviously the code ends up crashing. Is it normal to observe this kind of >>>>>> behaviour? >>>>>> >>>>>> Please find attached the log produced with the options >>>>>> -ksp_monitor_lg_residualnorm -ksp_log -ksp_view >>>>>> -ksp_monitor_true_residual >>>>>> -ksp_converged_reason and the function that changes the solver. I'm >>>>>> currently using FGMRES and BJACOBI preconditioner with LU for each block. >>>>>> The problem still happens with ILU for example. We can see in the log >>>>>> file >>>>>> that for the time step 921, the true residual is NaN and within just one >>>>>> iteration, the solver fails and it gives the reason DIVERGED_PC_FAILED. I >>>>>> simply changed the solver to MUMPS and it converged for that time step. >>>>>> However, when solving time step 922 we can see that FGMRES converges >>>>>> while >>>>>> the true residual is NaN. Why is that possible? I would appreciate it if >>>>>> someone could clarify this issue to me. >>>>>> >>>>>> Kind regards, >>>>>> Giovane >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Giovane Avancini >>>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>>>>> Carlos, USP >>>>>> >>>>>> PhD researcher in Structural Engineering - School of Engineering of >>>>>> São Carlos. USP >>>>>> <function.txt><log.txt> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Giovane Avancini >>>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>>>> Carlos, USP >>>>> >>>>> PhD researcher in Structural Engineering - School of Engineering of >>>>> São Carlos. USP >>>>> <log.txt><patch.png> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Giovane Avancini >>>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>>> Carlos, USP >>>> >>>> PhD researcher in Structural Engineering - School of Engineering of São >>>> Carlos. USP >>>> >>>> >>>> >>> >>> -- >>> Giovane Avancini >>> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >>> Carlos, USP >>> >>> PhD researcher in Structural Engineering - School of Engineering of São >>> Carlos. USP >>> >> >> >> -- >> Giovane Avancini >> Doutorando em Engenharia de Estruturas - Escola de Engenharia de São >> Carlos, USP >> >> PhD researcher in Structural Engineering - School of Engineering of São >> Carlos. USP >> <log.txt> >> >> >> > > -- > Giovane Avancini > Doutorando em Engenharia de Estruturas - Escola de Engenharia de São > Carlos, USP > > PhD researcher in Structural Engineering - School of Engineering of São > Carlos. USP > > > -- Giovane Avancini Doutorando em Engenharia de Estruturas - Escola de Engenharia de São Carlos, USP PhD researcher in Structural Engineering - School of Engineering of São Carlos. USP
