In fact, on my machine the code is compiled with gnu, and on the cluster it is compiled with intel (2015) compilers. I just run the program with "-fp_trap" and got:
=============================================================== |> Assembling interface problem. Unk # 56 |> Solving interface problem Residual norms for interp_ solve. 0 KSP Residual norm 3.642615470862e+03 [0]PETSC ERROR: *** unknown floating point error occurred *** [0]PETSC ERROR: The specific exception can be determined by running in a debugger. When the [0]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f) [0]PETSC ERROR: where the result is a bitwise OR of the following flags: [0]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20 [0]PETSC ERROR: Try option -start_in_debugger [0]PETSC ERROR: likely location of problem given in stack below [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [1]PETSC ERROR: [2]PETSC ERROR: *** unknown floating point error occurred *** [3]PETSC ERROR: *** unknown floating point error occurred *** [3]PETSC ERROR: The specific exception can be determined by running in a debugger. When the [4]PETSC ERROR: *** unknown floating point error occurred *** [4]PETSC ERROR: The specific exception can be determined by running in a debugger. When the [4]PETSC ERROR: [5]PETSC ERROR: *** unknown floating point error occurred *** [5]PETSC ERROR: The specific exception can be determined by running in a debugger. When the [5]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f) [5]PETSC ERROR: where the result is a bitwise OR of the following flags: [6]PETSC ERROR: *** unknown floating point error occurred *** [6]PETSC ERROR: The specific exception can be determined by running in a debugger. When the [6]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f) [6]PETSC ERROR: where the result is a bitwise OR of the following flags: [6]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20 [7]PETSC ERROR: *** unknown floating point error occurred *** [7]PETSC ERROR: The specific exception can be determined by running in a debugger. When the [7]PETSC ERROR: debugger traps the signal, the exception can be found with fetestexcept(0x3f) [7]PETSC ERROR: where the result is a bitwise OR of the following flags: [7]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8 FE_UNDERFLOW=0x10 FE_INEXACT=0x20 [7]PETSC ERROR: Try option -start_in_debugger [7]PETSC ERROR: likely location of problem given in stack below [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, [0]PETSC ERROR: INSTEAD the line number of the start of the function [0]PETSC ERROR: is given. [0]PETSC ERROR: [0] PetscDefaultFPTrap line 355 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/sys/error/fp.c [0]PETSC ERROR: [0] VecMDot line 1154 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/vec/vec/interface/rvector.c [0]PETSC ERROR: [0] KSPGMRESClassicalGramSchmidtOrthogonalization line 44 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/borthog2.c [0]PETSC ERROR: [0] KSPGMRESCycle line 122 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c [0]PETSC ERROR: [0] KSPSolve_GMRES line 225 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/impls/gmres/gmres.c [0]PETSC ERROR: [0] KSPSolve_Private line 590 /mnt/lustre/home/ajaramillo/petsc-3.13.0/src/ksp/ksp/interface/itfunc.c [0]PETSC ERROR: *** unknown floating point error occurred *** =============================================================== So it seems that in fact a division by 0 is taking place. I will try to run this in debug mode. thanks Alfredo On Tue, Aug 25, 2020 at 10:23 AM Barry Smith <bsm...@petsc.dev> wrote: > > Sounds like it might be a compiler problem generating bad code. > > On the machine where it fails you can run with -fp_trap to have it error > out as soon as a Nan or Inf appears. If you can use the debugger on that > machine you can tell the debugger to catch floating point exceptions and > see the exact line an values of variables where a Nan or Inf appear. > > As Matt conjectured it is likely there is a divide by zero before PETSc > detects and it may be helpful to find out exactly where that happens. > > Barry > > > On Aug 25, 2020, at 8:03 AM, Alfredo Jaramillo <ajaramillopa...@gmail.com> > wrote: > > Yes, Barry, that is correct. > > > > On Tue, Aug 25, 2020 at 1:02 AM Barry Smith <bsm...@petsc.dev> wrote: > >> >> On one system you get this error, on another system with the identical >> code and test case you do not get the error? >> >> You get it with three iterative methods but not with MUMPS? >> >> Barry >> >> >> On Aug 24, 2020, at 8:35 PM, Alfredo Jaramillo <ajaramillopa...@gmail.com> >> wrote: >> >> Hello Barry, Matthew, thanks for the replies ! >> >> Yes, it is our custom code, and it also happens when setting -pc_type >> bjacobi. Before testing an iterative solver, we were using MUMPS (-ksp_type >> preonly -ksp_pc_type lu -pc_factor_mat_solver_type mumps) without issues. >> >> Running the ex19 (as "mpirun -n 4 ex19 -da_refine 5") did not produce any >> problem. >> >> To reproduce the situation on my computer, I was able to reproduce the >> error for a small case and -pc_type bjacobi. For that particular case, when >> running in the cluster the error appears at the very last iteration: >> >> ===== >> 27 KSP Residual norm 8.230378644666e-06 >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [0]PETSC ERROR: Invalid argument >> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3 >> ==== >> >> whereas running on my computer the error is not launched and convergence >> is reached instead: >> >> ==== >> Linear interp_ solve converged due to CONVERGED_RTOL iterations 27 >> ==== >> >> I will run valgrind to seek for possible memory corruptions. >> >> thank you >> Alfredo >> >> On Mon, Aug 24, 2020 at 9:00 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> Oh yes, it could happen with Nan. >>> >>> KSPGMRESClassicalGramSchmidtOrthogonalization() >>> calls KSPCheckDot(ksp,lhh[j]); so should detect any NAN that appear and >>> set ksp->convergedreason but the call to MAXPY() is still made before >>> returning and hence producing the error message. >>> >>> We should circuit the orthogonalization as soon as it sees a Nan/Inf >>> and return immediately for GMRES to cleanup and produce a very useful error >>> message. >>> >>> Alfredo, >>> >>> It is also possible that the hypre preconditioners are producing a >>> Nan because your matrix is too difficult for them to handle, but it would >>> be odd to happen after many iterations. >>> >>> As I suggested before run with -pc_type bjacobi to see if you get the >>> same problem. >>> >>> Barry >>> >>> >>> On Aug 24, 2020, at 6:38 PM, Matthew Knepley <knep...@gmail.com> wrote: >>> >>> On Mon, Aug 24, 2020 at 6:27 PM Barry Smith <bsm...@petsc.dev> wrote: >>> >>>> >>>> Alfredo, >>>> >>>> This should never happen. The input to the VecMAXPY in gmres is >>>> computed via VMDot which produces the same result on all processes. >>>> >>>> If you run with -pc_type bjacobi does it also happen? >>>> >>>> Is this your custom code or does it happen in PETSc examples >>>> also? Like src/snes/tutorials/ex19 -da_refine 5 >>>> >>>> Could be memory corruption, can you run under valgrind? >>>> >>> >>> Couldn't it happen if something generates a NaN? That also should not >>> happen, but I was allowing that pilut might do it. >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Barry >>>> >>>> >>>> > On Aug 24, 2020, at 4:05 PM, Alfredo Jaramillo < >>>> ajaramillopa...@gmail.com> wrote: >>>> > >>>> > Dear PETSc developers, >>>> > >>>> > I'm trying to solve a linear problem with GMRES preconditioned with >>>> pilut from HYPRE. For this I'm using the options: >>>> > >>>> > -ksp_type gmres -pc_type hypre -pc_hypre_type pilut -ksp_monitor >>>> > >>>> > If I use a single core, GMRES (+ pilut or euclid) converges. However, >>>> when using multiple cores the next error appears after some number of >>>> iterations: >>>> > >>>> > [0]PETSC ERROR: Scalar value must be same on all processes, argument >>>> # 3 >>>> > >>>> > relative to the function VecMAXPY. I attached a screenshot with more >>>> detailed output. The same happens when using euclid. Can you please give me >>>> some insight on this? >>>> > >>>> > best regards >>>> > Alfredo >>>> > <Screenshot from 2020-08-24 17-57-52.png> >>>> >>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> <http://www.cse.buffalo.edu/~knepley/> >>> >>> >>> >> >