In fact, on my machine the code is compiled with gnu, and on the cluster it
is compiled with intel (2015) compilers. I just run the program with
"-fp_trap" and got:

   |> Assembling interface problem. Unk # 56
   |> Solving interface problem
  Residual norms for interp_ solve.
  0 KSP Residual norm 3.642615470862e+03
[0]PETSC ERROR: *** unknown floating point error occurred ***
[0]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[0]PETSC ERROR: debugger traps the signal, the exception can be found with
[0]PETSC ERROR: where the result is a bitwise OR of the following flags:
[0]PETSC ERROR: Try option -start_in_debugger
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames
[1]PETSC ERROR: [2]PETSC ERROR: *** unknown floating point error occurred
[3]PETSC ERROR: *** unknown floating point error occurred ***
[3]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[4]PETSC ERROR: *** unknown floating point error occurred ***
[4]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[4]PETSC ERROR: [5]PETSC ERROR: *** unknown floating point error occurred
[5]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[5]PETSC ERROR: debugger traps the signal, the exception can be found with
[5]PETSC ERROR: where the result is a bitwise OR of the following flags:
[6]PETSC ERROR: *** unknown floating point error occurred ***
[6]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[6]PETSC ERROR: debugger traps the signal, the exception can be found with
[6]PETSC ERROR: where the result is a bitwise OR of the following flags:
[7]PETSC ERROR: *** unknown floating point error occurred ***
[7]PETSC ERROR: The specific exception can be determined by running in a
debugger.  When the
[7]PETSC ERROR: debugger traps the signal, the exception can be found with
[7]PETSC ERROR: where the result is a bitwise OR of the following flags:
[7]PETSC ERROR: Try option -start_in_debugger
[7]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] PetscDefaultFPTrap line 355
[0]PETSC ERROR: [0] VecMDot line 1154
[0]PETSC ERROR: [0] KSPGMRESClassicalGramSchmidtOrthogonalization line 44
[0]PETSC ERROR: [0] KSPGMRESCycle line 122
[0]PETSC ERROR: [0] KSPSolve_GMRES line 225
[0]PETSC ERROR: [0] KSPSolve_Private line 590
[0]PETSC ERROR: *** unknown floating point error occurred ***

So it seems that in fact a division by 0 is taking place. I will try to run
this in debug mode.


On Tue, Aug 25, 2020 at 10:23 AM Barry Smith <> wrote:

>   Sounds like it might be a compiler problem generating bad code.
>   On the machine where it fails you can run with -fp_trap to have it error
> out as soon as a Nan or Inf appears. If you can use the debugger on that
> machine you can tell the debugger to catch floating point exceptions and
> see the exact line an values of variables where a Nan or Inf appear.
>    As Matt conjectured it is likely there is a divide by zero before PETSc
> detects and it may be helpful to find out exactly where that happens.
>   Barry
> On Aug 25, 2020, at 8:03 AM, Alfredo Jaramillo <>
> wrote:
> Yes, Barry, that is correct.
> On Tue, Aug 25, 2020 at 1:02 AM Barry Smith <> wrote:
>>   On one system you get this error, on another system with the identical
>> code and test case you do not get the error?
>>   You get it with three iterative methods but not with MUMPS?
>> Barry
>> On Aug 24, 2020, at 8:35 PM, Alfredo Jaramillo <>
>> wrote:
>> Hello Barry, Matthew, thanks for the replies !
>> Yes, it is our custom code, and it also happens when setting -pc_type
>> bjacobi. Before testing an iterative solver, we were using MUMPS (-ksp_type
>> preonly -ksp_pc_type lu -pc_factor_mat_solver_type mumps) without issues.
>> Running the ex19 (as "mpirun -n 4 ex19 -da_refine 5") did not produce any
>> problem.
>> To reproduce the situation on my computer, I was able to reproduce the
>> error for a small case and -pc_type bjacobi. For that particular case, when
>> running in the cluster the error appears at the very last iteration:
>> =====
>> 27 KSP Residual norm 8.230378644666e-06
>> [0]PETSC ERROR: --------------------- Error Message
>> --------------------------------------------------------------
>> [0]PETSC ERROR: Invalid argument
>> [0]PETSC ERROR: Scalar value must be same on all processes, argument # 3
>> ====
>> whereas running on my computer the error is not launched and convergence
>> is reached instead:
>> ====
>> Linear interp_ solve converged due to CONVERGED_RTOL iterations 27
>> ====
>> I will run valgrind to seek for possible memory corruptions.
>> thank you
>> Alfredo
>> On Mon, Aug 24, 2020 at 9:00 PM Barry Smith <> wrote:
>>>    Oh yes, it could happen with Nan.
>>>    KSPGMRESClassicalGramSchmidtOrthogonalization()
>>> calls  KSPCheckDot(ksp,lhh[j]); so should detect any NAN that appear and
>>> set ksp->convergedreason  but the call to MAXPY() is still made before
>>> returning and hence producing the error message.
>>>    We should circuit the orthogonalization as soon as it sees a Nan/Inf
>>> and return immediately for GMRES to cleanup and produce a very useful error
>>> message.
>>>   Alfredo,
>>>     It is also possible that the hypre preconditioners are producing a
>>> Nan because your matrix is too difficult for them to handle, but it would
>>> be odd to happen after many iterations.
>>>    As I suggested before run with -pc_type bjacobi to see if you get the
>>> same problem.
>>>   Barry
>>> On Aug 24, 2020, at 6:38 PM, Matthew Knepley <> wrote:
>>> On Mon, Aug 24, 2020 at 6:27 PM Barry Smith <> wrote:
>>>>    Alfredo,
>>>>       This should never happen. The input to the VecMAXPY in gmres is
>>>> computed via VMDot which produces the same result on all processes.
>>>>        If you run with -pc_type bjacobi does it also happen?
>>>>        Is this your custom code or does it happen in PETSc examples
>>>> also? Like src/snes/tutorials/ex19 -da_refine 5
>>>>       Could be memory corruption, can you run under valgrind?
>>> Couldn't it happen if something generates a NaN? That also should not
>>> happen, but I was allowing that pilut might do it.
>>>   Thanks,
>>>     Matt
>>>>     Barry
>>>> > On Aug 24, 2020, at 4:05 PM, Alfredo Jaramillo <
>>>>> wrote:
>>>> >
>>>> > Dear PETSc developers,
>>>> >
>>>> > I'm trying to solve a linear problem with GMRES preconditioned with
>>>> pilut from HYPRE. For this I'm using the options:
>>>> >
>>>> > -ksp_type gmres -pc_type hypre -pc_hypre_type pilut -ksp_monitor
>>>> >
>>>> > If I use a single core, GMRES (+ pilut or euclid) converges. However,
>>>> when using multiple cores the next error appears after some number of
>>>> iterations:
>>>> >
>>>> > [0]PETSC ERROR: Scalar value must be same on all processes, argument
>>>> # 3
>>>> >
>>>> > relative to the function VecMAXPY. I attached a screenshot with more
>>>> detailed output. The same happens when using euclid. Can you please give me
>>>> some insight on this?
>>>> >
>>>> > best regards
>>>> > Alfredo
>>>> > <Screenshot from 2020-08-24 17-57-52.png>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>> <>

Reply via email to