Bishesh Khanal <[email protected]> writes:

>> I tried running on the cluster with one core per node with 4 nodes and I
> got the following errors (note: using valgrind, and openmpi of the cluster)
> at the very end after the many usual "unconditional jump ... errors"  which
> might be interesting
>
> mpiexec: killing job...
>
> mpiexec: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 59.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> [0]PETSC ERROR:
> ------------------------------------------------------------------------
> [0]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the
> batch system) has told this process to end

Memory corruption generally results in SIGSEGV, so I suspect this is
still either a memory issue or some other resource issue.  How much
memory is available on these compute nodes?  Do turn off Valgrind for
this run; it takes a lot of memory.

> Does it mean it is crashing near MatSetValues_MPIAIJ ?

Possibly, but it could be killing the program for other reasons.

Attachment: pgpGjW0RqqGud.pgp
Description: PGP signature

Reply via email to