Bishesh Khanal <[email protected]> writes: >> I tried running on the cluster with one core per node with 4 nodes and I > got the following errors (note: using valgrind, and openmpi of the cluster) > at the very end after the many usual "unconditional jump ... errors" which > might be interesting > > mpiexec: killing job... > > mpiexec: abort is already in progress...hit ctrl-c again to forcibly > terminate > > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 59. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the > batch system) has told this process to end
Memory corruption generally results in SIGSEGV, so I suspect this is still either a memory issue or some other resource issue. How much memory is available on these compute nodes? Do turn off Valgrind for this run; it takes a lot of memory. > Does it mean it is crashing near MatSetValues_MPIAIJ ? Possibly, but it could be killing the program for other reasons.
pgpGjW0RqqGud.pgp
Description: PGP signature
