Dear Roy,
still let me share with you the stack trace.
It works for me now with mpich, but if I can help to improve libmesh
portability I would be glad to do it.
Here is the trace from the 1st MPI process.
On the 2nd it looks the same.
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
mlx5_stall_poll_cq (ibcq=0x2476cb0, ne=8, wc=0x7ffd610c0370) at src/cq.c:341
341 src/cq.c: No such file or directory.
in src/cq.c
(gdb) up
#1 mlx5_poll_cq (ibcq=0x2476cb0, ne=8, wc=0x7ffd610c0370) at src/cq.c:515
515 in src/cq.c
(gdb)
#2 0x00002af7b5a2ffc4 in ibv_poll_cq (ep=<value optimized out>)
at /usr/include/infiniband/verbs.h:1245
1245 return cq->context->ops.poll_cq(cq, num_entries, wc);
(gdb)
#3 fi_ibv_rdm_tagged_poll_recv (ep=<value optimized out>)
at prov/verbs/src/ep_rdm/verbs_tagged_ep_rdm.c:608
608 prov/verbs/src/ep_rdm/verbs_tagged_ep_rdm.c: No such file or
directory.
in prov/verbs/src/ep_rdm/verbs_tagged_ep_rdm.c
(gdb)
#4 0x00002af7b5a29d65 in fi_ibv_rdm_tagged_cq_readfrom (
cq=<value optimized out>, buf=<value optimized out>,
count=<value optimized out>, src_addr=<value optimized out>)
at prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c:82
82 prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c: No such file or directory.
in prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c
(gdb)
#5 0x00002af7b5a29e19 in fi_ibv_rdm_tagged_cq_read (cq=<value optimized
out>,
buf=<value optimized out>, count=<value optimized out>)
at prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c:98
98 in prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c
(gdb)
#6 0x00002af7b57c83c7 in ompi_mtl_ofi_progress_no_inline ()
from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/openmpi/mca_mtl_ofi.so
(gdb)
#7 0x00002af7afab5f2a in opal_progress ()
from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/libopen-pal.so.13
(gdb)
#8 0x00002af7b4b5f055 in mca_pml_cm_recv ()
from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/openmpi/mca_pml_cm.so
(gdb)
#9 0x00002af7a962022c in PMPI_Recv ()
from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/libmpi.so.12
(gdb)
#10 0x00002af7aab4b297 in libMesh::Parallel::Status
libMesh::Parallel::Communicator::receive<unsigned int>(unsigned int,
std::vector<unsigned int, std::allocator<unsigned int> >&,
libMesh::Parallel::DataTypenst&, libMesh::Parallel::MessageTag const&)
const () at ./include/libmesh/parallel_implementation.h:2649
2649 libmesh_call_mpi
(gdb)
#11 0x00002af7aab45a19 in void
libMesh::Parallel::Communicator::send_receive<unsigned int, unsigned
int>(unsigned int, std::vector<unsigned int, std::allocator<unsigned
int> > const&, libMesh::Parallel::Datae const&, unsigned int,
std::vector<unsigned int, std::allocator<unsigned int> >&,
libMesh::Parallel::DataType const&, libMesh::Parallel::MessageTag
const&, libMesh::Parallel::MessageTag const&) const () at
nclude/libmesh/parallel_implementation.h:2788
2788 this->receive (source_processor_id, recv, type2, recv_tag);
(gdb)
#12 0x00002af7aab40643 in void
libMesh::Parallel::Communicator::send_receive<unsigned int>(unsigned
int, std::vector<unsigned int, std::allocator<unsigned int> > const&,
unsigned int, std::vector<unsigned instd::allocator<unsigned int> >&,
libMesh::Parallel::MessageTag const&, libMesh::
2861 this->send_receive (dest_processor_id, sendvec,
(gdb)
#13 0x00002af7aaf28774 in unsigned int
libMesh::DistributedMesh::renumber_dof_ob
at src/mesh/distributed_mesh.C:1090
1090 this->comm().send_receive(procup, requested_ids[procup],
(gdb)
#14 0x00002af7aaf2337b in
libMesh::DistributedMesh::renumber_nodes_and_elements(
1267 _n_elem = this->renumber_dof_objects (this->_elements);
(gdb)
#15 0x00002af7aaf84a22 in libMesh::MeshBase::prepare_for_use(bool, bool) ()
at src/mesh/mesh_base.C:209
209 this->renumber_nodes_and_elements();
(gdb)
#16 0x00002af7aafe36b8 in
libMesh::MeshTools::Generation::build_cube(libMesh::Un) ()
at src/mesh/mesh_generation.C:1426
1426 mesh.prepare_for_use (/*skip_renumber =*/ false);
(gdb)
#17 0x0000000000417a99 in main () at create_mesh.C:19
19 MeshTools::Generation::build_cube (mesh, 5, 5, 5);
(gdb)
If you want me to print something in the debugger, I'll be happy to do it.
Michael.
On 09/06/2017 11:32 AM, Roy Stogner wrote:
On Wed, 6 Sep 2017, Michael Povolotskyi wrote:
I found that if I rebuilt everything with MPICH, instead of using
installed openmpi, then everything works perfectly. Is libmesh
supposed to work with openmpi? If yes, and I can recompile it again
and produce the stack trace.
libMesh does work with openmpi; I'm using MPICH2 right now but I was
using OpenMPI up until a month or two ago, and the problem that made
me switch was troublesome MPI_Abort behavior, not any bugs in regular
operation.
However, libMesh does *not* work with mixed MPI installs. Running an
openmpi-compiled libMesh with an mpich-based mpiexec, for instance,
will typically run N 1-processor jobs rather than 1 N-processor job.
If you somehow managed to *link* to multiple MPI versions (seems very
unlikely, but perhaps via another dependency which was built against a
different version?) then all bets are off.
Either way, no need for a stack trace if you've found a workaround.
---
Roy
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Libmesh-users mailing list
Libmesh-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-users