Dear Roy,
still let me share with you the stack trace.
It works for me now with mpich, but if I can help to improve libmesh portability I would be glad to do it.
Here is the trace from the 1st MPI process.
On the 2nd it looks the same.
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
mlx5_stall_poll_cq (ibcq=0x2476cb0, ne=8, wc=0x7ffd610c0370) at src/cq.c:341
341     src/cq.c: No such file or directory.
        in src/cq.c
(gdb) up
#1  mlx5_poll_cq (ibcq=0x2476cb0, ne=8, wc=0x7ffd610c0370) at src/cq.c:515
515     in src/cq.c
(gdb)
#2  0x00002af7b5a2ffc4 in ibv_poll_cq (ep=<value optimized out>)
    at /usr/include/infiniband/verbs.h:1245
1245            return cq->context->ops.poll_cq(cq, num_entries, wc);
(gdb)
#3  fi_ibv_rdm_tagged_poll_recv (ep=<value optimized out>)
    at prov/verbs/src/ep_rdm/verbs_tagged_ep_rdm.c:608
608 prov/verbs/src/ep_rdm/verbs_tagged_ep_rdm.c: No such file or directory.
        in prov/verbs/src/ep_rdm/verbs_tagged_ep_rdm.c
(gdb)
#4  0x00002af7b5a29d65 in fi_ibv_rdm_tagged_cq_readfrom (
    cq=<value optimized out>, buf=<value optimized out>,
    count=<value optimized out>, src_addr=<value optimized out>)
    at prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c:82
82      prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c: No such file or directory.
        in prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c
(gdb)
#5 0x00002af7b5a29e19 in fi_ibv_rdm_tagged_cq_read (cq=<value optimized out>,
    buf=<value optimized out>, count=<value optimized out>)
    at prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c:98
98      in prov/verbs/src/ep_rdm/verbs_cq_ep_rdm.c
(gdb)
#6  0x00002af7b57c83c7 in ompi_mtl_ofi_progress_no_inline ()
   from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/openmpi/mca_mtl_ofi.so
(gdb)
#7  0x00002af7afab5f2a in opal_progress ()
   from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/libopen-pal.so.13
(gdb)
#8  0x00002af7b4b5f055 in mca_pml_cm_recv ()
   from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/openmpi/mca_pml_cm.so
(gdb)
#9  0x00002af7a962022c in PMPI_Recv ()
   from /apps/conte/openmpi/1.10.1/gcc-5.2.0/lib/libmpi.so.12
(gdb)

#10 0x00002af7aab4b297 in libMesh::Parallel::Status libMesh::Parallel::Communicator::receive<unsigned int>(unsigned int, std::vector<unsigned int, std::allocator<unsigned int> >&, libMesh::Parallel::DataTypenst&, libMesh::Parallel::MessageTag const&) const () at ./include/libmesh/parallel_implementation.h:2649
2649      libmesh_call_mpi
(gdb)
#11 0x00002af7aab45a19 in void libMesh::Parallel::Communicator::send_receive<unsigned int, unsigned int>(unsigned int, std::vector<unsigned int, std::allocator<unsigned int> > const&, libMesh::Parallel::Datae const&, unsigned int, std::vector<unsigned int, std::allocator<unsigned int> >&, libMesh::Parallel::DataType const&, libMesh::Parallel::MessageTag const&, libMesh::Parallel::MessageTag const&) const () at nclude/libmesh/parallel_implementation.h:2788
2788      this->receive (source_processor_id, recv, type2, recv_tag);
(gdb)
#12 0x00002af7aab40643 in void libMesh::Parallel::Communicator::send_receive<unsigned int>(unsigned int, std::vector<unsigned int, std::allocator<unsigned int> > const&, unsigned int, std::vector<unsigned instd::allocator<unsigned int> >&, libMesh::Parallel::MessageTag const&, libMesh::
2861      this->send_receive (dest_processor_id, sendvec,
(gdb)
#13 0x00002af7aaf28774 in unsigned int libMesh::DistributedMesh::renumber_dof_ob
    at src/mesh/distributed_mesh.C:1090
1090              this->comm().send_receive(procup, requested_ids[procup],
(gdb)
#14 0x00002af7aaf2337b in libMesh::DistributedMesh::renumber_nodes_and_elements(
1267      _n_elem = this->renumber_dof_objects (this->_elements);
(gdb)
#15 0x00002af7aaf84a22 in libMesh::MeshBase::prepare_for_use(bool, bool) ()
    at src/mesh/mesh_base.C:209
209         this->renumber_nodes_and_elements();
(gdb)
#16 0x00002af7aafe36b8 in libMesh::MeshTools::Generation::build_cube(libMesh::Un) ()
    at src/mesh/mesh_generation.C:1426
1426      mesh.prepare_for_use (/*skip_renumber =*/ false);
(gdb)
#17 0x0000000000417a99 in main () at create_mesh.C:19
19          MeshTools::Generation::build_cube (mesh, 5, 5, 5);
(gdb)


If you want me to print something in the debugger, I'll be happy to do it.
Michael.


On 09/06/2017 11:32 AM, Roy Stogner wrote:

On Wed, 6 Sep 2017, Michael Povolotskyi wrote:

I found that if I rebuilt everything with MPICH, instead of using installed openmpi, then everything works perfectly. Is libmesh supposed to work with openmpi? If yes, and I can recompile it again and produce the stack trace.

libMesh does work with openmpi; I'm using MPICH2 right now but I was
using OpenMPI up until a month or two ago, and the problem that made
me switch was troublesome MPI_Abort behavior, not any bugs in regular
operation.

However, libMesh does *not* work with mixed MPI installs.  Running an
openmpi-compiled libMesh with an mpich-based mpiexec, for instance,
will typically run N 1-processor jobs rather than 1 N-processor job.
If you somehow managed to *link* to multiple MPI versions (seems very
unlikely, but perhaps via another dependency which was built against a
different version?) then all bets are off.

Either way, no need for a stack trace if you've found a workaround.
---
Roy



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Libmesh-users mailing list
Libmesh-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to