HI Gilles, > On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to > CUDA environments ?
No, this is just on normal CPU-only nodes. But memcpy always goes through opal_cuda_memcpy when CUDA support is enabled, even if there’s no GPUs in use (or indeed, even installed). > The coll/tuned default collective module is known not to work when tasks use > matching but different signatures. > For example, one task sends one vector of N elements, and the other task > receives N elements. This is the call that triggers it: ierror = MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, S, recvcounts, displs, mpitype_vec_nobs, node_comm); (and changing the source datatype to MPI_BYTE to avoid the NULL handle doesn’t help). > A workaround worth trying is to > mpirun --mca coll basic ... Thanks — using --mca coll basic,libnbc fixes it (basic on its own fails because it can’t work out what to use for Iallgather). > Last but not least, could you please post a minimal example (and the number > of MPI tasks used) that can evidence the issue ? I’m just waiting for the user to get back to me with the okay to share the code. Otherwise, I’ll see what I can put together myself. It works on 42 cores (at 14 per node = 3 nodes) but fails for 43 cores (so 1 rank on the 4th node). The communicator includes 1 rank per node, so it’s going from a three-rank communicator to a four-rank communicator — perhaps the tuned algorithm changes at that point? Cheers, Ben
_______________________________________________ devel mailing list email@example.com https://lists.open-mpi.org/mailman/listinfo/devel