HI Gilles,

> On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to 
> CUDA environments ?

No, this is just on normal CPU-only nodes. But memcpy always goes through 
opal_cuda_memcpy when CUDA support is enabled, even if there’s no GPUs in use 
(or indeed, even installed).

> The coll/tuned default collective module is known not to work when tasks use 
> matching but different signatures.
> For example, one task sends one vector of N elements, and the other task 
> receives N elements.

This is the call that triggers it:

        ierror = MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, S[0], 
recvcounts, displs, mpitype_vec_nobs, node_comm);

(and changing the source datatype to MPI_BYTE to avoid the NULL handle doesn’t 

> A workaround worth trying is to
> mpirun --mca coll basic ...

Thanks — using --mca coll basic,libnbc fixes it (basic on its own fails because 
it can’t work out what to use for Iallgather).

> Last but not least, could you please post a minimal example (and the number 
> of MPI tasks used) that can evidence the issue ?

I’m just waiting for the user to get back to me with the okay to share the 
code. Otherwise, I’ll see what I can put together myself. It works on 42 cores 
(at 14 per node = 3 nodes) but fails for 43 cores (so 1 rank on the 4th node). 
The communicator includes 1 rank per node, so it’s going from a three-rank 
communicator to a four-rank communicator — perhaps the tuned algorithm changes 
at that point?


devel mailing list

Reply via email to