Ok, I was wrong, the fix works.
Actually, I rebuilt with the latest trunk but openib support was somehow
dropped. I was running on tcp.
Which brings us to the next issue : tcp is actually not working (I don't
know why I was convinced that tcp worked). The fix fixed the problem for
openib, but if I'm not mistaken (again !) tcp still hangs.
Sylvain
On Fri, 4 Sep 2009, Sylvain Jeaugey wrote:
Hi Rolf,
I was indeed running a more than 4 weeks old trunk, but after pulling the
latest version (and checking the patch was in the code), it seems to make no
difference.
However, I know where to look at now, thanks !
Sylvain
On Fri, 4 Sep 2009, Rolf Vandevaart wrote:
I think you are running into a bug that we saw also and we recently fixed.
We would see a hang when we were sending from a contiguous type to a
non-contiguous type using a single port over openib. The problem was that
the state of the request on the sending side was not being properly updated
in that case. The reason we see it with only one port vs two is because
different protocols are used depending on the number of ports.
Don Kerr found and fixed the problem in both the trunk and the branch.
Trunk: https://svn.open-mpi.org/trac/ompi/changeset/21775
1.3 Branch: https://svn.open-mpi.org/trac/ompi/changeset/21833
If you are running the latest bits and still seeing the problem, then I
guess it is something else.
Rolf
On 09/04/09 04:40, Sylvain Jeaugey wrote:
Hi all,
We're currently working with romio and we hit a problem when exchanging
data with hindexed types with the openib btl.
The attached reproducer (adapted from romio) is working fine on tcp,
blocks on openib when using 1 port but works if we use 2 ports (!). I
tested it against the trunk and the 1.3.3 release with the same
conclusions.
The basic idea is : processes 0..3 send contiguous data to process 0. 0
receives these buffers with an hindexed datatype which scatters data at
different offsets.
Receiving in a contiguous manner works, but receiving with an hindexed
datatype makes the remote sends block. Yes, the remote send, not the
receive. The receive is working fine and data is correctly scattered on
the buffer, but the senders on the other node are stuck in the Wait().
I tried not using MPI_BOTTOM, which changed nothing. It seems that the
problem only occurs when STRIPE*NB (the size of the send) is higher than
12k -namely the RDMA threshold- but I didn't manage to remove the deadlock
by increasing the RDMA threshold.
I've tried to do some debugging, but I'm a bit lost on where the
non-contiguous types are handled and how they affect btl communication.
So, if anyone has a clue on where I should look at, I'm interested !
Thanks,
Sylvain
------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel