Sorry about the premature send... The basic mechanics of this is similar to the problem with the portals BTL that I fixed. However, in my case, the problem manifested itself with the Intel test MPI_Send_Fairness_c (and MPI_Isend_Fairness_c) at 60 processes (the limit that MTT imposes on the Intel tests).
The original code followed the portals design document for MPI pretty well. When the receiver is overwhelmed, a "reject" entry is used to handle the excess messages. One of the features of this "reject" entry is that the receiver (at the BTL level) never interacts with the actual message. The problem was that the sender did not recognize the return ACK from portals [in mca_btl_portals_component_progress()] as a failure. So, the sender did not resend a message that the receiver was expecting. While I fixed it in the trunk, I had to disable mca_btl_portals_sendi() because there is a potential for this function to be used with a 0-byte portals message payload. For this particular test, https://svn.open-mpi.org/trac/ompi/ticket/1791, we would not have seen a failure because, the root process would not know that it had missed a message and the non-root processes would not have diagnosed a need to resend. As corrected, the root process still is FD&H and the non-root processes will keep re-transmitting until success. Sorry for boring you about portals. In the sm case, the non-root processes continually are appending to FIFOs. Obviously, these blasters can append to the FIFOs much more quickly than the processes can remove entries: S7 --> S0 S6 --> S1 S5 --> S2 S4 --> S3 In the first cycle, everyone is busy. In the second cycle, S7, S6, S5, and S4 are ready for the next reduction, but S3, S2, S1, and S0 still are on the hook, meaning that the latter FIFOs are going to grow at a faster rate: S3 --> S0 S2 --> S1 Now, S3 and S2 are ready for the next reduction, but S0 and S1 still have work left in the current reduction: S1 --> S0 Since S0 (the root process) takes a little time to finish processing the reduction, it is going to be a little behind S1. So we end up with the Following timings: S0: (3+Δ)T S1: 3T S2: 2T S3: 2T S4: 1T S5: 1T S6: 1T S7: 1T If sm used a system of ACKs as in portals, we would know when we are overloading the root process. However, since it does not, and the reduction itself is non-blocking, we have the potential to exhaust memory. I guess that the real question is whether the reduction should be blocking or whether we expect the user to protect himself. -- -----Original Message----- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Eugene Loh Sent: Friday, February 13, 2009 11:42 AM To: Open MPI Developers Subject: Re: [OMPI devel] RFC: Eliminateompi/class/ompi_[circular_buffer_]fifo.h George Bosilca wrote: > I can't confirm or deny. The only thing I can tell is that the same > test works fine over other BTL, so this tent either to pinpoint a > problem in the sm BTL or in a particular path in the PML (the one > used by the sm BTL). I'll have to dig a little bit more into it, but > I was hoping to do it in the context of the new sm BTL (just to avoid > having to do it twice). Okay. I'll try to get "single queue" put back soon and might look at 1791 along the way. But here is what I wonder. Let's say you have one-way traffic -- either rank A sending rank B messages without ever any traffic in the other direction, or repeated MPI_Reduce operations always with the same root -- and the senders somehow get well ahead of the receiver. Say, A wants to pump 1,000,000 messages over and B is busy doing something else. What should happen? What should the PML and BTL do? The conditions could range from B not being in MPI at all, or B listening to the BTL without yet having the posted receives to match. Should the connection become congested and force the sender to wait -- and if so, is this in the BTL or PML? Or, should B keep on queueing up the unexpected messages? After some basic "single queue" putbacks, I'll try to look at the code and understand what the PML is doing in cases like this. _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel