I guess too much optimization always bites back :) In few words here is the description of the problem. The PML is event based, each action is triggered either by a function call from the upper level or a callback from the lower one. The last set of optimizations on the PML/ BTL remove the this callback in some cases, and therefore let the PML in a state where it is unable to do any progress. In this particular test (and the problem is not necessarily related to SM, it's just that we didn't find the right number of pending to trigger it over others BTL), the test execute a set of isend, followed by a blocking send. The isend are sent over SM, and as do not have progress in the isend, we fill up the SM queue. When the blocking send get posted, it will be delayed (as there is no more place in the SM file), and will be added by the PML to the pending send queue. So far, so good. Except, that at this point we return from the PML function, and go in the condition. The condition will call the BML progress functions, but as there is no callbacks to the PML, the PML is unable to reschedule the send.

This didn't happens until recently, but it was pure luck. Before there was a pending queue in the SM BTL, and eventually the message got sent at one point, without involving the PML. Anyway, as I said before the problem could happens with any other BTL, if we post the right number of non-blocking sends.

Here is the solution I propose. If you think there is any problem with it, please let me know asap.

Move the progress function from the BML layer back into the PML. Then the PML will have a way to check on it's pending requests, and progress them accordingly. This solution offer the same number of function calls as what we have today, and should only minimally affect the performances (one more if in the progress function).

  george.

On Jun 25, 2008, at 4:06 AM, Lenny Verkhovsky wrote:

Hi,
I downloaded new version from trunk and got the fallowing
1. opal_output for no reason ( probaly something was forgotten )
2. it got stacked.


/home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun -np 2 -hostfile hostfile_w4_8 ./osu_bw
[witch4:20920] Using eager rdma: 1
[witch4:20921] Using eager rdma: 1
# OSU MPI Bandwidth Test (Version 2.1)
# Size          Bandwidth (MB/s)

( got stacked )


Lenny.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to