On Apr 13, 2009, at 10:05 , Rolf Vandevaart wrote:

We also looking at mapping out a BTL when we get an error. We are going down the path of looking at registering a PML OB1 callback function that gets invoked when we get an error in the BTL. Then this PML OB1 callback function can map out the BTL via a call to mca_bml.bml_del_btl(btl) which seems to be doing the right thing.

There is already a PML functions (mca_pml_ob1_error_handler) that get called when an error [not related to any message] is detected by the BTL. However, the only thing this function does is calling abort.


But, to make this all work requires changes to the PML OB1 layer.

I have another version of the PML that is way more resilient than the one in the trunk. It is part of the fault tolerance work we're doing here at UTK, but it wasn't expected to go in the trunk anytime soon ...

We are also figuring out what we do for retransmission when we get an error.

There is some code for this. If the descriptor is for an RMA operation, we simply transform it into a send over another BTL. Right now, we're not dealing in OB1 with transmission failures for the match and rendez-vous fragments.

  george.


Rolf
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to