WHAT: New PML called "bfo" (Btl Fail Over) that supports failover between two or more openib BTLs. New configurable code in openib BTL that works with the bfo to do failover. Note this only works when we have two or more openib BTLs. This does not failover to another BTL, like tcp.
TO CONFIGURE: --enable-openib-failover TO RUN: --mca pml bfo TIMEOUT: June 16, 2010 ADDITIONAL DETAILS: The design relies on the BTL to call back into the PML with each fragment that fails so the PML can decide what needs to be done. There is no additional message tracking or software acknowledges added so that we can have minimal impact on latency. Testing has shown no measurable affect. When errors are detected on the BTL, it is no longer used. No effort is made to bring it back if the problems get corrected. If it gets fixed before the next job starts, then it will be used by the next job. Under normal conditions, these changes have no effect whatsover on the trunk as the bfo PML is never selected, and the failover support is not configured into the openib BTL. Every effort was made to minimize the changes in the openib BTL. The main changes are contained in two new files that only get compiled when the -enable-openib-failover flag is set. The other changes consist of about 75 new lines in various openib BTL files. The bitbucket version is at: http://bitbucket.org/rolfv/rfc-failover Here are the files that would be added/changed. BTL LAYER M ompi/mca/btl/btl.h M ompi/mca/btl/base/btl_base_mca.c M ompi/mca/btl/openib/btl_openib_component.c M ompi/mca/btl/openib/btl_openib.c M ompi/mca/btl/openib/btl_openib.h M ompi/mca/btl/openib/btl_openib_endpoint.h M ompi/mca/btl/openib/btl_openib_mca.c A ompi/mca/btl/openib/btl_openib_failover.c A ompi/mca/btl/openib/btl_openib_failover.h M ompi/mca/btl/openib/btl_openib_frag.h M ompi/mca/btl/openib/Makefile.am M ompi/config/ompi_check_openib.m4 PML LAYER A ompi/mca/pml/bfo A ompi/mca/pml/bfo/pml_bfo_comm.h A ompi/mca/pml/bfo/pml_bfo_sendreq.c A ompi/mca/pml/bfo/pml_bfo_isend.c A ompi/mca/pml/bfo/pml_bfo_component.c A ompi/mca/pml/bfo/Makefile.in A ompi/mca/pml/bfo/help-mpi-pml-bfo.txt A ompi/mca/pml/bfo/pml_bfo_recvfrag.h A ompi/mca/pml/bfo/pml_bfo_progress.c A ompi/mca/pml/bfo/pml_bfo_sendreq.h A ompi/mca/pml/bfo/pml_bfo_component.h A ompi/mca/pml/bfo/pml_bfo_failover.c A ompi/mca/pml/bfo/pml_bfo_recvreq.c A ompi/mca/pml/bfo/pml_bfo_irecv.c A ompi/mca/pml/bfo/pml_bfo_failover.h A ompi/mca/pml/bfo/pml_bfo_recvreq.h A ompi/mca/pml/bfo/pml_bfo_iprobe.c A ompi/mca/pml/bfo/pml_bfo.c A ompi/mca/pml/bfo/post_configure.sh A ompi/mca/pml/bfo/pml_bfo_hdr.h A ompi/mca/pml/bfo/pml_bfo_rdmafrag.c A ompi/mca/pml/bfo/pml_bfo_rdma.c A ompi/mca/pml/bfo/configure.params A ompi/mca/pml/bfo/pml_bfo.h A ompi/mca/pml/bfo/pml_bfo_rdmafrag.h A ompi/mca/pml/bfo/pml_bfo_rdma.h A ompi/mca/pml/bfo/.windows A ompi/mca/pml/bfo/Makefile.am A ompi/mca/pml/bfo/pml_bfo_comm.c A ompi/mca/pml/bfo/pml_bfo_start.c A ompi/mca/pml/bfo/pml_bfo_recvfrag.c