WHAT: New PML called "bfo" (Btl Fail Over) that supports failover between
two or more openib BTLs.  New configurable code in openib BTL that works
with the bfo to do failover.  Note this only works when we have two or more
openib BTLs.  This does not failover to another BTL, like tcp.

TO CONFIGURE:
--enable-openib-failover

TO RUN:
--mca pml bfo

TIMEOUT:
June 16, 2010

ADDITIONAL DETAILS:
The design relies on the BTL to call back into the PML with each
fragment that fails so the PML can decide what needs to be done.
There is no additional message tracking or software acknowledges
added so that we can have minimal impact on latency.  Testing has
shown no measurable affect.

When errors are detected on the BTL, it is no longer used.  No effort
is made to bring it back if the problems get corrected.  If it gets
fixed before the next job starts, then it will be used by the next
job.

Under normal conditions, these changes have no effect whatsover on the
trunk as the bfo PML is never selected, and the failover support is
not configured into the openib BTL.  Every effort was made to minimize
the changes in the openib BTL.  The main changes are contained in two
new files that only get compiled when the -enable-openib-failover flag
is set.  The other changes consist of about 75 new lines in various
openib BTL files.

The bitbucket version is at:
http://bitbucket.org/rolfv/rfc-failover

Here are the files that would be added/changed.

BTL LAYER
M       ompi/mca/btl/btl.h
M       ompi/mca/btl/base/btl_base_mca.c
M       ompi/mca/btl/openib/btl_openib_component.c
M       ompi/mca/btl/openib/btl_openib.c
M       ompi/mca/btl/openib/btl_openib.h
M       ompi/mca/btl/openib/btl_openib_endpoint.h
M       ompi/mca/btl/openib/btl_openib_mca.c
A       ompi/mca/btl/openib/btl_openib_failover.c
A       ompi/mca/btl/openib/btl_openib_failover.h
M       ompi/mca/btl/openib/btl_openib_frag.h
M       ompi/mca/btl/openib/Makefile.am
M       ompi/config/ompi_check_openib.m4

PML LAYER
A       ompi/mca/pml/bfo
A       ompi/mca/pml/bfo/pml_bfo_comm.h
A       ompi/mca/pml/bfo/pml_bfo_sendreq.c
A       ompi/mca/pml/bfo/pml_bfo_isend.c
A       ompi/mca/pml/bfo/pml_bfo_component.c
A       ompi/mca/pml/bfo/Makefile.in
A       ompi/mca/pml/bfo/help-mpi-pml-bfo.txt
A       ompi/mca/pml/bfo/pml_bfo_recvfrag.h
A       ompi/mca/pml/bfo/pml_bfo_progress.c
A       ompi/mca/pml/bfo/pml_bfo_sendreq.h
A       ompi/mca/pml/bfo/pml_bfo_component.h
A       ompi/mca/pml/bfo/pml_bfo_failover.c
A       ompi/mca/pml/bfo/pml_bfo_recvreq.c
A       ompi/mca/pml/bfo/pml_bfo_irecv.c
A       ompi/mca/pml/bfo/pml_bfo_failover.h
A       ompi/mca/pml/bfo/pml_bfo_recvreq.h
A       ompi/mca/pml/bfo/pml_bfo_iprobe.c
A       ompi/mca/pml/bfo/pml_bfo.c
A       ompi/mca/pml/bfo/post_configure.sh
A       ompi/mca/pml/bfo/pml_bfo_hdr.h
A       ompi/mca/pml/bfo/pml_bfo_rdmafrag.c
A       ompi/mca/pml/bfo/pml_bfo_rdma.c
A       ompi/mca/pml/bfo/configure.params
A       ompi/mca/pml/bfo/pml_bfo.h
A       ompi/mca/pml/bfo/pml_bfo_rdmafrag.h
A       ompi/mca/pml/bfo/pml_bfo_rdma.h
A       ompi/mca/pml/bfo/.windows
A       ompi/mca/pml/bfo/Makefile.am
A       ompi/mca/pml/bfo/pml_bfo_comm.c
A       ompi/mca/pml/bfo/pml_bfo_start.c
A       ompi/mca/pml/bfo/pml_bfo_recvfrag.c


Reply via email to