Short version:
--------------
The modular wireup code on /tmp/jms-modular-wireup seems to be
working. Can people give it a whirl before I bring it back to the
trunk? The more esoteric your hardware setup, the better.
Longer version:
---------------
I think that I have completed round 1 of the modular wireup work in /
tmp/jms-modular-wireup, meaning that all the wireup code has been
moved out of btl_openib_endpoint.* and into connect/*. The
endpoint.c file now simply calls the connect interface through a
function pointer (allowing the choice of the current RML-based wireup
or the RDMA CM). The selected connect "module" will call back to the
openib endpoint for two things:
1. post receive buffers on a locally-created-but-not-yet-connected qp
2. when the qp is fully connected and ready to be used
This cleaned up the endpoint.* code a *lot*. I also simplified the
RML connection code a bit -- I removed some useless sub-functions, etc.
I *think* that this new connection code is all working, but per
http://www.open-mpi.org/community/lists/devel/2007/07/2058.php, I'm
seeing other weird failures so I'm a little reluctant to put this
back on the trunk until I know that everything is working properly.
Granted, the failures in the other post sound like pml errors and
this should be a wholly separate issue (we would get different
warnings/errors if the btl failed to connect), but still -- it seems
a little safer to be prudent.
Still to do:
- make the static rate be exchanged and set properly during the RML
wireup
- RDMA CM support (it returns ERR_NOT_IMPLEMENTED right now)
--
Jeff Squyres
Cisco Systems