I have placed a tarball of this branch for those willing to MTT it: http://www.open-mpi.org/~rhc/openmpi-1.9.tar.bz2
I will update if/when major changes are made. On May 13, 2013, at 9:00 PM, Ralph Castain <r...@open-mpi.org> wrote: > Hi folks > > As most of you know, I have been working for quite some time on rewriting the > OOB. It is now getting close to being ready to be committed. > > Unfortunately, I am changing jobs on May 20th (starting a position at Intel) > that will cause a hopefully short "service interruption" in my ability to > contribute code to OMPI. I have started the legal paperwork to resolve that > situation and have the backing of my new management, but these things always > take time. > > Ordinarily, I would simply hold off the commit until the paperwork was > completed. However, after talking with a few people in the community, the > changes are important and desirable enough to get this into the trunk without > the indefinite delay. I can continue to help debug even after my status > changes - just cannot directly contribute code. So I have committed the code > to the OMPI repository in a public temporary branch. Once the community > believes the code is ready, Jeff can merge it back to the trunk if I'm not > able to do so. > > > WHAT: Rewrite of ORTE OOB > > WHY: Support asynchronous progress and a host of other features > > WHEN: TBD (will discuss at weekly telecon and/or on mailing list) > > SYNOPSIS: > The current OOB has served us well, but a number of limitations have been > identified over the years. Specifically: > > * it is only progressed when called via opal_progress, which can lead to > hangs or recursive calls into libevent (which is not supported by that code) > > * we've had issues when multiple NICs are available as the code doesn't > "shift" messages between transports - thus, all nodes had to be available via > the same TCP interface. > > * the OOB "unloads" incoming opal_buffer_t objects during the transmission, > thus preventing use of OBJ_RETAIN in the code when repeatedly sending the > same message to multiple recipients > > * there is no failover mechanism across NICs - if the selected NIC (or its > attached switch) fails, we are forced to abort > > * only one transport (i.e., component) can be "active" > > > The revised OOB resolves these problems: > > * async progress is used for all application processes, with the progress > thread blocking in the event library > > * each available NIC is supported by its own TCP module. The ability to > asynchronously progress each module independently is provided, but not > enabled by default (a runtime MCA parameter turns it "on") > > * multi-address NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with > virtual interfaces) are supported - reachability is determined by comparing > the contact info for a peer against all addresses within the range covered by > the address/mask pairs for the NIC. > > * a message that arrives on one NIC is automatically shifted to whatever NIC > that is connected to the next "hop" if that peer cannot be reached by the > incoming NIC. If no TCP module will reach the peer, then the OOB attempts to > send the message via all other available components - if none can reach the > peer, then an "error" is reported back to the RML, which then calls the > errmgr for instructions. > > * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no > longer "unload" the incoming object > > * NIC failure is reported to the TCP component, which then tries to resend > the message across any other available TCP NIC. If that doesn't work, then > the message is given back to the OOB base to try using other components. If > all that fails, then the error is reported to the RML, which reports to the > errmgr for instructions > > * obviously from the above, multiple OOB components (e.g., TCP and UD) can be > active in parallel > > * the matching code has been moved to the RML (and out of the OOB/TCP > component) so it is independent of transport > > * routing is done by the individual OOB modules (as opposed to the RML). > Thus, both routed and non-routed transports can simultaneously be active > > * all blocking send/recv APIs have been removed. Everything operates > asynchronously. > > > KNOWN LIMITATIONS: > > * although provision is made for component failover as described above, the > code for doing so has not been implemented yet. At the moment, if all > connections for a given component fail, the errmgr is notified of a "lost > connection", which by default results in termination of the job if it was a > lifeline > > * the IPv6 code is present and compiles, but has not been tested as I don't > have access to any IPv6-enabled cluster > > * routing is performed at the individual module level, yet the active routed > component is selected on a global basis. We probably should update that to > reflect that different transports may need/choose to route in different ways > > * obviously, not every error path has been tested nor necessarily covered > > * determining abnormal termination is more challenging than in the old code > as we now potentially have multiple ways of connecting to a process. Ideally, > we would declare "connection failed" when *all* transports can no longer > reach the process, but that requires some additional (possibly complex) code. > For now, the code replicates the old behavior only somewhat modified - i.e., > if a module sees its connection fail, it checks to see if it is a lifeline. > If so, it notifies the errmgr that the lifeline is lost - otherwise, it > notifies the errmgr that a non-lifeline connection was lost. > > * reachability is determined solely on the basis of a shared subnet > address/mask - more sophisticated algorithms (e.g., the one used in the tcp > btl) are required to handle routing via gateways > > * the RML needs to assign sequence numbers to each message on a per-peer > basis. The receiving RML will then deliver messages in order, thus preventing > out-of-order messaging in the case where messages travel across different > transports or a message needs to be redirected/resent due to failure of a NIC > > > The code is in https://svn.open-mpi.org/svn/ompi/tmp-public/oob2. It isn't > fully done yet (I'm still working on the above "limitations"), but I wanted > to provide as much time as possible for the RFC and begin the review process > as soon as possible. > > I will be providing "theory of operation" on the wiki. I'm somewhat hampered > by an injury to one arm, so it will take a bit for me to complete. In brief, > the primary design point is that all operations are executed within events. > This avoids the need to turn "on" OPAL thread support, thus allowing ORTE to > provide async progress and thread safety without impacting the performance of > the MPI layer itself. However, it means you have to be aware of what event > base you are in and only access the data within that base. > > > >