Re: [OMPI devel] RFC: rewrite of ORTE OOB

Ralph Castain Tue, 14 May 2013 21:47:02 -0400

I have placed a tarball of this branch for those willing to MTT it:

http://www.open-mpi.org/~rhc/openmpi-1.9.tar.bz2


I will update if/when major changes are made.


On May 13, 2013, at 9:00 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Hi folks
> 
> As most of you know, I have been working for quite some time on rewriting the 
> OOB. It is now getting close to being ready to be committed.
> 
> Unfortunately, I am changing jobs on May 20th (starting a position at Intel) 
> that will cause a hopefully short "service interruption" in my ability to 
> contribute code to OMPI. I have started the legal paperwork to resolve that 
> situation and have the backing of my new management, but these things always 
> take time.
> 
> Ordinarily, I would simply hold off the commit until the paperwork was 
> completed. However, after talking with a few people in the community, the 
> changes are important and desirable enough to get this into the trunk without 
> the indefinite delay. I can continue to help debug even after my status 
> changes - just cannot directly contribute code. So I have committed the code 
> to the OMPI repository in a public temporary branch. Once the community 
> believes the code is ready, Jeff can merge it back to the trunk if I'm not 
> able to do so.
> 
> 
> WHAT:    Rewrite of ORTE OOB
> 
> WHY:       Support asynchronous progress and a host of other features
> 
> WHEN:    TBD (will discuss at weekly telecon and/or on mailing list)
> 
> SYNOPSIS:
> The current OOB has served us well, but a number of limitations have been 
> identified over the years. Specifically:
> 
> * it is only progressed when called via opal_progress, which can lead to 
> hangs or recursive calls into libevent (which is not supported by that code)
> 
> * we've had issues when multiple NICs are available as the code doesn't 
> "shift" messages between transports - thus, all nodes had to be available via 
> the same TCP interface.
> 
> * the OOB "unloads" incoming opal_buffer_t objects during the transmission, 
> thus preventing use of OBJ_RETAIN in the code when repeatedly sending the 
> same message to multiple recipients
> 
> * there is no failover mechanism across NICs - if the selected NIC (or its 
> attached switch) fails, we are forced to abort
> 
> * only one transport (i.e., component) can be "active"
> 
> 
> The revised OOB resolves these problems:
> 
> * async progress is used for all application processes, with the progress 
> thread blocking in the event library
> 
> * each available NIC is supported by its own TCP module. The ability to 
> asynchronously progress each module independently is provided, but not 
> enabled by default (a runtime MCA parameter turns it "on")
> 
> * multi-address NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with 
> virtual interfaces) are supported - reachability is determined by comparing 
> the contact info for a peer against all addresses within the range covered by 
> the address/mask pairs for the NIC.
> 
> * a message that arrives on one NIC is automatically shifted to whatever NIC 
> that is connected to the next "hop" if that peer cannot be reached by the 
> incoming NIC. If no TCP module will reach the peer, then the OOB attempts to 
> send the message via all other available components - if none can reach the 
> peer, then an "error" is reported back to the RML, which then calls the 
> errmgr for instructions.
> 
> * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no 
> longer "unload" the incoming object
> 
> * NIC failure is reported to the TCP component, which then tries to resend 
> the message across any other available TCP NIC. If that doesn't work, then 
> the message is given back to the OOB base to try using other components. If 
> all that fails, then the error is reported to the RML, which reports to the 
> errmgr for instructions
> 
> * obviously from the above, multiple OOB components (e.g., TCP and UD) can be 
> active in parallel
> 
> * the matching code has been moved to the RML (and out of the OOB/TCP 
> component) so it is independent of transport
> 
> * routing is done by the individual OOB modules (as opposed to the RML). 
> Thus, both routed and non-routed transports can simultaneously be active
> 
> * all blocking send/recv APIs have been removed. Everything operates 
> asynchronously.
> 
> 
> KNOWN LIMITATIONS:
> 
> * although provision is made for component failover as described above, the 
> code for doing so has not been implemented yet. At the moment, if all 
> connections for a given component fail, the errmgr is notified of a "lost 
> connection", which by default results in termination of the job if it was a 
> lifeline
> 
> * the IPv6 code is present and compiles, but has not been tested as I don't 
> have access to any IPv6-enabled cluster
> 
> * routing is performed at the individual module level, yet the active routed 
> component is selected on a global basis. We probably should update that to 
> reflect that different transports may need/choose to route in different ways
> 
> * obviously, not every error path has been tested nor necessarily covered
> 
> * determining abnormal termination is more challenging than in the old code 
> as we now potentially have multiple ways of connecting to a process. Ideally, 
> we would declare "connection failed" when *all* transports can no longer 
> reach the process, but that requires some additional (possibly complex) code. 
> For now, the code replicates the old behavior only somewhat modified - i.e., 
> if a module sees its connection fail, it checks to see if it is a lifeline. 
> If so, it notifies the errmgr that the lifeline is lost - otherwise, it 
> notifies the errmgr that a non-lifeline connection was lost.
> 
> * reachability is determined solely on the basis of a shared subnet 
> address/mask - more sophisticated algorithms (e.g., the one used in the tcp 
> btl) are required to handle routing via gateways
> 
> * the RML needs to assign sequence numbers to each message on a per-peer 
> basis. The receiving RML will then deliver messages in order, thus preventing 
> out-of-order messaging in the case where messages travel across different 
> transports or a message needs to be redirected/resent due to failure of a NIC
> 
> 
> The code is in https://svn.open-mpi.org/svn/ompi/tmp-public/oob2. It isn't 
> fully done yet (I'm still working on the above "limitations"), but I wanted 
> to provide as much time as possible for the RFC and begin the review process 
> as soon as possible.
> 
> I will be providing "theory of operation" on the wiki. I'm somewhat hampered 
> by an injury to one arm, so it will take a bit for me to complete. In brief, 
> the primary design point is that all operations are executed within events. 
> This avoids the need to turn "on" OPAL thread support, thus allowing ORTE to 
> provide async progress and thread safety without impacting the performance of 
> the MPI layer itself. However, it means you have to be aware of what event 
> base you are in and only access the data within that base.
> 
> 
> 
>

Re: [OMPI devel] RFC: rewrite of ORTE OOB

Reply via email to