Hello all I understand that several people are interested in the OpenRTE scalability issues - this is great! However, it appears we haven't done a very good job of circulating information about the identified causes of the current issues. In the hope of helping people to be productive in their contributions, I thought it might be useful if we circulated both the info and diagnoses to-date, as well as current remediation plans that have been developed by those of us working on the issues so far and the status of those efforts.
First, a quick recap so everyone starts from a common knowledge base. We have performed roughly 4 scalability tests on OpenRTE/Open MPI over the last two years. In each case, we had exclusive use of a large cluster so that we could run large scale jobs - typically consisting of 500+ nodes and up to 8K processes, operating under either SLURM or TM environments. We have also received some scaling data from efforts at Sun involving Solaris-based systems running under N1GE. The tests showed we could reliably launch to about the 1K process level, but we encountered difficulties when extending significantly beyond that point. The scalability issues generally breakdown into two categories: 1. Memory consumption. We see a "spike" in memory usage by the HNP early in the launch process that can be quite large. In the earliest tests, we saw GBytes consumed during the launch of ~2K processes, with a steady-state usage of ~20MBytes. The spike was caused by the copying of buffers during transmission of OOB messages, combined with the large size of the STG1 xcast message. This was corrected at the time (courtesy of Tim W) by having the OOB *not* copy message buffers. Follow-on tests showed that the memory "spike" had essentially disappeared. However, recent tests indicate that this "fix" may have been lost, or we may now be using a code path that bypasses it (we used to send the xcast messages via blocking sends, but now use non-blocking sends, which do follow a slightly different code path). Regardless of the reason, it appears that the copying of buffers has returned, and OpenRTE once again exhibits the GByte memory "spike" on large jobs. Steady-state memory usage is driven by two things. First, we made a design decision at the beginning of the project to provide maximum system flexibility. Hence, there is no overarching control over the data being stored within the system (specifically, within the GPR framework), and each component/framework is free to store whatever its author wants. Given the free-lance nature of the development of these components, there is some non-trivial duplication of information on the GPR. However, if you add all that up, it doesn't amount to a very large number (on the order of a few megabytes for large scales). On machines of interest to those of us working on the code at the time, the steady-state memory footprint was not a priority issue - hence, little has been done to reduce it. 2. OOB communications. There are two primary issues in this category, both of which lead back to the same core problem. First, the number of TCP socket connections back to the HNP grows linearly with the number of processes. In the most recent tests, Galen reported ~20K sockets being opened on the HNP for an 8K process job running on 4K nodes. If you look at the code, you will find that (a) 8K of those sockets are due to each MPI process connecting directly back to the HNP, and (b) 4K of those sockets are due to the orteds on each node connecting back to the HNP. The other 8K sockets appear to be due to a "bug" in the code: from what I can tell so far, it appears that either the MPI layer's BTL/TCP component is opening a socket to the HNP, even though the HNP isn't actually part of the MPI job itself, or the processes are opening duplicate OOB sockets back to the HNP. I am not certain which (or either) of these is the root cause, however - it needs further investigation to identify the source of the extra sockets. The large number of sockets on the HNP causes timeout problems during the initial "connection storm" as processes start up, which subsequently causes the MPI job to go into "retry hell". To help relieve that problem, a "listener thread" was introduced on the HNP (courtesy of Brian) that could absorb the connections at a much higher rate. This has now been debugged in the current OMPI trunk (and I believe was just moved to 1.2.1 for release). The second issue is the time it takes to transmit the various stage gate messages from the HNP to each MPI process. The only stage gate of concern here is STG1 since that is where substantial data is involved (we send info required to inform each process of its peers for interconnect purposes). The current OMPI trunk uses a "direct" method - i.e., the HNP sends the stage gate messages directly to each MPI process. We have already implemented two measures to help reduce this part of the problem. First, late last year we revised the GPR notification message system to allow subscribers to eliminate (or at least drastically reduce) descriptive information sent with the message. We also changed the buffer packing system to likewise eliminate data type descriptions. This succeeded in significantly reducing the *size* of the message itself. Second, we implemented a routed xcast messaging system that sends the stage gate messages through the local orted. Thus, the *number* of messages being sent dropped by a factor equal to the number of processes/node. These two measures (reducing the message size and routing it through the orted) had the effect of chopping the stage gate time in more than half (for our 2ppn test machines). The message size changes are in the OMPI trunk and in 1.2 - however, the xcast routing code remains solely in the ORTE trunk. In addition, the ORTE trunk contains code for each MPI process to create a connection to its local orted - this does not currently exist within the OMPI trunk or 1.2 release. The remediation plans currently underway primarily focus on the OOB as this appears to be the central figure in both observed issues. The primary effort is aimed at creating a general message routing capability for the OOB. Several of us have discussed various design options - as things stand, I owe that group of people a draft design document (which I'm late in delivering). A routable OOB would yield several immediate benefits. First, it would significantly reduce the connection storm problem and provide a more scalable connection plan for the HNP. Second, it would reduce the memory "spike" since the HNP would be generating far fewer messages. And finally, it would reduce the xcast transmission (assuming the routing algorithm includes some type of tree-like structure). The secondary effort is aimed at removing the copying of message buffers within the OOB. We have two issues inside this area. First, Tim W originally copied the buffers for protective purposes - e.g., since the OOB queues messages, a non-blocking caller could release the buffer prior to the OOB actually sending it. The simplest method of protection was to have the OOB retain its own copy, thus ensuring it was always there when transmission actually occurred. Of course, at the time we were only worrying about launching jobs of a few hundred processes - not thousands. ;-) This practice probably needs to be reviewed in terms of future requirements, although we have to be careful since portions of the code may *rely* now on this behavior. One possible alternative that has been discussed is to create a new "send_multiple" API that allows us to pass a single buffer along with an array of recipients to the OOB. In this case, even though the buffer is being copied, it would only be copied one time since the OOB would control that memory copy while cycling through all the recipients. I'm not sure if this is the correct approach, but it may merit some thought. My expectation is that the current ORTE trunk's xcast routing and orted-to-local-proc connection codes will move over to the OMPI trunk at some time in the near future (awaiting release of the current OMPI trunk "freeze" called to last until 1.2.1 gets out). The OOB revisions have no real schedule at this time - however, both code changes were targeted for the 1.3 release (and specifically were *not* to be released in the 1.2 series, per the Dec decision). I hope that helps provide some food for thought. Feel free to ask questions - so far, the discussions have involved several people, so you are welcome to just hit the mailing lists. Ralph