Hello all If you are interested in the ongoing scalability work, or in the RML/OOB in ORTE, please read on - otherwise, feel free to hit "delete".
As many of you know, we have been working towards solving several problems that affect our ability to operate at large scale. Some of the required modifications to the code base have recently been applied to the trunk. We have known since it was originally written over two years ago that the OOB contained some inherent scalability limits. For example, the system immediately upon opening obtains contact info for all daemons in the universe, opens sockets to them, and sends an initial message to them. It then does the same with all the application processes in its job. As a result, for a 2000 process job running on 500 nodes, each application process will immediately open and communicate across 2501 sockets (2000 procs + 500 daemons [one per node] + the HNP) during the startup phase. If you really want to imagine some fun, now have that job comm_spawn 500 processes across the 500 nodes, and *don't* reuse daemons. As each new daemon is spawned, every process in the original job (including the original daemons) is notified, loads the new contact info for that daemon, opens a socket to it, and does an "ack" comm. After all 500 new daemons are running, they now launch the 500 new procs, each of which gets the info on 1000 daemons plus the info for 2000 parents and 500 peers, and immediately opens 1000 daemons + 2000 parents + 500 peers + 1 HNP = 3501 sockets! This was acceptable for small jobs, but causes considerable delay during startup for large jobs. A few other OOB operational characteristics further exacerbate the problem - I will detail those in a document on the wiki to help foster greater understanding. Jeff Squyres and I are about to begin a major revision of the RML/OOB code to resolve these problems. We will be using a staged approach to the effort: 1. separate the OOB's actions for loading contact info from actually opening a socket to a process. Currently, the OOB immediately opens a socket and performs an "ack" communication whenever contact info for another process is loaded into it. In addition, the OOB immediately subscribes to the job segment of the provided process, requesting that this process be alerted to *any* change in OOB contact info to any process in that job. These actions need to be separated out. 2. revise the RML/OOB init/open procedure. These are currently interwoven in a manner that causes the OOB to execute registry operations that are not needed (and actually cause headaches) during orte_init. The procedure will be revised so that connections to the HNP and to the process' local orted are opened, but all other contact info (e.g., for the other procs in the job) is simply loaded into the OOB's contact tables, but no sockets opened until first communication. 3. revise the xcast procedure so that it relays via the daemons and not the application processes. For systems that do not use our daemons, alternative mechanisms will be developed. At some point in the future, a fully routable OOB will be developed to remove the need for so many sockets on each application process. For now, these steps should improve our startup time considerably. With some luck and (hopefully) not too many conflicting priorities, Jeff and I may complete this work by Christmas - more likely, though, is sometime early in Jan. We will be working on a tmp branch, but you may see some transfer of code to the trunk as we progress. As always, feel free to comment and/or make suggestions! Ralph