Yo all Over the last 12-18 months, several of us (both inside and outside the Open MPI community) have discussed a variety of methods for making OpenRTE considerably faster - i.e., changes that would decrease launch times by at least one order of magnitude. While we documented the results, our general feeling has been to hold off from any implementation as the required changes would have compromised some features that users outside the Open MPI community might have wished to exploit.
In recent months, however, the non-Open MPI users have largely decided to pursue other options. There are a couple of reasons for this, but they are irrelevant to this discussion. What is relevant is that with the departure of those interests, there no longer is a valid reason for not streamlining the system. I have discussed this situation with several members of the Open MPI community, and the strong consensus was to go ahead with the necessary implementation. The changes will cost us a slight decrease in flexibility and programmer friendliness, but preliminary estimates show a potential decrease in launch time of roughly 20x at scale. The cost, therefore, seems worth the gain. The changes primarily revolve around the use of the GPR. Let me make something clear right away - it is *not* the GPR itself that is the cause of the slowdown, but rather the way we utilize it and the secondary impacts that result from those choices. Yes, the GPR *will* also see a major increase in the speed with which it processes requests, but the primary benefits will come from other areas in the code. The primary change involves replacement of the character string keys used to label data with uint8_t's. The immediate impact of this change is to reduce the size of the STG1 stage gate message - the primary rate limiting factor in today's launch procedure - by a factor of approximately 15-20x. It means, however, that keys will now have to be defined in a central location (you won't just be able to declare a new string in your component and use it). We will retain some flexibility by extending the name service to support dynamic key definition ala the current RML tag service. We expect, though, that all ORTE standard keys will be defined in a new orte_schema.h file to avoid speed impacts of registering dynamic keys (especially on remote nodes). This change also allows us to eliminate all dictionary functions from the GPR, replacing them by simply using the key as a direct index into the GPR storage arrays. This has the immediate benefit of greatly simplifying the GPR internal code (e.g., the search code becomes a simple array index) and provides a corresponding increase in speed. Similarly, GPR segments will also become simple numeric indices. Tokens that were used to identify containers on a given segment will be replaced by numeric indices as well - for job segments, the index will simply be the vpid of each process. On the job-master segment, the container index will be the jobid. On the node segment, the container index will be the nodeid - a numeric equivalent to each node's character string name. We will assign a numeric id to every node as we allocate it, and use that id in place of the current string nodenames. For those of you that want the string nodename in the proc structures for debugging and user-friendly error messages, we will provide that info based on an MCA param (either the current one or a slight variant - remains TBD). This will allow you to assess the performance impact of retaining those nodenames. Meantime, ORTE itself will be converted to use the node id for reduced communications and more efficient interface to the GPR. Finally, we will further reduce the size of the STG1 message (and any other stage gate messages) by compressing the data stream. First, we will remove the current system of indexing data using process names - the GPR will ensure that data is returned in a container-ordered array. Thus, we can know for certain that the data from each container is being provided to us in sequential order without having to include some index such as a process or node name. Second, we will remove duplication of data across subscriptions going to a specific process. The current system simply sends the data requested by each subscription without worrying about any duplications between subscriptions. Hence, we send multiple copies of node names, process names, and other information across the wire as part of the STG1 message. As part of a later stage to this planned change, we will compress that information by dealing with duplication at the local level - i.e., the GPR proxy will maintain a record of duplicate data requests, a single copy of each data element will be sent, and the GPR proxy will deal with the duplication at its end. These changes will be implemented in several phases on a tmp branch. Each phase will be tested across several environments and then brought over to the trunk. The first phase will be the most intrusive as it will involve the conversion from string to numeric keys, along with the corresponding changes to the GPR. I hope to complete this phase sometime in early July. Please feel free to offer comments, suggestions, or - if so inclined - assistance. I'll keep the community updated on progress as we go. Ralph