>> >> Do we really need a complete node map? A far as I can tell, it looks >> like the MPI layer only needs a list of local processes. So maybe it >> would be better to forget about the node ids at the mpi layer and just >> return the local procs. > > I agree, though I don't think we want a parallel list of procs. We just need > to set the "local" flag in the existing ompi_proc_t structures. >
Having a parallel list of procs makes perfect sense. That way ORTE can store ORTE information in the orte_proc_t and OMPI can store OMPI information in the ompi_proc_t. The ompi_proc_t could either "inherit" the orte_proc_t or have a pointer to it so that we have no duplication of data. Having a global map makes sense, particularly for numerous communication scenarios, if I know all the processes are on the same node I may send a message to the lowest "vpid" on that node and he could then forward to everyone else. > One option is for the RTE to just pass in an enviro variable with a > comma-separated list of your local ranks, but that creates a problem down > the road when trying to integrate tighter with systems like SLURM where the > procs would get mass-launched (so the environment has to be the same for all > of them). > Having a enviro variable with at comma-seperated list of local ranks doesn't seems like a bit of a hack to me. >> >> So my vote would be to leave the modex alone, but remove the node id, >> and add a function to get the list of local procs. It doesn't matter to >> me how the RTE implements that. > > I think we would need to be careful here that we don't create a need for > more communication. We have two functions currently in the modex: > > 1. how to exchange the info required to populate the ompi_proc_t structures; > and > > 2. how to identify which of those procs are "local" > > The problem with leaving the modex as it currently sits is that some > environments require a different mechanism for exchanging the ompi_proc_t > info. While most can use the RML, some can't. The same division of > capabilities applies to getting the "local" info, so it makes sense to me to > put the modex in a framework. > > Otherwise, we wind up with a bunch of #if's in the code to support > environments like the Cray. I believe the mca system was put in place > precisely to avoid those kind of practices, so it makes sense to me to take > advantage of it. > > >> >> Alternatively, if we did a process attribute system we could just use >> predefined attributes, and the runtime can get each process's node id >> however it wants. > > Same problem as above, isn't it? Probably ignorance on my part, but it seems > to me that we simply exchange a modex framework for an attribute framework > (since each environment would have to get the attribute values in a > different manner) - don't we? > > I have no problem with using attributes instead of the modex, but the issue > appears to be the same either way - you still need a framework to handle the > different methods. > > > Ralph > >> >> Tim >> >> Ralph H Castain wrote: >>> IV. RTE/MPI relative modex responsibilities >>> The modex operation conducted during MPI_Init currently involves the >>> exchange of two critical pieces of information: >>> >>> 1. the location (i.e., node) of each process in my job so I can determine >>> who shares a node with me. This is subsequently used by the shared memory >>> subsystem for initialization and message routing; and >>> >>> 2. BTL contact info for each process in my job. >>> >>> During our recent efforts to further abstract the RTE from the MPI layer, we >>> pushed responsibility for both pieces of information into the MPI layer. >>> This wasn't done capriciously - the modex has always included the exchange >>> of both pieces of information, and we chose not to disturb that situation. >>> >>> However, the mixing of these two functional requirements does cause problems >>> when dealing with an environment such as the Cray where BTL information is >>> "exchanged" via an entirely different mechanism. In addition, it has been >>> noted that the RTE (and not the MPI layer) actually "knows" the node >>> location for each process. >>> >>> Hence, questions have been raised as to whether: >>> >>> (a) the modex should be built into a framework to allow multiple BTL >>> exchange mechansims to be supported, or some alternative mechanism be used - >>> one suggestion made was to implement an MPICH-like attribute exchange; and >>> >>> (b) the RTE should absorb responsibility for providing a "node map" of the >>> processes in a job (note: the modex may -use- that info, but would no longer >>> be required to exchange it). This has a number of implications that need to >>> be carefully considered - e.g., the memory required to store the node map in >>> every process is non-zero. On the other hand: >>> >>> (i) every proc already -does- store the node for every proc - it is simply >>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We >>> would want to avoid duplicating that storage, but there would be no change >>> in memory footprint if done carefully. >>> >>> (ii) every daemon already knows the node map for the job, so communicating >>> that info to its local procs may not prove a major burden. However, the very >>> environments where this subject may be an issue (e.g., the Cray) do not use >>> our daemons, so some alternative mechanism for obtaining the info would be >>> required. >>> >>> >>> So the questions to be considered here are: >>> >>> (a) do we leave the current modex "as-is", to include exchange of the node >>> map, perhaps including "#if" statements to support different exchange >>> mechanisms? >>> >>> (b) do we separate the two functions currently in the modex and push the >>> requirement to obtain a node map into the RTE? If so, how do we want the MPI >>> layer to retrieve that info so we avoid increasing our memory footprint? >>> >>> (c) do we create a separate modex framework for handling the different >>> exchange mechanisms for BTL info, do we incorporate it into an existing one >>> (if so, which one), the new publish-subscribe framework, implement an >>> alternative approach, or...? >>> >>> (d) other suggestions? >>> >>> Ralph >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel