Restarting this discussion. A new update version of Grid Engine 6.2 will come out early next year [1], and I really hope that we can get at least the interface defined.
At the minimum, is it enough for the batch system to tell OpenMPI via an env variable which core (or virtual core, in the SMT case) to start binding the first MPI task?? I guess an added bonus would be information about the number of processors to skip (the stride) between the sibling tasks?? Stride of one is usually the case, but something larger than one would allow the batch system to control the level of cache and memory bandwidth sharing between the MPI tasks... Rayson [1]: http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=26002 On 1/11/08, Jeff Squyres <jsquy...@cisco.com> wrote: > carto is more intended to be a discovery and provider of topology > information. How various parts of the OMPI code base use that > information is a different issue. > > With regards to processor affinity, there are two general ways of > doing it: > > 1. The resource manager tells us what processors have been allocated > to us. E.g., provide us some environment variables saying what > processors/cores/whatever have been allocated to us on a per-host > basis (e.g., in the environment of the launched applications, and > therefore may be different on every host). Then Open MPI decides how > to split up the allocated host processors amongst all the Open MPI > processes on that host. > > It would be great if SGE could provide some environment variables to us. > > 2. The resource manager does all the processor affinity itself. > SLURM, for example, has a nice command line syntax for all kinds of > processor affinity stuff in their "srun" command. A traditional > roadblock to this has been that OMPI currently uses the resource > manager to launch a single "orted" process on each node, and then that > orted, in turn, launches all the MPI processes locally. However, > there is work progressing to remove this roadblock. If I try to > describe it, I'm sure I'll get it wrong :-) -- Ralph / IU? > > ----- > > Open MPI will need to be able to tell the difference between #1 and > #2. So it might be good if the RM always provides the environment > variables, but in those env variables, tell us whether the RM did the > affinity pinning or not. I.e., in #1, you'll get information about > all the processors that are available -- all the processes on a single > host will get the same information. In #2, each process will get > individualized information about where it has been pinned. > > Make sense? > > > > On Jan 11, 2008, at 6:22 AM, Pak Lui wrote: > > > Hi Rayson, > > > > I guess this is an issue only for SGE. I believe there is something > > called 'carto' framework is being developed to represent the node- > > socket > > relationship in order to address the multicore issue. I think there > > are > > other folks in the team who are actively working on it so they > > probably > > can address it better than I can. Here some descriptions on the wiki > > for it: > > > > https://svn.open-mpi.org/trac/ompi/wiki/OnHostTopologyDescription > > > > Rayson Ho wrote: > >> Hello, > >> > >> I'm from the Sun Grid Engine (SGE) project ( > >> http://gridengine.sunsource.net ). I am working on processor affinity > >> support for SGE. > >> > >> In 2005, we had some discussions on the SGE mailing list with Jeff on > >> this topic. As quad-core processors are available from AMD and Intel, > >> and higher core count per socket is coming soon, I would like to see > >> what we can do to come up with a simple interface for the SGE 6.2 > >> release, which will be available in Q2 this year (or at least into an > >> "update" release of SGE6.2 if we couldn't get the changes in on > >> time). > >> > >> The discussions we had before: > >> http://gridengine.sunsource.net/servlets/BrowseList?list=dev&by=thread&from=7081 > >> http://gridengine.sunsource.net/servlets/BrowseList?list=dev&by=thread&from=4803 > >> > >> I looked at the SGE code, the simplest we can do is to set an > >> environment variable to tell the task group the processor mask of the > >> node before we start each task group. Is it good enough for OpenMPI?? > >> > >> After reading the OpenMPI code, I believe what we need to do is that > >> in ompi/runtime/ompi_mpi_init.c , we need to add an else case: > >> > >> if (ompi_mpi_paffinity_alone) { > >> ... > >> } > >> else > >> { > >> // get processor affinity information from batch system via the > >> env var > >> ... > >> } > >> > >> Thanks, > >> Rayson > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > -- > > > > > > - Pak Lui > > pak....@sun.com > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >