I have the branch complete for executing this - please see https://bitbucket.org/rhc/ompi-scale
Timeout set to Feb 4th after that week's telecon On Jan 17, 2014, at 9:57 AM, Ralph Castain <r...@open-mpi.org> wrote: > After discussion on the telecon, we decided to: > > 1. let the modex be non-blocking so we can fall thru - only when the > corresponding MCA param is set! > > 2. do not modify the modex_recv to add the callback as the MPI layer really > doesn't know how to handle this in an async fashion. Modifying that behavior > would be difficult and could wind up impacting the critical path - something > we may decide to look into more at a later time > > So we will block in a call to modex_recv until the info for that particular > process can be obtained. I'll add a timeout feature (via yet another MCA > param) so we can gracefully recover if the remote proc never answers for some > reason. > > Will provide an update when this is ready > > > On Jan 13, 2014, at 3:00 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> What I want to do is make the current "modex" become a no-op, which means we >> have a lazy modex. As I noted in my commit message, this scales horribly if >> we don't, hence the MCA param requirement so people don't do this unless >> their BTL/MTLs will support it. >> >> What I found when playing with that arrangement is that a BTL/MTL is going >> to need or want data at first message, but that data may not be available >> yet because that particular remote proc hasn't registered all of its modex >> data yet. A beautiful race condition. So I was forced to block everyone at >> "modex" just to ensure the remote data was available at time of request. >> >> If I remove the global "barrier" requirement, then I didn't want to "block" >> on modex_recv as this is done on a per-proc basis. Even though one proc >> isn't ready to return the data, another might be - and so I'd let you queue >> up as many modex_recv calls as you like, resolving each of them as data >> becomes available. This leaves the MPI layer free to send a message as soon >> as the target remote proc is ready, without waiting for some other proc to >> register its modex info. >> >> Make sense? >> >> >> >> On Mon, Jan 13, 2014 at 12:05 PM, Barrett, Brian W <bwba...@sandia.gov> >> wrote: >> Is there any place that this can actually be used? It's a fairly large >> change to the RTE interface (which we should try to keep stable), and I >> can't convince myself that it's useful; in general, if a BTL or MTL is >> asking for a piece of data, the MPI library is stuck until that data's >> available. I can see doing some lazy approach, but I can't see making the >> modex_recv call non-blocking. >> >> Brian >> >> On 1/11/14 9:28 PM, "Ralph Castain" <r...@open-mpi.org> wrote: >> >> >NOTE: This will involve a change to the MPI-RTE interface >> > >> >WHAT: Modify modex_recv to add a callback function that will return the >> >requested data when it is available >> > >> >WHY: Enable faster startup on large scale systems by eliminating the >> >current mandatory modex barrier during MPI_Init >> > >> >HOW: The ompi_modex_recv functions will have callback function and >> >(void*)cbdata arguments added to them. >> > An ompi_modex_recv_t struct will be defined that includes a >> >pointer to the returned data plus a "bool active" >> > that can be used to detect when the data has been returned >> >if blocking is required. >> > >> > When a modex_recv is issued, ORTE will check for the >> >presence of the requested data and immediately >> > issue a callback if the data is available. If the data is >> >not available, then ORTE will request the data from >> > the remote process, and execute the callback when the >> >remote process returns it. >> > >> > The current behavior of a blocking modex barrier will >> >remain the default - the new behavior will only take affect >> > if specifically requested by the user via MCA param. With >> >this new behavior, the current call to "modex" in >> > MPI_Init will become a "no-op" when the processes are >> >launched via mpirun - this will be executed in ORTE >> > so that other RTEs that do not wish to support async modex >> >behavior are not impacted. >> > >> >WHEN: No hurry on this as it is intended for 1.9, so let's say mid Feb. >> >Info on a branch will be made available in >> > the near future. >> > >> > >> >_______________________________________________ >> >devel mailing list >> >de...@open-mpi.org >> >http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >> >> >> -- >> Brian W. Barrett >> Scalable System Software Group >> Sandia National Laboratories >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >