Crud - sorry about that. There was no intent to exclude - we just sometimes forget that others who don't monitor the lists with our degree of obsession might also want to participate.
There isn't a whole lot to report beyond what is in the thread, really. George provided a little more explanation of what he did that helped clarify things. Basically, he defines opal-level accessors to obtain jobid/vpid fields when required. The RTE "overlays" those (just like we do for show_help) with its own functions. Thus, we stick with the plain vanilla 64-bit unsigned int for the opal_identifier_t, but still can retrieve a uint32_t jobid and uint32_t vpid when required. Meantime, the RTE is free to define its process_name_t to anything it likes. We also agreed to review and try to eliminate any place in the BTLs that actually access jobid and/or vpid as we probably don't need to do so. So the need for opal accessors will hopefully go away some day. We reserve the right to someday reduce the size of the opal_identifier_t, jobid, and vpid as we continue to press the memory footprint issue, but that is for the future. HTH Ralph On May 1, 2014, at 12:55 PM, Vallee, Geoffroy R. <valle...@ornl.gov> wrote: > Too bad all this happened so fast otherwise ORNL would have at least > participated to the call to understand what is going to happen (since we have > a RTE module that we maintain). Any chance we could have a summary? > > Thanks, > > > On May 1, 2014, at 2:40 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Just to report back to the list: the three of us discussed this at some >> length, and decided we like George's proposed solution. Looks like a good >> clean approach that provides flexibility for the future. So we will >> introduce it when the BTLs move down to OPAL as (a) George already has it >> implemented there, and (b) we don't really need it before then. >> >> Thanks George! >> Ralph >> >> >> On May 1, 2014, at 9:40 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> wrote: >> >>> Done! >>> >>> On May 1, 2014, at 11:22 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>> >>>> Apparently we are good today at 2PM EST. Fire-up the webex ;) >>>> >>>> George. >>>> >>>> On May 1, 2014, at 10:35 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>> wrote: >>>> >>>>> http://doodle.com/hhm4yyr76ipcxgk2 >>>>> >>>>> >>>>> On May 1, 2014, at 10:25 AM, Ralph Castain <r...@open-mpi.org> >>>>> wrote: >>>>> >>>>>> sure - might be faster that way :-) >>>>>> >>>>>> On May 1, 2014, at 6:59 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>>>> wrote: >>>>>> >>>>>>> Want to have a phone call/webex to discuss? >>>>>>> >>>>>>> >>>>>>> On May 1, 2014, at 9:43 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>>> The problem we'll have with BTLs in opal is going to revolve around >>>>>>>> that ompi_process_name_t and will occur in a number of places. I've >>>>>>>> been trying to grok George's statement about accessors and can't >>>>>>>> figure out a clean way to make that work IF every RTE gets to define >>>>>>>> the process name a different way. >>>>>>>> >>>>>>>> For example, suppose I define ompi_process_name_t to be a string. I >>>>>>>> can hash the string down to an opal_identifier_t, but that is a >>>>>>>> structureless 64-bit value - there is no concept of a jobid or vpid in >>>>>>>> it. So if you now want to extract a jobid for that identifier, the >>>>>>>> only way you can do it is to "up-call" back to the RTE to parse it. >>>>>>>> >>>>>>>> This means that every RTE would have to initialize OPAL with a >>>>>>>> registration of its opal_identifier parser function(s), which seems >>>>>>>> like a really ugly solution. >>>>>>>> >>>>>>>> Maybe it is time to shift the process identifier down to the opal >>>>>>>> layer? If we define opal_identifier_t to include the required >>>>>>>> jobid/vpid, perhaps adding a void* so someone can put whatever they >>>>>>>> want in it? >>>>>>>> >>>>>>>> Note that I'm not wild about extending the identifier size beyond >>>>>>>> 64-bits as the memory footprint issue is growing in concern, and I >>>>>>>> still haven't seen any real use-case proposed for extending it. >>>>>>>> >>>>>>>> >>>>>>>> On May 1, 2014, at 3:41 AM, Jeff Squyres (jsquyres) >>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>> >>>>>>>>> On Apr 30, 2014, at 10:01 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Why do you need the ompi_process_name_t? Isn’t the opal_identifier_t >>>>>>>>>> enough to dig for the info of the peer into the opal_db? >>>>>>>>> >>>>>>>>> >>>>>>>>> At the moment, I use the ompi_process_name_t for RML sends/receives >>>>>>>>> in the usnic BTL. I know this will have to change when the BTLs move >>>>>>>>> down to OPAL (when is that going to happen, BTW?). So my future use >>>>>>>>> case may be somewhat moot. >>>>>>>>> >>>>>>>>> More detail >>>>>>>>> =========== >>>>>>>>> >>>>>>>>> "Why does the usnic BTL use RML sends/receives?", you ask. >>>>>>>>> >>>>>>>>> The reason is rooted in the fact that the usnic BTL uses an >>>>>>>>> unreliable, connectionless transport under the covert. We had some >>>>>>>>> customers have network misconfigurations that resulted in usnic >>>>>>>>> traffic not flowing properly (e.g., MTU mismatches in the network). >>>>>>>>> But since we don't have a connection-oriented underlying API that >>>>>>>>> will eventually timeout/fail to connect/etc. when there's a problem >>>>>>>>> with the network configuration, we added a "connection validation" >>>>>>>>> service in the usnic BTL that fires up in a thread in the local rank >>>>>>>>> 0 on each server. This thread provides service to all the MPI >>>>>>>>> processes on its server. >>>>>>>>> >>>>>>>>> In short: the service thread sends UDP pings and ACKs to peer service >>>>>>>>> threads on other servers (upon demand/upon first send between >>>>>>>>> servers) to verify network connectivity. If the pings eventually >>>>>>>>> fail/timeout (i.e., don't get ACKs back), the service thread does a >>>>>>>>> show_help and kills the job. >>>>>>>>> >>>>>>>>> There's more details, but that's the gist of it. >>>>>>>>> >>>>>>>>> This basically gives us the ability to highlight problems in the >>>>>>>>> network and kill the MPI job rather than spin infinitely while trying >>>>>>>>> to deliver MPI/BTL messages to a peer that will never get there. >>>>>>>>> >>>>>>>>> Since this is really a server-to-server network connectivity issue >>>>>>>>> (vs. an MPI peer-to-peer connectivity issue), we only need to have >>>>>>>>> one service thread for a whole server. The other MPI procs on the >>>>>>>>> server use RML to talk to it. E.g., "Please ping the server where >>>>>>>>> MPI proc X lives," and so on. This seemed better than having a >>>>>>>>> service thread in each MPI process. >>>>>>>>> >>>>>>>>> We've thought a bit about what to do when the BTLs move down to OPAL >>>>>>>>> (since they won't be able to use RML any more), but don't have a >>>>>>>>> final solution yet... We do still want to be able to utilize this >>>>>>>>> capability even after the BTL move. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> jsquy...@cisco.com >>>>>>>>> For corporate legal information go to: >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14673.php >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14674.php >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14675.php >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14676.php >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14677.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/05/14678.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/05/14680.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14681.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14682.php