Apparently we are good today at 2PM EST. Fire-up the webex ;) George.
On May 1, 2014, at 10:35 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > http://doodle.com/hhm4yyr76ipcxgk2 > > > On May 1, 2014, at 10:25 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >> sure - might be faster that way :-) >> >> On May 1, 2014, at 6:59 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> wrote: >> >>> Want to have a phone call/webex to discuss? >>> >>> >>> On May 1, 2014, at 9:43 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> The problem we'll have with BTLs in opal is going to revolve around that >>>> ompi_process_name_t and will occur in a number of places. I've been trying >>>> to grok George's statement about accessors and can't figure out a clean >>>> way to make that work IF every RTE gets to define the process name a >>>> different way. >>>> >>>> For example, suppose I define ompi_process_name_t to be a string. I can >>>> hash the string down to an opal_identifier_t, but that is a structureless >>>> 64-bit value - there is no concept of a jobid or vpid in it. So if you now >>>> want to extract a jobid for that identifier, the only way you can do it is >>>> to "up-call" back to the RTE to parse it. >>>> >>>> This means that every RTE would have to initialize OPAL with a >>>> registration of its opal_identifier parser function(s), which seems like a >>>> really ugly solution. >>>> >>>> Maybe it is time to shift the process identifier down to the opal layer? >>>> If we define opal_identifier_t to include the required jobid/vpid, perhaps >>>> adding a void* so someone can put whatever they want in it? >>>> >>>> Note that I'm not wild about extending the identifier size beyond 64-bits >>>> as the memory footprint issue is growing in concern, and I still haven't >>>> seen any real use-case proposed for extending it. >>>> >>>> >>>> On May 1, 2014, at 3:41 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>> wrote: >>>> >>>>> On Apr 30, 2014, at 10:01 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>> >>>>>> Why do you need the ompi_process_name_t? Isn’t the opal_identifier_t >>>>>> enough to dig for the info of the peer into the opal_db? >>>>> >>>>> >>>>> At the moment, I use the ompi_process_name_t for RML sends/receives in >>>>> the usnic BTL. I know this will have to change when the BTLs move down >>>>> to OPAL (when is that going to happen, BTW?). So my future use case may >>>>> be somewhat moot. >>>>> >>>>> More detail >>>>> =========== >>>>> >>>>> "Why does the usnic BTL use RML sends/receives?", you ask. >>>>> >>>>> The reason is rooted in the fact that the usnic BTL uses an unreliable, >>>>> connectionless transport under the covert. We had some customers have >>>>> network misconfigurations that resulted in usnic traffic not flowing >>>>> properly (e.g., MTU mismatches in the network). But since we don't have >>>>> a connection-oriented underlying API that will eventually timeout/fail to >>>>> connect/etc. when there's a problem with the network configuration, we >>>>> added a "connection validation" service in the usnic BTL that fires up in >>>>> a thread in the local rank 0 on each server. This thread provides >>>>> service to all the MPI processes on its server. >>>>> >>>>> In short: the service thread sends UDP pings and ACKs to peer service >>>>> threads on other servers (upon demand/upon first send between servers) to >>>>> verify network connectivity. If the pings eventually fail/timeout (i.e., >>>>> don't get ACKs back), the service thread does a show_help and kills the >>>>> job. >>>>> >>>>> There's more details, but that's the gist of it. >>>>> >>>>> This basically gives us the ability to highlight problems in the network >>>>> and kill the MPI job rather than spin infinitely while trying to deliver >>>>> MPI/BTL messages to a peer that will never get there. >>>>> >>>>> Since this is really a server-to-server network connectivity issue (vs. >>>>> an MPI peer-to-peer connectivity issue), we only need to have one service >>>>> thread for a whole server. The other MPI procs on the server use RML to >>>>> talk to it. E.g., "Please ping the server where MPI proc X lives," and >>>>> so on. This seemed better than having a service thread in each MPI >>>>> process. >>>>> >>>>> We've thought a bit about what to do when the BTLs move down to OPAL >>>>> (since they won't be able to use RML any more), but don't have a final >>>>> solution yet... We do still want to be able to utilize this capability >>>>> even after the BTL move. >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/05/14673.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/05/14674.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/05/14675.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14676.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14677.php