On Sep 23, 2013, at 01:43 , Ralph Castain <r...@open-mpi.org> wrote:
> > On Sep 22, 2013, at 2:15 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> In fact there are only two type of information: one that is added by the >> OMPI layer, which is exchanged during the modex exchange stage, and whatever >> else is built on top of this information by different pieces of the software >> stack (including the RTE). If we mark these two types of data independently, >> we will be able to exchange only what was registered in the beginning, which >> is basically what is needed to connect processes together. Everything else >> should be built on top of this. > > Agreed - we just have to figure out how to mark the data. However, there is > RTE data sometimes required as well, as indicated below, though this depends > on the RTE. If the data is required then it was part of the original modex. I might have failed to express myself clear but what I had in mind is to keep the original modex info intact, and then upon local reception decorate it with more info (but info that will never be exchanged it remains entirely local to the process that generate it). >> This bring me to the second issue, a software layer setting up information >> for another one. I think we mixed things together by a lack of clear >> separation between the layers. The RTE should be in charge of setting things >> up and allowing processes to exchange information, not to babysit the MPI >> processes and annotate their modex with additional info. > > I don't think we do, last I checked. The only RTE-related data in the > ORTE-supported modex is that required by the RTE to support the MPI layer - > e.g., URI info to support openib connection handshakes, or daemon vpid for > locality computations. Otherwise, I'm not aware of anything added just for > RTE purposes. I was under the impression that the binding and topology info was setup somewhere in the ORTE layer. A quick search in the source code where the OPAL_PROC_ON_NODE is set return mostly things in the ORTE layer (grpcomm and PMI). Am I mistaken? >> In particular regarding the topology stuff, I don't see any reason not to be >> able to build at the MPI layer the info. Once we have a daemon name or vpid, >> it is trivial to figure out if two processes are or are not on the same >> node. If they are we can extract their topo information to figure more >> precise details (NUMA hierarchy or whatever). > > There's the rub - daemon names/vpids is an ORTE concept that isn't shared by > all RTEs. How we determine that we are on two different vs the same node is > something done at the RTE level. In addition, some BTLs require RTE support > for connection formation, and others don't - depends on the RTE as well as > the BTL. One way or the other we must have access to the hostname where the MPI process is running no? This can be the key for defining the locality. In case an RTE does not provide such information (this might happen), then either it provide a naming scheme that allow us to know that two processes reside on the same host or we can consider that no processes are located together. >> This is something that is also hurting my effort toward moving the BTL in >> OPAL. I had to build a complex infrastructure to duplicate the connection >> information to be able to hide it from the RTE. Maybe it's time we address >> this problem in a more consistent way. > > I don't understand why that would be necessary - eventually, the RTE is going > to want to know that info anyway as it intends to use the BTLs as well. Why > not just put it in the opal db? The info is indeed in the opal_db (together with everything else), but it must be accesses easily. The problem today is that the layer that wants to use the BTL doesn't know what keys were registered by the BTL so they can't figure out what should be exchanged in order to allow two processes to discuss together. As a result, one must wait until all the modex is built and then exchange it. This maps to the way we use it before, so there is nothing socking here. It is just that now we need a little more flexibility … George. > >> >> George. >> >> >> On Sep 19, 2013, at 11:08 , Ralph Castain <r...@open-mpi.org> wrote: >> >>> Been wracking my brain on this, and I can't find any way to do this cleanly >>> without invoking some kind of extension/modification to the MPI-RTE >>> interface. >>> >>> The problem is that we are now executing an "in-band" modex operation. This >>> is fine, but the modex operation (no matter how it is executed) is an >>> RTE-dependent operation. Our current ompi_rte_modex function automatically >>> performs it out-of-band, so we don't want to use it here. However, we >>> currently lack any interface for directly obtaining endpoint info and/or >>> for defining/setting locality. >>> >>> There are several ways we could resolve the endpoint problem: >>> >>> * define flags as I mentioned previously and modify the opal_db APIs to >>> indicate "we want only non-RTE data" >>> >>> * set a convention that all OMPI-level data begin with a known substring >>> like "ompi." - we could then simply call "fetch" with an "ompi.*" wildcard >>> to retrieve all MPI-related data >>> >>> * modify the ompi_modex_* routines to insert "ompi." at the beginning of >>> all keys - this would require an asprintf call, which means a malloc >>> >>> * add new functions "ompi_rte_get_endpoint_info" and >>> "ompi_rte_set_endpoint_info", and let the RTEs figure out how to get/set >>> the right data >>> >>> >>> The locality issue is a little tougher. I can't think of any RTE-agnostic >>> method for setting locality. Unless someone else can, the only option I can >>> propose is to add a new MPI-RTE interface "ompi_rte_set_locality(proc)". >>> >>> Thoughts? >>> Ralph >>> >>> >>> On Sep 18, 2013, at 10:18 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Actually, we wouldn't have to modify the interface - just have to define a >>>> DB_RTE flag and OR it to the DB_INTERNAL/DB_EXTERNAL one. We'd need to >>>> modify the "fetch" routines to pass the flag into them so we fetched the >>>> right things, but that's a simple change. >>>> >>>> On Sep 18, 2013, at 10:12 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> I struggled with that myself when doing my earlier patch - part of the >>>>> reason why I added the dpm API. >>>>> >>>>> I don't know how to update the locality without referencing RTE-specific >>>>> keys, so maybe the best thing would be to provide some kind of hook into >>>>> the db that says we want all the non-RTE keys? Would be simple to add >>>>> that capability, though we'd have to modify the interface so we specify >>>>> "RTE key" when doing the initial store. >>>>> >>>>> The "internal" flag is used to avoid re-sending data to the system under >>>>> PMI. We "store" our data as "external" in the PMI components so the data >>>>> gets pushed out, then fetch using PMI and store "internal" to put it in >>>>> our internal hash. So "internal" doesn't mean "non-RTE". >>>>> >>>>> >>>>> On Sep 18, 2013, at 10:02 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>> >>>>>> I hit send too early. >>>>>> >>>>>> Now that we move the entire "local" modex is there any way to trim it >>>>>> down or to replace the entries that are not correct anymore? Like the >>>>>> locality? >>>>>> >>>>>> George. >>>>>> >>>>>> On Sep 18, 2013, at 18:53 , George Bosilca <bosi...@icl.utk.edu> wrote: >>>>>> >>>>>>> Regarding your comment on the bug trac, I noticed there is a >>>>>>> DB_INTERNAL flag. While I see how to set I could not figure out any way >>>>>>> to get it back. >>>>>>> >>>>>>> With the required modification of the DB API can't we take advantage of >>>>>>> it? >>>>>>> >>>>>>> George. >>>>>>> >>>>>>> >>>>>>> On Sep 18, 2013, at 18:52 , Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>>> Thanks George - much appreciated >>>>>>>> >>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The test case was broken. I just pushed a fix. >>>>>>>>> >>>>>>>>> George. >>>>>>>>> >>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>>> >>>>>>>>>> Hangs with any np > 1 >>>>>>>>>> >>>>>>>>>> However, I'm not sure if that's an issue with the test vs the >>>>>>>>>> underlying implementation >>>>>>>>>> >>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>> >>>>>>>>>>> Does it hang when you run with -np 4? >>>>>>>>>>> >>>>>>>>>>> Sent from my phone. No type good. >>>>>>>>>>> >>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one >>>>>>>>>>>> difference - I only run it with np=1 >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have >>>>>>>>>>>>>> another network enabled. >>>>>>>>>>>>> >>>>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if you >>>>>>>>>>>>> only run with sm,self because the comm_spawn will fail with >>>>>>>>>>>>> unreachable errors -- I just tested/proved this to myself). >>>>>>>>>>>>> >>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm >>>>>>>>>>>>>> based spawn and the debugging. It can't work without xterm >>>>>>>>>>>>>> support. Instead try using the test case from the trunk, the one >>>>>>>>>>>>>> committed by Ralph. >>>>>>>>>>>>> >>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran >>>>>>>>>>>>> with orte/test/mpi/intercomm_create.c, and that hangs for me as >>>>>>>>>>>>> well: >>>>>>>>>>>>> >>>>>>>>>>>>> ----- >>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> [hang] >>>>>>>>>>>>> ----- >>>>>>>>>>>>> >>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>>>>>>>> >>>>>>>>>>>>> ----- >>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>> [hang] >>>>>>>>>>>>> ----- >>>>>>>>>>>>> >>>>>>>>>>>>>> George. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> George -- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your >>>>>>>>>>>>>>> attached test case hangs: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>>> (0) >>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>>> (0) >>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>>> (0) >>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>>> (0) >>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca >>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that >>>>>>>>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should >>>>>>>>>>>>>>>> be applied after removal of 29166. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner >>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and doing >>>>>>>>>>>>>>>> a clean disconnect. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel