In fact there are only two type of information: one that is added by the OMPI layer, which is exchanged during the modex exchange stage, and whatever else is built on top of this information by different pieces of the software stack (including the RTE). If we mark these two types of data independently, we will be able to exchange only what was registered in the beginning, which is basically what is needed to connect processes together. Everything else should be built on top of this.
This bring me to the second issue, a software layer setting up information for another one. I think we mixed things together by a lack of clear separation between the layers. The RTE should be in charge of setting things up and allowing processes to exchange information, not to babysit the MPI processes and annotate their modex with additional info. In particular regarding the topology stuff, I don't see any reason not to be able to build at the MPI layer the info. Once we have a daemon name or vpid, it is trivial to figure out if two processes are or are not on the same node. If they are we can extract their topo information to figure more precise details (NUMA hierarchy or whatever). This is something that is also hurting my effort toward moving the BTL in OPAL. I had to build a complex infrastructure to duplicate the connection information to be able to hide it from the RTE. Maybe it's time we address this problem in a more consistent way. George. On Sep 19, 2013, at 11:08 , Ralph Castain <r...@open-mpi.org> wrote: > Been wracking my brain on this, and I can't find any way to do this cleanly > without invoking some kind of extension/modification to the MPI-RTE interface. > > The problem is that we are now executing an "in-band" modex operation. This > is fine, but the modex operation (no matter how it is executed) is an > RTE-dependent operation. Our current ompi_rte_modex function automatically > performs it out-of-band, so we don't want to use it here. However, we > currently lack any interface for directly obtaining endpoint info and/or for > defining/setting locality. > > There are several ways we could resolve the endpoint problem: > > * define flags as I mentioned previously and modify the opal_db APIs to > indicate "we want only non-RTE data" > > * set a convention that all OMPI-level data begin with a known substring like > "ompi." - we could then simply call "fetch" with an "ompi.*" wildcard to > retrieve all MPI-related data > > * modify the ompi_modex_* routines to insert "ompi." at the beginning of all > keys - this would require an asprintf call, which means a malloc > > * add new functions "ompi_rte_get_endpoint_info" and > "ompi_rte_set_endpoint_info", and let the RTEs figure out how to get/set the > right data > > > The locality issue is a little tougher. I can't think of any RTE-agnostic > method for setting locality. Unless someone else can, the only option I can > propose is to add a new MPI-RTE interface "ompi_rte_set_locality(proc)". > > Thoughts? > Ralph > > > On Sep 18, 2013, at 10:18 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Actually, we wouldn't have to modify the interface - just have to define a >> DB_RTE flag and OR it to the DB_INTERNAL/DB_EXTERNAL one. We'd need to >> modify the "fetch" routines to pass the flag into them so we fetched the >> right things, but that's a simple change. >> >> On Sep 18, 2013, at 10:12 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> I struggled with that myself when doing my earlier patch - part of the >>> reason why I added the dpm API. >>> >>> I don't know how to update the locality without referencing RTE-specific >>> keys, so maybe the best thing would be to provide some kind of hook into >>> the db that says we want all the non-RTE keys? Would be simple to add that >>> capability, though we'd have to modify the interface so we specify "RTE >>> key" when doing the initial store. >>> >>> The "internal" flag is used to avoid re-sending data to the system under >>> PMI. We "store" our data as "external" in the PMI components so the data >>> gets pushed out, then fetch using PMI and store "internal" to put it in our >>> internal hash. So "internal" doesn't mean "non-RTE". >>> >>> >>> On Sep 18, 2013, at 10:02 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>> >>>> I hit send too early. >>>> >>>> Now that we move the entire "local" modex is there any way to trim it down >>>> or to replace the entries that are not correct anymore? Like the locality? >>>> >>>> George. >>>> >>>> On Sep 18, 2013, at 18:53 , George Bosilca <bosi...@icl.utk.edu> wrote: >>>> >>>>> Regarding your comment on the bug trac, I noticed there is a DB_INTERNAL >>>>> flag. While I see how to set I could not figure out any way to get it >>>>> back. >>>>> >>>>> With the required modification of the DB API can't we take advantage of >>>>> it? >>>>> >>>>> George. >>>>> >>>>> >>>>> On Sep 18, 2013, at 18:52 , Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>>> Thanks George - much appreciated >>>>>> >>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>>> >>>>>>> The test case was broken. I just pushed a fix. >>>>>>> >>>>>>> George. >>>>>>> >>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>>> Hangs with any np > 1 >>>>>>>> >>>>>>>> However, I'm not sure if that's an issue with the test vs the >>>>>>>> underlying implementation >>>>>>>> >>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>> >>>>>>>>> Does it hang when you run with -np 4? >>>>>>>>> >>>>>>>>> Sent from my phone. No type good. >>>>>>>>> >>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one >>>>>>>>>> difference - I only run it with np=1 >>>>>>>>>> >>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>> >>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have >>>>>>>>>>>> another network enabled. >>>>>>>>>>> >>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if you >>>>>>>>>>> only run with sm,self because the comm_spawn will fail with >>>>>>>>>>> unreachable errors -- I just tested/proved this to myself). >>>>>>>>>>> >>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm >>>>>>>>>>>> based spawn and the debugging. It can't work without xterm >>>>>>>>>>>> support. Instead try using the test case from the trunk, the one >>>>>>>>>>>> committed by Ralph. >>>>>>>>>>> >>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran with >>>>>>>>>>> orte/test/mpi/intercomm_create.c, and that hangs for me as well: >>>>>>>>>>> >>>>>>>>>>> ----- >>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>> [rank 4] >>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>> [rank 5] >>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>> [rank 6] >>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>> [rank 7] >>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>> [hang] >>>>>>>>>>> ----- >>>>>>>>>>> >>>>>>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>>>>>> >>>>>>>>>>> ----- >>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>> [hang] >>>>>>>>>>> ----- >>>>>>>>>>> >>>>>>>>>>>> George. >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> George -- >>>>>>>>>>>>> >>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your >>>>>>>>>>>>> attached test case hangs: >>>>>>>>>>>>> >>>>>>>>>>>>> ----- >>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>> (0) >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>> [hang] >>>>>>>>>>>>> ----- >>>>>>>>>>>>> >>>>>>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>>>>>> >>>>>>>>>>>>> ----- >>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>> [hang] >>>>>>>>>>>>> ----- >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that >>>>>>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should be >>>>>>>>>>>>>> applied after removal of 29166. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I also added the corrected test case stressing the corner cases >>>>>>>>>>>>>> by doing barriers at every inter-comm creation and doing a clean >>>>>>>>>>>>>> disconnect. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jeff Squyres >>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel