The text looks correct to me. I don't have any better suggestion for now. I am thinking about adding a adopt() flag to say "adopt it, or give me a pointer to the already adopted one", but it's not clear to me how to implement this safely. I opened a hwloc issues to discuss the details of making sure both adopt() calls point to the very same shmem topology file https://github.com/open-mpi/hwloc/issues/449
Brice Le 04/02/2021 à 01:28, Ralph Castain via devel a écrit : > I have updated the site to reflect this discussion to-date. I'm still trying > to figure out what to do about low-level libs. For now, I've removed the > envars and modified suggestions. > > https://openpmix.github.io/support/faq/avoid-hwloc-dup > > Further comment/input is welcome. > > >> On Feb 3, 2021, at 8:09 AM, Ralph Castain via devel >> <devel@lists.open-mpi.org> wrote: >> >> What if we do this: >> >> - if you are using PMIx v4.1 or above, then there is no problem. Call >> PMIx_Load_topology and we will always return a valid pointer to the >> topology, subject to the caveat that all members of the process (as well as >> the server) must use the same hwloc version. >> >> - if you are using PMIx v4.0 or below, then first do a PMIx_Get for >> PMIX_TOPOLOGY. If "not found", then try to get the shmem info and adopt it. >> If the shmem info isn't found, then do a topology_load to discover the >> topology. Either way, when done, do a PMIx_Store_internal of the >> hwloc_topology_t using the PMIX_TOPOLOGY key. >> >> This still leaves open the question of what to do with low-level libraries >> that really don't want to link against PMIx. I'm not sure what to do there. >> I agree it is "ugly" to pass an addr in the environment, but there really >> isn't any cleaner option that I can see short of asking every library to >> provide us with the ability to pass hwloc_topology_t down to them. Outside >> of that obvious answer, I suppose we could put the hwloc_topology_t address >> into the environment and have them connect that way? >> >> >>> On Feb 3, 2021, at 7:36 AM, Ralph Castain via devel >>> <devel@lists.open-mpi.org> wrote: >>> >>> I guess this begs the question: how does a library detect that the shmem >>> region has already been mapped? If we attempt to map it and fail, does that >>> mean it has already been mapped or that it doesn't exist? >>> >>> It isn't reasonable to expect that all the libraries in a process will >>> coordinate such that they "know" hwloc has been initialized by the main >>> program, for example. So how do they determine that the topology is >>> present, and how do they gain access to it? >>> >>> >>>> On Feb 3, 2021, at 6:07 AM, Brice Goglin via devel >>>> <devel@lists.open-mpi.org> wrote: >>>> >>>> Hello Ralph >>>> >>>> One thing that isn't clear in this document : the hwloc shmem region may >>>> only be mapped *once* per process (because the mmap address is always >>>> the same). Hence, if a library calls adopt() in the process, others will >>>> fail. This applies to the 2nd and 3rd case in "Accessing the HWLOC >>>> topology tree from clients". >>>> >>>> For the 3rd case where low-level libraries don't want to depend on PMIx, >>>> storing the pointer to the topology in an environment variable might be >>>> a (ugly) solution. >>>> >>>> By the way, you may want to specify somewhere that all these libraries >>>> using the topology pointer in the process must use the same hwloc >>>> version (e.g. not 2.0 vs 2.4). shmem_adopt() verifies that the exported >>>> and importer are compatible. But passing the topology pointer doesn't >>>> provide any way to verify that the caller doesn't use its own >>>> incompatible embedded hwloc. >>>> >>>> Brice >>>> >>>> >>>> Le 02/02/2021 à 18:32, Ralph Castain via devel a écrit : >>>>> Hi folks >>>>> >>>>> Per today's telecon, here is a link to a description of the HWLOC >>>>> duplication issue for many-core environments and methods by which you >>>>> can mitigate the impact. >>>>> >>>>> https://openpmix.github.io/support/faq/avoid-hwloc-dup >>>>> <https://openpmix.github.io/support/faq/avoid-hwloc-dup> >>>>> >>>>> George: for lower-level libs like treematch or HAN, you might want to >>>>> look at the envar method (described about half-way down the page) to >>>>> avoid directly linking those libraries against PMIx. That wouldn't be >>>>> a problem while inside OMPI, but could be an issue if people want to >>>>> use them in a non-PMIx environment. >>>>> >>>>> Ralph >>>>> >>> >> >