Perhaps another way to think of this is in object-oriented terms. What I’m advocating is that we treat the topology tree like an object that is shared across the local procs. Thus, instead of traversing the tree by the current “next = current->next” method, you would use an accessor function “next = hwloc_get_next(current, my_constraints)”.
You still need to modify the upper-level APIs like get_obj_by_type to pass in the constraints so they can use it when traversing the tree - I don’t really see an easy way to avoid that. I suppose you could set the constraints in the environment, but that would mean doing a getenv at the beginning of every upper-level function and then converting that string into the constraints. Seems like a lot of overhead. Anyway, hope that helps tickle the imagination Ralph > On Oct 21, 2016, at 2:06 PM, r...@open-mpi.org wrote: > >> >> On Oct 21, 2016, at 10:09 AM, Brice Goglin <brice.gog...@inria.fr >> <mailto:brice.gog...@inria.fr>> wrote: >> >> Le 21/10/2016 17:21, r...@open-mpi.org <mailto:r...@open-mpi.org> a écrit : >>> I should add: this does beg the question of how a proc “discovers” its >>> resource constraints without having access to the hwloc tree. One possible >>> solution - the RM already knows the restrictions, and so it could pass >>> those down at proc startup (e.g., as part of the PMIx info). We could pass >>> whatever info hwloc would like passed into its calls - doesn’t have to be >>> something “understandable” by the proc itself. >> >> Retrieving cgroups info from Linux isn't expensive so my feeling was to >> still have compute processes do it. But, indeed, we could also avoid >> that step by having the caller pass a hwloc_bitmap_t for allowed PUs and >> another one for allowed NUMA nodes. More below. > > I think that minimizing the number of procs hitting the file system for info > is probably a good thing. In fact, the RM doesn’t have to ask Linux at all > for the cgroup or other binding (and remember, cgroup is only one way of > constraining resources) as it is the one assigning that constraint. So it can > just pass down what it is assigning with no need to do a query. > >> >>> >>>> On Oct 21, 2016, at 8:15 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> >>>> wrote: >>>> >>>> Hmmm...I think maybe we are only seeing a small portion of the picture >>>> here. There are two pieces of the problem when looking at large SMPs: >>>> >>>> * time required for discovery - your proposal is attempting to address >>>> that, assuming that the RM daemon collects the topology and then >>>> communicates it to each process (which is today’s method) >> >> There's actually an easy way to do that. Export to XML during boot, >> export HWLOC_XMLFILE=/path/to/xml to the processes. > > Yeah, but that still leaves everyone parsing an XML file, and having several > hundred procs hitting the same file is not good > >> >>>> >>>> * memory footprint. We are seeing over 40MBytes being consumed by hwloc >>>> topologies on fully loaded KNL machines, which is a disturbing number >> >> I'd be interested in knowing whether these 40MB are hwloc_obj structure, >> or bitmaps, or info strings, etc. Do you have this information already? > > Nope - just overall consumption. > >> >>>> Where we are headed is to having only one copy of the hwloc topology tree >>>> on a node, stored in a shared memory segment hosted by the local RM daemon. >> >> And would you need some sort of relative pointers so that all processes >> can traverse parent/child/sibling/... pointers from their own mapping at >> a random address in their address space? > > Not sure that is necessary - might be a nice optimization > >> >>>> Procs will then access that tree to obtain any required info. Thus, we are >>>> less interested in each process creating its own tree based on an XML >>>> representation passed to it by the RM, and more interested in having the >>>> hwloc search algorithms correctly handle any resource restrictions when >>>> searching the RM’s tree. >>>> >>>> In other words, rather than (or perhaps, in addition to?) filtering the >>>> XML, we’d prefer to see some modification of the search APIs to allow a >>>> proc to pass in its resource constraints, and have the search algorithm >>>> properly consider them when returning the result. This eliminates all the >>>> XML conversion overhead, and resolves the memory footprint issue. >> >> What do you call "search algorithm"? We have many functions to walk the >> tree, as levels, from top to bottom, etc. Passing such a resource >> constraints to all of them isn't easy. And we have explicit pointers >> between objects too. > > Maybe “search” is the wrong term, but you have functions like > hwloc_get_obj_by_type, hwloc_get_nbobjs_by_type, and hwloc_bitmap_weight that > might be impacted by constraints. As you say, maybe just creating a wrapper > that applies the constraints for the caller (like we do in OMPI) would be > sufficient. > >> >> Maybe define a good basic set of interesting functions for your search >> algorithm and duplicate these in a new hwloc/allowed.h with a new >> allowed_cpuset attribute? Whenever they find an object, they check >> whether hwloc_bitmap_intersects(obj->cpuset, allowed_cpuset). If FALSE, >> ignore that object. > > Sure - I’m open on implementation. > >> >> There's also a related change that I wasn't ready/sure to try yet: >> obj->allowed_cpuset is currently just a duplicate of obj->cpuset in the >> default case. When the WHOLE_SYSTEM topology flag is set, it's a binary >> AND between obj->cpuset and root->allowed_cpuset. Quite a lot of >> duplication. We could remove all these allowed_{cpuset,nodeset} from >> objects and have a topology->allowed_cpuset instead. Most users don't >> care and wouldn't see the difference. Others would pass the WHOLE_SYSTEM >> flag and use hwloc/allowed.h or do things manually): >> * ignore an object if !hwloc_bitmap_intersects(obj->cpuset, >> allowed_cpuset) like what hwloc/allowed.h would do. >> * bind using: >> set = hwloc_bitmap_dup(obj->cpuset); >> hwloc_bitmap_and(set, set, allowed_cpuset); >> set_cpubind(set); >> hwloc_bitmap_free(set); >> >> allowed_cpuset can be either a new topology->allowed_cpuset retrieve by >> the current process using the OS, or their own provided allowed_cpuset >> that came from the RM. >> >> I only talked about allowed_cpuset above, but there's also a >> allowed_nodeset. What happens if a NUMA node is disallowed but its local >> cores are allowed? We want to ignore that NUMA node when looking up NUMA >> nodes for manipulating memory. But we don't want to ignore it when >> looking up NUMA nodes and children for placing tasks. It's not clear to >> me how to handle these cases. Have all new functions receive both >> allowed_cpuset and allowed_nodeset but one of them can be NULL > >> >> >> >> By the way, obj->complete_{cpuset,nodeset} is also something we could >> drop and just have a topology->complete_{cpuset,nodeset} saying "by the >> way, there are others resources that we don't know much about, are >> offline, ...”. > > Again, I view these as optimizations. My main concern is to get the topology > map (not the XML) into shared memory, and have functions that can traverse > that tree, applying local constraints. > >> > >> Brice >> >> >> >>>> >>>> HTH >>>> Ralph >>>> >>>>> On Oct 21, 2016, at 5:16 AM, Brice Goglin <brice.gog...@inria.fr >>>>> <mailto:brice.gog...@inria.fr>> wrote: >>>>> >>>>> Hello >>>>> >>>>> Based on recent discussion about hwloc_topology_load() being slow on >>>>> some "large" platforms (almost 1 second on KNL), here's a new feature >>>>> proposal: >>>>> >>>>> We've been recommending the use of XML to avoid multiple expensive >>>>> discovery: Export to XML once at boot, and reload from XML for each >>>>> actual process using hwloc. The main limitation is cgroups: resource >>>>> managers use cgroups to restrict the processors and memory that are >>>>> actually available to each job. So the topology of different jobs on the >>>>> same machine is actually slightly different from the main XML that >>>>> contained everything when it was created outside of cgroups during boot. >>>>> >>>>> So we're looking at adding a new topology flag that loads the entire >>>>> machine from XML (or synthetic) and applies restrictions from the >>>>> local/native operating system. >>>>> >>>>> Details at https://github.com/open-mpi/hwloc/pull/212 >>>>> <https://github.com/open-mpi/hwloc/pull/212> >>>>> Comments welcome here or there. >>>>> >>>>> Brice >>>>> >>>>> _______________________________________________ >>>>> hwloc-devel mailing list >>>>> hwloc-devel@lists.open-mpi.org <mailto:hwloc-devel@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel >>>> _______________________________________________ >>>> hwloc-devel mailing list >>>> hwloc-devel@lists.open-mpi.org <mailto:hwloc-devel@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel >>> _______________________________________________ >>> hwloc-devel mailing list >>> hwloc-devel@lists.open-mpi.org <mailto:hwloc-devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel> >> >> _______________________________________________ >> hwloc-devel mailing list >> hwloc-devel@lists.open-mpi.org <mailto:hwloc-devel@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel> > _______________________________________________ > hwloc-devel mailing list > hwloc-devel@lists.open-mpi.org <mailto:hwloc-devel@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel>
_______________________________________________ hwloc-devel mailing list hwloc-devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel