> On Oct 21, 2016, at 10:09 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
> 
> Le 21/10/2016 17:21, r...@open-mpi.org <mailto:r...@open-mpi.org> a écrit :
>> I should add: this does beg the question of how a proc “discovers” its 
>> resource constraints without having access to the hwloc tree. One possible 
>> solution - the RM already knows the restrictions, and so it could pass those 
>> down at proc startup (e.g., as part of the PMIx info). We could pass 
>> whatever info hwloc would like passed into its calls - doesn’t have to be 
>> something “understandable” by the proc itself.
> 
> Retrieving cgroups info from Linux isn't expensive so my feeling was to
> still have compute processes do it. But, indeed, we could also avoid
> that step by having the caller pass a hwloc_bitmap_t for allowed PUs and
> another one for allowed NUMA nodes. More below.

I think that minimizing the number of procs hitting the file system for info is 
probably a good thing. In fact, the RM doesn’t have to ask Linux at all for the 
cgroup or other binding (and remember, cgroup is only one way of constraining 
resources) as it is the one assigning that constraint. So it can just pass down 
what it is assigning with no need to do a query.

> 
>> 
>>> On Oct 21, 2016, at 8:15 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>>> wrote:
>>> 
>>> Hmmm...I think maybe we are only seeing a small portion of the picture 
>>> here. There are two pieces of the problem when looking at large SMPs:
>>> 
>>> * time required for discovery - your proposal is attempting to address 
>>> that, assuming that the RM daemon collects the topology and then 
>>> communicates it to each process (which is today’s method)
> 
> There's actually an easy way to do that. Export to XML during boot,
> export HWLOC_XMLFILE=/path/to/xml to the processes.

Yeah, but that still leaves everyone parsing an XML file, and having several 
hundred procs hitting the same file is not good

> 
>>> 
>>> * memory footprint. We are seeing over 40MBytes being consumed by hwloc 
>>> topologies on fully loaded KNL machines, which is a disturbing number
> 
> I'd be interested in knowing whether these 40MB are hwloc_obj structure,
> or bitmaps, or info strings, etc. Do you have this information already?

Nope - just overall consumption.

> 
>>> Where we are headed is to having only one copy of the hwloc topology tree 
>>> on a node, stored in a shared memory segment hosted by the local RM daemon.
> 
> And would you need some sort of relative pointers so that all processes
> can traverse parent/child/sibling/... pointers from their own mapping at
> a random address in their address space?

Not sure that is necessary - might be a nice optimization

> 
>>> Procs will then access that tree to obtain any required info. Thus, we are 
>>> less interested in each process creating its own tree based on an XML 
>>> representation passed to it by the RM, and more interested in having the 
>>> hwloc search algorithms correctly handle any resource restrictions when 
>>> searching the RM’s tree.
>>> 
>>> In other words, rather than (or perhaps, in addition to?) filtering the 
>>> XML, we’d prefer to see some modification of the search APIs to allow a 
>>> proc to pass in its resource constraints, and have the search algorithm 
>>> properly consider them when returning the result. This eliminates all the 
>>> XML conversion overhead, and resolves the memory footprint issue.
> 
> What do you call "search algorithm"? We have many functions to walk the
> tree, as levels, from top to bottom, etc. Passing such a resource
> constraints to all of them isn't easy. And we have explicit pointers
> between objects too.

Maybe “search” is the wrong term, but you have functions like 
hwloc_get_obj_by_type, hwloc_get_nbobjs_by_type, and hwloc_bitmap_weight that 
might be impacted by constraints. As you say, maybe just creating a wrapper 
that applies the constraints for the caller (like we do in OMPI) would be 
sufficient.

> 
> Maybe define a good basic set of interesting functions for your search
> algorithm and duplicate these in a new hwloc/allowed.h with a new
> allowed_cpuset attribute? Whenever they find an object, they check
> whether hwloc_bitmap_intersects(obj->cpuset, allowed_cpuset). If FALSE,
> ignore that object.

Sure - I’m open on implementation.

> 
> There's also a related change that I wasn't ready/sure to try yet:
> obj->allowed_cpuset is currently just a duplicate of obj->cpuset in the
> default case. When the WHOLE_SYSTEM topology flag is set, it's a binary
> AND between obj->cpuset and root->allowed_cpuset. Quite a lot of
> duplication. We could remove all these allowed_{cpuset,nodeset} from
> objects and have a topology->allowed_cpuset instead. Most users don't
> care and wouldn't see the difference. Others would pass the WHOLE_SYSTEM
> flag and use hwloc/allowed.h or do things manually):
> * ignore an object if !hwloc_bitmap_intersects(obj->cpuset,
> allowed_cpuset) like what hwloc/allowed.h would do.
> * bind using:
> set = hwloc_bitmap_dup(obj->cpuset);
> hwloc_bitmap_and(set, set, allowed_cpuset);
> set_cpubind(set);
> hwloc_bitmap_free(set);
> 
> allowed_cpuset can be either a new topology->allowed_cpuset retrieve by
> the current process using the OS, or their own provided allowed_cpuset
> that came from the RM.
> 
> I only talked about allowed_cpuset above, but there's also a
> allowed_nodeset. What happens if a NUMA node is disallowed but its local
> cores are allowed? We want to ignore that NUMA node when looking up NUMA
> nodes for manipulating memory. But we don't want to ignore it when
> looking up NUMA nodes and children for placing tasks. It's not clear to
> me how to handle these cases. Have all new functions receive both
> allowed_cpuset and allowed_nodeset but one of them can be NULL

> 
> 
> 
> By the way, obj->complete_{cpuset,nodeset} is also something we could
> drop and just have a topology->complete_{cpuset,nodeset} saying "by the
> way, there are others resources that we don't know much about, are
> offline, ...”.

Again, I view these as optimizations. My main concern is to get the topology 
map (not the XML) into shared memory, and have functions that can traverse that 
tree, applying local constraints.

> 

> Brice
> 
> 
> 
>>> 
>>> HTH
>>> Ralph
>>> 
>>>> On Oct 21, 2016, at 5:16 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>> 
>>>> Hello
>>>> 
>>>> Based on recent discussion about hwloc_topology_load() being slow on
>>>> some "large" platforms (almost 1 second on KNL), here's a new feature
>>>> proposal:
>>>> 
>>>> We've been recommending the use of XML to avoid multiple expensive
>>>> discovery: Export to XML once at boot, and reload from XML for each
>>>> actual process using hwloc. The main limitation is cgroups: resource
>>>> managers use cgroups to restrict the processors and memory that are
>>>> actually available to each job. So the topology of different jobs on the
>>>> same machine is actually slightly different from the main XML that
>>>> contained everything when it was created outside of cgroups during boot.
>>>> 
>>>> So we're looking at adding a new topology flag that loads the entire
>>>> machine from XML (or synthetic) and applies restrictions from the
>>>> local/native operating system.
>>>> 
>>>> Details at https://github.com/open-mpi/hwloc/pull/212
>>>> Comments welcome here or there.
>>>> 
>>>> Brice
>>>> 
>>>> _______________________________________________
>>>> hwloc-devel mailing list
>>>> hwloc-devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel
>>> _______________________________________________
>>> hwloc-devel mailing list
>>> hwloc-devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel
>> _______________________________________________
>> hwloc-devel mailing list
>> hwloc-devel@lists.open-mpi.org <mailto:hwloc-devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel>
> 
> _______________________________________________
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org <mailto:hwloc-devel@lists.open-mpi.org>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel 
> <https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel>
_______________________________________________
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel

Reply via email to