Le 21/10/2016 17:21, r...@open-mpi.org a écrit :
> I should add: this does beg the question of how a proc “discovers” its 
> resource constraints without having access to the hwloc tree. One possible 
> solution - the RM already knows the restrictions, and so it could pass those 
> down at proc startup (e.g., as part of the PMIx info). We could pass whatever 
> info hwloc would like passed into its calls - doesn’t have to be something 
> “understandable” by the proc itself.

Retrieving cgroups info from Linux isn't expensive so my feeling was to
still have compute processes do it. But, indeed, we could also avoid
that step by having the caller pass a hwloc_bitmap_t for allowed PUs and
another one for allowed NUMA nodes. More below.

>
>> On Oct 21, 2016, at 8:15 AM, r...@open-mpi.org wrote:
>>
>> Hmmm...I think maybe we are only seeing a small portion of the picture here. 
>> There are two pieces of the problem when looking at large SMPs:
>>
>> * time required for discovery - your proposal is attempting to address that, 
>> assuming that the RM daemon collects the topology and then communicates it 
>> to each process (which is today’s method)

There's actually an easy way to do that. Export to XML during boot,
export HWLOC_XMLFILE=/path/to/xml to the processes.

>>
>> * memory footprint. We are seeing over 40MBytes being consumed by hwloc 
>> topologies on fully loaded KNL machines, which is a disturbing number

I'd be interested in knowing whether these 40MB are hwloc_obj structure,
or bitmaps, or info strings, etc. Do you have this information already?

>> Where we are headed is to having only one copy of the hwloc topology tree on 
>> a node, stored in a shared memory segment hosted by the local RM daemon.

And would you need some sort of relative pointers so that all processes
can traverse parent/child/sibling/... pointers from their own mapping at
a random address in their address space?

>>  Procs will then access that tree to obtain any required info. Thus, we are 
>> less interested in each process creating its own tree based on an XML 
>> representation passed to it by the RM, and more interested in having the 
>> hwloc search algorithms correctly handle any resource restrictions when 
>> searching the RM’s tree.
>>
>> In other words, rather than (or perhaps, in addition to?) filtering the XML, 
>> we’d prefer to see some modification of the search APIs to allow a proc to 
>> pass in its resource constraints, and have the search algorithm properly 
>> consider them when returning the result. This eliminates all the XML 
>> conversion overhead, and resolves the memory footprint issue.

What do you call "search algorithm"? We have many functions to walk the
tree, as levels, from top to bottom, etc. Passing such a resource
constraints to all of them isn't easy. And we have explicit pointers
between objects too.

Maybe define a good basic set of interesting functions for your search
algorithm and duplicate these in a new hwloc/allowed.h with a new
allowed_cpuset attribute? Whenever they find an object, they check
whether hwloc_bitmap_intersects(obj->cpuset, allowed_cpuset). If FALSE,
ignore that object.

There's also a related change that I wasn't ready/sure to try yet:
obj->allowed_cpuset is currently just a duplicate of obj->cpuset in the
default case. When the WHOLE_SYSTEM topology flag is set, it's a binary
AND between obj->cpuset and root->allowed_cpuset. Quite a lot of
duplication. We could remove all these allowed_{cpuset,nodeset} from
objects and have a topology->allowed_cpuset instead. Most users don't
care and wouldn't see the difference. Others would pass the WHOLE_SYSTEM
flag and use hwloc/allowed.h or do things manually):
* ignore an object if !hwloc_bitmap_intersects(obj->cpuset,
allowed_cpuset) like what hwloc/allowed.h would do.
* bind using:
set = hwloc_bitmap_dup(obj->cpuset);
hwloc_bitmap_and(set, set, allowed_cpuset);
set_cpubind(set);
hwloc_bitmap_free(set);

allowed_cpuset can be either a new topology->allowed_cpuset retrieve by
the current process using the OS, or their own provided allowed_cpuset
that came from the RM.

I only talked about allowed_cpuset above, but there's also a
allowed_nodeset. What happens if a NUMA node is disallowed but its local
cores are allowed? We want to ignore that NUMA node when looking up NUMA
nodes for manipulating memory. But we don't want to ignore it when
looking up NUMA nodes and children for placing tasks. It's not clear to
me how to handle these cases. Have all new functions receive both
allowed_cpuset and allowed_nodeset but one of them can be NULL?



By the way, obj->complete_{cpuset,nodeset} is also something we could
drop and just have a topology->complete_{cpuset,nodeset} saying "by the
way, there are others resources that we don't know much about, are
offline, ...".

Brice



>>
>> HTH
>> Ralph
>>
>>> On Oct 21, 2016, at 5:16 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>>
>>> Hello
>>>
>>> Based on recent discussion about hwloc_topology_load() being slow on
>>> some "large" platforms (almost 1 second on KNL), here's a new feature
>>> proposal:
>>>
>>> We've been recommending the use of XML to avoid multiple expensive
>>> discovery: Export to XML once at boot, and reload from XML for each
>>> actual process using hwloc. The main limitation is cgroups: resource
>>> managers use cgroups to restrict the processors and memory that are
>>> actually available to each job. So the topology of different jobs on the
>>> same machine is actually slightly different from the main XML that
>>> contained everything when it was created outside of cgroups during boot.
>>>
>>> So we're looking at adding a new topology flag that loads the entire
>>> machine from XML (or synthetic) and applies restrictions from the
>>> local/native operating system.
>>>
>>> Details at https://github.com/open-mpi/hwloc/pull/212
>>> Comments welcome here or there.
>>>
>>> Brice
>>>
>>> _______________________________________________
>>> hwloc-devel mailing list
>>> hwloc-devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel
>> _______________________________________________
>> hwloc-devel mailing list
>> hwloc-devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel
> _______________________________________________
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel

_______________________________________________
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel

Reply via email to