Been mulling this for a few days; here's my thoughts...

On Jan 7, 2010, at 1:35 PM, Samuel Thibault wrote:

> Considering future network topology support, I believe we probably need
> to fix a couple of things before releasing 1.0.  Just to sum up the a
> bunch of points that have been raised in the past months:
> 
> - there should be a way to have the complete toplogy in just one tree,
>   so you can browse in it and assign tasks/process/whatever in it,
>   according to architectural details provided by hwloc, but also network
>   details like bandwidth etc.

Are you thinking of adding bandwidth attributes?  Or are you thinking of adding 
weighting between objects in the hierarchy?  Or ...?

> - the core of hwloc mustn't force any kind of tools, it must be easy
>   to either build something around hwloc detection and binding
>   functions, or load detection & binding plugins.
> 
> The way I see it is to provide a hwloc_topology_combine() function that
> takes a series of several hwloc_topology_t trees and an object type,
> and builds a tree that contains a new object of that type, under which
> the trees appear.  That combination can actually already be done by
> hand by catenating xml files. For instance, on a simple cluster you'd
> run lstopo on each machine and save xml files, load them together,
> combine them under a "network" object (being able to register dynamic
> object types should be easy), and save the result as an xml file, which
> thus contains the complete topology of the cluster. A task dispatcher
> can thus browse it at will etc. When it comes about binding, it'd be
> the task dispatcher's role to first run the application to the target
> machine, and there run hwloc to perform the actual binding, according to
> the cpuset in the tree.

All sounds good.

> Now, coming to semantic changes:
> - The top node of the tree wouldn't necessarily be a system object.
>   Actually, having always the top object having the system type is not
>   providing any useful information :), and it makes us duplicate fields
>   between system and machine. On usual (non-Kerrighed) machines, the top
>   node would just be machine. On Kerrighed systems, the top node would
>   be system. On networked systems, the top node would be a switch or the
>   Internet :)
>   As a consequence, hwloc_get_system_obj would have to be renamed to
>   hwloc_get_root_obj.
> - Objects of network trees may not have cpusets defined  (Trees obtained
>   directly from hwloc with defaults parameter would still have cpusets
>   on every node however).  It does not make sense to merge cpusets of
>   different machines (they will conflict), and things like shifting
>   cpusets to be able to merge them would probably only bring issues.
>   That being said, that does not prevent from writing a transparency
>   plugin that automatically discovers the network topology, shifts
>   cpusets, and when requested for binding, automatically migrates to
>   the machine according to the shift, and uses the underlying OS hooks
>   to perform the binding.  My point is that the hwloc combining operation
>   wouldn't fix cpusets itself and leave them NULL. The caller of the
>   combining operation will be responsible for that.

More generally -- some objects can be bound to, some cannot.  I assume (per 
Brice's reply) that we can't bind to PCI objects, so I think making this a full 
generalization is probably a good thing (especially as hwloc can understand/map 
more and more kinds of objects).  

How does this kind of thing extend to, say, co-processors (such as 
accelerators, FPGAs, GPGPUs, etc.)?

> - This also means there can't be "global" cpusets like the recently
>   added hwloc_topology_get_{topology,complete,online,allowed}_cpuset
>   functions (not released yet). These can just be moved to the hwloc_obj
>   structure, thus being available for each object, which could actually be
>   helpful btw.

I'm not sure I follow -- you say that we can't have "global" cpusets anymore 
(which I grok), but then you say that we can move them to the hwloc_obj struct. 
 Isn't that the top-level struct?  I probably misunderstand here.

> - Helpers that take cpuset parameters of course don't make sense any more
>   when applied to networked topologies.  But it probably doesn't make
>   sense for the caller to call them in the first place, and the caller
>   knows it since it's the caller that has first called the combining
>   operation or loaded an XML file resulting from it.

Agreed.  Perhaps we should have a general query function that can return 
whether a given object can be bound to or not (e.g., for generic tree-traversal 
kinds of functionality)...?

>   If, however, at some point (after having distributed tasks between
>   machines for instance), operations with cpusets are desired, we could
>   provide a duplication function that takes a topology object parameter
>   A and builds a new topology tree containing all the objects under
>   A, A thus being its root, and then (if A indeed has a cpuset, but
>   the caller should know that) heleprs taking cpuset parameters can be
>   called.
> 
> So, to sum it up:
> - hwloc_get_obj_by_depth(topo, 0, 0) may not be a system object any
>   more (actually it'd only be one on kerrighed systems).
> - no global cpuset field, only in objects.

Some generic points...

1. How about defining a small set of generic operations based on what you 
described above?  E.g.:

- copy: take a tree with root R; copy it to a new tree (note that R may not be 
the root of the original tree)
- remove: take a tree with root R; find object X within that tree; remove X and 
all of its children
- insert: take two trees with roots R and S; find object X within R; copy tree 
S to become a new child of X
- ...?

> The second point shouldn't harm, it's just a matter of fixing the (not
> yet released) API.  The first point clearly contradicts the current
> documentation (“HWLOC_OBJ_SYSTEM will always be the highest”),
> but I believe not breaking it as soon as now will tie us from further
> extensions anyway, and I don't think much code relies on it anyway.

Agreed.

> The plan I see is that for 1.0 we only check that catenating .XML files
> by hand to build misc levels representing network layers does indeed
> work, which should mean that actual combining functions etc. should be
> possible to implement later.

FWIW, I'd prefer to see the combining/etc. functions ASAP -- we could 
definitely use such things in Open MPI...

-- 
Jeff Squyres
jsquy...@cisco.com


Reply via email to