Been mulling this for a few days; here's my thoughts...
On Jan 7, 2010, at 1:35 PM, Samuel Thibault wrote: > Considering future network topology support, I believe we probably need > to fix a couple of things before releasing 1.0. Just to sum up the a > bunch of points that have been raised in the past months: > > - there should be a way to have the complete toplogy in just one tree, > so you can browse in it and assign tasks/process/whatever in it, > according to architectural details provided by hwloc, but also network > details like bandwidth etc. Are you thinking of adding bandwidth attributes? Or are you thinking of adding weighting between objects in the hierarchy? Or ...? > - the core of hwloc mustn't force any kind of tools, it must be easy > to either build something around hwloc detection and binding > functions, or load detection & binding plugins. > > The way I see it is to provide a hwloc_topology_combine() function that > takes a series of several hwloc_topology_t trees and an object type, > and builds a tree that contains a new object of that type, under which > the trees appear. That combination can actually already be done by > hand by catenating xml files. For instance, on a simple cluster you'd > run lstopo on each machine and save xml files, load them together, > combine them under a "network" object (being able to register dynamic > object types should be easy), and save the result as an xml file, which > thus contains the complete topology of the cluster. A task dispatcher > can thus browse it at will etc. When it comes about binding, it'd be > the task dispatcher's role to first run the application to the target > machine, and there run hwloc to perform the actual binding, according to > the cpuset in the tree. All sounds good. > Now, coming to semantic changes: > - The top node of the tree wouldn't necessarily be a system object. > Actually, having always the top object having the system type is not > providing any useful information :), and it makes us duplicate fields > between system and machine. On usual (non-Kerrighed) machines, the top > node would just be machine. On Kerrighed systems, the top node would > be system. On networked systems, the top node would be a switch or the > Internet :) > As a consequence, hwloc_get_system_obj would have to be renamed to > hwloc_get_root_obj. > - Objects of network trees may not have cpusets defined (Trees obtained > directly from hwloc with defaults parameter would still have cpusets > on every node however). It does not make sense to merge cpusets of > different machines (they will conflict), and things like shifting > cpusets to be able to merge them would probably only bring issues. > That being said, that does not prevent from writing a transparency > plugin that automatically discovers the network topology, shifts > cpusets, and when requested for binding, automatically migrates to > the machine according to the shift, and uses the underlying OS hooks > to perform the binding. My point is that the hwloc combining operation > wouldn't fix cpusets itself and leave them NULL. The caller of the > combining operation will be responsible for that. More generally -- some objects can be bound to, some cannot. I assume (per Brice's reply) that we can't bind to PCI objects, so I think making this a full generalization is probably a good thing (especially as hwloc can understand/map more and more kinds of objects). How does this kind of thing extend to, say, co-processors (such as accelerators, FPGAs, GPGPUs, etc.)? > - This also means there can't be "global" cpusets like the recently > added hwloc_topology_get_{topology,complete,online,allowed}_cpuset > functions (not released yet). These can just be moved to the hwloc_obj > structure, thus being available for each object, which could actually be > helpful btw. I'm not sure I follow -- you say that we can't have "global" cpusets anymore (which I grok), but then you say that we can move them to the hwloc_obj struct. Isn't that the top-level struct? I probably misunderstand here. > - Helpers that take cpuset parameters of course don't make sense any more > when applied to networked topologies. But it probably doesn't make > sense for the caller to call them in the first place, and the caller > knows it since it's the caller that has first called the combining > operation or loaded an XML file resulting from it. Agreed. Perhaps we should have a general query function that can return whether a given object can be bound to or not (e.g., for generic tree-traversal kinds of functionality)...? > If, however, at some point (after having distributed tasks between > machines for instance), operations with cpusets are desired, we could > provide a duplication function that takes a topology object parameter > A and builds a new topology tree containing all the objects under > A, A thus being its root, and then (if A indeed has a cpuset, but > the caller should know that) heleprs taking cpuset parameters can be > called. > > So, to sum it up: > - hwloc_get_obj_by_depth(topo, 0, 0) may not be a system object any > more (actually it'd only be one on kerrighed systems). > - no global cpuset field, only in objects. Some generic points... 1. How about defining a small set of generic operations based on what you described above? E.g.: - copy: take a tree with root R; copy it to a new tree (note that R may not be the root of the original tree) - remove: take a tree with root R; find object X within that tree; remove X and all of its children - insert: take two trees with roots R and S; find object X within R; copy tree S to become a new child of X - ...? > The second point shouldn't harm, it's just a matter of fixing the (not > yet released) API. The first point clearly contradicts the current > documentation (“HWLOC_OBJ_SYSTEM will always be the highest”), > but I believe not breaking it as soon as now will tie us from further > extensions anyway, and I don't think much code relies on it anyway. Agreed. > The plan I see is that for 1.0 we only check that catenating .XML files > by hand to build misc levels representing network layers does indeed > work, which should mean that actual combining functions etc. should be > possible to implement later. FWIW, I'd prefer to see the combining/etc. functions ASAP -- we could definitely use such things in Open MPI... -- Jeff Squyres jsquy...@cisco.com