On 01/01/2014 08:25, Brice Goglin wrote: > Hello, > > Overall, the code looks good to me. > Maybe the core/socket/node discovery code could be factorized to reduce > duplication.
I will see what I can do. at the time of writing I was more concerned with getting it functional. > There's a little fix attached to improve error management in instantiate(). I considered that, but hwloc_debug() disappears in a non-debug build from what I can tell, so a regular user would just see that they have explicitly asked for Xen and it has been skipped over. What would be useful would be hwloc_warning()/error() which is independent of --{en,dis}able-debug. > >> First of all, I have hacked at the m4, by copying surrounding code, and >> it now appears to work sanely for me, including --{en,dis}able-xen >> configure options. I have no idea whether what I have done is appropriate. > The attach patch should extend your code to also support building the > xen component as a plugin so we don't enforce the libxen hard dependency > inside the main hwloc library. I was under the impression that the m4 so far would silently drop support for Xen if the library was no available and the user didn't explicitly --enable-xen, or have I misunderstood something? > > However, we need to export 2 additional functions (setup_pu_level and > alloc_obj_cpusets) to plugins first (Xen will be the first plugin-able > component to do the "core" discovery, that's why these functions were > not exported earlier). I'll work at this and then we'll remove the > "HACK" #include from my patch. > >> Xen support itself is only usable if explicitly requested, via the >> presence of the HWLOC_XEN environment variable. The Xen backend has a >> higher priority than native, and excludes all other CPU topology >> gathering modules, as the native OS will see the fake topology, and the >> x86 module has no idea which pcpus its vcpus are currently executing on, >> so can't map cpuid results back to reality. > If that environment variable is the way you want to enable Xen discovery > in the end, then priority 52 is a good solution. Maybe use 55 instead in > case we ever have to put something between native and Xen. It was the easiest way I could find to select between the full system topology and the dom0 fake topology. I am open to any better suggestions. I shall bump to 55. > > Regarding component flags, the usual way to exclude ALL other components > if to put ~0 instead of > HWLOC_DISC_COMPONENT_TYPE_CPU | HWLOC_DISC_COMPONENT_TYPE_GLOBAL I didn't want to exclude PCI devices, as dom0 does have an accurate view of them. (Only at the moment, cant associate them with numa nodes). > >> As for developing the backend, the documentation was a little lacking. >> It would have vastly helped had there been a sentence describing how one >> is expected to build the topology. What I realised, after far too long >> staring at the spaghetti code in other backends, was that >> hwloc_insert_object_by_cpuset() appears to be my friend, and >> subsequently made the process very simple. > Yes sorry, there are very very few people that develop new backends so > the documentation doesn't really cover these internal details. The only > doc about these is basically comments in hwloc/plugins.h > > We should also point to a trivial backend as an example that is easy to > read (certainly not Linux which is horrible because it can use info from > sysfs, /proc/cpuinfo, device-tree, etc.). > >> At the moment, topology-xen appears capable of constructing the PUs, >> cores, sockets and numa nodes. I have a stack of queries and bugfixes >> against Xen which I shall bring up on xen-devel in due course. > Crazy question: what if you put hwloc in the hypervisor and just pass > the XML output from the hypervisor to the guest instead of passing all > these socket/node/core information? T Xen, by design, is tiny (170 KLoc in total, including x86_64 and arm32/arm64) leaving as much work as possible to dom0. Xen has no device drivers, no AML vm and certainly no xml library. Most of its API (of which this socket/node/core information is a typical example) are designed for consumption by guest kernels as much as guest userspace. So sadly, the idea is quite crazy, and would certainly be a no-go as far as suggesting it on xen-devel goes. > >> Chief >> among them is that there the way Xen currently signals an offline PU's >> is to trash its location information in the system topology. This means >> that I can identify a specific PU as being offline, but can only infer >> its position in the topology as I happen to know Xen writes the records >> sequentially. > I think this goes in the "complete_cpuset". See below. > >> You might notice that the xml is a little thin on details. One problem >> I have is how to integrate things like the DMI information? I know for >> certain that the linux component will get the correct DMI information >> (as dom0 gets the real SMBios tables), but it is not in an easy form to >> consume from outside the linux module. Then comes the question of how >> to use the *BSD DMI information on BSD systems which can use the xen >> component as-is? One idea would be to have a "native dmi information" >> function pointer which can be optionally implemented, but that would be >> somewhat of an architecture overhaul. > DMI info isn't widely used anyway. Some people may use them to define > some quirks for some machines but I don't think it's critical to have it > in Xen too. > >> Also I suspect it would require >> access to the native components private data which doesn't appears to >> exist for the duration of Xen's topology gathering. > It exists. All backends are instantiated at once, then all the discovery > callbacks are called, then all backends info is destroyed. You're not > really supposed to access other backend private data, but we already had > to do it in the linuxpci component anyway. So we had to define a proper > solution if we go there. > >> One thing I noticed was about allowed_{cpu,node}set. From what I can >> gather, this is to do with process binding, which is meaningless in the >> context of the Xen system topology. What is the approved way of setting >> these all to 0? > This is related to PUs/nodes that are restricted by Linux control groups. > If you just ignore these allowed fields, they should be set to equal to > the main cpuset/nodeset. > >> A problem I encountered was the difference between cpuset, online_cpuset >> and complete_cpuset. I can see that online_cpuset is likely to be >> subset of the others, but I cant find a definitive difference between >> the cpuset and complete_cpuset. Can anyone enlighten me? > These fields are also explained in hwloc.h. > > cpuset => contains PU that exist, we know where they are, they are > online, they are allowed (allowed with respect of cgroups on Linux) > online => exist, we know where they are, they are online, they may be > DISALLOWED > allowed => exist, we know where they are, may be OFFLINE > complete => exist, may be DISALLOWED/OFFLINE, we don't know where all of > them are So they are - I clearly missed them when reading the doxygen docs. > >> What is the canonical way of stating that a certain PU is offline? Xen >> doesn't really do offline cpus of its own accord at the moment but does >> have a hypercalls to explicitly online and offline cpus. In the case >> that one is offline, I suspect my current code will cause the PU to fall >> out of the rest of the topology, rather than stay within it and marked >> differently. > What we do in topology-linux.c is we clear the corresponding bit from > the top-level online_cpuset hwloc_get_root_obj(topology)->cpuset > Then the core will propagate that everywhere at the end of the discovery. Does that mean that hwloc_insert_object_by_cpuset() will work correctly inserting an object with a bit set in cpuset which is clear in the root online_cpuset, or will I have to explicitly construct each sock/node/core object with correct cpuset and online_cpuset information? > > Thanks for all your work and happy new year ! > > Brice Thanks for all the help, and a happy new year to you too. I shall try some more playing around, especially when trying to get offline cpus working correctly. ~Andrew