On 17/09/10 00:57, Samuel Thibault wrote:
Alexey Kardashevskiy, le Thu 16 Sep 2010 15:57:47 +1000, a écrit :
- where do I put IBM-specific code?
Is the device tree linux-specific ? If so, it can stay in linux file as
long as it's not 30k lines :) We already have both sysfs and
/proc/cpuinfo code there anyway.
It is powerpc-specific. It is mapped from the system firmware (aka bios)
by the powerpc kernel. However it is just a folder within /proc so it is
usual linux folder. But PowerPC might be not the only architecture which
uses the same pathname for the same thing.
AIUI from google searches, it is an Open Firmware standard, also
used on the OLPC, and thus would be not only for PowerPC. Since
that's not really Linux-specific, it would take place somewhere
else than topology-linux.c, but the standard doesn't seem to say
how/where this tree is exposed. Apparently there's even some sort
of text format for it, see arch/powerpc/boot/dts/*.dts. I'd thus
say that all OS-independant code should go to a topology-of.c file,
and topology-linux.c would provide generic tree functions to browse
/proc/device-tree, i.e. of_get_str, of_get_int_arr, and generic tree
browsing functions, which topology-of.c can then use to browse the
tree and fetch information without knowing where it is taken from. A
topology-dts.c file would provide the same functions, but this time
providing data from a .dts file, and that combination could be a backend
replacing the /proc and /sys parsing completely, similarly to loading a
.xml file. Note that I'm not asking you to do all this, I'm just asking
to rework the function interfaces a little bit to have things already
cleanly separated for anybody who would feel like adding another OS
support or parsing .dts files some day, I believe that shouldn't be too
much work for you.
Regarding topology walking - there is actually nothing device-tree
special in reading strings and numbers from a device-tree, it is just
common functions which (I think) should be placed in utils/misc.c. I
named functions like hwloc_read_str.
Regarding open firmware and device trees - yes, it is IEEE1275 which can
be implemented anywhere. However, in our case, there are IBM-specific
properties (ibm,phandle, ibm,ppc-interrupt-server#s) which make all this
very IBM specific. I do not know how to deal with it better.
I reworked the patch today, it is attached. It is made against today's
SVN tree.
> From what I can read in drivers/of/fdt.c, the binary values are passed
directly from the table in memory. I wonder which endian order this is
supposed to be, but since I can see be32_to_cpu all around in the code
there, I guess it is assumed to be all big endian. There are thus also a
few fixes to be done in your code:
- Make sure you have proper sizes: code like
unsigned int reg = -1;
if (0 != of_get_int_arr(cpu, "reg",®, sizeof(reg), root_fd))
is not really safe, you should use uint32_t to make sure of the size of
the integer.
- Once the read is done, values need to be swapped on little-endian
machines. You could use ntohl for that.
The hwlow code is compiled on powerpc and reads device-tree on a powerpc
machine. If endianess does not match in this case, we have got a big
problem, this should not happen. Or I am missing something.
The point regarding uint32_t vs. (unsigned int) is taken.
- Constant-sized buffers are never a good idea. I know there are already
some in topology-linux, but that's not a reason when it should be easy
to make things properly :)
Fixed :)
It's thus say that you should rather provide the following functions:
static inline int
of_get_int(const char *p, const char *p1, uint32_t *value, int root_fd);
Gets one integer into `value'.
done :)
static inline uint32_t *
of_get_int_arr(const char *p, const char *p1, int root_fd);
Returns the array of integers. Yes, this makes a lot of
allocation/deallocation, but that should be neglectible compared to the
system calls and will save us nasty length-bugs later.
How many integers were read by this function? What allocator was used
for the returned value? Sure I can presume it was malloc, but still it
is not a good style :)
Regarding the latest patch about numa nodes numbers - do I understand
correctly that hwloc_cpuset_t is just a bit array which can be used
everywhere where we need a bit array, not just for CPUs? :-)
Index: src/topology-linux.c
===================================================================
--- src/topology-linux.c (revision 2443)
+++ src/topology-linux.c (working copy)
@@ -1391,6 +1391,238 @@
*found = nbnodes;
}
+static int
+hwloc_read_str(const char *p, const char *p1, char *buf, size_t nbuf, int root_fd)
+{
+ char fname[256];
+ int ret = -1;
+ sprintf(fname, "%s/%s", p, p1);
+ FILE *file = hwloc_fopen(fname, "r", root_fd);
+ if (NULL == file)
+ return ret;
+ if (NULL != fgets(buf, nbuf, file))
+ ret = 0;
+ fclose(file);
+ buf[nbuf-1] = 0;
+ return ret;
+}
+
+static size_t
+hwloc_read_raw(const char *p, const char *p1, void *buf, size_t nbuf,
+ int root_fd)
+{
+ char fname[256];
+ size_t cb = 0;
+ sprintf(fname, "%s/%s", p, p1);
+ FILE *file = hwloc_fopen(fname, "r", root_fd);
+ if (NULL == file)
+ return cb;
+ cb = fread(buf, 1, nbuf, file);
+ fclose(file);
+ return cb;
+}
+
+static size_t
+hwloc_read_uint32(const char *p, const char *p1, uint32_t *buf, int root_fd)
+{
+ return hwloc_read_raw(p, p1, buf, sizeof(*buf), root_fd);
+}
+
+typedef struct {
+ unsigned int n, allocated;
+ struct {
+ hwloc_cpuset_t cpuset;
+ uint32_t ibm_phandle;
+ uint32_t l2_cache;
+ char name[64];
+ } *p;
+} device_tree_cpus_t;
+
+static void
+add_device_tree_cpus_node(device_tree_cpus_t *cpus, hwloc_cpuset_t cpuset,
+ uint32_t l2_cache, uint32_t ibm_phandle, const char *name)
+{
+ if (cpus->n == cpus->allocated) {
+ cpus->allocated += 64;
+ cpus->p = realloc(cpus->p, cpus->allocated * sizeof(cpus->p[0]));
+ }
+ cpus->p[cpus->n].ibm_phandle = ibm_phandle;
+ cpus->p[cpus->n].cpuset = (NULL == cpuset)?NULL:hwloc_cpuset_dup(cpuset);
+ cpus->p[cpus->n].l2_cache = l2_cache;
+ strncpy(cpus->p[cpus->n].name, name, sizeof(cpus->p[0].name)-1);
+ ++cpus->n;
+}
+
+static int
+look_powerpc_device_tree_discover_cache(device_tree_cpus_t *cpus,
+ uint32_t ibm_phandle, unsigned int *level, hwloc_cpuset_t cpuset)
+{
+ int ret = -1;
+ if ((NULL == level) || (NULL == cpuset))
+ return ret;
+ for (unsigned int i = 0; i < cpus->n; ++i) {
+ if (ibm_phandle != cpus->p[i].l2_cache)
+ continue;
+ if (NULL != cpus->p[i].cpuset) {
+ hwloc_cpuset_or(cpuset, cpuset, cpus->p[i].cpuset);
+ ret = 0;
+ } else {
+ ++(*level);
+ if (0 == look_powerpc_device_tree_discover_cache(cpus,
+ cpus->p[i].ibm_phandle, level, cpuset))
+ ret = 0;
+ }
+ }
+ return ret;
+}
+
+static void
+look_powerpc_device_tree(struct hwloc_topology *topology)
+{
+ device_tree_cpus_t cpus = { 0 };
+ const char ofroot[] = "/proc/device-tree/cpus";
+
+ int root_fd = topology->backend_params.sysfs.root_fd;
+ DIR *dt = hwloc_opendir(ofroot, root_fd);
+ if (NULL == dt)
+ return;
+
+ struct dirent *dirent;
+ while (NULL != (dirent = readdir(dt))) {
+
+ char cpu[256];
+ sprintf(cpu, "%s/%s", ofroot, dirent->d_name);
+
+ char device_type[64];
+ if (0 > hwloc_read_str(cpu, "device_type", device_type, sizeof(device_type), root_fd))
+ continue;
+
+ uint32_t reg = -1, l2_cache = -1, ibm_phandle = -1;
+ hwloc_read_uint32(cpu, "reg", ®, root_fd);
+ hwloc_read_uint32(cpu, "l2-cache", &l2_cache, root_fd);
+ hwloc_read_uint32(cpu, "ibm,phandle", &ibm_phandle, root_fd);
+
+ if (0 == strcmp(device_type, "cache")) {
+ add_device_tree_cpus_node(&cpus, NULL, l2_cache, ibm_phandle, dirent->d_name);
+ }
+ else if (0 == strcmp(device_type, "cpu")) {
+ /* Found CPU */
+ hwloc_cpuset_t cpuset = NULL;
+ uint32_t threads[128], nthreads = 0;
+ nthreads = hwloc_read_raw(cpu, "ibm,ppc-interrupt-server#s",
+ threads, sizeof(threads), root_fd) / sizeof(threads[0]);
+ if (0 != nthreads) {
+ cpuset = hwloc_cpuset_alloc();
+ for (unsigned int i = 0; i < nthreads; ++i) {
+ hwloc_cpuset_set(cpuset, threads[i]);
+ }
+ } else if ((unsigned int)-1 != reg) {
+ cpuset = hwloc_cpuset_alloc();
+ hwloc_cpuset_set(cpuset, reg);
+ }
+
+ if (NULL == cpuset) {
+ hwloc_debug("%s has no \"reg\" property, skipping\n", cpu);
+ continue;
+ }
+ add_device_tree_cpus_node(&cpus, cpuset, l2_cache, ibm_phandle, dirent->d_name);
+
+ /* Add core */
+ struct hwloc_obj *core = hwloc_alloc_setup_object(HWLOC_OBJ_CORE, reg);
+ core->cpuset = hwloc_cpuset_dup(cpuset);
+ hwloc_insert_object_by_cpuset(topology, core);
+
+ /* Add socket */
+ /* -1 - to discuss */
+ struct hwloc_obj *socket = hwloc_alloc_setup_object(HWLOC_OBJ_SOCKET, -1);
+ socket->cpuset = hwloc_cpuset_dup(cpuset);
+ hwloc_insert_object_by_cpuset(topology, socket);
+
+ /* Add L1 cache */
+ /* Ignore Instruction caches */
+
+ /* d-cache-block-size - ignore */
+ /* d-cache-line-size - to read, in bytes */
+ /* d-cache-sets - ignore */
+ /* d-cache-size - to read, in bytes */
+ /* d-tlb-sets - ignore */
+ /* d-tlb-size - ignore, always 0 on power6 */
+ /* i-cache-* and i-tlb-* represent instruction cache, ignore */
+ uint32_t d_cache_line_size = 0, d_cache_size = 0;
+ if ( (0 != hwloc_read_uint32(cpu, "d-cache-line-size", &d_cache_line_size, root_fd)) &&
+ (0 != hwloc_read_uint32(cpu, "d-cache-size", &d_cache_size, root_fd)) ) {
+ struct hwloc_obj *cache =
+ hwloc_alloc_setup_object(HWLOC_OBJ_CACHE, -1);
+ cache->attr->cache.size = d_cache_size;
+ cache->attr->cache.depth = 1;
+ cache->attr->cache.linesize = d_cache_line_size;
+ cache->cpuset = hwloc_cpuset_dup(cpuset);
+ hwloc_debug_cpuset("cache depth 1 has cpuset %s\n", cache->cpuset);
+ hwloc_insert_object_by_cpuset(topology, cache);
+ }
+ hwloc_cpuset_free(cpuset);
+ }
+ }
+ closedir(dt);
+
+ /* No cores and L2 cache were found, exiting */
+ if (0 == cpus.n) {
+ hwloc_debug("No cores and L2 cache were found in %s, exiting\n", ofroot);
+ return;
+ }
+
+#ifdef HWLOC_DEBUG
+ for (unsigned int i = 0; i < cpus.n; ++i) {
+ hwloc_debug("%i: %s ibm,phandle=%08X l2_cache=%08X ",
+ i, cpus.p[i].name, cpus.p[i].ibm_phandle, cpus.p[i].l2_cache);
+ if (NULL == cpus.p[i].cpuset) {
+ hwloc_debug("%s\n", "no cpuset");
+ } else {
+ hwloc_debug_cpuset("cpuset %s\n", cpus.p[i].cpuset);
+ }
+ }
+#endif
+
+ /* Scan L2/L3/... caches */
+ for (unsigned int i = 0; i < cpus.n; ++i) {
+ /* Skip real CPUs */
+ if (NULL != cpus.p[i].cpuset)
+ continue;
+
+ /* Calculate cache level and CPU mask */
+ unsigned int level = 2;
+ hwloc_cpuset_t cpuset = hwloc_cpuset_alloc();
+ if (0 == look_powerpc_device_tree_discover_cache(&cpus,
+ cpus.p[i].ibm_phandle, &level, cpuset)) {
+ /* Check for "cache-unified" - if it is present, CPU has unified L1 cache */
+ /* d-cache-line-size - to read, in bytes */
+ /* d-cache-sets - ignore */
+ /* d-cache-size - to read, in bytes */
+ /* i-cache-* represent instruction cache, ignore */
+ uint32_t d_cache_line_size = 0, d_cache_size = 0;
+ char cpu[256];
+ sprintf(cpu, "%s/%s", ofroot, cpus.p[i].name);
+
+ if ( (0 != hwloc_read_uint32(cpu, "d-cache-line-size", &d_cache_line_size, root_fd)) &&
+ (0 != hwloc_read_uint32(cpu, "d-cache-size", &d_cache_size, root_fd)) ) {
+ struct hwloc_obj *c = hwloc_alloc_setup_object(HWLOC_OBJ_CACHE, -1);
+ c->attr->cache.size = d_cache_size;
+ c->attr->cache.depth = level;
+ c->attr->cache.linesize = d_cache_line_size;
+ c->cpuset = hwloc_cpuset_dup(cpuset);
+ hwloc_debug_1arg_cpuset("cache depth %d has cpuset %s\n", level, c->cpuset);
+ hwloc_insert_object_by_cpuset(topology, c);
+ }
+ }
+ hwloc_cpuset_free(cpuset);
+ }
+
+ /* Do cleanup */
+ for (unsigned int i = 0; i < cpus.n; ++i)
+ hwloc_cpuset_free(cpus.p[i].cpuset);
+ free(cpus.p);
+}
+
/* Look at Linux' /sys/devices/system/cpu/cpu%d/topology/ */
static void
look_sysfscpu(struct hwloc_topology *topology, const char *path)
@@ -1453,6 +1685,7 @@
hwloc_debug_1arg_cpuset("found %d cpu topologies, cpuset %s\n",
hwloc_cpuset_weight(cpuset), cpuset);
+ unsigned caches_added = 0;
hwloc_cpuset_foreach_begin(i, cpuset)
{
struct hwloc_obj *socket, *core, *thread;
@@ -1580,6 +1813,7 @@
hwloc_debug_1arg_cpuset("cache depth %d has cpuset %s\n",
depth, cacheset);
hwloc_insert_object_by_cpuset(topology, cache);
+ ++caches_added;
} else
hwloc_cpuset_free(cacheset);
}
@@ -1587,6 +1821,9 @@
}
hwloc_cpuset_foreach_end();
+ if (0 == caches_added)
+ look_powerpc_device_tree(topology);
+
hwloc_cpuset_free(cpuset);
}