[ add Keith and Dave for their thoughts ] On Wed, Apr 17, 2019 at 2:46 PM Brice Goglin <[email protected]> wrote: > > > Le 17/04/2019 à 23:35, Dan Williams a écrit : > > On Tue, Apr 16, 2019 at 8:31 AM Brice Goglin <[email protected]> wrote: > >> > >> Le 08/04/2019 à 21:55, Brice Goglin a écrit : > >> > >> Le 08/04/2019 à 16:56, Dan Williams a écrit : > >> > >> Yes, I agree with all of the above, but I think we need a way to fix > >> this independent of the HMAT data being present. The SLIT already > >> tells the kernel enough to let tooling figure out equidistant "local" > >> nodes. While the numa_node attribute will remain a singleton the > >> tooling needs to handle this case and can't assume the HMAT data will > >> be present. > >> > >> So you want to export the part of SLIT that is currently hidden to > >> userspace because the corresponding nodes aren't registered? > >> > >> With the patch below, I get 17 17 28 28 in dax0.0/node_distance which > >> means it's close to node0 and node1. > >> > >> The code is pretty much a duplicate of read_node_distance() in > >> drivers/base/node.c. Not sure it's worth factorizing such small functions? > >> > >> The name "node_distance" (instead of "distance" for NUMA nodes) is also > >> subject to discussion. > >> > >> Here's a better patch that exports the existing routine for showing > >> node distances, and reuses it in dax/bus.c and nvdimm/pfn_devs.c: > >> > >> # cat /sys/class/block/pmem1/device/node_distance > >> 28 28 17 17 > >> # cat /sys/bus/dax/devices/dax0.0/node_distance > >> 17 17 28 28 > >> > >> By the way, it also handles the case where the nd_region has no > >> valid target_node (idea stolen from kmem.c). > >> > >> Are there other places where it'd be useful to export that attribute? > >> > >> Ideally we could just export it in the region sysfs directory, > >> but I can't find backlinks going from daxX.Y or pmemZ to that > >> region directory :/ > > I understand where you're trying to go, but this is too dax-device > > specific. What about a storage-controller in the topology that is > > equidistant from multiple cpu nodes. I'd rather solve this from the > > tooling perspective to lookup cpu nodes that are equidistant to the > > device's "numa_node". > > > I don't see how you're going to lookup those equidistant nodes. In the > above case, pmem1 numa_node is 2. Where do you want tools to find the > information that pmem1 is actually close to node2 AND node3?
Yeah, I was indeed confusing proximity-domain and numa-node in my thought process of what information userspace tools have readily available, but I think a generic solution is still salvageable. > That information is hidden in SLIT node5<->node2 and node5<->node3 but > these are not exposed to userspace tools since node5 isn't registered. I think the root problem is that the kernel allocates numa-nodes in arch-specific code at the beginning of time and the proximity-domain information is not readily available with the expectation that the Linux numa node is sufficient. Your node_distance attribute proposal solves this, but I find SLIT data to be a bit magical and poorly specified, especially across architectures. What about just exporting the proximity domain information via an opaque firmware-implementation-specific 'node_handle' attribute? Then the node_handle can be used to consult questions like what numa-nodes is this handle local to beyond what the 'numa_node' attribute indicates, what is the effective target-node for this node-handle, and allow for interrogating the next level of detail beyond what CONFIG_HMEM_REPORTING allows. _______________________________________________ Linux-nvdimm mailing list [email protected] https://lists.01.org/mailman/listinfo/linux-nvdimm
