On Mon, Apr 8, 2019 at 1:13 AM Brice Goglin <[email protected]> wrote: > > Le 08/04/2019 à 06:26, Dan Williams a écrit : > > On Thu, Apr 4, 2019 at 12:48 PM Brice Goglin <[email protected]> wrote: > >> Hello > >> > >> I am trying to understand the locality of the DAX devices with > >> respect to processors with SubNUMA clustering enabled. The machine > >> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the > >> SNCs of both > >> processors, #4-5 are prox domains for each socket set of NVDIMMs. > >> > >> SLIT says the topology looks like this, which seems OK to me: > >> > >> Package 0 ---------- Package 1 > >> NVregion0 NVregion1 > >> | | | | > >> SNC 0 SNC 1 SNC 2 SNC 3 > >> node0 node1 node2 node3 > >> > >> However each DAX "numa_node" attribute contains a single node ID, > >> which leads to this topology instead: > >> > >> Package 0 ---------- Package 1 > >> | | | | > >> SNC 0 SNC 1 SNC 2 SNC 3 > >> node0 node1 node2 node3 > >> | | > >> dax0.0 dax1.0 > >> > >> It looks like this is caused by acpi_map_pxm_to_online_node() > >> only returning the first closest node found in the SLIT. > >> However, even if we change it to return multiple local nodes, > >> the DAX "numa_node" attribute cannot expose multiple nodes. > >> Should we rather expose Keith HMAT attributes for DAX devices? > > If I understand the suggestion correctly you're referring to the > > "target_node" or the unique node number that gets assigned when the > > memory is transitioned online. I struggle to see the incremental > > benefit relative to what we lose with compatibility of the > > "traditional" numa node interpretation for a device that indicates > > which cpus are close to the given device. I think the bulk of the > > problem is solved with the next suggestion below. > > > Hello Dan, > > Not sure why you're talking about "target_node" here. That attribute is > correct: > > $ cat /sys/bus/dax/devices/dax0.0/target_node > 4 > > My issue is with "numa_node" which fails to return enough information here: > > $ cat > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/dax0.0/numa_node > > 0 > > (instead of 0+1 but I don't want to change the semantics of that file, > see below) > > > > > >> Maybe there's even a way to share them between DAX devices > >> and Dave's KMEM hotplugged NUMA nodes? > > In this instance, where the expectation is that the NVDIMM range is > > equidistant from both SNC nodes on a package, I would teach numactl > > tool and other tooling to return a list of local nodes rather than the > > single attribute. Effectively an operation like "numactl --preferred > > block:pmem0" would return a node-mask that includes nodes 0 and 1. > > > Teaching these tools is exactly what I want to solve here (I was rather > talking about dax0.0 than pmem0 but it doesn't matter much). There are > usually two ways to find the locality of a device from userspace: > > * Reading a "local_cpus" sysfs attribute. Works well for finding local > CPUs. Doesn't always work for finding local memory when some CPUs are > offline: if all CPUs of the local node are offline, you loose the > information about the local memory being close to your device (Intel > people from "mOS" heavily rely of this). > > * Reading a "numa_node" sysfs attribute, but it points to a single node. > > > Keith HMAT patches are somehow a 3rd way that doesn't have any of these > issues: you just read "access0/initiators/node*": > > * If you want local CPUs, you read the "cpumap" of the initiators nodes. > > * If you want the list of "close" memory nodes, you have the list of > initiator "nodes", or their targets. > > It would work very well for describing the topology of my machine once I > hotplug node4 and node5 using Dave's "kmem" driver: I get node0 and > node1 is node4/access0/initiators/
Yes, I agree with all of the above, but I think we need a way to fix this independent of the HMAT data being present. The SLIT already tells the kernel enough to let tooling figure out equidistant "local" nodes. While the numa_node attribute will remain a singleton the tooling needs to handle this case and can't assume the HMAT data will be present. > I know HMAT attributes don't appear in hotplugged node sysfs directories > yet, but it would also be nice to have a way to get that information for > dax devices before hotplug, since the dax device and hotplugged nodes > are the same thing. > > > In an crazy world, maybe we could have something like this: > > * before hotplug with kmem driver, unregistered nodes appear in a > special directory such as > /sys/devices/system/node/unregistered_hmat/nodeX together with their > HMAT attributes. If I want to find the locality of a DAX device, I read > its target_node, and go to the corresponding unregistered_hmat/nodeX and > read cpumap, initiators, etc. > > * at hotplug, the node is moved out of unregistered_hmat/ Some sort of offline target_node data makes sense, but seems secondary to teaching tools to supplement the 'numa_node' attribute. _______________________________________________ Linux-nvdimm mailing list [email protected] https://lists.01.org/mailman/listinfo/linux-nvdimm
