Re: DAX numa_attribute vs SubNUMA clusters

Dan Williams Mon, 08 Apr 2019 11:56:01 -0700

On Mon, Apr 8, 2019 at 1:13 AM Brice Goglin <[email protected]> wrote:
>
> Le 08/04/2019 à 06:26, Dan Williams a écrit :
> > On Thu, Apr 4, 2019 at 12:48 PM Brice Goglin <[email protected]> wrote:
> >> Hello
> >>
> >> I am trying to understand the locality of the DAX devices with
> >> respect to processors with SubNUMA clustering enabled. The machine
> >> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the 
> >> SNCs of both
> >> processors, #4-5 are prox domains for each socket set of NVDIMMs.
> >>
> >> SLIT says the topology looks like this, which seems OK to me:
> >>
> >>   Package 0 ---------- Package 1
> >>   NVregion0            NVregion1
> >>    |     |              |     |
> >> SNC 0   SNC 1        SNC 2   SNC 3
> >> node0   node1        node2   node3
> >>
> >> However each DAX "numa_node" attribute contains a single node ID,
> >> which leads to this topology instead:
> >>
> >>   Package 0 ---------- Package 1
> >>    |     |              |     |
> >> SNC 0   SNC 1        SNC 2   SNC 3
> >> node0   node1        node2   node3
> >>    |                   |
> >> dax0.0               dax1.0
> >>
> >> It looks like this is caused by acpi_map_pxm_to_online_node()
> >> only returning the first closest node found in the SLIT.
> >> However, even if we change it to return multiple local nodes,
> >> the DAX "numa_node" attribute cannot expose multiple nodes.
> >> Should we rather expose Keith HMAT attributes for DAX devices?
> > If I understand the suggestion correctly you're referring to the
> > "target_node" or the unique node number that gets assigned when the
> > memory is transitioned online. I struggle to see the incremental
> > benefit relative to what we lose with compatibility of the
> > "traditional" numa node interpretation for a device that indicates
> > which cpus are close to the given device. I think the bulk of the
> > problem is solved with the next suggestion below.
>
>
> Hello Dan,
>
> Not sure why you're talking about "target_node" here. That attribute is
> correct:
>
> $ cat /sys/bus/dax/devices/dax0.0/target_node
> 4
>
> My issue is with "numa_node" which fails to return enough information here:
>
> $ cat
> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/dax0.0/numa_node
>
> 0
>
> (instead of 0+1 but I don't want to change the semantics of that file,
> see below)
>
>
> >
> >> Maybe there's even a way to share them between DAX devices
> >> and Dave's KMEM hotplugged NUMA nodes?
> > In this instance, where the expectation is that the NVDIMM range is
> > equidistant from both SNC nodes on a package, I would teach numactl
> > tool and other tooling to return a list of local nodes rather than the
> > single attribute. Effectively an operation like "numactl --preferred
> > block:pmem0" would return a node-mask that includes nodes 0 and 1.
>
>
> Teaching these tools is exactly what I want to solve here (I was rather
> talking about dax0.0 than pmem0 but it doesn't matter much). There are
> usually two ways to find the locality of a device from userspace:
>
> * Reading a "local_cpus" sysfs attribute. Works well for finding local
> CPUs. Doesn't always work for finding local memory when some CPUs are
> offline: if all CPUs of the local node are offline, you loose the
> information about the local memory being close to your device (Intel
> people from "mOS" heavily rely of this).
>
> * Reading a "numa_node" sysfs attribute, but it points to a single node.
>
>
> Keith HMAT patches are somehow a 3rd way that doesn't have any of these
> issues: you just read "access0/initiators/node*":
>
> * If you want local CPUs, you read the "cpumap" of the initiators nodes.
>
> * If you want the list of "close" memory nodes, you have the list of
> initiator "nodes", or their targets.
>
> It would work very well for describing the topology of my machine once I
> hotplug node4 and node5 using Dave's "kmem" driver: I get node0 and
> node1 is node4/access0/initiators/


Yes, I agree with all of the above, but I think we need a way to fix
this independent of the HMAT data being present. The SLIT already
tells the kernel enough to let tooling figure out equidistant "local"
nodes. While the numa_node attribute will remain a singleton the
tooling needs to handle this case and can't assume the HMAT data will
be present.

> I know HMAT attributes don't appear in hotplugged node sysfs directories
> yet, but it would also be nice to have a way to get that information for
> dax devices before hotplug, since the dax device and hotplugged nodes
> are the same thing.
>
>
> In an crazy world, maybe we could have something like this:
>
> * before hotplug with kmem driver, unregistered nodes appear in a
> special directory such as
> /sys/devices/system/node/unregistered_hmat/nodeX together with their
> HMAT attributes. If I want to find the locality of a DAX device, I read
> its target_node, and go to the corresponding unregistered_hmat/nodeX and
> read cpumap, initiators, etc.
>
> * at hotplug, the node is moved out of unregistered_hmat/

Some sort of offline target_node data makes sense, but seems secondary
to teaching tools to supplement the 'numa_node' attribute.
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: DAX numa_attribute vs SubNUMA clusters

Reply via email to