[ add Keith and Dave for their thoughts ]

On Wed, Apr 17, 2019 at 2:46 PM Brice Goglin <[email protected]> wrote:
>
>
> Le 17/04/2019 à 23:35, Dan Williams a écrit :
> > On Tue, Apr 16, 2019 at 8:31 AM Brice Goglin <[email protected]> wrote:
> >>
> >> Le 08/04/2019 à 21:55, Brice Goglin a écrit :
> >>
> >> Le 08/04/2019 à 16:56, Dan Williams a écrit :
> >>
> >> Yes, I agree with all of the above, but I think we need a way to fix
> >> this independent of the HMAT data being present. The SLIT already
> >> tells the kernel enough to let tooling figure out equidistant "local"
> >> nodes. While the numa_node attribute will remain a singleton the
> >> tooling needs to handle this case and can't assume the HMAT data will
> >> be present.
> >>
> >> So you want to export the part of SLIT that is currently hidden to
> >> userspace because the corresponding nodes aren't registered?
> >>
> >> With the patch below, I get 17 17 28 28 in dax0.0/node_distance which
> >> means it's close to node0 and node1.
> >>
> >> The code is pretty much a duplicate of read_node_distance() in
> >> drivers/base/node.c. Not sure it's worth factorizing such small functions?
> >>
> >> The name "node_distance" (instead of "distance" for NUMA nodes) is also
> >> subject to discussion.
> >>
> >> Here's a better patch that exports the existing routine for showing
> >> node distances, and reuses it in dax/bus.c and nvdimm/pfn_devs.c:
> >>
> >> # cat /sys/class/block/pmem1/device/node_distance
> >> 28 28 17 17
> >> # cat /sys/bus/dax/devices/dax0.0/node_distance
> >> 17 17 28 28
> >>
> >> By the way, it also handles the case where the nd_region has no
> >> valid target_node (idea stolen from kmem.c).
> >>
> >> Are there other places where it'd be useful to export that attribute?
> >>
> >> Ideally we could just export it in the region sysfs directory,
> >> but I can't find backlinks going from daxX.Y or pmemZ to that
> >> region directory :/
> > I understand where you're trying to go, but this is too dax-device
> > specific. What about a storage-controller in the topology that is
> > equidistant from multiple cpu nodes. I'd rather solve this from the
> > tooling perspective to lookup cpu nodes that are equidistant to the
> > device's "numa_node".
>
>
> I don't see how you're going to lookup those equidistant nodes. In the
> above case, pmem1 numa_node is 2. Where do you want tools to find the
> information that pmem1 is actually close to node2 AND node3?

Yeah, I was indeed confusing proximity-domain and numa-node in my
thought process of what information userspace tools have readily
available, but I think a generic solution is still salvageable.

> That information is hidden in SLIT node5<->node2 and node5<->node3 but
> these are not exposed to userspace tools since node5 isn't registered.

I think the root problem is that the kernel allocates numa-nodes in
arch-specific code at the beginning of time and the proximity-domain
information is not readily available with the expectation that the
Linux numa node is sufficient.

Your node_distance attribute proposal solves this, but I find SLIT
data to be a bit magical and poorly specified, especially across
architectures.

What about just exporting the proximity domain information via an
opaque firmware-implementation-specific 'node_handle' attribute? Then
the node_handle can be used to consult questions like what numa-nodes
is this handle local to beyond what the 'numa_node' attribute
indicates, what is the effective target-node for this node-handle, and
allow for interrogating the next level of detail beyond what
CONFIG_HMEM_REPORTING allows.
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

Reply via email to