On Tue, Oct 15, 2019 at 06:40:29PM +0800, Yunsheng Lin wrote:
> On 2019/10/14 17:25, Greg KH wrote:
> > On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
> >> On 2019/10/12 18:47, Greg KH wrote:
> >>> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
> >>>> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> >>>>> On 2019/10/12 15:40, Greg KH wrote:
> >>>>>> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> >>>>>>> add pci and acpi maintainer
> >>>>>>> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> >>>>>>>
> >>>>>>> On 2019/10/11 19:15, Peter Zijlstra wrote:
> >>>>>>>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> >>>>>>>>> But I failed to see why the above is related to making 
> >>>>>>>>> node_to_cpumask_map()
> >>>>>>>>> NUMA_NO_NODE aware?
> >>>>>>>>
> >>>>>>>> Your initial bug is for hns3, which is a PCI device, which really 
> >>>>>>>> _MUST_
> >>>>>>>> have a node assigned.
> >>>>>>>>
> >>>>>>>> It not having one, is a straight up bug. We must not silently accept
> >>>>>>>> NO_NODE there, ever.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I suppose you mean reporting a lack of affinity when the node of a 
> >>>>>>> pcie
> >>>>>>> device is not set by "not silently accept NO_NODE".
> >>>>>>
> >>>>>> If the firmware of a pci device does not provide the node information,
> >>>>>> then yes, warn about that.
> >>>>>>
> >>>>>>> As Greg has asked about in [1]:
> >>>>>>> what is a user to do when the user sees the kernel reporting that?
> >>>>>>>
> >>>>>>> We may tell user to contact their vendor for info or updates about
> >>>>>>> that when they do not know about their system well enough, but their
> >>>>>>> vendor may get away with this by quoting ACPI spec as the spec
> >>>>>>> considering this optional. Should the user believe this is indeed a
> >>>>>>> fw bug or a misreport from the kernel?
> >>>>>>
> >>>>>> Say it is a firmware bug, if it is a firmware bug, that's simple.
> >>>>>>
> >>>>>>> If this kind of reporting is common pratice and will not cause any
> >>>>>>> misunderstanding, then maybe we can report that.
> >>>>>>
> >>>>>> Yes, please do so, that's the only way those boxes are ever going to 
> >>>>>> get
> >>>>>> fixed.  And go add the test to the "firmware testing" tool that is 
> >>>>>> based
> >>>>>> on Linux that Intel has somewhere, to give vendors a chance to fix this
> >>>>>> before they ship hardware.
> >>>>>>
> >>>>>> This shouldn't be a big deal, we warn of other hardware bugs all the
> >>>>>> time.
> >>>>>
> >>>>> Ok, thanks for clarifying.
> >>>>>
> >>>>> Will send a patch to catch the case when a pcie device without numa node
> >>>>> being set and warn about it.
> >>>>>
> >>>>> Maybe use dev->bus to verify if it is a pci device?
> >>>>
> >>>> No, do that in the pci bus core code itself, when creating the devices
> >>>> as that is when you know, or do not know, the numa node, right?
> >>>>
> >>>> This can't be in the driver core only, as each bus type will have a
> >>>> different way of determining what the node the device is on.  For some
> >>>> reason, I thought the PCI core code already does this, right?
> >>>
> >>> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> >>> thing...
> >>>
> >>> Anyway, it looks like the pci core code does call set_dev_node() based
> >>> on the PCI bridge, so if that is set up properly, all should be fine.
> >>>
> >>> If not, well, you have buggy firmware and you need to warn about that at
> >>> the time you are creating the bridge.  Look at the call to
> >>> pcibus_to_node() in pci_register_host_bridge().
> >>
> >> Thanks for pointing out the specific function.
> >> Maybe we do not need to warn about the case when the device has a parent,
> >> because we must have warned about the parent if the device has a parent
> >> and the parent also has a node of NO_NODE, so do not need to warn the child
> >> device anymore? like blew:
> >>
> >> @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
> >> pci_host_bridge *bridge)
> >>         list_add_tail(&bus->node, &pci_root_buses);
> >>         up_write(&pci_bus_sem);
> >>
> >> +       if (nr_node_ids > 1 && !parent &&
> > 
> > Why do you need to check this?  If you have a parent, it's your node
> > should be set, if not, that's an error, right?
> 
> If the device has parent and the parent device also has a node of
> NUMA_NO_NODE, then maybe we have warned about the parent device, so
> we do not have to warn about the child device?

But it's a PCI bridge, if it is not set properly, that needs to be fixed
otherwise the PCI devices attached to it have no hope of working
properly.

> In pci_register_host_bridge():
> 
>       if (!parent)
>               set_dev_node(bus->bridge, pcibus_to_node(bus));
> 
> The above only set the node of the bridge device to the node of bus if
> the bridge device does not have a parent.

Odd, what happens to devices behind another bridge today?  Are their
nodes set properly today?  Is the node supposed to be the same as the
parent bridge?

> >> +           dev_to_node(bus->bridge) == NUMA_NO_NODE)
> >> +               dev_err(bus->bridge, FW_BUG "No node assigned on NUMA 
> >> capable HW. Please contact your vendor for updates.\n");
> >> +
> >>         return 0;
> > 
> > Who set that bus->bridge node to NUMA_NO_NODE?
> 
> It seems x86 and arm64 may have different implemention of
> pcibus_to_node():
> 
> For arm64:
> int pcibus_to_node(struct pci_bus *bus)
> {
>       return dev_to_node(&bus->dev);
> }
> 
> And the node of bus is set in:
> int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge)
> {
>       if (!acpi_disabled) {
>               struct pci_config_window *cfg = bridge->bus->sysdata;
>               struct acpi_device *adev = to_acpi_device(cfg->parent);
>               struct device *bus_dev = &bridge->bus->dev;
> 
>               ACPI_COMPANION_SET(&bridge->dev, adev);
>               set_dev_node(bus_dev, acpi_get_node(acpi_device_handle(adev)));
>       }
> 
>       return 0;
> }
> 
> acpi_get_node() may return NUMA_NO_NODE in pcibios_root_bridge_prepare(),
> which will set the node of bus_dev to NUMA_NO_NODE
> 
> 
> x86:
> static inline int __pcibus_to_node(const struct pci_bus *bus)
> {
>       const struct pci_sysdata *sd = bus->sysdata;
> 
>       return sd->node;
> }
> 
> And the node of bus is set in pci_acpi_scan_root(), which uses
> pci_acpi_root_get_node() get the node of a bus. And it also may return
> NUMA_NO_NODE.

Fixing that will be good :)

> > If that is set, the firmware is broken, as you say, but you need to tell
> > the user what firmware is broken.
> 
> Maybe mentioning the BIOS in log?
> dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW by BIOS. 
> Please contact your vendor for updates.\n");

That's a good start.  Try running it on your machines (big and small)
and see what happens.

> > Try something like this out and see what happens on your machine that
> > had things "broken".  What does it say?
> 
> Does not have a older bios right now.
> But always returning NUMA_NO_NODE by below patch:
> 
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -484,6 +484,7 @@ int acpi_get_node(acpi_handle handle)
> 
>         pxm = acpi_get_pxm(handle);
> 
> -       return acpi_map_pxm_to_node(pxm);
> +       return -1;
> +       //return acpi_map_pxm_to_node(pxm);
> 
> it gives the blow warning in my machine:
> 
> [   16.126136]  pci0000:00: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   17.733831]  pci0000:7b: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   18.020924]  pci0000:7a: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   18.552832]  pci0000:78: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   19.514948]  pci0000:7c: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   20.652990]  pci0000:74: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   22.573200]  pci0000:80: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   23.225355]  pci0000:bb: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   23.514040]  pci0000:ba: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   24.050107]  pci0000:b8: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   25.017491]  pci0000:bc: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.
> [   25.557974]  pci0000:b4: [Firmware Bug]: No node assigned on NUMA capable 
> HW by BIOS. Please contact your vendor for updates.

And can you fix your bios?  If you can't then why are we going to warn
about this?

:)

thanks,

greg k-h

Reply via email to