On Tue, Nov 3, 2020 at 1:08 PM Dave Stevenson
<[email protected]> wrote:
>
> Hi Ilia
> Thanks again for the reply.
>
> On Wed, 28 Oct 2020 at 14:59, Ilia Mirkin <[email protected]> wrote:
> >
> > On Wed, Oct 28, 2020 at 10:20 AM Dave Stevenson
> > <[email protected]> wrote:
> > >
> > > Hi Ilia
> > >
> > > Thanks for taking the time to reply.
> > >
> > > On Wed, 28 Oct 2020 at 14:10, Ilia Mirkin <[email protected]> wrote:
> > > >
> > > > The most common issue on arm is that the pci memory window is too 
> > > > narrow to allocate all the BARs. Can you see if there are messages in 
> > > > the kernel to that effect?
> > >
> > > All the BAR allocations seem to succeed except for the IO one.
> > > AIUI I/O is deprecated, but is it still used on these cards?
> >
> > I must admit I was ignorant of the fact that the IO ports were treated
> > as a BAR, but it makes a lot of sense.
> >
> > One thing does stand out as odd:
> >
> > >
> > > [    1.060851] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 
> > > ranges:
> > > [    1.060892] brcm-pcie fd500000.pcie:   No bus range found for
> > > /scb/pcie@7d500000, using [bus 00-ff]
> > > [    1.060975] brcm-pcie fd500000.pcie:      MEM
> > > 0x0600000000..0x063fffffff -> 0x00c0000000
> > > [    1.061061] brcm-pcie fd500000.pcie:   IB MEM
> > > 0x0000000000..0x00ffffffff -> 0x0100000000
> > > [    1.109943] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
> > > [    1.110129] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
> > > [    1.110159] pci_bus 0000:00: root bus resource [bus 00-ff]
> > > [    1.110187] pci_bus 0000:00: root bus resource [mem
> > > 0x600000000-0x63fffffff] (bus address [0xc0000000-0xffffffff])
> > > [    1.110286] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
> > > [    1.110505] pci 0000:00:00.0: PME# supported from D0 D3hot
> > > [    1.114095] pci 0000:00:00.0: bridge configuration invalid ([bus
> > > 00-00]), reconfiguring
> > > [    1.114343] pci 0000:01:00.0: [10de:128b] type 00 class 0x030000
> > > [    1.114404] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00ffffff]
> > > [    1.114456] pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x07ffffff
> > > 64bit pref]
> > > [    1.114510] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x01ffffff
> > > 64bit pref]
> > > [    1.114551] pci 0000:01:00.0: reg 0x24: [io  0x0000-0x007f]
> > > [    1.114590] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0007ffff 
> > > pref]
> > > [    1.114853] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth,
> > > limited by 5.0 GT/s PCIe x1 link at 0000:00:00.0 (capable of 63.008
> > > Gb/s with 8.0 GT/s PCIe x8 link)
> > > [    1.115022] pci 0000:01:00.0: vgaarb: VGA device added:
> > > decodes=io+mem,owns=none,locks=none
> > > [    1.115125] pci 0000:01:00.1: [10de:0e0f] type 00 class 0x040300
> > > [    1.115184] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
> > > [    1.119065] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
> > > [    1.119120] pci 0000:00:00.0: BAR 9: assigned [mem
> > > 0x600000000-0x60bffffff 64bit pref]
> > > [    1.119151] pci 0000:00:00.0: BAR 8: assigned [mem 
> > > 0x60c000000-0x60d7fffff]
> >
> > This is your brcm-pcie device.
> >
> > > [    1.119183] pci 0000:01:00.0: BAR 1: assigned [mem
> > > 0x600000000-0x607ffffff 64bit pref]
> > > [    1.119235] pci 0000:01:00.0: BAR 3: assigned [mem
> > > 0x608000000-0x609ffffff 64bit pref]
> > > [    1.119285] pci 0000:01:00.0: BAR 0: assigned [mem 
> > > 0x60c000000-0x60cffffff]
> >
> > And this is the NVIDIA device. Note that these memory windows are
> > identical (or at least overlapping). I must admit almost complete
> > ignorance of PCIe and whether this is OK, but it seems sketchy at
> > first glance. A quick eyeballing of my x86 system suggests that all
> > PCIe devices get non-overlapping windows. OTOH there are messages
> > further up about some sort of remapping going on, so perhaps it's OK?
> > But two things on the same bus still shouldn't have the same addresses
> > allocated, based on my (limited) understanding.
>
> I've raised this with colleagues and it seems that this is normal.
> The PCI bridge reports the window through which devices can be mapped,
> and all devices have to fit within that. Pass as to whether that is a
> quirk of ARM or this particular bridge.
>
> I do note that on my x86 systems device 0000:00:00.0 is reported by
> lspci as a "Host bridge" instead of a "PCI bridge".
> On an Ubuntu VM I've got running, I do get
> [    0.487249] PCI host bridge to bus 0000:00
> [    0.487252] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
> [    0.487254] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
> [    0.487256] pci_bus 0000:00: root bus resource [mem
> 0x000a0000-0x000bffff window]
> [    0.487258] pci_bus 0000:00: root bus resource [mem
> 0xe0000000-0xfdffffff window]
> [    0.487260] pci_bus 0000:00: root bus resource [bus 00-ff]
> and all device allocations are from within those ranges, so I'm not
> convinced it's that different.
>
> > In case it's an option, could you "unplug" the NIC (not just not load
> > its driver, but make it not appear at all on the PCI bus)?
>
> NIC? The network interface is totally separate. Or is this another
> reuse of a TLA?
>
> Unplugging the GPU means the PCI bus reports as being down and I get
> no output at all from lspci.

Oh duh. I thought brcm-pcie was a broadcom NIC. Apparently it's the
whole bus - can't unplug that! Also explains the "conflict" which
makes a lot more sense if you (correctly) understand that other
"device" is the bus itself. Apologies for the misinterpretation :(
[And in hindsight, RPi runs on a Broadcom SoC, so ... I should have
remembered that. In my mind they just make network stuff, will try to
get that updated.]

> > > >> I've tried it so far with a GT710 board [1] and ARM64. It's blowing up
> > > >> in the memset of nvkm_instobj_new whilst initialising the BAR
> > > >> subdevice [2], having gone through the "No such luck" path in
> > > >> nvkm_mmu_ptc_get [3].
> > > >>
> > > >> Taking the naive approach of simply removing the memset, I get through
> > > >> initialising all the subdevices, but again die in a location I
> > > >> currently haven't pinpointed. The last logging messages are:
> >
> > That's not a winning strategy, I'm afraid. You need to figure out why
> > the memset is blowing up. The simplest explanation is "it's trying to
> > write to an I/O resource but that resource wasn't allocated", hence my
> > question about BARs. But something's not mapped, or mapped in the
> > wrong way, or whatever. If you can't write to it at that point in
> > time, you won't be able to write to it later either. I would focus on
> > that.
>
> I did say it was the naive approach :-)
> I was trying to gauge how much effort was going to be needed to get
> this going. Was it going to blow up in 1, 10, or 100 places? It feels
> like it is only a couple of things that are wrong, so there is hope.
>
> Slightly annoyingly something more urgent has come up and I need to
> shelve my experimentation for now, but thanks for the pointers. At
> least I have some idea of where to start looking when time allows.

When/if you do get back to it, you might consider posting a more
complete log without getting rid of the memset, perhaps the nature of
the blow-up will make the underlying problem more apparent, or make
further investigation paths apparent.

Cheers,

  -ilia
_______________________________________________
Nouveau mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/nouveau

Reply via email to