Re: ath10k_sdio: Long time for loading firmware and RCU errors
On Tue, May 25, 2021 at 9:40 PM Fabio Estevam wrote: > > Hi, > > I am using the QCA9377 chip on an imx6dl-pico-pi board running kernel > 5.10.37 and I noticed that the firmware takes a long time to load > (more than 3 minutes after boot): Ok, I replaced eudev with mdev in Buildroot and now it loads the QCA9377 firmware quickly. > # wpa_supplicant -iwlan0 -c /etc/wpa.conf & > # Successfully initialized wpa_supplicant > [ 234.360447] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 234.390478] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! These NOHZ messages still pop up. Cheers ___ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k
Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
On Wednesday 02 June 2021 10:55:59 Bjorn Helgaas wrote: > On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote: > > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote: > > > > I wonder if this could be restructured as a generic quirk in quirks.c > > > that simply set the bridge's TLS to 2.5 GT/s during enumeration. Or > > > would the retrain fail even in that case? > > > > If I understand it correctly then PCIe link is already up when kernel > > starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change > > anything here. > > > > Moreover it would have side effect that cards which are already set to > > 5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for > > increasing speed would be needed another round of "enumeration" to set a > > new TLS and retrain link again. As TLS affects link only after link goes > > into Recovery state. > > > > So this would just complicate card enumeration and settings. > > The current quirk complicates the ASPM code. I'm hoping that if we > set the bridge's Target Link Speed during enumeration, the link > retrain will "just work" without complicating the ASPM code. > > An enumeration quirk wouldn't have to set the bridge's TLS to 2.5 > GT/s; the quirk would be attached to specific endpoint devices and > could set the bridge's TLS to whatever the endpoint supports. Now I see what you mean. Yes, I agree this is a good idea and can simplify code. Quirk is not related to ASPM code and basically has nothing with it, just I put it into aspm.c because this is the only place where link retraining was activated. But with this proposal there is one issue. Some kernel drivers already overwrite PCI_EXP_LNKCTL2_TLS value. So if PCI enumeration code set some value into PCI_EXP_LNKCTL2_TLS bits then drivers can change it and once ASPM will try to retrain link this may cause this issue. > > Moreover here we are dealing with specific OTP/EEPROM bug in Atheros > > chips, which was confirmed that exists. As I wrote in previous email, I > > was told that semi-official workaround is do Warm Reset or Cold Reset > > with turning power off from card. Which on most platforms / boards is > > not possible. > > If there's a specific bug with a real root-cause analysis, please cite > it. The threads mentioned in the current commit log are basically > informed speculation. I had (private) discussion with Adrian Chadd about ABCD device id issue. I hope that nobody is against if I put there summary and important parts about secondary bus reset (=hot reset): The reason for abcd is because: * the MAC has hardware that upon cold reset, will read EEPROM/OTP values for things like PCIe and other register defaults, and squirt them into the MAC/PHY/etc registers * the default values for the PCIe bus pre-AR9300 were 0x168c:0xff, where is the normal chip ID * the default values for the PCIe bus POST-AR9300 were 0x168c:0xabcd, where they're always that regardless of the chip family * so yeah, all you know with 0x168c:0xabcd is there's an atheros device there, but not WHICH it is. * the bug is that the reset line isn't held low for long enough, or it's bounced twice in quick succession, before the MAC has time to program in the defaults from EEPROM/OTP and it doesn't do it a second time. * the MAC has hardware that upon cold reset, will read EEPROM/OTP values for things like PCIe and other register defaults, and squirt them into the MAC/PHY/etc registers * need to use the external reset line OR try using D3, not D3hot (I assume that "external reset line" means PERST# - PCIe Warm Reset and "D3, not D3hot" means D3cold) And now my experiments: Disabling and Enabling link via root bridge has exactly same syndromes as hot reset on all tested cards. See that different chips (pre-AR9300 and post-AR9300) have slightly different behavior and it matches all my experiments (I wrote test details in commit message). And doing link retrain when root bridge has non-2.5GT/s value in PCI_EXP_LNKCTL2_TLS has also same effect as hot reset. So based on same results from my experiments all these actions (disabling link, hot reset and link retrain) have common issue. ___ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k
Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
On Wed, Jun 02, 2021 at 02:08:16PM +0200, Pali Rohár wrote: > On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote: > > I wonder if this could be restructured as a generic quirk in quirks.c > > that simply set the bridge's TLS to 2.5 GT/s during enumeration. Or > > would the retrain fail even in that case? > > If I understand it correctly then PCIe link is already up when kernel > starts enumeration. So setting Bridge TLS to 2.5 GT/s does not change > anything here. > > Moreover it would have side effect that cards which are already set to > 5+ GT/s would be downgraded to 2.5 GT/s during enumeration and for > increasing speed would be needed another round of "enumeration" to set a > new TLS and retrain link again. As TLS affects link only after link goes > into Recovery state. > > So this would just complicate card enumeration and settings. The current quirk complicates the ASPM code. I'm hoping that if we set the bridge's Target Link Speed during enumeration, the link retrain will "just work" without complicating the ASPM code. An enumeration quirk wouldn't have to set the bridge's TLS to 2.5 GT/s; the quirk would be attached to specific endpoint devices and could set the bridge's TLS to whatever the endpoint supports. > Moreover here we are dealing with specific OTP/EEPROM bug in Atheros > chips, which was confirmed that exists. As I wrote in previous email, I > was told that semi-official workaround is do Warm Reset or Cold Reset > with turning power off from card. Which on most platforms / boards is > not possible. If there's a specific bug with a real root-cause analysis, please cite it. The threads mentioned in the current commit log are basically informed speculation. Bjorn ___ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k
Re: [PATCH v3] PCI: Disallow retraining link for Atheros chips on non-Gen1 PCIe bridges
On Tuesday 01 June 2021 19:00:36 Bjorn Helgaas wrote: > On Tue, Jun 01, 2021 at 11:18:39PM +0200, Pali Rohár wrote: > > On Tuesday 01 June 2021 15:05:49 Bjorn Helgaas wrote: > > > On Wed, May 05, 2021 at 06:33:57PM +0200, Pali Rohár wrote: > > > > Atheros AR9xxx and QCA9xxx chips have behaviour issues not only after a > > > > bus reset, but also after doing retrain link, if PCIe bridge is not in > > > > GEN1 mode (at 2.5 GT/s speed): > > > > > > > > - QCA9880 and QCA9890 chips throw a Link Down event and completely > > > > disappear from the bus and their config space is not accessible > > > > afterwards. > > > > > > > > - QCA9377 chip throws a Link Down event followed by Link Up event, the > > > > config space is accessible and PCI device ID is correct. But trying to > > > > access chip's I/O space causes Uncorrected (Non-Fatal) AER error, > > > > followed by Synchronous external abort 96000210 and Segmentation fault > > > > of insmod while loading ath10k_pci.ko module. > > > > > > > > - AR9390 chip throws a Link Down event followed by Link Up event, config > > > > space is accessible, but contains nonsense values. PCI device ID is > > > > 0xABCD which indicates HW bug that chip itself was not able to read > > > > values from internal EEPROM/OTP. > > > > > > > > - AR9287 chip throws also Link Down and Link Up events, also has > > > > accessible config space containing correct values. But ath9k driver > > > > fails to initialize card from this state as it is unable to access HW > > > > registers. This also indicates that the chip iself is not able to read > > > > values from internal EEPROM/OTP. > > > > > > > > These issues related to PCI device ID 0xABCD and to reading internal > > > > EEPROM/OTP were previously discussed at ath9k-devel mailing list in > > > > following thread: > > > > > > > > https://www.mail-archive.com/ath9k-devel@lists.ath9k.org/msg07529.html > > > > > > > > After experiments we've come up with a solution: it seems that Retrain > > > > link can be called only when using GEN1 PCIe bridge or when PCIe bridge > > > > link speed is forced to 2.5 GT/s. Applying this workaround fixes all > > > > mentioned cards. > > > > > > I *assume* this means the device was running at > 2.5 GT/s in the > > > first place, > > > > No. All these Atheros chips are 2.5 GT/s only. It looks like that if > > PCIe Bridge has initial value 5 GT/s (or higher) in PCI_EXP_LNKCAP2 > > register and link retraining is activated, something happen which cause > > these Atheros chips to "crash". Looks like that Root Bridge tries to > > change link speed from 2.5 GT/s to 5 GT/s (which is not supported by all > > these Atheros chips). > > Oh, perfect. Then I guess all we need is to restrict these devices to > 2.5 GT/s. And we can just ignore all my rambling about higher speeds > below, so I'll elide them. Yes, all these tests shows that these Atheros chips are stable only when link is operating at 2.5 GT/s. > > > ... > > Except this: > > > > This patch implies that the hardware automatically trained to a > > > higher rate after power-on (which I think is what PCIe hardware is > > > *supposed* to do) and something prevents that from succeeding when > > > we retrain, or maybe BIOS did something different than what Linux > > > is doing, or ... something else? > > > Tested platforms was also without BIOS and without any other firmware > > which touched PCIe. > > The fact that the link came up automatically without any firmware or > software at all is very interesting. The retrain path is actually > different from a hardware point of view: the power-on path through > LTSSM would normally be Detect, Polling, Configuration, L0; the > retrain path would be L0, Recovery, L0. So I guess it isn't *too* > surprising that the power-on path could work even if the retrain path > is broken. Yes, this is truth. In my opinion these Atheros chips are trying to do some kind of init / reset procedure when either entering or leaving Recovery state. And because there is known bug that Hot Reset should be avoided, it looks like that Hot Reset is just one from more options how to trigger this bug. > I wonder if setting, then clearing, the bridge's Link Disable bit > would work, since that would start again with the LTSSM Detect state, > just like power-on. Tested and it does not work. Same effect as Hot Reset. This really looks like OTP/EEPROM related issue which was already described, that doing (something) related to reset too fast cause internal chip to not finish reading OTP/EEPROM data needed to correctly initialize PCIe part of card. > But I don't think that would help with this > ASPM/Common Clock issue because I think the link disable would look > like a hot reset to the endpoint, and it would clear the Common Clock > Configuration bit. > > So backing up a lng ways, how much value is there in doing this > retrain at all? AFAICT the only reason we do it is because we think > the