On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote:
> On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> > On 09/13/2017 04:46 PM, James Cameron wrote:
> > >
> > >I'll give it some more testing and let you know, but it seems as
> > >capable of keeping a connection as 4.13 plus my earlier revert.
> > >
>
> Testing went well; removing the call to enable ASPM was as good as
> changing the DBI read back to 16-bit width.
>
> > The change I sent earlier should be as good as reverting the change
> > to write_byte in your reversion.
>
> Yes, that would be the hope.
>
> But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
> read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
> so perhaps reading that register has an unexpected side-effect.
>
I've ruled that out after testing for several days different kernels
based on v4.13;
- add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just
after the call to enable_aspm; does not solve problem,
- add an rtl_read_byte of REG_DBI_CTRL+0 at the start of
_rtl8821ae_check_pcie_dma_hang; does not solve problem,
Only way to solve the problem at the moment is either;
- reverting 40b368af4b75 ("rtlwifi: Fix alignment issues"), which
means using rtl_read_word in _rtl8821ae_dbi_read,
or
- removing the two lines that enable ASPM, as you asked me to try.
> Is there any documentation for that register? I see other code writes
> to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang
I'll repeat and expand on this. Is there any documentation for this
register, or the other REG_DBI_* registers?
I see that DBI windowed access in rtl8192de is different and yet very
similar.
In rtl8821ae, rtl8723be, and rtl8192de the method seems straightforward;
there are bits for address, bits for write enable by byte, and flag
bits for starting the transfer and completing.
> Evidence of read from REG_DBI_CTRL was captured with an instrumented
> kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding
> these dmesg lines;
>
> [ 6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03
> [ 6.010338] rtl_pci: rtl_pci_enable_aspm
> [ 6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
> [ 6.034806] rtlwifi: rtlwifi: wireless switch is on
> [ 6.196958] rtl8821ae 0000:02:00.0 wlp2s0: renamed from wlan0
> [ 7.979186] rtl_pci: rtl_pci_disable_aspm
> [ 7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [ 8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [ 8.295437] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f)
> [ 8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [ 8.295462] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0200 (@034d)
> [ 8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [ 8.295477] rtl_pci: rtl_pci_enable_aspm
> [ 8.469734] rtl_pci: rtl_pci_disable_aspm
> [ 8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [ 8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [ 8.687013] rtl8821ae: _rtl8821ae_dbi_read 070f -> ffff (@034f)
> [ 8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [ 8.687038] rtl8821ae: _rtl8821ae_dbi_read 0719 -> 0218 (@034d)
> [ 8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [ 8.687053] rtl_pci: rtl_pci_enable_aspm
>
> Observe how the windowed read of DBI register 0x70f causes a read of
> 16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL.
>
> By the way, the cold boot value of DBI register 0x719 is 0x00, and
> the warm boot value is 0x18, so I'm confident there isn't a
> comprehensive register reset. It means that BIOS has relevance; and
> this BIOS is outside my control. BIOS variation may explain
> difficulty reproducing.
Is there a register for device reset that I can try? It would help
to exclude BIOS.
>
> > There has been a report (in Russian unfortunately) at
> > https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP
> > handling.
>
> Thanks. I've considered and excluded ARP handling delay. Though ARP
> renewal is typical reason for device sleep to end.
>
> With the call to enable ASPM disabled, instead of changing the DBI
> read to 16-bit width, what happens is that the device stops accepting
> data from the access point, packets are buffered there, and are
> transmitted as soon as the device makes the next transmission.
>
> http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump
> to confirm this.
>
> I've a monitor mode tcpdump I can send by private mail if required.
> In that the burst of packets shows ICMP echo requests were buffered by
> the access point.
>
> > According to Google translate is as follows:
> >
> > ============================================================
> > Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP,
> > which causes the Internet to end. Wireshark looks quite interesting:
> > ARP replays can be sent by one large packet a few seconds after
> > receiving the requests, ie. they seem to be buffered somewhere.
>
> Yes, buffering at access point.
>
> > I need to explore that ENOBUFS return code.
>
> I've seen ENOBUFS up at the application level with ping too, when the
> original problem happens with v4.10 plus stable.
>
> > Your case where the device is unresponsive to pings from another NIC
> > until the device transmits may also be an ARP problem.
> >
> > For completeness, are you using the 2.4 of 5 GHz band? What is the
> > make/model your AP? If possible for you to determine, what firmware
> > is it running?
>
> 2.4 GHz and 5 GHz reproduces the problem.
>
> Open or WPA reproduces the problem.
>
> Netgear WNDR3800 OpenWrt 12.09-beta, r33312.
>
> Several other access points reproduce the problem, including a
> customer's TP-Link TL-WR1042ND with unknown firmware version.
>
> No access point as yet does not reproduce the problem.
>
> Hope that helps, thanks for your ideas.
>
> --
> James Cameron
> http://quozl.netrek.org/
--
James Cameron
http://quozl.netrek.org/