Hi Holger

On Mon, Dec 18, 2017 at 02:38:53PM +0100, Holger Hoffstätte wrote:
> On 12/18/17 06:49, Jonathan Woithe wrote:
> > Resend to netdev.  LKML CCed in case anyone in the wider kernel community
> > can suggest a way forward.  Please CC responses if replying only to LKML.
> > 
> > It seems that this 4+ year old regression in the r8169 driver (documented in
> > this thread on netdev beginning on 9 March 2013) will never be fixed,
> > despite the identification of the commit which broke it.  Cards using this
> > driver will therefore remain unusable for certain workloads utilising UDP.
> (snip)
> 
> Since I've seen your postings several times now with no comment or resolution
> I've decided to try your reproducer on my own systems. In short, I cannot
> reproduce any packet loss, despite having 2 (cheap) 1Gb switches between the
> two machines. Both are running 4.14.7.

Thanks for trying the test program on your system.  The result indicates
that the problem might be specific to the behaviour of a particular network
variant of the r8169 chip.  The systems we use are all equipped with a
PCI Netgear GA311 card, which identifies as

  05:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 
  Gigabit Ethernet (rev 10)
        Subsystem: Netgear GA311

Respective IDs are

  05:01.0 0200: 10ec:8169 (rev 10)
          Subsystem: 1385:311a

> Both NICs are onboard PCIe

This is a significant difference between your test systems and ours: the
cards we are using are PCI and are not onboard.

> Nevertheless your reproducer runs forever and all I see is 6 bytes
> request, 14 bytes response, with no drops.  Not one.  I tried in both
> directions - no difference.

That's very interesting.  On the system noted above with the GA311 the
packet sequence certainly works most of the time.  However, within an hour
the 14 byte response will not be seen by the system which sent the 6 byte
request.  The slave sees the 6 byte request and sends the 14 byte response:
the problem is in the master (the system sending the 6 byte request).  The
NIC in the slave or kernel version running on the slave does not affect the
result.

> I realize this doesn't actually solve your immediate problem, but it is
> nevertheless an indicator that whatever you have been observing is caused
> by something else.

The inability to trigger the problem on your systems could be due to the
NICs in use.  That is an obvious difference between our system (which
reliably experiences the problem) and yours (which doesn't).  This may
indicate that only certain variants of the r8169 chip are affected, which
obviously complicates things.

In any case, this tester (and the production program with which the problem
was first noticed) work perfectly until commit
da78dbff2e05630921c551dbbc70a4b7981a8fff (identified with git bisect). 
Furthermore, when the pre-da78dbff...981a8fff driver was ported to 4.3 as a
test the problem was resolved, verified over a week of continuous testing;
the standard 4.3 reliably triggered the problem within minutes.  Of course
the ported driver isn't a viable long term solution since it's essentially
an out of tree driver.

It's hard to see how this problem is unrelated to da78dbff...981a8fff. 
Before this commit, everything worked fine.  While keeping everything else on
the system unchanged, applying this single commit to the r8169 driver causes
the problem.

Thank you again for running the tests.

Regards
  jonathan

Reply via email to