First of all, thanks to everyone involved in the NuttX project. We
really appreciate all the work that has gone into keeping this operating
system maintained and functional on a wide variety of hardware.
We have several different NuttX-based projects that are using both
LPC1769 and LPC4078 processors with an Ethernet interface for
communication. These projects are fairly mature, having been developed
and used for several years now. We've seen occasional glitches on
Ethernet before, but we've more or less been able to tolerate them so
far. This is no longer the case in our current application of the
system, and we'd really like to try to eliminate as many issues as we
can from the software side of things. We are primarily using Modbus
TCP, which is a fairly simple request/response protocol.
We have seen the issue manifest itself in several ways:
* Assertion failures in lpc17_40_ethernet.c:
** DEBUGASSERT((priv->lp_inten & ETH_INT_TXDONE) != 0) in lpc17_40_response
** DEBUGASSERT(lpc17_40_txdesc(priv) == OK) in lpc17_40_txdone_work
* Incorrect TCP sequence numbers in messages coming back from the
embedded device.
Typically we will be able to run for many hundreds or thousands of
packets before we hit one of these cases, but it does seem to depend to
an extent on external factors such as which switch the device is
connected to, the amount of broadcast traffic on the network, etc. The
nature of the failures makes me think that there may be a race condition
of some kind that we're hitting, but I don't otherwise have a lot of
other evidence to base that on.
In an attempt to narrow down the cause of these issues, I pulled out a
few dev boards and tried to run some of the stock NuttX example apps
(TCP echo server, TCP blaster server, uIP web server) on them with
settings as close to defaults as possible, using a freshly-checked-out
copy of NuttX and the NuttX apps.
* On the STM32H743 Nucleo-144 board, all the network examples I tried
appear to work flawlessly. This matches my general experience running
NuttX on these parts; we have used them on several projects and have
been very pleased with their performance overall.
* On the SAM E54 Xplained Pro board, I had mixed results. I am not
using this chip for any current projects, but I had the board handy and
it is supported by NuttX, so I gave it a try in an attempt to collect
more data. The TCP echo server and web server work as expected. Using
the TCP blaster example, only a fraction of the packets seem to make the
round trip to the PC client application. Watching in wireshark, I see
some runs of clean traffic interspersed with bursts of duplicate TCP
packets and packets with invalid sequence numbers.
* On the LPC4088 Quickstart board, only the TCP echo server works
reliably. The web server will accept the initial connection and return
a status code, but then hangs. Looking at the exchange with wireshark,
I see the embedded board returns a fragment of the HTML content from the
middle of the page, then a bunch of TCP packets with incorrect sequence
numbers. Using the TCP blaster example, I can see some traffic
generated, again with a lot of invalid sequence numbers, but the PC
client application does not report any successfully received packets. I
tried changing a number of networking- and Ethernet-related settings in
menuconfig and was only ever able to make it less functional than this,
never more.
* On the LPC1769 LPCXpresso board, I see identical results to the
LPC4088 board. This is not surprising as the two chips use the same
Ethernet peripheral, but I figured it was worth checking for completeness.
Since the STM32H743 seems to work correctly, I don't believe there is an
issue with the TCP/IP stack in NuttX, but possibly an issue with the
drivers for the Ethernet peripherals on the chips that are having
issues. In my own application, I can't rule out the possibility of my
code causing problems, but I certainly would expect to be able to use
the provided NuttX apps such as the web server on any platform with a
network interface. The fact that at least one of the problems I'm
seeing in my application matches a problem that I'm seeing with the
example apps (missing/incorrect TCP sequence numbers) leads me to
believe that I'm probably triggering the same issue, but I know that's
not necessarily true.
I've been looking at this for a while now, and I'm more or less out of
ideas on how to proceed. I'll be the first to admit that I don't fully
understand how the network drivers and the OS are supposed to interact.
Unless I'm missing something, the fact that so many network operations
are deferred using worker threads really appears to make this area of
the system difficult to debug. I've done a lot of testing with network
warning/error/info messages turned on, and found the signal/noise ratio
to be pretty poor. If anyone with more experience or familiarity with
the NuttX TCP stack and/or Ethernet drivers could provide any comments,
tips, or insight on this issue or how best to debug this type of
problem, I would really appreciate it.
Thanks,
--Josh