First of all, thanks to everyone involved in the NuttX project. We really appreciate all the work that has gone into keeping this operating system maintained and functional on a wide variety of hardware. We have several different NuttX-based projects that are using both LPC1769 and LPC4078 processors with an Ethernet interface for communication.  These projects are fairly mature, having been developed and used for several years now.  We've seen occasional glitches on Ethernet before, but we've more or less been able to tolerate them so far.  This is no longer the case in our current application of the system, and we'd really like to try to eliminate as many issues as we can from the software side of things.  We are primarily using Modbus TCP, which is a fairly simple request/response protocol.

We have seen the issue manifest itself in several ways:

* Assertion failures in lpc17_40_ethernet.c:
** DEBUGASSERT((priv->lp_inten & ETH_INT_TXDONE) != 0) in lpc17_40_response
** DEBUGASSERT(lpc17_40_txdesc(priv) == OK) in lpc17_40_txdone_work

* Incorrect TCP sequence numbers in messages coming back from the embedded device.

Typically we will be able to run for many hundreds or thousands of packets before we hit one of these cases, but it does seem to depend to an extent on external factors such as which switch the device is connected to, the amount of broadcast traffic on the network, etc.  The nature of the failures makes me think that there may be a race condition of some kind that we're hitting, but I don't otherwise have a lot of other evidence to base that on.

In an attempt to narrow down the cause of these issues, I pulled out a few dev boards and tried to run some of the stock NuttX example apps (TCP echo server, TCP blaster server, uIP web server) on them with settings as close to defaults as possible, using a freshly-checked-out copy of NuttX and the NuttX apps.

* On the STM32H743 Nucleo-144 board, all the network examples I tried appear to work flawlessly.  This matches my general experience running NuttX on these parts; we have used them on several projects and have been very pleased with their performance overall.

* On the SAM E54 Xplained Pro board, I had mixed results.  I am not using this chip for any current projects, but I had the board handy and it is supported by NuttX, so I gave it a try in an attempt to collect more data. The TCP echo server and web server work as expected.  Using the TCP blaster example, only a fraction of the packets seem to make the round trip to the PC client application.  Watching in wireshark, I see some runs of clean traffic interspersed with bursts of duplicate TCP packets and packets with invalid sequence numbers.

* On the LPC4088 Quickstart board, only the TCP echo server works reliably.  The web server will accept the initial connection and return a status code, but then hangs.  Looking at the exchange with wireshark, I see the embedded board returns a fragment of the HTML content from the middle of the page, then a bunch of TCP packets with incorrect sequence numbers.  Using the TCP blaster example, I can see some traffic generated, again with a lot of invalid sequence numbers, but the PC client application does not report any successfully received packets.  I tried changing a number of networking- and Ethernet-related settings in menuconfig and was only ever able to make it less functional than this, never more.

* On the LPC1769 LPCXpresso board, I see identical results to the LPC4088 board.  This is not surprising as the two chips use the same Ethernet peripheral, but I figured it was worth checking for completeness.

Since the STM32H743 seems to work correctly, I don't believe there is an issue with the TCP/IP stack in NuttX, but possibly an issue with the drivers for the Ethernet peripherals on the chips that are having issues.  In my own application, I can't rule out the possibility of my code causing problems, but I certainly would expect to be able to use the provided NuttX apps such as the web server on any platform with a network interface.  The fact that at least one of the problems I'm seeing in my application matches a problem that I'm seeing with the example apps (missing/incorrect TCP sequence numbers) leads me to believe that I'm probably triggering the same issue, but I know that's not necessarily true.

I've been looking at this for a while now, and I'm more or less out of ideas on how to proceed.  I'll be the first to admit that I don't fully understand how the network drivers and the OS are supposed to interact.  Unless I'm missing something, the fact that so many network operations are deferred using worker threads really appears to make this area of the system difficult to debug.  I've done a lot of testing with network warning/error/info messages turned on, and found the signal/noise ratio to be pretty poor.  If anyone with more experience or familiarity with the NuttX TCP stack and/or Ethernet drivers could provide any comments, tips, or insight on this issue or how best to debug this type of problem, I would really appreciate it.

Thanks,

--Josh

Reply via email to