Re: Ethernet issues on LPC17xx/LPC40xx

2023-08-10 Thread Gregory Nutt



On 8/10/2023 4:23 PM, Josh Lange wrote:
First of all, thanks to everyone involved in the NuttX project. We 
really appreciate all the work that has gone into keeping this 
operating system maintained and functional on a wide variety of hardware.
We have several different NuttX-based projects that are using both 
LPC1769 and LPC4078 processors with an Ethernet interface for 
communication.  These projects are fairly mature, having been 
developed and used for several years now.  We've seen occasional 
glitches on Ethernet before, but we've more or less been able to 
tolerate them so far.  This is no longer the case in our current 
application of the system, and we'd really like to try to eliminate as 
many issues as we can from the software side of things.  We are 
primarily using Modbus TCP, which is a fairly simple request/response 
protocol.


We have seen the issue manifest itself in several ways:

* Assertion failures in lpc17_40_ethernet.c:
** DEBUGASSERT((priv->lp_inten & ETH_INT_TXDONE) != 0) in 
lpc17_40_response

** DEBUGASSERT(lpc17_40_txdesc(priv) == OK) in lpc17_40_txdone_work

* Incorrect TCP sequence numbers in messages coming back from the 
embedded device.


Typically we will be able to run for many hundreds or thousands of 
packets before we hit one of these cases, but it does seem to depend 
to an extent on external factors such as which switch the device is 
connected to, the amount of broadcast traffic on the network, etc.  
The nature of the failures makes me think that there may be a race 
condition of some kind that we're hitting, but I don't otherwise have 
a lot of other evidence to base that on.


In an attempt to narrow down the cause of these issues, I pulled out a 
few dev boards and tried to run some of the stock NuttX example apps 
(TCP echo server, TCP blaster server, uIP web server) on them with 
settings as close to defaults as possible, using a freshly-checked-out 
copy of NuttX and the NuttX apps.


* On the STM32H743 Nucleo-144 board, all the network examples I tried 
appear to work flawlessly.  This matches my general experience running 
NuttX on these parts; we have used them on several projects and have 
been very pleased with their performance overall.


* On the SAM E54 Xplained Pro board, I had mixed results.  I am not 
using this chip for any current projects, but I had the board handy 
and it is supported by NuttX, so I gave it a try in an attempt to 
collect more data. The TCP echo server and web server work as 
expected.  Using the TCP blaster example, only a fraction of the 
packets seem to make the round trip to the PC client application.  
Watching in wireshark, I see some runs of clean traffic interspersed 
with bursts of duplicate TCP packets and packets with invalid sequence 
numbers.


* On the LPC4088 Quickstart board, only the TCP echo server works 
reliably.  The web server will accept the initial connection and 
return a status code, but then hangs.  Looking at the exchange with 
wireshark, I see the embedded board returns a fragment of the HTML 
content from the middle of the page, then a bunch of TCP packets with 
incorrect sequence numbers.  Using the TCP blaster example, I can see 
some traffic generated, again with a lot of invalid sequence numbers, 
but the PC client application does not report any successfully 
received packets.  I tried changing a number of networking- and 
Ethernet-related settings in menuconfig and was only ever able to make 
it less functional than this, never more.


* On the LPC1769 LPCXpresso board, I see identical results to the 
LPC4088 board.  This is not surprising as the two chips use the same 
Ethernet peripheral, but I figured it was worth checking for 
completeness.


Since the STM32H743 seems to work correctly, I don't believe there is 
an issue with the TCP/IP stack in NuttX, but possibly an issue with 
the drivers for the Ethernet peripherals on the chips that are having 
issues.  In my own application, I can't rule out the possibility of my 
code causing problems, but I certainly would expect to be able to use 
the provided NuttX apps such as the web server on any platform with a 
network interface.  The fact that at least one of the problems I'm 
seeing in my application matches a problem that I'm seeing with the 
example apps (missing/incorrect TCP sequence numbers) leads me to 
believe that I'm probably triggering the same issue, but I know that's 
not necessarily true.


I've been looking at this for a while now, and I'm more or less out of 
ideas on how to proceed.  I'll be the first to admit that I don't 
fully understand how the network drivers and the OS are supposed to 
interact.  Unless I'm missing something, the fact that so many network 
operations are deferred using worker threads really appears to make 
this area of the system difficult to debug.  I've done a lot of 
testing with network warning/error/info messages turned on, and found 
the signal/noise ratio to be pretty poor.  If anyone 

Ethernet issues on LPC17xx/LPC40xx

2023-08-10 Thread Josh Lange
First of all, thanks to everyone involved in the NuttX project. We 
really appreciate all the work that has gone into keeping this operating 
system maintained and functional on a wide variety of hardware.
We have several different NuttX-based projects that are using both 
LPC1769 and LPC4078 processors with an Ethernet interface for 
communication.  These projects are fairly mature, having been developed 
and used for several years now.  We've seen occasional glitches on 
Ethernet before, but we've more or less been able to tolerate them so 
far.  This is no longer the case in our current application of the 
system, and we'd really like to try to eliminate as many issues as we 
can from the software side of things.  We are primarily using Modbus 
TCP, which is a fairly simple request/response protocol.


We have seen the issue manifest itself in several ways:

* Assertion failures in lpc17_40_ethernet.c:
** DEBUGASSERT((priv->lp_inten & ETH_INT_TXDONE) != 0) in lpc17_40_response
** DEBUGASSERT(lpc17_40_txdesc(priv) == OK) in lpc17_40_txdone_work

* Incorrect TCP sequence numbers in messages coming back from the 
embedded device.


Typically we will be able to run for many hundreds or thousands of 
packets before we hit one of these cases, but it does seem to depend to 
an extent on external factors such as which switch the device is 
connected to, the amount of broadcast traffic on the network, etc.  The 
nature of the failures makes me think that there may be a race condition 
of some kind that we're hitting, but I don't otherwise have a lot of 
other evidence to base that on.


In an attempt to narrow down the cause of these issues, I pulled out a 
few dev boards and tried to run some of the stock NuttX example apps 
(TCP echo server, TCP blaster server, uIP web server) on them with 
settings as close to defaults as possible, using a freshly-checked-out 
copy of NuttX and the NuttX apps.


* On the STM32H743 Nucleo-144 board, all the network examples I tried 
appear to work flawlessly.  This matches my general experience running 
NuttX on these parts; we have used them on several projects and have 
been very pleased with their performance overall.


* On the SAM E54 Xplained Pro board, I had mixed results.  I am not 
using this chip for any current projects, but I had the board handy and 
it is supported by NuttX, so I gave it a try in an attempt to collect 
more data. The TCP echo server and web server work as expected.  Using 
the TCP blaster example, only a fraction of the packets seem to make the 
round trip to the PC client application.  Watching in wireshark, I see 
some runs of clean traffic interspersed with bursts of duplicate TCP 
packets and packets with invalid sequence numbers.


* On the LPC4088 Quickstart board, only the TCP echo server works 
reliably.  The web server will accept the initial connection and return 
a status code, but then hangs.  Looking at the exchange with wireshark, 
I see the embedded board returns a fragment of the HTML content from the 
middle of the page, then a bunch of TCP packets with incorrect sequence 
numbers.  Using the TCP blaster example, I can see some traffic 
generated, again with a lot of invalid sequence numbers, but the PC 
client application does not report any successfully received packets.  I 
tried changing a number of networking- and Ethernet-related settings in 
menuconfig and was only ever able to make it less functional than this, 
never more.


* On the LPC1769 LPCXpresso board, I see identical results to the 
LPC4088 board.  This is not surprising as the two chips use the same 
Ethernet peripheral, but I figured it was worth checking for completeness.


Since the STM32H743 seems to work correctly, I don't believe there is an 
issue with the TCP/IP stack in NuttX, but possibly an issue with the 
drivers for the Ethernet peripherals on the chips that are having 
issues.  In my own application, I can't rule out the possibility of my 
code causing problems, but I certainly would expect to be able to use 
the provided NuttX apps such as the web server on any platform with a 
network interface.  The fact that at least one of the problems I'm 
seeing in my application matches a problem that I'm seeing with the 
example apps (missing/incorrect TCP sequence numbers) leads me to 
believe that I'm probably triggering the same issue, but I know that's 
not necessarily true.


I've been looking at this for a while now, and I'm more or less out of 
ideas on how to proceed.  I'll be the first to admit that I don't fully 
understand how the network drivers and the OS are supposed to interact.  
Unless I'm missing something, the fact that so many network operations 
are deferred using worker threads really appears to make this area of 
the system difficult to debug.  I've done a lot of testing with network 
warning/error/info messages turned on, and found the signal/noise ratio 
to be pretty poor.  If anyone with more experience or familiarity with 
the 

Re: CAN TX fail handling

2023-08-10 Thread Alan C. Assis
Hi Nathan,

On 8/10/23, Nathan Hartman  wrote:
> On Thu, Aug 10, 2023 at 4:38 AM Tim Hardisty 
> wrote:
>
>> I like your idea of IOCTLs - I will be revisiting this issue in the next
>> few weeks and will look to see what's involved in implementing this as it
>> "feels" right.
>>
>
> snip
>
> In trying to cover potential board faults, I have found that if
>> there's something that prevents a CAN message reaching an
>> endpoint/destination, the CAN transmitter (of course, as I
>> understand it) is continuously retrying the message send, meaning
>> the test app hangs when you try and close the file once the test has
>> been deemed to fail. That is "by design" in the higher (i.e.
>> non-arch specific) can code as it waits for the TX FIFO/queue to empty
>> until the close is allowed.
>>
>> What is the correct POSIX way to handle this error condition?
>>
>>
> Sounds like in CAN we need the equivalent of tcflush() / tcdrain() as found
> in termios. (Try looking up the man page for these functions on your system
> or at online manpages.) In NuttX, at least for serial ports (i.e., UARTs),
> these functions call IOCTLs which (if I remember correctly) are partly
> implemented in the upper half driver (to clear the software buffer) and
> partly passed to the lower half driver (to flush the hardware FIFO, if
> applicable in the arch in question).
>
> I am not sure whether actual *termios* and its tc family of functions like
> tcflush() / tcdrain() are a good fit for CAN. Maybe they are and you can
> just adopt the same IOCTLs they use. But even if not, you can follow along
> how these are implemented in NuttX and do something very similar.
>

I think Can4Linux could be a good standard to follow, it was used on
Linux before SocketCAN (and still an option there).

https://en.wikipedia.org/wiki/Can4linux

https://gitlab.com/hjoertel/can4linux

BR,

Alan


Re: CAN TX fail handling

2023-08-10 Thread Nathan Hartman
On Thu, Aug 10, 2023 at 4:38 AM Tim Hardisty 
wrote:

> I like your idea of IOCTLs - I will be revisiting this issue in the next
> few weeks and will look to see what's involved in implementing this as it
> "feels" right.
>

snip

In trying to cover potential board faults, I have found that if
> there's something that prevents a CAN message reaching an
> endpoint/destination, the CAN transmitter (of course, as I
> understand it) is continuously retrying the message send, meaning
> the test app hangs when you try and close the file once the test has
> been deemed to fail. That is "by design" in the higher (i.e.
> non-arch specific) can code as it waits for the TX FIFO/queue to empty
> until the close is allowed.
>
> What is the correct POSIX way to handle this error condition?
>
>
Sounds like in CAN we need the equivalent of tcflush() / tcdrain() as found
in termios. (Try looking up the man page for these functions on your system
or at online manpages.) In NuttX, at least for serial ports (i.e., UARTs),
these functions call IOCTLs which (if I remember correctly) are partly
implemented in the upper half driver (to clear the software buffer) and
partly passed to the lower half driver (to flush the hardware FIFO, if
applicable in the arch in question).

I am not sure whether actual *termios* and its tc family of functions like
tcflush() / tcdrain() are a good fit for CAN. Maybe they are and you can
just adopt the same IOCTLs they use. But even if not, you can follow along
how these are implemented in NuttX and do something very similar.

Hope this helps,
Nathan


Re: CAN TX fail handling

2023-08-10 Thread Tim Hardisty
Thanks David - whilst I perhaps could of searched for that, it is why I 
asked here as I was sure someone else was likely to have seen this.


I like your idea of IOCTLs - I will be revisiting this issue in the next 
few weeks and will look to see what's involved in implementing this as 
it "feels" right.


On 10/08/2023 09:04, David Sidrane wrote:

Tim,

Seehttps://github.com/apache/nuttx/issues/3927

David

-Original Message-
From: Alan C. Assis
Sent: Wednesday, August 9, 2023 3:47 PM
To:dev@nuttx.apache.org
Cc: Pavel Pisa
Subject: Re: CAN TX fail handling

Hi Tim,

Agree! This behavior could be implemented in the driver, for example using
some elapsed time. But again, it needs to be analyzed careful to avoid
introduce sometime too specific for a user needs.

Currently the can_close() try to wait for the TX complete that could never
happen because this issue.

If you implement the idea of resetting the CAN controller in the
can_close() you need to guarantee that it will be reinitialized correctly,
because in can_open() it expects the CAN controller in working state.

BR,

Alan

On 8/9/23, Tim Hardisty  wrote:

Thanks Alan,

I can see that a timeout/retry in detail would be hardware dependent.
But in the absence of "something," code can send a message, but have
no idea that it hasn't actually been sent, then try and close the "file"
and the thread will hang indefinitely. I think we need something that
reports the fail so some kind of recovery/reset can be attempted?

Perhaps the "close" could be wrapped with something to deal with this?
Or the open mode needs to be different somehow?

POSIX/Linux type programming is new to me, after decades of bare-metal
type software dev where I'm in total control albeit unique to a
given/chosen processor, so any suggestions would be very welcome.

On 09/08/2023 19:56, Alan C. Assis wrote:

Hi Tim,

I think that the default behavior of CAN Controller is trying to send
indefinitely a message, some HW can define some retry limit.

Please take a look:
https://forum.pjrc.com/threads/67435-FlexCAN-Infinite-Endless-TX-Retr
ies

So, I'm not sure if it will make sense to implement a CAN TX timeout
on NuttX side, since this behavior could be HW dependent.

BR,

Alan

On 8/9/23, Tim Hardisty   wrote:

I am now cracking on with the app for my custom board, and in
parallel writing a production board-test app.

In trying to cover potential board faults, I have found that if
there's something that prevents a CAN message reaching an
endpoint/destination, the CAN transmitter (of course, as I
understand it) is continuously retrying the message send, meaning
the test app hangs when you try and close the file once the test has
been deemed to fail. That is "by design" in the higher (i.e.
non-arch specific) can code as it waits for the TX FIFO/queue to empty
until the close is allowed.

What is the correct POSIX way to handle this error condition?

Might it be better to use Socket CAN, for example, assuming it has
better error handling by design, or is the NuttX CAN "system"
fundamentally missing something to handle this (or, more likely,
I've just missed it )?



--

Regards,

Tim Hardisty


A picture containing text, clipart Description automatically generated



+44 (0) 1305 534535







JTi.uk.com







\JTinnovations

JT Innovations Ltd.

Registered office: 36 East St, Weymouth, Dorset, DT3 4DT, UK.

Company number 7619086

VAT Registration GB 111 7906 35


--

Regards,

Tim Hardisty


A picture containing text, clipart Description automatically generated



+44 (0) 1305 534535







JTi.uk.com 







\JTinnovations 

JT Innovations Ltd.

Registered office: 36 East St, Weymouth, Dorset, DT3 4DT, UK.

Company number 7619086

VAT Registration GB 111 7906 35


RE: CAN TX fail handling

2023-08-10 Thread David Sidrane
Tim,

See https://github.com/apache/nuttx/issues/3927

David

-Original Message-
From: Alan C. Assis 
Sent: Wednesday, August 9, 2023 3:47 PM
To: dev@nuttx.apache.org
Cc: Pavel Pisa 
Subject: Re: CAN TX fail handling

Hi Tim,

Agree! This behavior could be implemented in the driver, for example using
some elapsed time. But again, it needs to be analyzed careful to avoid
introduce sometime too specific for a user needs.

Currently the can_close() try to wait for the TX complete that could never
happen because this issue.

If you implement the idea of resetting the CAN controller in the
can_close() you need to guarantee that it will be reinitialized correctly,
because in can_open() it expects the CAN controller in working state.

BR,

Alan

On 8/9/23, Tim Hardisty  wrote:
> Thanks Alan,
>
> I can see that a timeout/retry in detail would be hardware dependent.
> But in the absence of "something," code can send a message, but have
> no idea that it hasn't actually been sent, then try and close the "file"
> and the thread will hang indefinitely. I think we need something that
> reports the fail so some kind of recovery/reset can be attempted?
>
> Perhaps the "close" could be wrapped with something to deal with this?
> Or the open mode needs to be different somehow?
>
> POSIX/Linux type programming is new to me, after decades of bare-metal
> type software dev where I'm in total control albeit unique to a
> given/chosen processor, so any suggestions would be very welcome.
>
> On 09/08/2023 19:56, Alan C. Assis wrote:
>> Hi Tim,
>>
>> I think that the default behavior of CAN Controller is trying to send
>> indefinitely a message, some HW can define some retry limit.
>>
>> Please take a look:
>> https://forum.pjrc.com/threads/67435-FlexCAN-Infinite-Endless-TX-Retr
>> ies
>>
>> So, I'm not sure if it will make sense to implement a CAN TX timeout
>> on NuttX side, since this behavior could be HW dependent.
>>
>> BR,
>>
>> Alan
>>
>> On 8/9/23, Tim Hardisty  wrote:
>>> I am now cracking on with the app for my custom board, and in
>>> parallel writing a production board-test app.
>>>
>>> In trying to cover potential board faults, I have found that if
>>> there's something that prevents a CAN message reaching an
>>> endpoint/destination, the CAN transmitter (of course, as I
>>> understand it) is continuously retrying the message send, meaning
>>> the test app hangs when you try and close the file once the test has
>>> been deemed to fail. That is "by design" in the higher (i.e.
>>> non-arch specific) can code as it waits for the TX FIFO/queue to empty
>>> until the close is allowed.
>>>
>>> What is the correct POSIX way to handle this error condition?
>>>
>>> Might it be better to use Socket CAN, for example, assuming it has
>>> better error handling by design, or is the NuttX CAN "system"
>>> fundamentally missing something to handle this (or, more likely,
>>> I've just missed it )?
>>>
>>>
> --
>
> Regards,
>
> Tim Hardisty
>
>
> A picture containing text, clipart Description automatically generated
>
>
>
> +44 (0) 1305 534535
>
>
>
> 
>
>
>
> JTi.uk.com 
>
>
>
> 
>
>
>
> \JTinnovations 
>
> JT Innovations Ltd.
>
> Registered office: 36 East St, Weymouth, Dorset, DT3 4DT, UK.
>
> Company number 7619086
>
> VAT Registration GB 111 7906 35
>