Hello, still trying to make the network work reliably. After fixing another issue of my application, I hit another problem.
The following sequence causes NuttX to crash: 1. My application is creating a TCP socket and communicates with a server. 2. At one point the server stops responding (unrelated to NuttX / network issue). 3. The application detects the timeout, and calls close() on the socket. 4. A new socket is created, and it is connected to the server. 5. At this point, the server decides to send a FIN message for the previous connection. 6. I get a failed assertion in devif_callback.c at line 85. Note that I haven't managed to manually reproduce this issue. No matter what I do manually, everything seems to be working correctly. I just have to wait for it to happen. It seems that it is only triggered if a FIN arrives **after** a SYN. I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS enabled. I have no problems without buffering. The assertion seems right to fire. When a FIN is received for a closed connection, the same callback is free'd both by tcp_lost_connection() and later on by tcp_close_eventhandler(). All these are happening within the same execution of tcp_input(). Any ideas? On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet <sebast...@lorquet.fr> wrote: > Hi, > > good find but > > -I dont think any usual application tinkers with PHY regs during its > lifetime except the ethernet monitor > > -the fix is certainly a lock somewhere but global or fine grained I dont > know. > > Not all calls need to be locked, eg the one that returns the PHY > address. Probaby not needed by default, but a PHY access lock would > prevent any issue you describe. > > I will wait for people with more expertise about this. > > Just a note, dont forget that not all PHY have an interrupt, the one on > the nucleo stm32h743zi[2] board does not have one. > > Sebastien > > Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit : > > Hello, > > > > I have eventually found 2 issues regarding networking in my application. > > I would like to discuss the first one. > > > > > > My code contains something like this: > > > > int sd = socket(AF_INET, SOCK_DGRAM, 0); > > > > struct ifreq ifr; > > memset(&ifr, 0, sizeof(struct ifreq)); > > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ); > > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR; > > ifr.ifr_mii_reg_num = MII_LAN8720_SECR; > > ifr.ifr_mii_val_out = 0; > > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr); > > > > // Do stuff with ifr.ifr_mii_val_out. > > > > close(sd); > > > > I realized that this type of ioctl will directly access the hardware, > > without any locking. > > That is, if any other task needs to use the PHY in any other way, it will > > eventually corrupt its register data. > > > > > > Two questions on this: > > 1. Is there any good reason for this? > > 2. What is the best way to fix it? Shall I add a driver level lock, or > > should net_lock() be used in any higher layer? > > > > > > > > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos < > f.j.pa...@gmail.com> > > wrote: > > > >> Hello, > >> > >>> We have deployed hundreds of boards with stm32f427 and ethernet, they > >>> have all been working reliably for months without stopping, we know it > >>> because they critically depend on network functionality and we have > >>> reports if a card becomes unreachable. None has so far outside of > >>> dedicated tests. > >>> So I believe that there is no obvious hard bug in these drivers. > >> Good to hear that! > >> Although, I may be using a feature or protocol that you are not. > >> Of course, I don't believe that NuttX is broken per se, but a minor bug > >> may lurk somewhere... > >> > >> > >>> I have seen that when I enable the network debugging features, it seems > >> to > >>> hit an assertion failure before getting to nsh prompt at startup. This > >> was > >>> on a quite recent master. I haven't had a chance to diagnose this > >> further. > >>> Have you tried enabling these and if so, do they work? > >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it > works. > >> I have some devices under test, waiting to reproduce the issue to see if > >> this option provides any useful information. > >> > >> > >>> Also, out of curiosity, have you tried running ostest on your board? > >> I just tried. > >> It passed all the tests. > >> > >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet <sebast...@lorquet.fr > > > >> wrote: > >> > >>> Hi, > >>> > >>> We have deployed hundreds of boards with stm32f427 and ethernet, they > >>> have all been working reliably for months without stopping, we know it > >>> because they critically depend on network functionality and we have > >>> reports if a card becomes unreachable. None has so far outside of > >>> dedicated tests. > >>> > >>> So I believe that there is no obvious hard bug in these drivers. > >>> > >>> Most certainly a build option on your particular config. debug is a > >>> possible issue, thread problems is another possibility. > >>> > >>> Sebastien > >>> > >>> > >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote: > >>>> Hello! > >>>> > >>>> I am using Ethernet on an STM32F427 target, but I am facing some > issues. > >>>> > >>>> Initially the device works correctly. After some hours of continuous > >>>> operation I completely lose all network communications. > >>>> Trying to troubleshoot the issue, I enabled assertions and various > other > >>>> debug features. > >>>> > >>>> Again the device works correctly for some hours, and then I get a > failed > >>>> assertion at stm32_eth.c, line 1372: > >>>> > >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL); > >>>> > >>>> No other errors are reported (e.g. stack overflows etc). > >>>> > >>>> > >>>> I have observed that this issue usually manifests itself when there is > >>>> insufficient stack on a task. > >>>> But in my case, all tasks have oversized stacks. Typically they do not > >>>> exceed 50% utilization. > >>>> I have plenty of room available in the heap too (> 100kB). > >>>> > >>>> Regarding the rest of the firmware, I cannot see any other > misbehaviour > >>> or > >>>> problem. > >>>> I haven't ever seen any other unexplained problem, assertion fail, > >>>> hard-fault etc. > >>>> The application code passes all of our tests. > >>>> In fact, even when this issue happens, although I lose network > >>>> connectivity, the rest of the system works perfectly. > >>>> > >>>> Please note that I have checked the contents of dev->d_len and > >>> dev->d_buf, > >>>> and they seem to contain valid data. > >>>> The address lies within the normal address space of the MCU, and the > >>> size > >>>> is sane. > >>>> So it doesn't look like any kind of memory corruption. > >>>> > >>>> > >>>> At this point I believe that this is an actual bug either on the STM32 > >>> MAC > >>>> driver, or at the TCP/IP stack itself. > >>>> I had a look at the driver code, but I didn't see anything suspicious. > >>>> > >>>> > >>>> Has anyone observed the same issue before? > >>>> Can it be affected in any way with my configuration? > >>>> Or maybe, do you have any recommendations on what to test next? > >>>> > >>>> > >>>> Thank you! > >>>> >