Hi Alan, I am trying hard to reproduce the issue reliably, but I haven't been able to do so yet.
I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem does not disappear, rather it changes form. Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95. I have to mention that everything in my system is commented out. Currently the only thing working is the network thread that opens the TCP connection, nothing else. I have disabled all of my usage of the workers, all signals etc. I verify that when the fault occurs, this thread is not interrupted by anything (using Segger SystemView). It looks like a scheduling issue is unlikely. I also increased the stacks more, and I added padding to the very few malloc's that I use. --- At this moment I observe something very interesting. I am calling netlib_ifdown(), which causes the attached stack trace. So: 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set explicitly to NULL. 2. devif_dev_event() eventually calls tcp_close_eventhandler() 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes the crash. This is wrong, but I don't have the understanding of it yet. Shall there be a check for a NULL conn? Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the first place? Or tcp_close_eventhandler() should be tolerant to a NULL conn argument? On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis <acas...@gmail.com> wrote: > Hi Fotis, > > Are you in sync with mainline? > > If you can create a host application to induce the issue will be > easier for us to test. > > BR, > > Alan > > On 8/9/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote: > > Hello, > > > > still trying to make the network work reliably. > > After fixing another issue of my application, I hit another problem. > > > > The following sequence causes NuttX to crash: > > > > 1. My application is creating a TCP socket and communicates with a > server. > > 2. At one point the server stops responding (unrelated to NuttX / network > > issue). > > 3. The application detects the timeout, and calls close() on the socket. > > 4. A new socket is created, and it is connected to the server. > > 5. At this point, the server decides to send a FIN message for the > previous > > connection. > > 6. I get a failed assertion in devif_callback.c at line 85. > > > > Note that I haven't managed to manually reproduce this issue. > > No matter what I do manually, everything seems to be working correctly. > > I just have to wait for it to happen. > > It seems that it is only triggered if a FIN arrives **after** a SYN. > > > > I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS > > enabled. > > I have no problems without buffering. > > > > The assertion seems right to fire. > > When a FIN is received for a closed connection, the same callback is > free'd > > both by tcp_lost_connection() and later on by tcp_close_eventhandler(). > > All these are happening within the same execution of tcp_input(). > > > > Any ideas? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet <sebast...@lorquet.fr> > > wrote: > > > >> Hi, > >> > >> good find but > >> > >> -I dont think any usual application tinkers with PHY regs during its > >> lifetime except the ethernet monitor > >> > >> -the fix is certainly a lock somewhere but global or fine grained I dont > >> know. > >> > >> Not all calls need to be locked, eg the one that returns the PHY > >> address. Probaby not needed by default, but a PHY access lock would > >> prevent any issue you describe. > >> > >> I will wait for people with more expertise about this. > >> > >> Just a note, dont forget that not all PHY have an interrupt, the one on > >> the nucleo stm32h743zi[2] board does not have one. > >> > >> Sebastien > >> > >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit : > >> > Hello, > >> > > >> > I have eventually found 2 issues regarding networking in my > >> > application. > >> > I would like to discuss the first one. > >> > > >> > > >> > My code contains something like this: > >> > > >> > int sd = socket(AF_INET, SOCK_DGRAM, 0); > >> > > >> > struct ifreq ifr; > >> > memset(&ifr, 0, sizeof(struct ifreq)); > >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ); > >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR; > >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR; > >> > ifr.ifr_mii_val_out = 0; > >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr); > >> > > >> > // Do stuff with ifr.ifr_mii_val_out. > >> > > >> > close(sd); > >> > > >> > I realized that this type of ioctl will directly access the hardware, > >> > without any locking. > >> > That is, if any other task needs to use the PHY in any other way, it > >> > will > >> > eventually corrupt its register data. > >> > > >> > > >> > Two questions on this: > >> > 1. Is there any good reason for this? > >> > 2. What is the best way to fix it? Shall I add a driver level lock, or > >> > should net_lock() be used in any higher layer? > >> > > >> > > >> > > >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos < > >> f.j.pa...@gmail.com> > >> > wrote: > >> > > >> >> Hello, > >> >> > >> >>> We have deployed hundreds of boards with stm32f427 and ethernet, > they > >> >>> have all been working reliably for months without stopping, we know > >> >>> it > >> >>> because they critically depend on network functionality and we have > >> >>> reports if a card becomes unreachable. None has so far outside of > >> >>> dedicated tests. > >> >>> So I believe that there is no obvious hard bug in these drivers. > >> >> Good to hear that! > >> >> Although, I may be using a feature or protocol that you are not. > >> >> Of course, I don't believe that NuttX is broken per se, but a minor > >> >> bug > >> >> may lurk somewhere... > >> >> > >> >> > >> >>> I have seen that when I enable the network debugging features, it > >> >>> seems > >> >> to > >> >>> hit an assertion failure before getting to nsh prompt at startup. > >> >>> This > >> >> was > >> >>> on a quite recent master. I haven't had a chance to diagnose this > >> >> further. > >> >>> Have you tried enabling these and if so, do they work? > >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it > >> works. > >> >> I have some devices under test, waiting to reproduce the issue to see > >> >> if > >> >> this option provides any useful information. > >> >> > >> >> > >> >>> Also, out of curiosity, have you tried running ostest on your board? > >> >> I just tried. > >> >> It passed all the tests. > >> >> > >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet > >> >> <sebast...@lorquet.fr > >> > > >> >> wrote: > >> >> > >> >>> Hi, > >> >>> > >> >>> We have deployed hundreds of boards with stm32f427 and ethernet, > they > >> >>> have all been working reliably for months without stopping, we know > >> >>> it > >> >>> because they critically depend on network functionality and we have > >> >>> reports if a card becomes unreachable. None has so far outside of > >> >>> dedicated tests. > >> >>> > >> >>> So I believe that there is no obvious hard bug in these drivers. > >> >>> > >> >>> Most certainly a build option on your particular config. debug is a > >> >>> possible issue, thread problems is another possibility. > >> >>> > >> >>> Sebastien > >> >>> > >> >>> > >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote: > >> >>>> Hello! > >> >>>> > >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some > >> issues. > >> >>>> > >> >>>> Initially the device works correctly. After some hours of > continuous > >> >>>> operation I completely lose all network communications. > >> >>>> Trying to troubleshoot the issue, I enabled assertions and various > >> other > >> >>>> debug features. > >> >>>> > >> >>>> Again the device works correctly for some hours, and then I get a > >> failed > >> >>>> assertion at stm32_eth.c, line 1372: > >> >>>> > >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL); > >> >>>> > >> >>>> No other errors are reported (e.g. stack overflows etc). > >> >>>> > >> >>>> > >> >>>> I have observed that this issue usually manifests itself when there > >> >>>> is > >> >>>> insufficient stack on a task. > >> >>>> But in my case, all tasks have oversized stacks. Typically they do > >> >>>> not > >> >>>> exceed 50% utilization. > >> >>>> I have plenty of room available in the heap too (> 100kB). > >> >>>> > >> >>>> Regarding the rest of the firmware, I cannot see any other > >> misbehaviour > >> >>> or > >> >>>> problem. > >> >>>> I haven't ever seen any other unexplained problem, assertion fail, > >> >>>> hard-fault etc. > >> >>>> The application code passes all of our tests. > >> >>>> In fact, even when this issue happens, although I lose network > >> >>>> connectivity, the rest of the system works perfectly. > >> >>>> > >> >>>> Please note that I have checked the contents of dev->d_len and > >> >>> dev->d_buf, > >> >>>> and they seem to contain valid data. > >> >>>> The address lies within the normal address space of the MCU, and > the > >> >>> size > >> >>>> is sane. > >> >>>> So it doesn't look like any kind of memory corruption. > >> >>>> > >> >>>> > >> >>>> At this point I believe that this is an actual bug either on the > >> >>>> STM32 > >> >>> MAC > >> >>>> driver, or at the TCP/IP stack itself. > >> >>>> I had a look at the driver code, but I didn't see anything > >> >>>> suspicious. > >> >>>> > >> >>>> > >> >>>> Has anyone observed the same issue before? > >> >>>> Can it be affected in any way with my configuration? > >> >>>> Or maybe, do you have any recommendations on what to test next? > >> >>>> > >> >>>> > >> >>>> Thank you! > >> >>>> > >> > > >