Hi Fotis, Yes, I understood the point. Because it needs the right timing it could be trick to duplicate.
Did you try to create a simple host server to try to emulate this connection issue? BR, Alan On 8/12/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote: > I think I understand the nature of the bug. > > When closing a socket, tcp_close_eventhandler() is set as a callback in the > dev->d_devcb list. > > Typically, the server's response (FIN ACK) will have as a result > tcp_callback() to be executed, and thus the callback to be properly called, > with proper arguments. > Then the cb is properly free'd. > > If however devif_dev_event() has the chance to execute before > tcp_callback() (e.g. server's response was lost), then the callbacks take > NULL as a conn argument. > This crashes the whole system horribly. > > As you see, this requires specific timings with the server communication, > that's why this is so hard to reproduce. > > > On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <f.j.pa...@gmail.com> > wrote: > >> Hi Alan, >> >> I am trying hard to reproduce the issue reliably, but I haven't been able >> to do so yet. >> >> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem >> does not disappear, rather it changes form. >> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95. >> >> I have to mention that everything in my system is commented out. >> Currently the only thing working is the network thread that opens the TCP >> connection, nothing else. >> I have disabled all of my usage of the workers, all signals etc. >> I verify that when the fault occurs, this thread is not interrupted by >> anything (using Segger SystemView). >> It looks like a scheduling issue is unlikely. >> >> I also increased the stacks more, and I added padding to the very few >> malloc's that I use. >> >> --- >> >> At this moment I observe something very interesting. >> I am calling netlib_ifdown(), which causes the attached stack trace. >> >> So: >> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set >> explicitly to NULL. >> 2. devif_dev_event() eventually calls tcp_close_eventhandler() >> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes >> the crash. >> >> This is wrong, but I don't have the understanding of it yet. >> Shall there be a check for a NULL conn? >> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the >> first place? >> Or tcp_close_eventhandler() should be tolerant to a NULL conn argument? >> >> >> >> >> >> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis >> <acas...@gmail.com> >> wrote: >> >>> Hi Fotis, >>> >>> Are you in sync with mainline? >>> >>> If you can create a host application to induce the issue will be >>> easier for us to test. >>> >>> BR, >>> >>> Alan >>> >>> On 8/9/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote: >>> > Hello, >>> > >>> > still trying to make the network work reliably. >>> > After fixing another issue of my application, I hit another problem. >>> > >>> > The following sequence causes NuttX to crash: >>> > >>> > 1. My application is creating a TCP socket and communicates with a >>> server. >>> > 2. At one point the server stops responding (unrelated to NuttX / >>> network >>> > issue). >>> > 3. The application detects the timeout, and calls close() on the >>> > socket. >>> > 4. A new socket is created, and it is connected to the server. >>> > 5. At this point, the server decides to send a FIN message for the >>> previous >>> > connection. >>> > 6. I get a failed assertion in devif_callback.c at line 85. >>> > >>> > Note that I haven't managed to manually reproduce this issue. >>> > No matter what I do manually, everything seems to be working >>> > correctly. >>> > I just have to wait for it to happen. >>> > It seems that it is only triggered if a FIN arrives **after** a SYN. >>> > >>> > I am sure that this is only happening with >>> > CONFIG_NET_TCP_WRITE_BUFFERS >>> > enabled. >>> > I have no problems without buffering. >>> > >>> > The assertion seems right to fire. >>> > When a FIN is received for a closed connection, the same callback is >>> free'd >>> > both by tcp_lost_connection() and later on by >>> > tcp_close_eventhandler(). >>> > All these are happening within the same execution of tcp_input(). >>> > >>> > Any ideas? >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet >>> > <sebast...@lorquet.fr >>> > >>> > wrote: >>> > >>> >> Hi, >>> >> >>> >> good find but >>> >> >>> >> -I dont think any usual application tinkers with PHY regs during its >>> >> lifetime except the ethernet monitor >>> >> >>> >> -the fix is certainly a lock somewhere but global or fine grained I >>> dont >>> >> know. >>> >> >>> >> Not all calls need to be locked, eg the one that returns the PHY >>> >> address. Probaby not needed by default, but a PHY access lock would >>> >> prevent any issue you describe. >>> >> >>> >> I will wait for people with more expertise about this. >>> >> >>> >> Just a note, dont forget that not all PHY have an interrupt, the one >>> >> on >>> >> the nucleo stm32h743zi[2] board does not have one. >>> >> >>> >> Sebastien >>> >> >>> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit : >>> >> > Hello, >>> >> > >>> >> > I have eventually found 2 issues regarding networking in my >>> >> > application. >>> >> > I would like to discuss the first one. >>> >> > >>> >> > >>> >> > My code contains something like this: >>> >> > >>> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0); >>> >> > >>> >> > struct ifreq ifr; >>> >> > memset(&ifr, 0, sizeof(struct ifreq)); >>> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ); >>> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR; >>> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR; >>> >> > ifr.ifr_mii_val_out = 0; >>> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr); >>> >> > >>> >> > // Do stuff with ifr.ifr_mii_val_out. >>> >> > >>> >> > close(sd); >>> >> > >>> >> > I realized that this type of ioctl will directly access the >>> >> > hardware, >>> >> > without any locking. >>> >> > That is, if any other task needs to use the PHY in any other way, >>> >> > it >>> >> > will >>> >> > eventually corrupt its register data. >>> >> > >>> >> > >>> >> > Two questions on this: >>> >> > 1. Is there any good reason for this? >>> >> > 2. What is the best way to fix it? Shall I add a driver level lock, >>> or >>> >> > should net_lock() be used in any higher layer? >>> >> > >>> >> > >>> >> > >>> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos < >>> >> f.j.pa...@gmail.com> >>> >> > wrote: >>> >> > >>> >> >> Hello, >>> >> >> >>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet, >>> they >>> >> >>> have all been working reliably for months without stopping, we >>> >> >>> know >>> >> >>> it >>> >> >>> because they critically depend on network functionality and we >>> >> >>> have >>> >> >>> reports if a card becomes unreachable. None has so far outside of >>> >> >>> dedicated tests. >>> >> >>> So I believe that there is no obvious hard bug in these drivers. >>> >> >> Good to hear that! >>> >> >> Although, I may be using a feature or protocol that you are not. >>> >> >> Of course, I don't believe that NuttX is broken per se, but a >>> >> >> minor >>> >> >> bug >>> >> >> may lurk somewhere... >>> >> >> >>> >> >> >>> >> >>> I have seen that when I enable the network debugging features, it >>> >> >>> seems >>> >> >> to >>> >> >>> hit an assertion failure before getting to nsh prompt at startup. >>> >> >>> This >>> >> >> was >>> >> >>> on a quite recent master. I haven't had a chance to diagnose this >>> >> >> further. >>> >> >>> Have you tried enabling these and if so, do they work? >>> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and >>> >> >> it >>> >> works. >>> >> >> I have some devices under test, waiting to reproduce the issue to >>> see >>> >> >> if >>> >> >> this option provides any useful information. >>> >> >> >>> >> >> >>> >> >>> Also, out of curiosity, have you tried running ostest on your >>> board? >>> >> >> I just tried. >>> >> >> It passed all the tests. >>> >> >> >>> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet >>> >> >> <sebast...@lorquet.fr >>> >> > >>> >> >> wrote: >>> >> >> >>> >> >>> Hi, >>> >> >>> >>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet, >>> they >>> >> >>> have all been working reliably for months without stopping, we >>> >> >>> know >>> >> >>> it >>> >> >>> because they critically depend on network functionality and we >>> >> >>> have >>> >> >>> reports if a card becomes unreachable. None has so far outside of >>> >> >>> dedicated tests. >>> >> >>> >>> >> >>> So I believe that there is no obvious hard bug in these drivers. >>> >> >>> >>> >> >>> Most certainly a build option on your particular config. debug is >>> >> >>> a >>> >> >>> possible issue, thread problems is another possibility. >>> >> >>> >>> >> >>> Sebastien >>> >> >>> >>> >> >>> >>> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote: >>> >> >>>> Hello! >>> >> >>>> >>> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some >>> >> issues. >>> >> >>>> >>> >> >>>> Initially the device works correctly. After some hours of >>> continuous >>> >> >>>> operation I completely lose all network communications. >>> >> >>>> Trying to troubleshoot the issue, I enabled assertions and >>> >> >>>> various >>> >> other >>> >> >>>> debug features. >>> >> >>>> >>> >> >>>> Again the device works correctly for some hours, and then I get >>> >> >>>> a >>> >> failed >>> >> >>>> assertion at stm32_eth.c, line 1372: >>> >> >>>> >>> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL); >>> >> >>>> >>> >> >>>> No other errors are reported (e.g. stack overflows etc). >>> >> >>>> >>> >> >>>> >>> >> >>>> I have observed that this issue usually manifests itself when >>> there >>> >> >>>> is >>> >> >>>> insufficient stack on a task. >>> >> >>>> But in my case, all tasks have oversized stacks. Typically they >>> >> >>>> do >>> >> >>>> not >>> >> >>>> exceed 50% utilization. >>> >> >>>> I have plenty of room available in the heap too (> 100kB). >>> >> >>>> >>> >> >>>> Regarding the rest of the firmware, I cannot see any other >>> >> misbehaviour >>> >> >>> or >>> >> >>>> problem. >>> >> >>>> I haven't ever seen any other unexplained problem, assertion >>> >> >>>> fail, >>> >> >>>> hard-fault etc. >>> >> >>>> The application code passes all of our tests. >>> >> >>>> In fact, even when this issue happens, although I lose network >>> >> >>>> connectivity, the rest of the system works perfectly. >>> >> >>>> >>> >> >>>> Please note that I have checked the contents of dev->d_len and >>> >> >>> dev->d_buf, >>> >> >>>> and they seem to contain valid data. >>> >> >>>> The address lies within the normal address space of the MCU, and >>> the >>> >> >>> size >>> >> >>>> is sane. >>> >> >>>> So it doesn't look like any kind of memory corruption. >>> >> >>>> >>> >> >>>> >>> >> >>>> At this point I believe that this is an actual bug either on the >>> >> >>>> STM32 >>> >> >>> MAC >>> >> >>>> driver, or at the TCP/IP stack itself. >>> >> >>>> I had a look at the driver code, but I didn't see anything >>> >> >>>> suspicious. >>> >> >>>> >>> >> >>>> >>> >> >>>> Has anyone observed the same issue before? >>> >> >>>> Can it be affected in any way with my configuration? >>> >> >>>> Or maybe, do you have any recommendations on what to test next? >>> >> >>>> >>> >> >>>> >>> >> >>>> Thank you! >>> >> >>>> >>> >> >>> > >>> >> >