Hi Alan,

I am trying hard to reproduce the issue reliably, but I haven't been able
to do so yet.

I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
does not disappear, rather it changes form.
Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.

I have to mention that everything in my system is commented out.
Currently the only thing working is the network thread that opens the TCP
connection, nothing else.
I have disabled all of my usage of the workers, all signals etc.
I verify that when the fault occurs, this thread is not interrupted by
anything (using Segger SystemView).
It looks like a scheduling issue is unlikely.

I also increased the stacks more, and I added padding to the very few
malloc's that I use.

---

At this moment I observe something very interesting.
I am calling netlib_ifdown(), which causes the attached stack trace.

So:
1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
explicitly to NULL.
2. devif_dev_event() eventually calls tcp_close_eventhandler()
3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes the
crash.

This is wrong, but I don't have the understanding of it yet.
Shall there be a check for a NULL conn?
Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
first place?
Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?





On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis <acas...@gmail.com>
wrote:

> Hi Fotis,
>
> Are you in sync with mainline?
>
> If you can create a host application to induce the issue will be
> easier for us to test.
>
> BR,
>
> Alan
>
> On 8/9/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote:
> > Hello,
> >
> > still trying to make the network work reliably.
> > After fixing another issue of my application, I hit another problem.
> >
> > The following sequence causes NuttX to crash:
> >
> > 1. My application is creating a TCP socket and communicates with a
> server.
> > 2. At one point the server stops responding (unrelated to NuttX / network
> > issue).
> > 3. The application detects the timeout, and calls close() on the socket.
> > 4. A new socket is created, and it is connected to the server.
> > 5. At this point, the server decides to send a FIN message for the
> previous
> > connection.
> > 6. I get a failed assertion in devif_callback.c at line 85.
> >
> > Note that I haven't managed to manually reproduce this issue.
> > No matter what I do manually, everything seems to be working correctly.
> > I just have to wait for it to happen.
> > It seems that it is only triggered if a FIN arrives **after** a SYN.
> >
> > I am sure that this is only happening with CONFIG_NET_TCP_WRITE_BUFFERS
> > enabled.
> > I have no problems without buffering.
> >
> > The assertion seems right to fire.
> > When a FIN is received for a closed connection, the same callback is
> free'd
> > both by tcp_lost_connection() and later on by tcp_close_eventhandler().
> > All these are happening within the same execution of tcp_input().
> >
> > Any ideas?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet <sebast...@lorquet.fr>
> > wrote:
> >
> >> Hi,
> >>
> >> good find but
> >>
> >> -I dont think any usual application tinkers with PHY regs during its
> >> lifetime except the ethernet monitor
> >>
> >> -the fix is certainly a lock somewhere but global or fine grained I dont
> >> know.
> >>
> >> Not all calls need to be locked, eg the one that returns the PHY
> >> address. Probaby not needed by default, but a PHY access lock would
> >> prevent any issue you describe.
> >>
> >> I will wait for people with more expertise about this.
> >>
> >> Just a note, dont forget that not all PHY have an interrupt, the one on
> >> the nucleo stm32h743zi[2] board does not have one.
> >>
> >> Sebastien
> >>
> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
> >> > Hello,
> >> >
> >> > I have eventually found 2 issues regarding networking in my
> >> > application.
> >> > I would like to discuss the first one.
> >> >
> >> >
> >> > My code contains something like this:
> >> >
> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
> >> >
> >> > struct ifreq ifr;
> >> > memset(&ifr, 0, sizeof(struct ifreq));
> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
> >> > ifr.ifr_mii_val_out = 0;
> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
> >> >
> >> > // Do stuff with ifr.ifr_mii_val_out.
> >> >
> >> > close(sd);
> >> >
> >> > I realized that this type of ioctl will directly access the hardware,
> >> > without any locking.
> >> > That is, if any other task needs to use the PHY in any other way, it
> >> > will
> >> > eventually corrupt its register data.
> >> >
> >> >
> >> > Two questions on this:
> >> > 1. Is there any good reason for this?
> >> > 2. What is the best way to fix it? Shall I add a driver level lock, or
> >> > should net_lock() be used in any higher layer?
> >> >
> >> >
> >> >
> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
> >> f.j.pa...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
> they
> >> >>> have all been working reliably for months without stopping, we know
> >> >>> it
> >> >>> because they critically depend on network functionality and we have
> >> >>> reports if a card becomes unreachable. None has so far outside of
> >> >>> dedicated tests.
> >> >>> So I believe that there is no obvious hard bug in these drivers.
> >> >> Good to hear that!
> >> >> Although, I may be using a feature or protocol that you are not.
> >> >> Of course, I don't believe that NuttX is broken per se, but a minor
> >> >> bug
> >> >> may lurk somewhere...
> >> >>
> >> >>
> >> >>> I have seen that when I enable the network debugging features, it
> >> >>> seems
> >> >> to
> >> >>> hit an assertion failure before getting to nsh prompt at startup.
> >> >>> This
> >> >> was
> >> >>> on a quite recent master. I haven't had a chance to diagnose this
> >> >> further.
> >> >>> Have you tried enabling these and if so, do they work?
> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and it
> >> works.
> >> >> I have some devices under test, waiting to reproduce the issue to see
> >> >> if
> >> >> this option provides any useful information.
> >> >>
> >> >>
> >> >>> Also, out of curiosity, have you tried running ostest on your board?
> >> >> I just tried.
> >> >> It passed all the tests.
> >> >>
> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
> >> >> <sebast...@lorquet.fr
> >> >
> >> >> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
> they
> >> >>> have all been working reliably for months without stopping, we know
> >> >>> it
> >> >>> because they critically depend on network functionality and we have
> >> >>> reports if a card becomes unreachable. None has so far outside of
> >> >>> dedicated tests.
> >> >>>
> >> >>> So I believe that there is no obvious hard bug in these drivers.
> >> >>>
> >> >>> Most certainly a build option on your particular config. debug is a
> >> >>> possible issue, thread problems is another possibility.
> >> >>>
> >> >>> Sebastien
> >> >>>
> >> >>>
> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> >> >>>> Hello!
> >> >>>>
> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
> >> issues.
> >> >>>>
> >> >>>> Initially the device works correctly. After some hours of
> continuous
> >> >>>> operation I completely lose all network communications.
> >> >>>> Trying to troubleshoot the issue, I enabled assertions and various
> >> other
> >> >>>> debug features.
> >> >>>>
> >> >>>> Again the device works correctly for some hours, and then I get a
> >> failed
> >> >>>> assertion at stm32_eth.c, line 1372:
> >> >>>>
> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
> >> >>>>
> >> >>>> No other errors are reported (e.g. stack overflows etc).
> >> >>>>
> >> >>>>
> >> >>>> I have observed that this issue usually manifests itself when there
> >> >>>> is
> >> >>>> insufficient stack on a task.
> >> >>>> But in my case, all tasks have oversized stacks. Typically they do
> >> >>>> not
> >> >>>> exceed 50% utilization.
> >> >>>> I have plenty of room available in the heap too (> 100kB).
> >> >>>>
> >> >>>> Regarding the rest of the firmware, I cannot see any other
> >> misbehaviour
> >> >>> or
> >> >>>> problem.
> >> >>>> I haven't ever seen any other unexplained problem, assertion fail,
> >> >>>> hard-fault etc.
> >> >>>> The application code passes all of our tests.
> >> >>>> In fact, even when this issue happens, although I lose network
> >> >>>> connectivity, the rest of the system works perfectly.
> >> >>>>
> >> >>>> Please note that I have checked the contents of dev->d_len and
> >> >>> dev->d_buf,
> >> >>>> and they seem to contain valid data.
> >> >>>> The address lies within the normal address space of the MCU, and
> the
> >> >>> size
> >> >>>> is sane.
> >> >>>> So it doesn't look like any kind of memory corruption.
> >> >>>>
> >> >>>>
> >> >>>> At this point I believe that this is an actual bug either on the
> >> >>>> STM32
> >> >>> MAC
> >> >>>> driver, or at the TCP/IP stack itself.
> >> >>>> I had a look at the driver code, but I didn't see anything
> >> >>>> suspicious.
> >> >>>>
> >> >>>>
> >> >>>> Has anyone observed the same issue before?
> >> >>>> Can it be affected in any way with my configuration?
> >> >>>> Or maybe, do you have any recommendations on what to test next?
> >> >>>>
> >> >>>>
> >> >>>> Thank you!
> >> >>>>
> >>
> >
>

Reply via email to