Hi Fotis,

Yes, I understood the point. Because it needs the right timing it
could be trick to duplicate.

Did you try to create a simple host server to try to emulate this
connection issue?

BR,

Alan

On 8/12/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote:
> I think I understand the nature of the bug.
>
> When closing a socket, tcp_close_eventhandler() is set as a callback in the
> dev->d_devcb list.
>
> Typically, the server's response (FIN ACK) will have as a result
> tcp_callback() to be executed, and thus the callback to be properly called,
> with proper arguments.
> Then the cb is properly free'd.
>
> If however devif_dev_event() has the chance to execute before
> tcp_callback() (e.g. server's response was lost), then the callbacks take
> NULL as a conn argument.
> This crashes the whole system horribly.
>
> As you see, this requires specific timings with the server communication,
> that's why this is so hard to reproduce.
>
>
> On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <f.j.pa...@gmail.com>
> wrote:
>
>> Hi Alan,
>>
>> I am trying hard to reproduce the issue reliably, but I haven't been able
>> to do so yet.
>>
>> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the problem
>> does not disappear, rather it changes form.
>> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
>>
>> I have to mention that everything in my system is commented out.
>> Currently the only thing working is the network thread that opens the TCP
>> connection, nothing else.
>> I have disabled all of my usage of the workers, all signals etc.
>> I verify that when the fault occurs, this thread is not interrupted by
>> anything (using Segger SystemView).
>> It looks like a scheduling issue is unlikely.
>>
>> I also increased the stacks more, and I added padding to the very few
>> malloc's that I use.
>>
>> ---
>>
>> At this moment I observe something very interesting.
>> I am calling netlib_ifdown(), which causes the attached stack trace.
>>
>> So:
>> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn set
>> explicitly to NULL.
>> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
>> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which causes
>> the crash.
>>
>> This is wrong, but I don't have the understanding of it yet.
>> Shall there be a check for a NULL conn?
>> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in the
>> first place?
>> Or tcp_close_eventhandler() should be tolerant to a NULL conn argument?
>>
>>
>>
>>
>>
>> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis
>> <acas...@gmail.com>
>> wrote:
>>
>>> Hi Fotis,
>>>
>>> Are you in sync with mainline?
>>>
>>> If you can create a host application to induce the issue will be
>>> easier for us to test.
>>>
>>> BR,
>>>
>>> Alan
>>>
>>> On 8/9/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > still trying to make the network work reliably.
>>> > After fixing another issue of my application, I hit another problem.
>>> >
>>> > The following sequence causes NuttX to crash:
>>> >
>>> > 1. My application is creating a TCP socket and communicates with a
>>> server.
>>> > 2. At one point the server stops responding (unrelated to NuttX /
>>> network
>>> > issue).
>>> > 3. The application detects the timeout, and calls close() on the
>>> > socket.
>>> > 4. A new socket is created, and it is connected to the server.
>>> > 5. At this point, the server decides to send a FIN message for the
>>> previous
>>> > connection.
>>> > 6. I get a failed assertion in devif_callback.c at line 85.
>>> >
>>> > Note that I haven't managed to manually reproduce this issue.
>>> > No matter what I do manually, everything seems to be working
>>> > correctly.
>>> > I just have to wait for it to happen.
>>> > It seems that it is only triggered if a FIN arrives **after** a SYN.
>>> >
>>> > I am sure that this is only happening with
>>> > CONFIG_NET_TCP_WRITE_BUFFERS
>>> > enabled.
>>> > I have no problems without buffering.
>>> >
>>> > The assertion seems right to fire.
>>> > When a FIN is received for a closed connection, the same callback is
>>> free'd
>>> > both by tcp_lost_connection() and later on by
>>> > tcp_close_eventhandler().
>>> > All these are happening within the same execution of tcp_input().
>>> >
>>> > Any ideas?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet
>>> > <sebast...@lorquet.fr
>>> >
>>> > wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> good find but
>>> >>
>>> >> -I dont think any usual application tinkers with PHY regs during its
>>> >> lifetime except the ethernet monitor
>>> >>
>>> >> -the fix is certainly a lock somewhere but global or fine grained I
>>> dont
>>> >> know.
>>> >>
>>> >> Not all calls need to be locked, eg the one that returns the PHY
>>> >> address. Probaby not needed by default, but a PHY access lock would
>>> >> prevent any issue you describe.
>>> >>
>>> >> I will wait for people with more expertise about this.
>>> >>
>>> >> Just a note, dont forget that not all PHY have an interrupt, the one
>>> >> on
>>> >> the nucleo stm32h743zi[2] board does not have one.
>>> >>
>>> >> Sebastien
>>> >>
>>> >> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
>>> >> > Hello,
>>> >> >
>>> >> > I have eventually found 2 issues regarding networking in my
>>> >> > application.
>>> >> > I would like to discuss the first one.
>>> >> >
>>> >> >
>>> >> > My code contains something like this:
>>> >> >
>>> >> > int sd = socket(AF_INET, SOCK_DGRAM, 0);
>>> >> >
>>> >> > struct ifreq ifr;
>>> >> > memset(&ifr, 0, sizeof(struct ifreq));
>>> >> > strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
>>> >> > ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
>>> >> > ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
>>> >> > ifr.ifr_mii_val_out = 0;
>>> >> > ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
>>> >> >
>>> >> > // Do stuff with ifr.ifr_mii_val_out.
>>> >> >
>>> >> > close(sd);
>>> >> >
>>> >> > I realized that this type of ioctl will directly access the
>>> >> > hardware,
>>> >> > without any locking.
>>> >> > That is, if any other task needs to use the PHY in any other way,
>>> >> > it
>>> >> > will
>>> >> > eventually corrupt its register data.
>>> >> >
>>> >> >
>>> >> > Two questions on this:
>>> >> > 1. Is there any good reason for this?
>>> >> > 2. What is the best way to fix it? Shall I add a driver level lock,
>>> or
>>> >> > should net_lock() be used in any higher layer?
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
>>> >> f.j.pa...@gmail.com>
>>> >> > wrote:
>>> >> >
>>> >> >> Hello,
>>> >> >>
>>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>>> they
>>> >> >>> have all been working reliably for months without stopping, we
>>> >> >>> know
>>> >> >>> it
>>> >> >>> because they critically depend on network functionality and we
>>> >> >>> have
>>> >> >>> reports if a card becomes unreachable. None has so far outside of
>>> >> >>> dedicated tests.
>>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>>> >> >> Good to hear that!
>>> >> >> Although, I may be using a feature or protocol that you are not.
>>> >> >> Of course, I don't believe that NuttX is broken per se, but a
>>> >> >> minor
>>> >> >> bug
>>> >> >> may lurk somewhere...
>>> >> >>
>>> >> >>
>>> >> >>> I have seen that when I enable the network debugging features, it
>>> >> >>> seems
>>> >> >> to
>>> >> >>> hit an assertion failure before getting to nsh prompt at startup.
>>> >> >>> This
>>> >> >> was
>>> >> >>> on a quite recent master. I haven't had a chance to diagnose this
>>> >> >> further.
>>> >> >>> Have you tried enabling these and if so, do they work?
>>> >> >> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and
>>> >> >> it
>>> >> works.
>>> >> >> I have some devices under test, waiting to reproduce the issue to
>>> see
>>> >> >> if
>>> >> >> this option provides any useful information.
>>> >> >>
>>> >> >>
>>> >> >>> Also, out of curiosity, have you tried running ostest on your
>>> board?
>>> >> >> I just tried.
>>> >> >> It passed all the tests.
>>> >> >>
>>> >> >> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
>>> >> >> <sebast...@lorquet.fr
>>> >> >
>>> >> >> wrote:
>>> >> >>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We have deployed hundreds of boards with stm32f427 and ethernet,
>>> they
>>> >> >>> have all been working reliably for months without stopping, we
>>> >> >>> know
>>> >> >>> it
>>> >> >>> because they critically depend on network functionality and we
>>> >> >>> have
>>> >> >>> reports if a card becomes unreachable. None has so far outside of
>>> >> >>> dedicated tests.
>>> >> >>>
>>> >> >>> So I believe that there is no obvious hard bug in these drivers.
>>> >> >>>
>>> >> >>> Most certainly a build option on your particular config. debug is
>>> >> >>> a
>>> >> >>> possible issue, thread problems is another possibility.
>>> >> >>>
>>> >> >>> Sebastien
>>> >> >>>
>>> >> >>>
>>> >> >>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
>>> >> >>>> Hello!
>>> >> >>>>
>>> >> >>>> I am using Ethernet on an STM32F427 target, but I am facing some
>>> >> issues.
>>> >> >>>>
>>> >> >>>> Initially the device works correctly. After some hours of
>>> continuous
>>> >> >>>> operation I completely lose all network communications.
>>> >> >>>> Trying to troubleshoot the issue, I enabled assertions and
>>> >> >>>> various
>>> >> other
>>> >> >>>> debug features.
>>> >> >>>>
>>> >> >>>> Again the device works correctly for some hours, and then I get
>>> >> >>>> a
>>> >> failed
>>> >> >>>> assertion at stm32_eth.c, line 1372:
>>> >> >>>>
>>> >> >>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
>>> >> >>>>
>>> >> >>>> No other errors are reported (e.g. stack overflows etc).
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> I have observed that this issue usually manifests itself when
>>> there
>>> >> >>>> is
>>> >> >>>> insufficient stack on a task.
>>> >> >>>> But in my case, all tasks have oversized stacks. Typically they
>>> >> >>>> do
>>> >> >>>> not
>>> >> >>>> exceed 50% utilization.
>>> >> >>>> I have plenty of room available in the heap too (> 100kB).
>>> >> >>>>
>>> >> >>>> Regarding the rest of the firmware, I cannot see any other
>>> >> misbehaviour
>>> >> >>> or
>>> >> >>>> problem.
>>> >> >>>> I haven't ever seen any other unexplained problem, assertion
>>> >> >>>> fail,
>>> >> >>>> hard-fault etc.
>>> >> >>>> The application code passes all of our tests.
>>> >> >>>> In fact, even when this issue happens, although I lose network
>>> >> >>>> connectivity, the rest of the system works perfectly.
>>> >> >>>>
>>> >> >>>> Please note that I have checked the contents of dev->d_len and
>>> >> >>> dev->d_buf,
>>> >> >>>> and they seem to contain valid data.
>>> >> >>>> The address lies within the normal address space of the MCU, and
>>> the
>>> >> >>> size
>>> >> >>>> is sane.
>>> >> >>>> So it doesn't look like any kind of memory corruption.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> At this point I believe that this is an actual bug either on the
>>> >> >>>> STM32
>>> >> >>> MAC
>>> >> >>>> driver, or at the TCP/IP stack itself.
>>> >> >>>> I had a look at the driver code, but I didn't see anything
>>> >> >>>> suspicious.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Has anyone observed the same issue before?
>>> >> >>>> Can it be affected in any way with my configuration?
>>> >> >>>> Or maybe, do you have any recommendations on what to test next?
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Thank you!
>>> >> >>>>
>>> >>
>>> >
>>>
>>
>

Reply via email to