> -----Original Message-----
> From: Reto Gähwiler <gret.hexa...@gmail.com>
> Sent: Tuesday, June 30, 2020 4:58 PM
> To: dev@nuttx.apache.org
> Subject: close() socket called in second thread combined with reconnect kills 
> eth (stm32h743zi)
> 
> Hello Everyone,
> 
> I am facing the following problem working with nuttx and ethernet 
> connections. A TCP socket is setup as blocking and connected to
> the server.
> The connection is handled in one thread which hangs in the recv call and 
> processes the data if some arrives. In case of an error the
> connection is closed.
> Now, if a close() call on that particular TCP connection is called from a 
> different thread, it terminates the connection and the recv() fails
> and breaks free.

It's unsafe to call close() while other thread is blocking on recv(). Yes, it's 
safe for most POSIX OS but isn't truth for NuttX because NuttX always directly 
release all resource associated with the socket in close() regardless whether 
other thread is blocking on it.
Note: not only socket is unsafe, but also normal file handle is unsafe in this 
case too. On the other hand, it's safe to call other API(except close) 
concurrently from the different threads.

> If we now connect to a new IP, it first seems to be fine but shortly after 
> the whole network disappears. No more icmp responses
> (therefore no ping) and all other opened connections in different threads are 
> not reachable anymore. Besides, any of the still opened
> connections starts to consume all cpu time. Looking into it with the debugger 
> attached it can be seen, that in the
> net/devif/devif_callback.c the for-loop looking for the callback in the 
> device event list is cycling without an end.
> 
> Looking at wireshark while data is transmitted from my client to the server 
> it looks as follows around the termination. So basically
> before we reconnect and fail.
> 
> No. Time Source Destination Protocol Length Src.MacAddress Info
> > 43178 0.000451 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06
> > 29500 → 1026 [PSH, ACK] Seq=30001 Ack=759475 Win=1758 Len=21
> > 43179 0.000102 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [ACK] Seq=759475 Ack=30022 Win=5954 Len=0
> > 43182 0.001144 10.62.64.110 195.65.177.171 TCP 586 xxxx_0c:70:04 1026
> > →
> > 29500 [PSH, ACK] Seq=759475 Ack=30022 Win=6150 Len=532
> > 43183 0.000437 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 [TCP
> > Out-Of-Order] 1026 → 29500 [FIN, ACK] Seq=759475 Ack=30022 Win=6150
> > Len=0
> > 43184 0.000049 195.65.177.171 10.62.64.110 TCP 75 Fortinet_09:00:06
> > 29500 → 1026 [PSH, ACK] Seq=30022 Ack=760007 Win=1758 Len=21
> > 43185 0.000090 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
> > 43186 0.000012 195.65.177.171 10.62.64.110 TCP 60 Fortinet_09:00:06
> > [TCP Dup ACK 43184#1] 29500 → 1026 [ACK] Seq=30043 Ack=760007 Win=1758
> > Len=0
> > 43187 0.000096 10.62.64.110 195.65.177.171 TCP 60 xxxx_0c:70:04 1026 →
> > 29500 [RST, ACK] Seq=760007 Ack=30043 Win=1758 Len=0
> >
> 
> As can be seen are the clients (the device I am working on) sequence number 
> not synchronised after the last data transmit
> (seq=759475, len=532 -->
> nextseq=760007) and the FIN,ACK also sent by the device (seq=759475 as 
> well!!!). Therefore, it looks like closing a connection this way
> is not thread safe!
> In case of an idle connection the sequence numbers would look just fine but 
> the next connection will trigger the same error.
> 
> I then also tried to make use of the shutdown and call it from the thread I 
> used to call close, but shutdown.c is just a dummy API as
> already noticed by seanshpark
> <https://nuttx.yahoogroups.narkive.com/YjaUuARV/socket-shutdown>5 years ago.
> 
> The platform the code is executed on is based on a stm32h743zi. Since things 
> seem to happen in the libraries it could affect other
> platforms as well.
> 
> I was wondering if anyone else ran into the issue of calling close on a 
> socket from a different thread as the recv/send is handled on
> and that the following connection kills the entire ethernet? Please let me 
> know if you know a fix for blocking sockets or it would be
> better to go with non-blocking and work with select/poll instead.
> 

Two methods can fix this problem:
1.Implement the safe close() like other OS:
   a.Increase the reference count at the entry point of each API
   b.Decrease the reference count at the leave point of each API(potentially 
release the socket resource here)
   c.close() has to wake up other blocking thread instead releasing the socket 
resource directly
2.Close the socket in the receiving thread only, you can either:
   a.Send the signal to break the blocking recv()
   b.Create a pipe and poll/select both socket and pipe, and then send the data 
to pipe to break poll/select
Of course, Item 1 is the better choice, item 2 is just a workaround.

> Thanks for your input and help,
> best regards, Reto

Reply via email to