Hi Paul, On Tue, Mar 12, 2019 at 8:26 AM Paul Thomas <pthomas8...@gmail.com> wrote: > > Hi All, > > Let me do a quick clean recap of this issue. > > On a Debian arm64 system with a 5.0rc8 kernel using the macb driver on > zynqmp, enabling tx timestamping (1) breaks networking! The first and > most noticeable way is that you can no longer connect with ssh. This > is a serious bug somewhere and merits some attention. > > Trying to debug ssh is a possibility, but I was trying to debug with > something easier and thus the netcat testing. The specific issue can > be seen in the following strace. In this setup nc just connects to a > server and tries to send two packets (2). The first packet goes > through fine, but the second doesn't because nc is stuck forever > trying to read from the socket. > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- waiting on > stdin and UDP sock > read(0, "c1\n", 8192) = 3 <-- read three chars from stdin > write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3]) <-- waiting > on stdin and UDP sock > read(3, <-- waits forever here as there is no data to read > > I've been reading more, an old patch and the timestamping.txt doc > helped me understand a little more of what's going on: > https://lore.kernel.org/netdev/20130328211925.7644.15781.st...@jekeller-hub.jf.intel.com/ > https://www.kernel.org/doc/Documentation/networking/timestamping.txt > > So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in > fact the select should return, but it is not set in this case. I can > see everything that is going on in datagram_poll() in datagram.c. The > main difference being that in the broken case the mask is 0x30c and in > the working case it is 0x304. The difference is EPOLLERR, which is > there clearly in the code if !skb_queue_empty(&sk->sk_error_queue). > > Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as if > it's behaving as it should (except that things break). My first > question is should the sk_error_queue be empty if there is a tx > timestamp available (in datagram_poll() in datagram.c)? If it's not > empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for the > select() and I don't see what would be different about the macb/arm64 > setup?
Thanks for the summary. I think sk_error_queue should be empty because packets are queued to that via skb_complete_timestamp (sock_queue_err_skb) and this should not be called in this flow. I'm sorry if I'm missing something - I'll let others from netdev comment. I'm not sure why EPOLLERR in being set in this case. Regards, Harini > > Any insight here would be very much appreciated. > > thanks, > Paul > > (1) hwstamp_ctl -i eth0 -t 1 > > (2) The actual script to be able to run nc and strace from a single > serial console is slightly clever: > (sleep 3; echo "c1"; sleep 1; echo "c2") | nc -u 10.1.155.100 9999 & > strace -p $(ps -A | grep nc | awk '{print $1}') _______________________________________________ Linuxptp-devel mailing list Linuxptp-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linuxptp-devel