On Tue, Mar 12, 2019 at 3:22 PM Keller, Jacob E <jacob.e.kel...@intel.com> wrote: > > > > > -----Original Message----- > > From: Paul Thomas [mailto:pthomas8...@gmail.com] > > Sent: Tuesday, March 12, 2019 10:50 AM > > To: Keller, Jacob E <jacob.e.kel...@intel.com> > > Cc: linuxptp-devel@lists.sourceforge.net > > Subject: Re: [Linuxptp-devel] strangeness > > > > On Tue, Mar 12, 2019 at 1:11 PM Keller, Jacob E > > <jacob.e.kel...@intel.com> wrote: > > > > > > > -----Original Message----- > > > > From: Paul Thomas [mailto:pthomas8...@gmail.com] > > > > Sent: Tuesday, March 12, 2019 10:00 AM > > > > To: Keller, Jacob E <jacob.e.kel...@intel.com> > > > > Subject: Re: [Linuxptp-devel] strangeness > > > > > > > > On Tue, Mar 12, 2019 at 12:26 PM Keller, Jacob E > > > > <jacob.e.kel...@intel.com> wrote: > > > > > > > > > > > -----Original Message----- > > > > > > From: netdev-ow...@vger.kernel.org > > > > > > [mailto:netdev-ow...@vger.kernel.org] > > On > > > > > > Behalf Of Harini Katakam > > > > > > Sent: Tuesday, March 12, 2019 3:10 AM > > > > > > To: Paul Thomas <pthomas8...@gmail.com> > > > > > > Cc: linuxptp-devel@lists.sourceforge.net; net...@vger.kernel.org > > > > > > Subject: Re: [Linuxptp-devel] strangeness > > > > > > > > > > > > Hi Paul, > > > > > > On Tue, Mar 12, 2019 at 8:26 AM Paul Thomas <pthomas8...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > Let me do a quick clean recap of this issue. > > > > > > > > > > > > > > On a Debian arm64 system with a 5.0rc8 kernel using the macb > > > > > > > driver on > > > > > > > zynqmp, enabling tx timestamping (1) breaks networking! The first > > > > > > > and > > > > > > > most noticeable way is that you can no longer connect with ssh. > > > > > > > This > > > > > > > is a serious bug somewhere and merits some attention. > > > > > > > > > > > > > > Trying to debug ssh is a possibility, but I was trying to debug > > > > > > > with > > > > > > > something easier and thus the netcat testing. The specific issue > > > > > > > can > > > > > > > be seen in the following strace. In this setup nc just connects > > > > > > > to a > > > > > > > server and tries to send two packets (2). The first packet goes > > > > > > > through fine, but the second doesn't because nc is stuck forever > > > > > > > trying to read from the socket. > > > > > > > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- > > > > > > > waiting on > > > > > > > stdin and UDP sock > > > > > > > read(0, "c1\n", 8192) = 3 <-- read three chars from stdin > > > > > > > write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock > > > > > > > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3]) <-- > > > > > > > waiting > > > > > > > on stdin and UDP sock > > > > > > > read(3, <-- waits forever here as there is no data to read > > > > > > > > > > > > > > I've been reading more, an old patch and the timestamping.txt doc > > > > > > > helped me understand a little more of what's going on: > > > > > > > https://lore.kernel.org/netdev/20130328211925.7644.15781.stgit@jekeller- > > > > > > hub.jf.intel.com/ > > > > > > > https://www.kernel.org/doc/Documentation/networking/timestamping.txt > > > > > > > > > > > > > > So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in > > > > > > > fact the select should return, but it is not set in this case. I > > > > > > > can > > > > > > > see everything that is going on in datagram_poll() in datagram.c. > > > > > > > The > > > > > > > main difference being that in the broken case the mask is 0x30c > > > > > > > and in > > > > > > > the working case it is 0x304. The difference is EPOLLERR, which is > > > > > > > there clearly in the code if > > > > > > > !skb_queue_empty(&sk->sk_error_queue). > > > > > > > > > > > > > > Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as > > > > > > > if > > > > > > > it's behaving as it should (except that things break). My first > > > > > > > question is should the sk_error_queue be empty if there is a tx > > > > > > > timestamp available (in datagram_poll() in datagram.c)? If it's > > > > > > > not > > > > > > > empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for > > > > > > > the > > > > > > > select() and I don't see what would be different about the > > > > > > > macb/arm64 > > > > > > > setup? > > > > > > > > > > > > Thanks for the summary. > > > > > > I think sk_error_queue should be empty because packets are queued to > > > > > > that via skb_complete_timestamp (sock_queue_err_skb) and this should > > > > > > not be called in this flow. I'm sorry if I'm missing something - > > > > > > I'll let others > > > > > > from netdev comment. > > > > > > I'm not sure why EPOLLERR in being set in this case. > > > > > > > > > > > > > > > > I believe at least historically that the Tx timestamps were notified > > > > > to the > > > > applications using the error queue. > > > > The documentation (Documentation/networking/timestamping.txt) says > > > > "recvmsg() with flag MSG_ERRQUEUE", so it's returned to userspace > > > > through the error queue, but at datagram_poll() in datagram.c: > > > > /* exceptional events? */ > > > > if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue)) > > > > > > > > Should this already be non-empty and EPOLLERR be returned as part of > > > > mask? Or should _only_ EPOLLPRI be returned when SOCK_SELECT_ERR_QUEUE > > > > is true? Surly someone on this list knows what should be happening? It > > > > will help direct where we are looking. Should we be looking at the > > > > queues? Or should be looking at do_select() in fs/select.c or > > > > somewhere else? > > > > > > > > > > > HI Jake, > > Thanks for looking at this. > > > > > It's been quite some time since I looked at this. Hmm. Can you reproduce > > > this with > > another board? I haven't seen the issue before. > > > > > > I remember adding SO_SELECT_ERR_QUEUE as an option, and it made the socket > > wake up when a Tx timestamp occurred. I believe this was eventually fixed > > in a more > > direct way, and now the socket option doesn't really do much. > > > > > > See Linux commit 7d4c04fc1700 ("net: add option to enable error queue > > > packets > > waking select", 2013-03-31) > > That looks like your netdev patch, and the text makes sense. > > > > > > Basically, the option now just adds POLLPRI along with POLLERR when the Tx > > timestamp is ready, and the socket will wake up when there is a Timestamp. > > > > > > This was eventually fixed so that you could select on just POLLERR, and > > > have it > > actually wake up on the error messages, and the option was no longer > > necessary. > > Ah, so is selecting on POLLERR a seperate flag? > > > > > > > > That's what I recall about SO_SELECT_ERR_QUEUE at least. > > > > > > So.. to clarify the behavior you're seeing is that the socket wakes up on > > > pselect, and > > then you go to read it, but it never returns because there's no data > > available? And the > > socket woke up because there was an error, but the error queue is also > > empty? > > The tx timestams are enabled globally across the system with > > hwstamp_ctl, there is just one option on or off. The issue seems to be > > with applications that are not expecting select to return on POLLERR. > > This certainly the case with netcat, and I suspect it is a similar > > thing that breaks ssh. I haven't checked if there is actually a tx > > timestamp in the error queue for the nc case (probably is). But maybe > > this points us in the right direction for a specific socket if the > > SOF_TIMESTAMPING_TX_HARDWARE option hasn't been requested should the > > timestamps be pushed to the error queue in the first place? > > > > thanks, > > Paul > > Hmm. I haven't been able to reproduce this using other device drivers, so I > wonder if the specific device driver is doing something weird and indicating > that it can Tx timestamp every packet, in its hard_xmit routine... But it > should only be doing that for packets which are requested to timestamp from > the socket.
Yes! I think this could be it. I just submitted this patch so that we can talk about it: https://lore.kernel.org/netdev/20190312195053.21741-1-pthomas8...@gmail.com/ This fixes the issue I'm having with nc and ssh. -Paul > > Thanks, > Jake > > > > > > > > > Thanks, > > > Jake > > > > > > > thanks, > > > > Paul > > > > > > > > > Thanks, > > > > > Jake _______________________________________________ Linuxptp-devel mailing list Linuxptp-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linuxptp-devel