> -----Original Message-----
> From: Paul Thomas [mailto:[email protected]]
> Sent: Tuesday, March 12, 2019 10:50 AM
> To: Keller, Jacob E <[email protected]>
> Cc: [email protected]
> Subject: Re: [Linuxptp-devel] strangeness
>
> On Tue, Mar 12, 2019 at 1:11 PM Keller, Jacob E
> <[email protected]> wrote:
> >
> > > -----Original Message-----
> > > From: Paul Thomas [mailto:[email protected]]
> > > Sent: Tuesday, March 12, 2019 10:00 AM
> > > To: Keller, Jacob E <[email protected]>
> > > Subject: Re: [Linuxptp-devel] strangeness
> > >
> > > On Tue, Mar 12, 2019 at 12:26 PM Keller, Jacob E
> > > <[email protected]> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: [email protected]
> > > > > [mailto:[email protected]]
> On
> > > > > Behalf Of Harini Katakam
> > > > > Sent: Tuesday, March 12, 2019 3:10 AM
> > > > > To: Paul Thomas <[email protected]>
> > > > > Cc: [email protected]; [email protected]
> > > > > Subject: Re: [Linuxptp-devel] strangeness
> > > > >
> > > > > Hi Paul,
> > > > > On Tue, Mar 12, 2019 at 8:26 AM Paul Thomas <[email protected]>
> > > wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > Let me do a quick clean recap of this issue.
> > > > > >
> > > > > > On a Debian arm64 system with a 5.0rc8 kernel using the macb driver
> > > > > > on
> > > > > > zynqmp, enabling tx timestamping (1) breaks networking! The first
> > > > > > and
> > > > > > most noticeable way is that you can no longer connect with ssh. This
> > > > > > is a serious bug somewhere and merits some attention.
> > > > > >
> > > > > > Trying to debug ssh is a possibility, but I was trying to debug with
> > > > > > something easier and thus the netcat testing. The specific issue can
> > > > > > be seen in the following strace. In this setup nc just connects to a
> > > > > > server and tries to send two packets (2). The first packet goes
> > > > > > through fine, but the second doesn't because nc is stuck forever
> > > > > > trying to read from the socket.
> > > > > > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [0]) <-- waiting
> > > > > > on
> > > > > > stdin and UDP sock
> > > > > > read(0, "c1\n", 8192) = 3 <-- read three chars from stdin
> > > > > > write(3, "c1\n", 3) = 3 <-- write those out on the UDP sock
> > > > > > pselect6(4, [0 3], NULL, NULL, NULL, NULL) = 1 (in [3]) <-- waiting
> > > > > > on stdin and UDP sock
> > > > > > read(3, <-- waits forever here as there is no data to read
> > > > > >
> > > > > > I've been reading more, an old patch and the timestamping.txt doc
> > > > > > helped me understand a little more of what's going on:
> > > > > > https://lore.kernel.org/netdev/20130328211925.7644.15781.stgit@jekeller-
> > > > > hub.jf.intel.com/
> > > > > > https://www.kernel.org/doc/Documentation/networking/timestamping.txt
> > > > > >
> > > > > > So it is clear that if the SO_SELECT_ERR_QUEUE flag is set then in
> > > > > > fact the select should return, but it is not set in this case. I can
> > > > > > see everything that is going on in datagram_poll() in datagram.c.
> > > > > > The
> > > > > > main difference being that in the broken case the mask is 0x30c and
> > > > > > in
> > > > > > the working case it is 0x304. The difference is EPOLLERR, which is
> > > > > > there clearly in the code if !skb_queue_empty(&sk->sk_error_queue).
> > > > > >
> > > > > > Then in select.c POLLIN_SET includes EPOLLERR. It almost looks as if
> > > > > > it's behaving as it should (except that things break). My first
> > > > > > question is should the sk_error_queue be empty if there is a tx
> > > > > > timestamp available (in datagram_poll() in datagram.c)? If it's not
> > > > > > empty I don't see what else SO_SELECT_ERR_QUEUE flag is doing for
> > > > > > the
> > > > > > select() and I don't see what would be different about the
> > > > > > macb/arm64
> > > > > > setup?
> > > > >
> > > > > Thanks for the summary.
> > > > > I think sk_error_queue should be empty because packets are queued to
> > > > > that via skb_complete_timestamp (sock_queue_err_skb) and this should
> > > > > not be called in this flow. I'm sorry if I'm missing something - I'll
> > > > > let others
> > > > > from netdev comment.
> > > > > I'm not sure why EPOLLERR in being set in this case.
> > > > >
> > > >
> > > > I believe at least historically that the Tx timestamps were notified to
> > > > the
> > > applications using the error queue.
> > > The documentation (Documentation/networking/timestamping.txt) says
> > > "recvmsg() with flag MSG_ERRQUEUE", so it's returned to userspace
> > > through the error queue, but at datagram_poll() in datagram.c:
> > > /* exceptional events? */
> > > if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
> > >
> > > Should this already be non-empty and EPOLLERR be returned as part of
> > > mask? Or should _only_ EPOLLPRI be returned when SOCK_SELECT_ERR_QUEUE
> > > is true? Surly someone on this list knows what should be happening? It
> > > will help direct where we are looking. Should we be looking at the
> > > queues? Or should be looking at do_select() in fs/select.c or
> > > somewhere else?
> > >
> >
>
> HI Jake,
> Thanks for looking at this.
>
> > It's been quite some time since I looked at this. Hmm. Can you reproduce
> > this with
> another board? I haven't seen the issue before.
> >
> > I remember adding SO_SELECT_ERR_QUEUE as an option, and it made the socket
> wake up when a Tx timestamp occurred. I believe this was eventually fixed in
> a more
> direct way, and now the socket option doesn't really do much.
> >
> > See Linux commit 7d4c04fc1700 ("net: add option to enable error queue
> > packets
> waking select", 2013-03-31)
> That looks like your netdev patch, and the text makes sense.
> >
> > Basically, the option now just adds POLLPRI along with POLLERR when the Tx
> timestamp is ready, and the socket will wake up when there is a Timestamp.
> >
> > This was eventually fixed so that you could select on just POLLERR, and
> > have it
> actually wake up on the error messages, and the option was no longer
> necessary.
> Ah, so is selecting on POLLERR a seperate flag?
>
> >
> > That's what I recall about SO_SELECT_ERR_QUEUE at least.
> >
> > So.. to clarify the behavior you're seeing is that the socket wakes up on
> > pselect, and
> then you go to read it, but it never returns because there's no data
> available? And the
> socket woke up because there was an error, but the error queue is also empty?
> The tx timestams are enabled globally across the system with
> hwstamp_ctl, there is just one option on or off. The issue seems to be
> with applications that are not expecting select to return on POLLERR.
> This certainly the case with netcat, and I suspect it is a similar
> thing that breaks ssh. I haven't checked if there is actually a tx
> timestamp in the error queue for the nc case (probably is). But maybe
> this points us in the right direction for a specific socket if the
> SOF_TIMESTAMPING_TX_HARDWARE option hasn't been requested should the
> timestamps be pushed to the error queue in the first place?
>
> thanks,
> Paul
Hmm. I haven't been able to reproduce this using other device drivers, so I
wonder if the specific device driver is doing something weird and indicating
that it can Tx timestamp every packet, in its hard_xmit routine... But it
should only be doing that for packets which are requested to timestamp from the
socket.
Thanks,
Jake
>
> >
> > Thanks,
> > Jake
> >
> > > thanks,
> > > Paul
> > >
> > > > Thanks,
> > > > Jake
_______________________________________________
Linuxptp-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel