> Date: Wed, 6 May 2020 09:53:56 +0200
> From: Martin Pieuchot <[email protected]>
> Cc: [email protected]
>
> On 02/05/20(Sat) 16:02, Mark Kettenis wrote:
> > > Date: Sat, 2 May 2020 11:33:17 +0200
> > > From: Martin Pieuchot <[email protected]>
> > > [...]
> > > Do we see that the issue is caused by the order in which descriptors are
> > > closed in fdfree()? The current deadlock occurs because the duped master
> > > has a higher fd number than the slave which means it is still open when
> > > the
> > > slave is closed.
> >
> > I'm sure we could construct an example where the file descriptors are
> > in a different oder. So changing the order is not going to help.
>
> Obviously :)
>
> > > But why would that be a problem? By default *close() functions,
> > > including ttylclose() are blocking. So any exiting process might end up
> > > hanging in fdfree(). Diff below illustrates that by forcing all *close()
> > > during exit1() to be non-blocking, it also fix the issue.
> >
> > I very much fear that is going to have unintended side-effects with
> > output not being flushed properly. And the process could still
> > deadlock itself by using close(2) directly isn't it?
>
> Indeed.
>
> > > Does it make sense to close fds as non-blocking when existing? What
> > > should a dying thread wait for? What can be the cons of such approach?
> > >
> > > Now regarding your fix, why does it make sense to wait 5sec instead of
> > > indefinitely? Did you look at r1.263 of NetBSD's kern/tty.c? If we go
> > > with this change could you please change the 'timo' suffix and variables
> > > to 'nsec' and use uint64_t instead of int?
> >
> > r1.263 was reverted in r1.264. Then r1.265 is the commit quoted by anton@.
> > There is also r2.267 which adds an additional fix to r1.265.
> >
> > In ttywait(), NetBSD only calls ttyflush() if there is a timeout. That
> > makes sense, because we have ttywflush() to combine the wait and flush
> > so ttywait() shouldn't flush when there is no error.
>
> Updated diff below reflecting those changes. I'm still questioning the
> 5sec timeout, but it is without doubt an improvement over the current
> behavior.
>
> The previously mentioned test as well as a modified version closing the
> slave before exit(2) now hang for 5 seconds instead of deadlocking
> indefinitely.
>
> I believe we want that for release, ok?
I'm not sure. I can't really judge the impact of this on serial ports
for example. I think it may take some time for any potential problems
to surface. And I think this bug has been with us for many years.
My understanding is that the impact of the bug is rather limited.
Users can DOS themselves with this artificially constructed testcase,
but there are many other ways to do this. I would wait with this
until after the release.
> Index: kern/tty.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/tty.c,v
> retrieving revision 1.154
> diff -u -p -r1.154 tty.c
> --- kern/tty.c 7 Apr 2020 13:27:51 -0000 1.154
> +++ kern/tty.c 6 May 2020 07:44:53 -0000
> @@ -80,6 +80,8 @@ void filt_ttyrdetach(struct knote *kn);
> int filt_ttywrite(struct knote *kn, long hint);
> void filt_ttywdetach(struct knote *kn);
> void ttystats_init(struct itty **, size_t *);
> +int ttywait_nsec(struct tty *tp, uint64_t nsecs);
> +int ttysleep_nsec(struct tty *, void *, int, char *, uint64_t);
>
> /* Symbolic sleep message strings. */
> char ttclos[] = "ttycls";
> @@ -1202,10 +1204,10 @@ ttnread(struct tty *tp)
> }
>
> /*
> - * Wait for output to drain.
> + * Wait for output to drain, or if this times out, flush it.
> */
> int
> -ttywait(struct tty *tp)
> +ttywait_nsec(struct tty *tp, uint64_t nsecs)
> {
> int error, s;
>
> @@ -1219,7 +1221,10 @@ ttywait(struct tty *tp)
> (ISSET(tp->t_state, TS_CARR_ON) || ISSET(tp->t_cflag,
> CLOCAL))
> && tp->t_oproc) {
> SET(tp->t_state, TS_ASLEEP);
> - error = ttysleep(tp, &tp->t_outq, TTOPRI | PCATCH,
> ttyout);
> + error = ttysleep_nsec(tp, &tp->t_outq, TTOPRI | PCATCH,
> + ttyout, nsecs);
> + if (error == EWOULDBLOCK)
> + ttyflush(tp, FWRITE);
> if (error)
> break;
> } else
> @@ -1229,6 +1234,12 @@ ttywait(struct tty *tp)
> return (error);
> }
>
> +int
> +ttywait(struct tty *tp)
> +{
> + return (ttywait_nsec(tp, INFSLP));
> +}
> +
> /*
> * Flush if successfully wait.
> */
> @@ -1237,7 +1248,8 @@ ttywflush(struct tty *tp)
> {
> int error;
>
> - if ((error = ttywait(tp)) == 0)
> + error = ttywait_nsec(tp, SEC_TO_NSEC(5));
> + if (error == 0 || error == EWOULDBLOCK)
> ttyflush(tp, FREAD);
> return (error);
> }
> @@ -2281,11 +2293,18 @@ tputchar(int c, struct tty *tp)
> int
> ttysleep(struct tty *tp, void *chan, int pri, char *wmesg)
> {
> +
Please don't add these silly NetBSD-style empty lines at the start of
functions without local variables.
> + return (ttysleep_nsec(tp, chan, pri, wmesg, INFSLP));
> +}
> +
> +int
> +ttysleep_nsec(struct tty *tp, void *chan, int pri, char *wmesg, uint64_t
> nsecs)
> +{
> int error;
> short gen;
>
> gen = tp->t_gen;
> - if ((error = tsleep_nsec(chan, pri, wmesg, INFSLP)) != 0)
> + if ((error = tsleep_nsec(chan, pri, wmesg, nsecs)) != 0)
> return (error);
> return (tp->t_gen == gen ? 0 : ERESTART);
> }
>