> Date: Sat, 2 May 2020 11:33:17 +0200
> From: Martin Pieuchot <[email protected]>
>
> On 02/05/20(Sat) 10:40, Anton Lindqvist wrote:
> > On Fri, May 01, 2020 at 05:17:36PM +0200, Martin Pieuchot wrote:
> > > On 01/05/20(Fri) 12:13, Anton Lindqvist wrote:
> > > > The order in which the pty master/slave is closed seems to be the
> > > > trigger here. While not duping the master, it's closed before the slave.
> > > > In the opposite scenario, the slave is closed before the master. While
> > > > closing the slave, it ends up here expressed as a simplified backtrace:
> > > >
> > > > tsleep()
> > > > ttysleep()
> > > > ttywait()
> > > > ttywflush()
> > > > ttylclose()
> > > > ptsclose()
> > > > fdfree()
> > > > exit1()
> > > >
> > > > In order words, it ends up doing a tsleep(INFSLP) causing the thread to
> > > > hang. Note that this is not the case when the master is closed before
> > > > the slave since `tp->t_oproc == NULL' causing ttywait() to bail early.
> > >
> > > Why is the sleeper never awaken? Does that mean a ttwakeup() is missing?
> >
> > In this case, the process is single threaded, about to exit and the only
> > consumer of the pty. I don't see how it could be any other process
> > responsibility to perform the wakeup.
>
> Do we see that the issue is caused by the order in which descriptors are
> closed in fdfree()? The current deadlock occurs because the duped master
> has a higher fd number than the slave which means it is still open when the
> slave is closed.
I'm sure we could construct an example where the file descriptors are
in a different oder. So changing the order is not going to help.
> But why would that be a problem? By default *close() functions,
> including ttylclose() are blocking. So any exiting process might end up
> hanging in fdfree(). Diff below illustrates that by forcing all *close()
> during exit1() to be non-blocking, it also fix the issue.
I very much fear that is going to have unintended side-effects with
output not being flushed properly. And the process could still
deadlock itself by using close(2) directly isn't it?
> Does it make sense to close fds as non-blocking when existing? What
> should a dying thread wait for? What can be the cons of such approach?
>
> Now regarding your fix, why does it make sense to wait 5sec instead of
> indefinitely? Did you look at r1.263 of NetBSD's kern/tty.c? If we go
> with this change could you please change the 'timo' suffix and variables
> to 'nsec' and use uint64_t instead of int?
r1.263 was reverted in r1.264. Then r1.265 is the commit quoted by anton@.
There is also r2.267 which adds an additional fix to r1.265.
In ttywait(), NetBSD only calls ttyflush() if there is a timeout. That
makes sense, because we have ttywflush() to combine the wait and flush
so ttywait() shouldn't flush when there is no error.
> Index: kern/vfs_vnops.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/vfs_vnops.c,v
> retrieving revision 1.114
> diff -u -p -r1.114 vfs_vnops.c
> --- kern/vfs_vnops.c 8 Apr 2020 08:07:51 -0000 1.114
> +++ kern/vfs_vnops.c 2 May 2020 09:18:28 -0000
> @@ -601,6 +601,7 @@ vn_closefile(struct file *fp, struct pro
> {
> struct vnode *vp = fp->f_data;
> struct flock lf;
> + unsigned int flag;
> int error;
>
> KERNEL_LOCK();
> @@ -611,7 +612,10 @@ vn_closefile(struct file *fp, struct pro
> lf.l_type = F_UNLCK;
> (void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK);
> }
> - error = vn_close(vp, fp->f_flag, fp->f_cred, p);
> + flag = fp->f_flag;
> + if (p != NULL && p->p_flag & P_WEXIT)
> + flag |= O_NONBLOCK;
> + error = vn_close(vp, flag, fp->f_cred, p);
> KERNEL_UNLOCK();
> return (error);
> }
>
>