On 02/05/20(Sat) 10:40, Anton Lindqvist wrote:
> On Fri, May 01, 2020 at 05:17:36PM +0200, Martin Pieuchot wrote:
> > On 01/05/20(Fri) 12:13, Anton Lindqvist wrote:
> > > The order in which the pty master/slave is closed seems to be the
> > > trigger here. While not duping the master, it's closed before the slave.
> > > In the opposite scenario, the slave is closed before the master. While
> > > closing the slave, it ends up here expressed as a simplified backtrace:
> > >
> > > tsleep()
> > > ttysleep()
> > > ttywait()
> > > ttywflush()
> > > ttylclose()
> > > ptsclose()
> > > fdfree()
> > > exit1()
> > >
> > > In order words, it ends up doing a tsleep(INFSLP) causing the thread to
> > > hang. Note that this is not the case when the master is closed before
> > > the slave since `tp->t_oproc == NULL' causing ttywait() to bail early.
> >
> > Why is the sleeper never awaken? Does that mean a ttwakeup() is missing?
>
> In this case, the process is single threaded, about to exit and the only
> consumer of the pty. I don't see how it could be any other process
> responsibility to perform the wakeup.
Do we see that the issue is caused by the order in which descriptors are
closed in fdfree()? The current deadlock occurs because the duped master
has a higher fd number than the slave which means it is still open when the
slave is closed.
But why would that be a problem? By default *close() functions,
including ttylclose() are blocking. So any exiting process might end up
hanging in fdfree(). Diff below illustrates that by forcing all *close()
during exit1() to be non-blocking, it also fix the issue.
Does it make sense to close fds as non-blocking when existing? What
should a dying thread wait for? What can be the cons of such approach?
Now regarding your fix, why does it make sense to wait 5sec instead of
indefinitely? Did you look at r1.263 of NetBSD's kern/tty.c? If we go
with this change could you please change the 'timo' suffix and variables
to 'nsec' and use uint64_t instead of int?
Index: kern/vfs_vnops.c
===================================================================
RCS file: /cvs/src/sys/kern/vfs_vnops.c,v
retrieving revision 1.114
diff -u -p -r1.114 vfs_vnops.c
--- kern/vfs_vnops.c 8 Apr 2020 08:07:51 -0000 1.114
+++ kern/vfs_vnops.c 2 May 2020 09:18:28 -0000
@@ -601,6 +601,7 @@ vn_closefile(struct file *fp, struct pro
{
struct vnode *vp = fp->f_data;
struct flock lf;
+ unsigned int flag;
int error;
KERNEL_LOCK();
@@ -611,7 +612,10 @@ vn_closefile(struct file *fp, struct pro
lf.l_type = F_UNLCK;
(void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK);
}
- error = vn_close(vp, fp->f_flag, fp->f_cred, p);
+ flag = fp->f_flag;
+ if (p != NULL && p->p_flag & P_WEXIT)
+ flag |= O_NONBLOCK;
+ error = vn_close(vp, flag, fp->f_cred, p);
KERNEL_UNLOCK();
return (error);
}