> Date: Sat, 2 May 2020 11:33:17 +0200
> From: Martin Pieuchot <[email protected]>
> 
> On 02/05/20(Sat) 10:40, Anton Lindqvist wrote:
> > On Fri, May 01, 2020 at 05:17:36PM +0200, Martin Pieuchot wrote:
> > > On 01/05/20(Fri) 12:13, Anton Lindqvist wrote:
> > > > The order in which the pty master/slave is closed seems to be the
> > > > trigger here. While not duping the master, it's closed before the slave.
> > > > In the opposite scenario, the slave is closed before the master. While
> > > > closing the slave, it ends up here expressed as a simplified backtrace:
> > > > 
> > > >   tsleep()
> > > >   ttysleep()
> > > >   ttywait()
> > > >   ttywflush()
> > > >   ttylclose()
> > > >   ptsclose()
> > > >   fdfree()
> > > >   exit1()
> > > > 
> > > > In order words, it ends up doing a tsleep(INFSLP) causing the thread to
> > > > hang. Note that this is not the case when the master is closed before
> > > > the slave since `tp->t_oproc == NULL' causing ttywait() to bail early.
> > > 
> > > Why is the sleeper never awaken?  Does that mean a ttwakeup() is missing?
> > 
> > In this case, the process is single threaded, about to exit and the only
> > consumer of the pty. I don't see how it could be any other process
> > responsibility to perform the wakeup.
> 
> Do we see that the issue is caused by the order in which descriptors are
> closed in fdfree()?  The current deadlock occurs because the duped master
> has a higher fd number than the slave which means it is still open when the
> slave is closed.

I'm sure we could construct an example where the file descriptors are
in a different oder.  So changing the order is not going to help.

> But why would that be a problem?  By default *close() functions,
> including ttylclose() are blocking.  So any exiting process might end up
> hanging in fdfree().  Diff below illustrates that by forcing all *close()
> during exit1() to be non-blocking, it also fix the issue.

I very much fear that is going to have unintended side-effects with
output not being flushed properly.  And the process could still
deadlock itself by using close(2) directly isn't it?

> Does it make sense to close fds as non-blocking when existing?  What
> should a dying thread wait for?  What can be the cons of such approach? 
> 
> Now regarding your fix, why does it make sense to wait 5sec instead of
> indefinitely?  Did you look at r1.263 of NetBSD's kern/tty.c?  If we go
> with this change could you please change the 'timo' suffix and variables
> to 'nsec' and use uint64_t instead of int?

r1.263 was reverted in r1.264.  Then r1.265 is the commit quoted by anton@.
There is also r2.267 which adds an additional fix to r1.265.

In ttywait(), NetBSD only calls ttyflush() if there is a timeout. That
makes sense, because we have ttywflush() to combine the wait and flush
so ttywait() shouldn't flush when there is no error.


> Index: kern/vfs_vnops.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/vfs_vnops.c,v
> retrieving revision 1.114
> diff -u -p -r1.114 vfs_vnops.c
> --- kern/vfs_vnops.c  8 Apr 2020 08:07:51 -0000       1.114
> +++ kern/vfs_vnops.c  2 May 2020 09:18:28 -0000
> @@ -601,6 +601,7 @@ vn_closefile(struct file *fp, struct pro
>  {
>       struct vnode *vp = fp->f_data;
>       struct flock lf;
> +     unsigned int flag;
>       int error;
>  
>       KERNEL_LOCK();
> @@ -611,7 +612,10 @@ vn_closefile(struct file *fp, struct pro
>               lf.l_type = F_UNLCK;
>               (void) VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK);
>       }
> -     error = vn_close(vp, fp->f_flag, fp->f_cred, p);
> +     flag = fp->f_flag;
> +     if (p != NULL && p->p_flag & P_WEXIT)
> +             flag |= O_NONBLOCK;
> +     error = vn_close(vp, flag, fp->f_cred, p);
>       KERNEL_UNLOCK();
>       return (error);
>  }
> 
> 

Reply via email to