On Mon, Feb 21, 2022 at 03:00:01PM +1000, David Gwynne wrote:
> On Sun, Feb 20, 2022 at 10:30:22AM +1000, David Gwynne wrote:
> > 
> > 
> > > On 20 Feb 2022, at 09:46, David Gwynne <da...@gwynne.id.au> wrote:
> > > 
> > > On Sat, Feb 19, 2022 at 02:58:08PM -0800, Greg Steuck wrote:
> > >> There's no reproducer, but maybe this race is approachable without one?
> > >> 
> > >> dev = sc->sc_dev;
> > >> if (dev) {
> > >> struct vnode *vp;
> > >> 
> > >> if (vfinddev(dev, VCHR, &vp))
> > >> VOP_REVOKE(vp, REVOKEALL);
> > >> 
> > >> KASSERT(sc->sc_dev == 0);
> > >> }
> > > 
> > > this was my last run at it:
> > > https://marc.info/?l=openbsd-tech&m=164489981621957&w=2
> > > 
> > > maybe i need another dvthing thread to push it a bit harder...
> > 
> > adding another dvthing thread or two made it blow up pretty quickly :(
> 
> so it is this section:
> 
>         /* kick userland off the device */
>         dev = sc->sc_dev;
>         if (dev) {
>                 struct vnode *vp;
> 
>                 if (vfinddev(dev, VCHR, &vp))
>                         VOP_REVOKE(vp, REVOKEALL);
> 
>                 KASSERT(sc->sc_dev == 0);
>         }
> 
> my assumption was/is that VOP_REVOKE would call tunclose (or tapclose)
> on the currently open tun (or tap) device, and swap the vfs backend out
> behind the scenes with deadfs.
> 
> the context behind this is that isnt really a strong binding between an
> open /dev entry (eg, /dev/tun0) and an instance of an interface (eg,
> tun0). all the device entrypoints (eg, tunopen, tunread, tunwrite,
> etc) pass a dev_t, and that's used to look up an interface instance to
> work with.
> 
> the problem with this is an interface could be destroyed and recreated
> in between calls to the device entrypoints. ie, you could do the
> following high level steps:
> 
> - ifconfig tun0 create
> - open /dev/tun0 -> fd 3
> - ifconfig tun0 destroy
> - ifconfig tun0 create
> - write to fd 3
> 
> and that would send a packet on the newly created tun0 because it had
> the same minor number as the previous one.
> 
> there was a lot of consensus that this was Not The Best(tm), and that if
> a tun interface was destroyed while the /dev entry was open, it should
> act like the interface was detached and the open /dev entry should stop
> working. this is what VOP_REVOKE helps with. or it is supposed to help
> with.
> 
> when we create a tun interface, it's added to a global list of tun
> interfaces. when a tun device is opened, we look for the interface on
> that list (and create it if it doesnt exist), and then check to see if
> it is already open by looking at sc_dev. if sc_dev is 0, it's not open
> and tunopen can set sc_dev to claim ownership of it. if sc_dev is
> non-zero, then the device is busy and open fails.
> 
> tunclose clears sc_dev to say ownership is given up.
> 
> tun destroy checks sc_dev, and if it is != 0 then it knows something has
> it open and will call VOP_REVOKE. VOP_REVOKE is supposed to do what i
> described above, which is call tunclose on the programs behalf and swap
> the vfs ops out.
> 
> what i'm seeing is that sometimes VOP_REVOKE gets called, it happily
> returns 0, but tunclose is not called. this means sc_dev is not cleared,
> and then this KASSERT fires.
> 
> ive tried changing it something like this in the destroy path:
> 
> -             KASSERT(sc->sc_dev == 0);
> +             while (sc->sc_dev != 0)
> +                     tsleep_nsec(&sc->sc_dev, PWAIT, "tunclose", INFSLP);
> 
> with tunclose calling wakeup(&sc->dev) when it's finished, but this ends
> up getting stuck in the tsleep.
> 
> however, if i cut the KASSERT out and let destroy keep going, i do see
> tunclose get called against the now non-existent interface. this would
> be fine, but now we're back where we started. if someone recreates tun0
> after it's been destroyed, tunclose will find the new interface and try
> to close it.
> 
> ive tried to follow what VOP_REVOKE actually does and how it finds
> tunclose to call it, but it's pretty twisty and i got tired.
> 
> i guess my question at this point is are my assumptions about how
> VOP_REVOKE works valid? specifically, should it reliably be calling
> tunclose? if not, what causes tunclose to be called in the future and
> why can't i wait for it in tun_clone_destroy?

claudio figured it out. his clue was that multiple concurrent calls
to tunopen (or tapopen) will share a vnode. because tunopen can sleep,
multiple programs can be inside tunopen for the same tun interface at
the same time, all with references against the same vnode.

at the same time as this another thread/program can call VOP_REVOKE
via tun_clone_destroy (eg, ifconfig tun1 destroy does this).
VOP_REVOKE marks a vnode as bad, which in turn means that subsequent
open()s of a tun interface will get a brand new vnode.

so multiple threads holding references to a vnode can be sleeping in
tun_dev_open on the interface cloner lock. one thread wins and takes
ownership of the tun interface, then another thread can destroy that tun
interface, calls VOP_REVOKE which calls tun_dev_close to tear down the
vnodes association with the tun interface and mark the vnode as bad.
the thread that called tun_clone_destroy then creates another instance
of the interface by calling tun_clone_create immediately.

one of the original threads with the old vnode reference wakes up and
takes ownership of the new tun_softc. however, because the vnode is bad,
all the vnode ops have been replaced with the deadfs ops. the close() op
on the old vnode is now a nop from the point of view of tun interfaces.
the old vnode is no longer associated with tun and tap and will now
never call tun_dev_close (via tunclose or tapclose), which in turn means
sc_dev won't get cleared.

another thread can now call tun_clone_destroy against the new instance
of tun_softc. this instance has sc_dev set, so it tries to revoke it,
but there's no vnode associated with it because the old vnode reference
is dead.

because this second call to VOP_REVOKE couldnt find a vnode, it
can't call tunclose against it, so sc_dev is still set and this
KASSERT fires.

hairy.

claudio and i came up with the following, which is to have tun_dev_open
check the state of the vnode associated with the current open call
after all the sleeping and potential tun_clone_destroy and
tun_clone_create calls. if the vnode has been made bad/dead after
all the sleeping, it returns with ENXIO.

this appears to fix the issue. i had a test program that would trigger
the KASSERT in a matter of seconds and it's been running for an hour
now.

here's the diff.

Index: if_tun.c
===================================================================
RCS file: /cvs/src/sys/net/if_tun.c,v
retrieving revision 1.234
diff -u -p -r1.234 if_tun.c
--- if_tun.c    16 Feb 2022 02:22:39 -0000      1.234
+++ if_tun.c    24 Feb 2022 08:08:38 -0000
@@ -374,10 +374,19 @@ tun_dev_open(dev_t dev, const struct if_
        struct ifnet *ifp;
        int error;
        u_short stayup = 0;
+       struct vnode *vp;
 
        char name[IFNAMSIZ];
        unsigned int rdomain;
 
+       /*
+        * Find the vnode associated with this open before we sleep
+        * and let something else revoke it. Our caller has a reference
+        * to it so we don't need to account for it.
+        */
+       if (!vfinddev(dev, VCHR, &vp))
+               panic("%s vfinddev failed", __func__);
+
        snprintf(name, sizeof(name), "%s%u", ifc->ifc_name, minor(dev));
        rdomain = rtable_l2(p->p_p->ps_rtableid);
 
@@ -412,6 +421,12 @@ tun_dev_open(dev_t dev, const struct if_
                        /* XXX if_clone_destroy if stayup? */
                        goto done;
                }
+       }
+
+       /* Has tun_clone_destroy torn the rug out under us? */
+       if (vp->v_type == VBAD) {
+               error = ENXIO;
+               goto done;
        }
 
        if (sc->sc_dev != 0) {

Reply via email to