Re: What's going on with vnets and epairs w/ addresses?

2022-12-20 Thread Bjoern A. Zeeb

On Tue, 20 Dec 2022, Mark Johnston wrote:


On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote:

On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff  wrote:


  Zhenlei,

On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote:
Z> I managed to repeat this issue on CURRENT/14 with this small snip:
Z>
Z> ---
Z> #!/bin/sh
Z>
Z> # test jail name
Z> n="test_ref_leak"
Z>
Z> jail -c name=$n path=/ vnet persist
Z> # The following line trigger jail pr_ref leak
Z> jexec $n ifconfig lo0 inet 127.0.0.1/8
Z>
Z> jail -R $n
Z>
Z> # wait a moment
Z> sleep 1
Z>
Z> jls -j $n
Z>
Z> After DDB debugging and tracing , it seems that is triggered by a combine of 
[1] and [2]
Z>
Z> [1] https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915 

Z> [2] https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b 

Z>
Z>
Z> In [1] the per-VNET uma zone is shared with the global one.
Z> `pcbinfo->ipi_zone = pcbstor->ips_zone;`
Z>
Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() by 
uma_zfree_smr() .
Z>
Z> Unfortunately inps freed by uma_zfree_smr() are cached and inpcb_dtor() is 
not called immediately ,
Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`.
Z>
Z> And it is also not possible to free up the cache by per-VNET SYSUNINIT 
tcp_destroy / udp_destroy / rip_destroy.

This is known issue and I'd prefer not to call it a problem. The "leak" of a 
jail
happens only if machine is idle wrt the networking activity.

Getting back to the problem that started this thread - the epair(4)s not 
immediately
popping back to prison0. IMHO, the problem again lies in the design of if_vmove 
and
epair(4) in particular. The if_vmove shall not exist, instead we should do a 
full
if_attach() and if_detach(). The state of an ifnet when it undergoes if_vmove 
doesn't
carry any useful information. With Alexander melifaro@ we discussed better 
options
for creating or attaching interfaces to jails that if_vmove. Until they are 
ready
the most easy workaround to deal with annoying epair(4) come back problem is to
remove it manually before destroying a jail, like I did in 80fc25025ff.



It still behaved much better prior to eb93b99d6986, which you and Mark
were going to work on a solution for to allow the cred "leak" to close
up much more quickly. CC markj@, since I think it's been six months
since the last time I inquired about it, making this a good time to do
it again...


I spent some time trying to see if we could fix this in UMA/SMR and
talked to Jeff about it a bit.  At this point I don't think it's the
right approach, at least for now.  Really we have a composability
problem where different layers are using different techniques to signal
that they're done with a particular piece of memory, and they just
aren't compatible.

One thing I tried is to implement a UMA function which walks over all
SMR zones and synchronizes all cached items (so that their destructors
are called).  This is really expensive, at minimum it has to bind to all


A semi-unrelated question -- do we have any documentation around SMR
in the tree which is not in subr_smr.c?

(I have to admit I find it highly confusing that the acronym is more
easily found as "Shingled Magnetic Recording (SMR)" in a different
header file).



CPUs in the system so that it can flush per-CPU buckets.  If
jail_deref() calls that function, the bug goes away at least in my
limited testing, but its use is really a layering violation.

We could, say, periodically scan cached UMA/SMR items and invoke their
destructors, but for most SMR consumers this is unnecessary, and again
there's a layering problem: the inpcb layer shouldn't "know" that it has
to do that for its zones, since it's the jail layer that actually cares.

It also seems kind of strange that dying jails still occupy a slot in
the jail namespace.  I don't really understand why the existence of a
dying jail prevents creation of a new jail with the same name, but
presumably there's a good reason for it?


You can create a new jail but if you have (physical) resources tied to
the old one which are not released, then you are stuck (physical
network interfaces for example).



Now my inclination is to try and fix this in the inpcb layer, by not
accessing the inp_cred at all in the lookup path until we hold the inpcb
lock, and then releasing the cred ref before freeing a PCB to its zone.
I think this is doable based on a few observations:
- When doing an SMR-protected lookup, we always lock the returned inpcb
 before handing it to the caller.  So we could in principle perform
 inp_cred checking after acquiring the lock but before returning.
- If there are no jailed PCBs in a hash chain in_pcblookup_hash_locked()
 always scans the whole chain.
- If we match only one PCB in a lookup, we can probably(?) 

Re: prison_flag() check in hot path of in_pcblookup()

2022-12-20 Thread Mark Johnston
On Tue, Dec 13, 2022 at 11:54:17PM +, Bjoern A. Zeeb wrote:
> On Tue, 13 Dec 2022, Andrew Gallatin wrote:
> 
> > [ I added pjd, since the original patch came from him ]
> >
> > Just to make sure I understand, I have a simple yes/no question:
> >
> > Can  jails and the host ever share the same (local) port and the same IP?
> 
> Can they currently (I tested only for TCP)?
> 
> - local binds can overlap like they can with just the base system.
>so bind(... {AF_INET, laddr, lport} ... ) works fine (REUSEPORT).
> 
> - tcp connect of a 2nd socket to the same {faddr, fport} from the above
>bind will fail with 'Address already in use'  [currently]
>[I believe that would mean your patch could go in? Where does the error 
> come from [%]?] [*]

I presume that the patch just causes the first loop in
in_pcblookup_hash_locked() to return immediately upon an exact match?  I
think this is only valid if in_pcblookup_hash_locked() can assume that
faddr != INADDR_ANY.  I'm pretty sure this is true but it's not entirely
obvious to me.

> - tcp listen will work on {laddr, lport} if run in paralllel (REUSEPORT)
>or in base and jail at the same time.

I wonder if we can improve wildcard matching by enforcing an invariant
that jailed sockets always appear first in a hash chain?  That would
make hash insertion more expensive but that might be a reasonable
tradeoff.

> [%] likely in_pcbconnect_setup() ?  Also one should check the other
>  order (jail first then base);  also we assume no other race
>  conditions in this rather simple testing...



Re: What's going on with vnets and epairs w/ addresses?

2022-12-20 Thread Mark Johnston
On Sun, Dec 18, 2022 at 10:52:58AM -0600, Kyle Evans wrote:
> On Sat, Dec 17, 2022 at 11:22 AM Gleb Smirnoff  wrote:
> >
> >   Zhenlei,
> >
> > On Fri, Dec 16, 2022 at 06:30:57PM +0800, Zhenlei Huang wrote:
> > Z> I managed to repeat this issue on CURRENT/14 with this small snip:
> > Z>
> > Z> ---
> > Z> #!/bin/sh
> > Z>
> > Z> # test jail name
> > Z> n="test_ref_leak"
> > Z>
> > Z> jail -c name=$n path=/ vnet persist
> > Z> # The following line trigger jail pr_ref leak
> > Z> jexec $n ifconfig lo0 inet 127.0.0.1/8
> > Z>
> > Z> jail -R $n
> > Z>
> > Z> # wait a moment
> > Z> sleep 1
> > Z>
> > Z> jls -j $n
> > Z>
> > Z> After DDB debugging and tracing , it seems that is triggered by a 
> > combine of [1] and [2]
> > Z>
> > Z> [1] 
> > https://reviews.freebsd.org/rGfec8a8c7cbe4384c7e61d376f3aa5be5ac895915 
> > 
> > Z> [2] 
> > https://reviews.freebsd.org/rGeb93b99d698674e3b1cc7139fda98e2b175b8c5b 
> > 
> > Z>
> > Z>
> > Z> In [1] the per-VNET uma zone is shared with the global one.
> > Z> `pcbinfo->ipi_zone = pcbstor->ips_zone;`
> > Z>
> > Z> In [2] unref `inp->inp_cred` is deferred called in inpcb_dtor() by 
> > uma_zfree_smr() .
> > Z>
> > Z> Unfortunately inps freed by uma_zfree_smr() are cached and inpcb_dtor() 
> > is not called immediately ,
> > Z> thus leaking `inp->inp_cred` ref and hence `prison->pr_ref`.
> > Z>
> > Z> And it is also not possible to free up the cache by per-VNET SYSUNINIT 
> > tcp_destroy / udp_destroy / rip_destroy.
> >
> > This is known issue and I'd prefer not to call it a problem. The "leak" of 
> > a jail
> > happens only if machine is idle wrt the networking activity.
> >
> > Getting back to the problem that started this thread - the epair(4)s not 
> > immediately
> > popping back to prison0. IMHO, the problem again lies in the design of 
> > if_vmove and
> > epair(4) in particular. The if_vmove shall not exist, instead we should do 
> > a full
> > if_attach() and if_detach(). The state of an ifnet when it undergoes 
> > if_vmove doesn't
> > carry any useful information. With Alexander melifaro@ we discussed better 
> > options
> > for creating or attaching interfaces to jails that if_vmove. Until they are 
> > ready
> > the most easy workaround to deal with annoying epair(4) come back problem 
> > is to
> > remove it manually before destroying a jail, like I did in 80fc25025ff.
> >
> 
> It still behaved much better prior to eb93b99d6986, which you and Mark
> were going to work on a solution for to allow the cred "leak" to close
> up much more quickly. CC markj@, since I think it's been six months
> since the last time I inquired about it, making this a good time to do
> it again...

I spent some time trying to see if we could fix this in UMA/SMR and
talked to Jeff about it a bit.  At this point I don't think it's the
right approach, at least for now.  Really we have a composability
problem where different layers are using different techniques to signal
that they're done with a particular piece of memory, and they just
aren't compatible.

One thing I tried is to implement a UMA function which walks over all
SMR zones and synchronizes all cached items (so that their destructors
are called).  This is really expensive, at minimum it has to bind to all
CPUs in the system so that it can flush per-CPU buckets.  If
jail_deref() calls that function, the bug goes away at least in my
limited testing, but its use is really a layering violation.

We could, say, periodically scan cached UMA/SMR items and invoke their
destructors, but for most SMR consumers this is unnecessary, and again
there's a layering problem: the inpcb layer shouldn't "know" that it has
to do that for its zones, since it's the jail layer that actually cares.

It also seems kind of strange that dying jails still occupy a slot in
the jail namespace.  I don't really understand why the existence of a
dying jail prevents creation of a new jail with the same name, but
presumably there's a good reason for it?

Now my inclination is to try and fix this in the inpcb layer, by not
accessing the inp_cred at all in the lookup path until we hold the inpcb
lock, and then releasing the cred ref before freeing a PCB to its zone.
I think this is doable based on a few observations:
- When doing an SMR-protected lookup, we always lock the returned inpcb
  before handing it to the caller.  So we could in principle perform
  inp_cred checking after acquiring the lock but before returning.
- If there are no jailed PCBs in a hash chain in_pcblookup_hash_locked()
  always scans the whole chain.
- If we match only one PCB in a lookup, we can probably(?) return that
  PCB without dereferencing the cred pointer at all.  If not, then the
  scan only has to keep track of a fixed number of PCBs before picking
  which one to return.  So it looks like we can perform