On Sun, Sep 07, 2025 at 09:25:59AM -0700, James Gritton wrote:
> On 2025-09-06 17:26, Konstantin Belousov wrote:
> > On Fri, Sep 05, 2025 at 10:57:30AM -0700, James Gritton wrote:
> > > On 2025-09-04 22:14, Konstantin Belousov wrote:
> > > > BTW, you added some support for kqueue for jail events, but not to the
> > > > jail file descriptors.  This seems to be backward: if somebody wants to
> > > > monitor events for jails, then it is more reliable and straightforward
> > > > to do with the new jail fds rather than with ids.
> > > 
> > > It is at least incomplete, and not the state I want things to be at.
> > > There's a sticking point with jaildesc kqueue, so while I work that
> > > out I went with jid-baseds kqueue as a starter.
> > > 
> > > The trouble is child jails.  I took their handling from the existing
> > > child process handling, where I register a new kevent under the new
> > > jail's id.  But that's something I can't do with descriptors, since
> > > they have a process-specific identifier, the descriptor number.  The
> > > code that creates the new event, coming from the jail_set call that
> > > created a new jail, has access to the global descriptor (the struct
> > > file), but not to the process(es) that have it open, so I have no
> > > way of registering one or more events with that descriptor number.
> > > 
> > > One workaround is to have both jid- and jaildesc-based kevents, but
> > > both of them register a new jid-based kevent for a newly created child
> > > jail.  The caller may then get a descriptor with jail_get, and add a
> > > kevent for it and remove the old jid-based one.  This would work, but
> > > feels really klunky.
> > > 
> > > The other idea I've had is to register a temporary event, and then add
> > > code to kqueue_scan that converts that into a proper jaildesc event
> > > with the expected file descriptor number.  That would require either
> > > jaildesc-specific code in or around kqueue_scan, or adding another
> > > filterops function, neither of which is great.  Still, it seems the
> > > better solution.
> > 
> > This is not how the monitoring APIs work in general.  For instance,
> > when you register a listening socket in kqueue (or mark it for select or
> > poll), you do not get back a new connected file descriptor.  Kqueue only
> > provides a notification that new connection arrived, and then code
> > needs to accept it and get the file descriptor for new connection using
> > dedicated socket API.
> 
> True.  An accepted connection changes the network state, both locally
> and remotely, and automatically establishing that connection wouldn't
> be the right things to do.  The existence of a listen queue also fits
> well with a notification system that doesn't do its own queueing.
> 
> Jail descriptors, on the other hand, only exist as a veiw to an
> existing jail, and don't establish anything other than that view.
> Jail creation also has no associated queue, so loss-free noficiation
> relies on the same hack that process forking already established,
> but requiring a little more in the way of making it fit.
> 
> An alternate way of solving the problem would be to create such
> a queue, allowing a single notification of such things as a jail
> attachment or child jail creation, or possibly more than one of
> them by the time the process reads the queue.

No, the queue is obviously overkill.  Still, no notification system
really require the queue.

I am not thinking too hard what would be a good design for the jail filedesc,
but I have some ideas that feels worth communicating.

First, since you already mentioned a desire to capsicumize jfds, I think it
is already a huge wart in the interface.  The function that opens (or
creates) fd from a jail id, must not take just jail.  It should be
namespace-aware already.  In other words, it should take existing jfd
and create a child jail, returning jfd for it.  The existing jfd gives
the namespace container to start with, which is essentially how capsicum
is organizing the rights limiting.

For the bootstrapping, the prison0 non-capentered process can pass a special
id for jfd to reference prison0, similar how AT_FWCWD marks '.' for *at(2)
syscalls.

Next, for notifications, the notification subsystem does not need to
indicate what happen, in particular, it does not need to communicate
neither the jid of created (or destroyed) jail, nor jfd for it.  It is
enough to make the listener aware that something happen.  Upon receiving
the notification, listener would interrogate the system state and see
what changed.

The proc knotes are very bad example, in particular, the idea to pass
back the pid is wrong.  It is known to be racy, both because reported
pids might get lost, and because pids only behave handle-like in the
parent of the forked child, so pid might be reused.

Proc knote + half-done procdesc is not a good example to start with.


Reply via email to