[ This mail is primarily directed at Thiru, since AFAIK he's the only
one who truly understands the dank depths of the IPSQ code. However, if
this approach proves feasible, it will ideally be used to elegantly
sidestep some really brutal race conditions in the new IPMP codebase.
If splitting headaches are your "thing", then you may enjoy trying to
make sense of all this ;-) ]
Thiru,
I believe I have modified the IPSQ framework to accommodate non-ioctl
based quiescent access, even when operations that are traditionally
handled in ioctl context (e.g., ill_up_ipifs()) need to be called.
It seems to be working well on my test machine (using ce, which supports
DL_NOTE_PHYS_ADDR).
Here's the proposed code:
http://atlantic.east/ws/meem/physaddr/webrev
It also includes the related cleanup of the DL_PHYS_ADDR_REQ and
DL_NOTE_PHYS_ADDR logic, and the removal of ill_lock where appropriate.
(Alas, my new implementation has six more lines of code than the original,
violating one of my big rules. But the new code has a lot more comments,
and fixes one panic, so I think that offsets it ;-)
Here are the salient points:
* Although only DL_NOTE_PHYS_ADDR and the old M_ERROR/M_HANGUP
handling currently make use of the new logic, it has been
designed so that future DL_NOTE_*'s could easily operate on
quiescent ill's. My plan is to use this to change fields that
have long been set-once but now occasionally need to be changed
on-the-fly in the new IPMP model (such as ill_phys_addr_length)
by injecting synthetic notifications onto ill_rq.
* Two new functions, ipsq_current_start() and
ipsq_current_finish(), are used to start and finish an
exclusive operation. The existing ip_*_ioctl() routines
automatically call these as necessary -- but having them be
independent enables use outside of ioctl processing.
* The ipsq_current_ipif field still tracks whether a current
operation is in progress (and its associated ipif). However, a
new ipsq_current_ioctl field is added to track whether the
operation is associated with an ioctl. This field also enables
the special handling of SIOCLIFREMOVEIF to be moved inside
ipsq_current_finish(), and for the `ipif' argument to be removed
from ip_ioctl_finish().
* The M_ERROR/M_HANGUP special handling in ipsq_pending_mp_add()
and ipif_all_down_tail() has been removed/replaced with
appropriate calls to ipsq_current_start()/ipsq_current_end().
For a DL_NOTE_PHYS_ADDR, the codepath is as follows:
1. DL_NOTE_PHYS_ADDR is processed in ip_rput_dlpi_writer(). The
thread is exclusive on the IPSQ but the ill is not necessarily
quiesced.
2. ill_set_phys_addr() is called to set the address. This calls
ipsq_current_start() to start a new operation (not associated
with an ioctl), and calls ill_down_ipifs() to bring down all
IFF_UP ipifs (and generally try to quiesce the ill). It then
checks to see if the ill has indeed quiesced, and if so
immediately calls ill_set_phys_addr_tail(). Otherwise, it
calls ipsq_pending_mp_add() to queue the operation once the
ILL_DOWN condition has been met.
3. If the ill didn't immediately quiesce, then processing is
deferred until the final reference is removed via
ipif_ill_refrele_tail(), which then demuxes based on the mblk_t
contents and calls ill_set_physaddr_tail().
4. Eventually, ill_set_phys_addr_tail() is called with the ill
quiesced. It unpacks the mblks it needs (which were
preallocated by ill_set_phys_addr() so that we can fail fast
if necessary), sets the information on the ill, and calls
ill_up_ipifs() to bring the ipifs back up.
5. If there were no up ipifs, then ill_up_ipifs() will return
zero, in which case ill_set_phys_addr_tail() simply calls
ipsq_current_finish() to complete the operation; when
ipsq_exit() is eventually called, the next operation will be
allowed to begin.
6. If there were up ipifs, then ill_up_ipifs() will return
EINPROGRESS. At this point, ill_up_ipifs(), ipif_up(), and
ip_rput_dlpi_writer() or ip_arp_done() begin their traditional
dance -- e.g.:
* ill_up_ipifs() finds an ipif_was_up ipif, sets the
ill_up_ipifs flag and calls ipif_up() to bring it up.
* ipif_up() starts bringup, then enqueues a request with the
resolver (e.g., ipif_resolver_up()) and returns
EINPROGRESS (which is in-turn returned by ill_up_ipifs()).
* Eventually, the resolver request is handled and we end up
in ip_arp_done(). It sees that ill_up_ipifs is set, and
rather than completing the operation, calls *back* to
ill_up_ipifs() to bring up another ipif.
* If there are more ipifs to bring up, then ill_up_ipifs()
will again call ipif_up() which will again enqueue a new
request with the resolver and again return EINPROGRESS --
in other words, we go to start of the dance.
* If there are no more ipifs to bring up, then
ill_up_ipifs() will return 0 and ip_arp_done() will
complete the IPSQ operation. If ipsq_current_ioctl
indicates the request is associated with an ioctl, then
ip_ioctl_finish() will be called to send up the ioctl
response and finish the IPSQ operation. Otherwise,
ipsq_current_finish() will be called.
One subtle bit: ill_up_ipifs() currently takes an `mp' (traditionally the
pending ioctl), which is passed to ipif_up() and tracks the state of the
pending operation. Right now, ill_set_phys_addr_tail() passes in the
DL_NOTIFY_IND that was cloned by ill_set_phys_addr() (and assigned to
either ill_phys_addr_mp or ill_nd_lla_mp) -- but this seems kind of
skanky. Maybe I should allocate an `mp' specifically for the purpose?
Or maybe ipif_up() can get be recoded to get by without one?
I also had to add some kinda-ugly changes to ipif_up() to allow calls
from "lower queue" context. Please look closely at these, because I'm
not convinced they're appropriate.
Thanks for your time and thoughts.
--
meem