[ This mail is primarily directed at Thiru, since AFAIK he's the only
  one who truly understands the dank depths of the IPSQ code.  However, if
  this approach proves feasible, it will ideally be used to elegantly
  sidestep some really brutal race conditions in the new IPMP codebase.

  If splitting headaches are your "thing", then you may enjoy trying to
  make sense of all this ;-) ]

Thiru,

I believe I have modified the IPSQ framework to accommodate non-ioctl
based quiescent access, even when operations that are traditionally
handled in ioctl context (e.g., ill_up_ipifs()) need to be called. 
It seems to be working well on my test machine (using ce, which supports
DL_NOTE_PHYS_ADDR).

Here's the proposed code:

      http://atlantic.east/ws/meem/physaddr/webrev

It also includes the related cleanup of the DL_PHYS_ADDR_REQ and
DL_NOTE_PHYS_ADDR logic, and the removal of ill_lock where appropriate.
(Alas, my new implementation has six more lines of code than the original,
violating one of my big rules.  But the new code has a lot more comments,
and fixes one panic, so I think that offsets it ;-)

Here are the salient points:

        * Although only DL_NOTE_PHYS_ADDR and the old M_ERROR/M_HANGUP
          handling currently make use of the new logic, it has been
          designed so that future DL_NOTE_*'s could easily operate on
          quiescent ill's.  My plan is to use this to change fields that
          have long been set-once but now occasionally need to be changed
          on-the-fly in the new IPMP model (such as ill_phys_addr_length)
          by injecting synthetic notifications onto ill_rq.

        * Two new functions, ipsq_current_start() and
          ipsq_current_finish(), are used to start and finish an
          exclusive operation.  The existing ip_*_ioctl() routines
          automatically call these as necessary -- but having them be
          independent enables use outside of ioctl processing.

        * The ipsq_current_ipif field still tracks whether a current
          operation is in progress (and its associated ipif).  However, a
          new ipsq_current_ioctl field is added to track whether the
          operation is associated with an ioctl.  This field also enables
          the special handling of SIOCLIFREMOVEIF to be moved inside
          ipsq_current_finish(), and for the `ipif' argument to be removed
          from ip_ioctl_finish().

        * The M_ERROR/M_HANGUP special handling in ipsq_pending_mp_add()
          and ipif_all_down_tail() has been removed/replaced with
          appropriate calls to ipsq_current_start()/ipsq_current_end().

For a DL_NOTE_PHYS_ADDR, the codepath is as follows:

        1. DL_NOTE_PHYS_ADDR is processed in ip_rput_dlpi_writer().  The
           thread is exclusive on the IPSQ but the ill is not necessarily
           quiesced.

        2. ill_set_phys_addr() is called to set the address.  This calls
           ipsq_current_start() to start a new operation (not associated
           with an ioctl), and calls ill_down_ipifs() to bring down all
           IFF_UP ipifs (and generally try to quiesce the ill).  It then
           checks to see if the ill has indeed quiesced, and if so
           immediately calls ill_set_phys_addr_tail().  Otherwise, it
           calls ipsq_pending_mp_add() to queue the operation once the
           ILL_DOWN condition has been met.

        3. If the ill didn't immediately quiesce, then processing is
           deferred until the final reference is removed via
           ipif_ill_refrele_tail(), which then demuxes based on the mblk_t
           contents and calls ill_set_physaddr_tail().

        4. Eventually, ill_set_phys_addr_tail() is called with the ill
           quiesced.  It unpacks the mblks it needs (which were
           preallocated by ill_set_phys_addr() so that we can fail fast
           if necessary), sets the information on the ill, and calls    
           ill_up_ipifs() to bring the ipifs back up.

        5. If there were no up ipifs, then ill_up_ipifs() will return
           zero, in which case ill_set_phys_addr_tail() simply calls
           ipsq_current_finish() to complete the operation; when
           ipsq_exit() is eventually called, the next operation will be
           allowed to begin.

        6. If there were up ipifs, then ill_up_ipifs() will return
           EINPROGRESS.  At this point, ill_up_ipifs(), ipif_up(), and
           ip_rput_dlpi_writer() or ip_arp_done() begin their traditional
           dance -- e.g.:

              * ill_up_ipifs() finds an ipif_was_up ipif, sets the
                ill_up_ipifs flag and calls ipif_up() to bring it up.

              * ipif_up() starts bringup, then enqueues a request with the
                resolver (e.g., ipif_resolver_up()) and returns
                EINPROGRESS (which is in-turn returned by ill_up_ipifs()).

              * Eventually, the resolver request is handled and we end up
                in ip_arp_done().  It sees that ill_up_ipifs is set, and
                rather than completing the operation, calls *back* to
                ill_up_ipifs() to bring up another ipif.

              * If there are more ipifs to bring up, then ill_up_ipifs()
                will again call ipif_up() which will again enqueue a new
                request with the resolver and again return EINPROGRESS --
                in other words, we go to start of the dance.

              * If there are no more ipifs to bring up, then
                ill_up_ipifs() will return 0 and ip_arp_done() will
                complete the IPSQ operation.  If ipsq_current_ioctl
                indicates the request is associated with an ioctl, then
                ip_ioctl_finish() will be called to send up the ioctl
                response and finish the IPSQ operation.  Otherwise,
                ipsq_current_finish() will be called.

One subtle bit: ill_up_ipifs() currently takes an `mp' (traditionally the
pending ioctl), which is passed to ipif_up() and tracks the state of the
pending operation.  Right now, ill_set_phys_addr_tail() passes in the
DL_NOTIFY_IND that was cloned by ill_set_phys_addr() (and assigned to
either ill_phys_addr_mp or ill_nd_lla_mp) -- but this seems kind of
skanky.  Maybe I should allocate an `mp' specifically for the purpose?
Or maybe ipif_up() can get be recoded to get by without one?

I also had to add some kinda-ugly changes to ipif_up() to allow calls
from "lower queue" context.  Please look closely at these, because I'm
not convinced they're appropriate.

Thanks for your time and thoughts.
--
meem

Reply via email to