Meem,

Overall this looks pretty reasonable and extensible for initiating
ipif/ill quiescence based on async messages from below IP.

Some minor things below.

* ill_set_phys_addr(): At L23612 ip_if.c if the first copyb fails we
  free up the original 'mp' itself, and we have another free in the
  caller ip_rput_dlpi_writer at L15547 ip.c

* ipif_up(): After the call to ipif_ndp_up, we want to return if
  the error is EINPROGRESS, but we fall through. But Iam not sure
  about ipif_ndp_up behavior, it does not seem to return the EINPROGRESS
  I guess something related to Jim's DAD changes ?

* ip_reprocess_ioctl() The check at L25359 is redundant. ipsq_current_ipif
  is bound to be NULL here. So we can call ipsq_current_start directly. 
Right ?
 
One issue that needs some thought below.

Currently (onnv) we try to become exclusive as NEW_OP when we get a
DL_NOTIFY_IND, the assumption being that this is an asynchronous message
unrelated to the current operation (if any) in progress. But it is possible
that drivers may send up DL_NOTE_LINK_UP/DOWN during an interface bring up.
May be there are other NOTEs also that could happen during a bringup 
sequence.
In the current (onnv) code, such messages would just be queued up and not
processed till the bring up ioctl completes.  May be what is needed is here
is to actually examine the sub type in the DL_NOTIFY_IND and then determine
whether it is actually part of the current operation in progress or whether
it should be treated as something unrelated i.e. as a new operation.

Thirumalai

Peter Memishian wrote:

>[ This mail is primarily directed at Thiru, since AFAIK he's the only
>  one who truly understands the dank depths of the IPSQ code.  However, if
>  this approach proves feasible, it will ideally be used to elegantly
>  sidestep some really brutal race conditions in the new IPMP codebase.
>
>  If splitting headaches are your "thing", then you may enjoy trying to
>  make sense of all this ;-) ]
>
>Thiru,
>
>I believe I have modified the IPSQ framework to accommodate non-ioctl
>based quiescent access, even when operations that are traditionally
>handled in ioctl context (e.g., ill_up_ipifs()) need to be called. 
>It seems to be working well on my test machine (using ce, which supports
>DL_NOTE_PHYS_ADDR).
>
>Here's the proposed code:
>
>      http://atlantic.east/ws/meem/physaddr/webrev
>
>It also includes the related cleanup of the DL_PHYS_ADDR_REQ and
>DL_NOTE_PHYS_ADDR logic, and the removal of ill_lock where appropriate.
>(Alas, my new implementation has six more lines of code than the original,
>violating one of my big rules.  But the new code has a lot more comments,
>and fixes one panic, so I think that offsets it ;-)
>
>Here are the salient points:
>
>       * Although only DL_NOTE_PHYS_ADDR and the old M_ERROR/M_HANGUP
>         handling currently make use of the new logic, it has been
>         designed so that future DL_NOTE_*'s could easily operate on
>         quiescent ill's.  My plan is to use this to change fields that
>         have long been set-once but now occasionally need to be changed
>         on-the-fly in the new IPMP model (such as ill_phys_addr_length)
>         by injecting synthetic notifications onto ill_rq.
>
>       * Two new functions, ipsq_current_start() and
>         ipsq_current_finish(), are used to start and finish an
>         exclusive operation.  The existing ip_*_ioctl() routines
>         automatically call these as necessary -- but having them be
>         independent enables use outside of ioctl processing.
>
>       * The ipsq_current_ipif field still tracks whether a current
>         operation is in progress (and its associated ipif).  However, a
>         new ipsq_current_ioctl field is added to track whether the
>         operation is associated with an ioctl.  This field also enables
>         the special handling of SIOCLIFREMOVEIF to be moved inside
>         ipsq_current_finish(), and for the `ipif' argument to be removed
>         from ip_ioctl_finish().
>
>       * The M_ERROR/M_HANGUP special handling in ipsq_pending_mp_add()
>         and ipif_all_down_tail() has been removed/replaced with
>         appropriate calls to ipsq_current_start()/ipsq_current_end().
>
>For a DL_NOTE_PHYS_ADDR, the codepath is as follows:
>
>       1. DL_NOTE_PHYS_ADDR is processed in ip_rput_dlpi_writer().  The
>          thread is exclusive on the IPSQ but the ill is not necessarily
>          quiesced.
>
>       2. ill_set_phys_addr() is called to set the address.  This calls
>          ipsq_current_start() to start a new operation (not associated
>          with an ioctl), and calls ill_down_ipifs() to bring down all
>          IFF_UP ipifs (and generally try to quiesce the ill).  It then
>          checks to see if the ill has indeed quiesced, and if so
>          immediately calls ill_set_phys_addr_tail().  Otherwise, it
>          calls ipsq_pending_mp_add() to queue the operation once the
>          ILL_DOWN condition has been met.
>
>       3. If the ill didn't immediately quiesce, then processing is
>          deferred until the final reference is removed via
>          ipif_ill_refrele_tail(), which then demuxes based on the mblk_t
>          contents and calls ill_set_physaddr_tail().
>
>       4. Eventually, ill_set_phys_addr_tail() is called with the ill
>          quiesced.  It unpacks the mblks it needs (which were
>          preallocated by ill_set_phys_addr() so that we can fail fast
>          if necessary), sets the information on the ill, and calls    
>          ill_up_ipifs() to bring the ipifs back up.
>
>       5. If there were no up ipifs, then ill_up_ipifs() will return
>          zero, in which case ill_set_phys_addr_tail() simply calls
>          ipsq_current_finish() to complete the operation; when
>          ipsq_exit() is eventually called, the next operation will be
>          allowed to begin.
>
>       6. If there were up ipifs, then ill_up_ipifs() will return
>          EINPROGRESS.  At this point, ill_up_ipifs(), ipif_up(), and
>          ip_rput_dlpi_writer() or ip_arp_done() begin their traditional
>          dance -- e.g.:
>
>             * ill_up_ipifs() finds an ipif_was_up ipif, sets the
>               ill_up_ipifs flag and calls ipif_up() to bring it up.
>
>             * ipif_up() starts bringup, then enqueues a request with the
>               resolver (e.g., ipif_resolver_up()) and returns
>               EINPROGRESS (which is in-turn returned by ill_up_ipifs()).
>
>             * Eventually, the resolver request is handled and we end up
>               in ip_arp_done().  It sees that ill_up_ipifs is set, and
>               rather than completing the operation, calls *back* to
>               ill_up_ipifs() to bring up another ipif.
>
>             * If there are more ipifs to bring up, then ill_up_ipifs()
>               will again call ipif_up() which will again enqueue a new
>               request with the resolver and again return EINPROGRESS --
>               in other words, we go to start of the dance.
>
>             * If there are no more ipifs to bring up, then
>               ill_up_ipifs() will return 0 and ip_arp_done() will
>               complete the IPSQ operation.  If ipsq_current_ioctl
>               indicates the request is associated with an ioctl, then
>               ip_ioctl_finish() will be called to send up the ioctl
>               response and finish the IPSQ operation.  Otherwise,
>               ipsq_current_finish() will be called.
>
>One subtle bit: ill_up_ipifs() currently takes an `mp' (traditionally the
>pending ioctl), which is passed to ipif_up() and tracks the state of the
>pending operation.  Right now, ill_set_phys_addr_tail() passes in the
>DL_NOTIFY_IND that was cloned by ill_set_phys_addr() (and assigned to
>either ill_phys_addr_mp or ill_nd_lla_mp) -- but this seems kind of
>skanky.  Maybe I should allocate an `mp' specifically for the purpose?
>Or maybe ipif_up() can get be recoded to get by without one?
>
>I also had to add some kinda-ugly changes to ipif_up() to allow calls
>from "lower queue" context.  Please look closely at these, because I'm
>not convinced they're appropriate.
>
>Thanks for your time and thoughts.
>--
>meem
>  
>


Reply via email to