Thiru,

Many thanks for your insighful comments, and the alternative radix treelocking scheme that you provided. It is much appreciated.

Attached is a txt file that contains all your comments and our responses. Please go over them. Also we need clarification on these:

thiru-10a, thiru-30, thiru-31, thiru-33

Sangeeta
Comments thiru-1a- 4 are related to radix tree lookup/add/walk/delete locking
         The add, delete, lookup, and walk need a mutually consistent locking 
         mechanism. A suggested scheme is given in the attachment.

RESP: thiru-1a- 4 will be addressred correctly via implementation of the
      suggested locking schme.
   
thiru-1a ire_add_v4 -> ire_get_bucket -> returns &rt->rt_irb after dropping the
         global radix node lock. Now the radix node is not protected henceforth
         and can disappear any time before the ire is actually linked in to the
         radix node. To see how it can happen consider the following sequence.
  
  T1. route add dst = 192.168/16, ipif = bge0
  Thread T1 starts off, creates and links the radix node for 192.168/16
  into the tree and returns the irb. Let us suspend this thread at
  L3266 ip_ire.c. 

  T2. route add dst = 192.168/16, ipif = bge1

  Now T2 starts off, tries to add a duplicate (route,netmask) pair, fails
  and uses the radix node that T1 added, and links in the ire into the irb.
  and finishes to completion.
 
  T3. route delete dst = 192.168/16, ipif = bge1
  T4. netstat walker on the above bucket.

  T4 walks the above bucket. Now T3 -> ire_delete locates the ire and removes
  the radix node from the tree and marks the irb as DEAD and finishes the
  deletion actions. The walker finishes eventually and IRB_REFRELE
  ends up deleting the radix node.
 
  Now T1 resumes and finds a freed radix node. 

RESP: ACCEPT Will try alternative locking scheme that you suggested

thiru-1b. (We had to consider T4 a walker thread due to another bug.
          The radix node free only happens in IRB_REFRELE. If there
          is no walker during the deletion of an ire, the radix node
          is leaked.)

RESP: ACCEPT Will try alternative locking scheme that you suggested

thiru-2. In rn_walktree(), the global radix lock is held while doing
         the traversal.  But that is not sufficient. After dropping the
         lock at L1129 radix.c the access to the node in the block
         in L1121 - L1134 is unprotected and could be accessing freed
         memory. Similarly we get a pointer to the next radix node at
         L1124 and then use it much later at L1139 while we have
         dropped the lock in the interrugnum. The walker callbacks rtfunc and
         ire_ihandle_onlink_match do IRB_REFHOLD to walk the list, but there is
         nothing to prevent the list from vanishing even before IRB_REFHOLD
         is called or even before the callbacks are called.

         In both 1. and 2. we need to increment a refcount on the radix node
         before dropping the global radix tree lock. (could use the
         irb_cnt as the refcount under the irb_lock).

RESP: ACCEPT Will try alternative locking scheme that you suggested


thiru-3a In ire_delete(), we check for irb_ire_cnt of 1 and if so we remove the
         radix node from the tree. irb_ire_cnt is protected by the irb_lock and
         it can increase now, if another thread is in ire_add_v4. Note that the
         the thread in ire_add_v4 is not holding the global radix lock, but just
         the irb_lock while it is adding the ire to the irb list. 

RESP: ACCEPT Will try alternative locking scheme that you suggested

thiru-3b Another issue is that the radix node deletion may never happen.
         Consider that we have 2 ires in the bucket ire1, ire2 and consider
         2 threads T1 trying to delete ire1, and T2 trying to delete
         ire2 simultaneously.
   
   T1 -> ire_delete(ire1), grabs radix tree lock, finds irb_ire_cnt is 2
   and does not remove the radix node from the tree. Drops the radix tree lock
   and calls ire_delete_common(ire1). Let us suspend this thread now.

   T2 -> ire_delete(ire2), grabs radix tree lock, finds irb_ire_cnt is 2 and
   does not remove the radix node from the tree. Drops the radix tree lock
   and calls ire_delete_common(ire2). Proceeds to completion.
   Now let T1 also complete. Both ires are deleted, but radix node remains.

RESP: ACCEPT Will try alternative locking scheme that you suggested

thiru-3c The other issue is that if the radix node that is being
         removed from the tree is the 'next' in the tree walker, then
         even if the node itself is not freed (due to non-zero irb_cnt due
         to other walkers) the walk will be truncated, and that is bad. A
         truncated walk may cause unplumb to hang.

RESP: ACCEPT Will try alternative locking scheme that you suggested

thiru-3d L3857 ip_ire.c should never happen. 

RESP: ACCEPT


thiru-4  The last thread that brings the irb_refcnt to 0 frees the radix node
         if it has been removed from the tree. However the ires of this node
         which were located by threads before the ires were deleted may still
         be using them, and they may access ire_bucket which would access freed
         memory. Existing code assumes a non-zero ire_bucket to be always valid.
         Thus the radix node free must happen only when all the ires have been
         inactive'd (ire_inactive).

RESP: ACCEPT

thiru-5a. In the ND_INITIAL state the nce->nce_res_mp is a template or copy of
          the ill_resolver_mp. After successful resolution the nce_res_mp
          mutates to a dl_unitdata_req. This can lead to confusion since the
          type of mp is known only while holding nce_lock. Consider the
          following sequence 2 threads T1, T2 in the forwarding path and
          host path simultaneously.

   T1.1 ire_forward -> ire_create -> ire_add creates and adds an incomplete ire.
        ire->ire_nce is in  ND_INITIAL state.
   T2.1 ip_newroute -> hits on above nce -> sends arp request resolution.
   T1.2 ire_forward -> ip_rput_process_forward -> ip_rput_forward -> ip_xmit_v4
        grabs nce_lock, checks the nce_state, changes it to ND_INCOMPLETE,
        drops nce_lock 
   T2.2 arp resolution initiated in T2.1 completes and nce goes to ND_REACHABLE
   T1.3 ip_xmit_v4 -> ire_arpresolve would end up sending a bad request to
        ARP. The nce_res_mp is now a dl_unitdata_req (after arp resolution)
        and not an ill_resolver_mp.

   i.e. essentially the nce is resolved in between L27091 and the execution
   of ire_arpresolve causing the problem.

RESP: ACCEPT 

thiru-5b To get the state right, ip_newroute() must transition the
         nce to ND_INCOMPLETE before sending the resolution to ARP.

RESP: ACCEPT
      We will change the code such that  ip_newroute and ire_arpresolve
      will always get its dlu request from the ill_resolver_mp. The only
      functions that actually get/set nce_res_mp will be: 
             ire_arpresolve() and ip_wput_nondata()

thiru-5c. Since 1 thread in the forwarding path, and multiple threads
          in the host path can attempt to simultaneously resolve an nce,
          the nce_res_mp may be null and we have code to get it from
          the associated ill_resolver_mp in case it is null. L7714  ip.c,
          L6489 ip_ire.c. 

          It looks like we will be better off to avoid this mutation
          of nce_res_mp and leave it null to start with, and install
          nce_res_mp only after the nce is resolved. There is a comment at
          L6560 ip_ire.c which is somewhat similar.

RESP: ACCEPT

thiru-5d. The ire_nce must be set atomically under the nce_lock, while making
          sure it is not yet CONDEMNED, and be added to the table which in turn
          requires it to  be under ire_atomic_start. (Please see the way it
          is done in IPv6, especially the comments in L926, ip6_ire.c onnv
          gate, and the comments L345 - L348 ip_ndp.c).  You should be able
          to do it for IPv4 the same way. The block comment at L926 ip6_ire.c
          should explain it.

RESP: ACCEPT
      The ire_nce assignment for CACHE and BROADCAST ire's should
      happen in ire_add_v4

thiru-5e Surya should create IRE_CACHE for nexthop for the IF_NORESOLVER
         case to be consistent with V6

RESP: ACCEPT

thiru-6  In IPv6, only ire cache IREs have a non-null ire_nce. In IPv4,
         interface ires also have a non-null ire_nce. Why the dissimilarity ?
         Do you propose to extend IPv6 interface IREs also to have an ire_nce ?
         Look at L4881 ip6.c, L5816 ip6.c. This stands different from
         L7473 ip.c, L7701 ip.c. In one case we have to call ill_dlur_gen
         and build it, while in the other we can do a copy from the
         ire->ire_nce->nce_res_mp.

RESP: Only IRE_CACHE, IRE_BROADCAST and Mobile IP ires need nce_t, and
      ip_newroute will always find the dl_unitdata_req from ill_resolver_mp.
      There should be no nces for the IRE_INTERFACE.  


      Line 7473 is totally unnecessary, as  discovered from punchin testing
      with Dan's help: the ndp_lookup_then_add will anyway discover
      ire->ire_nce->nce_res_mp as part of the nce_lookup_addr, so all that I
      will end up doing is to copyb and freeb in the process.

      Line 7701:
      Currently we are just trying to obsolete ire_dlureq_mp
      (which holds the template for the DL_UNITDATA_REQ for ipv4).
       You are correct, in that we could just
      always use res_mp for this purpose.


thiru-7. esballoc/ire_freemblk should be removed. 
   The mblk callback for freeb (ire_freemblk) and esballoc logic are now
   highly avoidable (and convoluted) since the original design goal of the
   mblk holding a ref on the nce is no longer met. The attempt to access
   IP structures and drop nce references in ire_freemblk, when the mblk does
   not hold any ref on the nce is dangerous. Mblks can travel outside of IP,
   and any other entity that comes across this mblk and frees it will cause
   corruption.  Eg. Consider the following scenario.

   Assume unplumb starts and an arp request sent up ends up on the streamhead
   since ARP has been popped off. Now the comment at L7746 ip.c is this
   scenario. The mblk (arp request and ire mblk) is queued at the streamhead.
   IP finishes the unplumb, since there are no refs from the mblk to the nce.
   IP frees up all its structures. Now the close returns and the streamhead
   frees up the mblk causing ire_freemblk to be executed. Now any access to
   the ill at L6639 is accessing freed memory.

   The solution is to get rid of this esballoc and free. The essential part
   of ire_mblkfree must be called from IP on getting back an arp reply.
   If the mblk is freed by ARP or the streamhead, it is fine, since there
   are no extra refs on any nce, thus any unplumb will go through. 
   All blocks of code that do freeb(ire->ire_mp) will need to be revisited and
   redone after removing ire_freemblk. An alternative is to add an nce
   parameter to ire_add_then_send(). This way the mblk ire's ire->ire_nce
   can always be null.

   ire_freemblk: This function is unstructured and convoluted.
   i. The argument name 'fake_ire' is inappropriate since it can
   be either fake_ire or the ire created through ire_create_mp. 
   ii. This function does not do any useful work in the normal successful
   case of arp resolution. 
   iii. In ire_freemblk() we need to locate the nce. Instead we try to locate
   the ire cache and then the nce from the ire cache. This is incorrect,
   since an ire cache may not exist and we won't do the cleanup. This happens
   in the case of ip_newroute initiated arp request. Instead we should use
   ndp_lookup to locate the nce.
   iv. If the arp resolution fails, we want to call ndp_delete(). i.e.
   logically in the block at L6699, i.e. delete all associated ires since
   link layer resolution failed.
   v. If this ire is marked IRE_MARK_NOADD, why do you want to delete other
   IREs tied to the same nce ? 

RESP: DEFER on the removal if esballoc/ire_freemblk until ARP_IP merge
      We have had offline conversation and have agreed that we should continue
      using a simplified and well commented version of ire_freemblk/esballoc to
      handle cleaning up of incomplete ires. This will take care of 3rd
      party ARP, and exception cases in Solaris ARP where mblk is silently
      dropped at ARP resolution failure without IP notification

      In future the ARP-IP merge can replace the esballoc/ire_freembk logic
      with ARP directly cleaning up incomplete ires(nces) as
      part of ARP failure when ARP is part of IP. 

      We do accept your observation that the current ire_freemblk code is quite
      convoluted and needs simplification and comments.

      ire_freemblk/esballoc scheme will take care of cleaning up incomplete
      ires(and nce data structure). In addition we need to time out
      stale comeplete nces. Refer to response to thiru-12 for detailed
      expanation of that part. 
 
thiru-8  ip_wput_nondata:
   Calling ire_nce_init at L27105 is convoluted, all we need to do when an
   arp response arrives, is to locate the nce, by doing an ndp_lookup using
   the ire->ire_addr, and the ill on which the arp response arrives.

RESP: ACCEPT

thiru-9 ip_wput_ire: label noprepend: can be deleted

RESP: ACCEPT

thiru-10a ipfil_sendpkt: If ipfil_sendpkt -> ire_route_lookup happens for a
   destination that is onlink ire_route_lookup() would return an IRE_INTERFACE
   and sire would be null. The check at L731 in ire_forward must check whether
   supplied_ire is null to determine the caller.

RESP: CLARIFY. You are correct in that L731 in ire_forward  is incorrect. But 
sire
      may null ( or onlink host) and a non-null for offlink host. So
      should L731 just check for a non-null ire(and not check on sire at all) 
to determine
      that the caller of ire_forward is ipfil_sendpkt instead?

thiru-10b At L1283  of ipfil_sendpkt()if ire_type is 
IRE_IF_NORESOLVER/IRE_CACHE,
          then we end up calling ip_xmit_v4 at L1317 with a null value for ire.

RESP: ACCEPT
      To fix this we should do this: 
        1285                  case IRE_CACHE:
        1286                          if (sire != NULL) {
        1287                                  UPDATE_OB_PKT_COUNT(sire);
        1288                                  sire->ire_last_used_time = lbolt;
        1289                                  ire_refrele(sire);
        1290                          }
        +                             ire_cache = ire;
        1291                          break;
 

thiru-11 ip_arp_news() must delete the affected nces, otherwise
         gratuituous arp will be broken. 

RESP: ACCEPT
      DAD touches ip_arp_news.  When ace_t times out DAD checks to see if 
      there is a ire cache table entry that contains this ace_info. If so
      it marks the ace_t with ACE_F_OLD and keeps it around till the ire
      cache entry is deleted.  Link-layer info of ace_t's marked with ACE_F_OLD 
      are not used to supply AR_ENTRY QUERY responses. Instead when a request
      is triggered, and ARP finds that there exists a corresponding "old"ace_t 
      it sends out a ARP request on the wire to get updated info for
      this old ace_t. 

      In order to make this new (non-deleting) logic work, DAD now has to
      delete ace_t entries when IP is no longer interested in them.  I could
      DAD just has ARP run a timer (which it's doing anyway for address defense
      and other things).  If the timer goes off and the ire_t is gone, then
      delete the ace_t because IP's no longer interested.
      That latter bit happens in ar_wsrv, at around lines 4167-4189.


      If DAD integrates before Surya:
        - ire_delete_hw_changed will have to
          check the nce_state and bail out if the entry is ND_INITIAL or
          ND_INCOMPLETE
        - ar_wsrv will have to check if the ire obtained from ire_cache_lookup
          is ND_INCOMPLETE or not before deciding to either blow away
          the old ace_t or mark it with ACE_F_OLD

     If Surya integrates before DAD we have to make sure that :
          - make sure ire_delete_cache_gw checks nce_state and does ndp_delete 
          - make sure ire_delete_cache_gw checks nce_state and does ndp_delete

thiru-12 The NCEs need to be periodically deleted. The V6 has got a protocol
         to transition NCEs through reachable, delay, probe, unreachable
         and delete them. Given that V4 does not have anything, we
         must at least have some mechanism to delete the NCEs periodically.
         Both arp and ire caches have periodic timers to delete the
         entries, but v4 NCEs have nothing. So this is a regression.

RESP: The cleanup of incomplete ire (and its nce) will be done via
      esballoc/ire_freemblk.
      In addition the we need a way to timeout complete nces. We need to
      add code to the following functions: 
        o ip_arp_news - this logic will update  affected ires( and thier nce)
                        when a link goes up/down  and ARP notifies IP
                        via AR_CN_ANNOUNCE 
        o ip_trash_ire_reclaim - this logic will clean up old ires and
          ire-less nces (that are sitting around in V4 nce table) when
          there is a memory pinch

        o In ire_inactive(), we should add code where when one
          comes across an nce with refcnt of 1( ie there are no ires
          associated to this nce) , it should be marked with ND_STALE. Then
          if ndp_lookup_addr() comes across an ND_STALE nce, it should
          delete the nce  and return NULL to the caller. This will
          cause the caller to trigger a new ARP resolution.  

        o ip_trash_ire_reclaim - this logic will clean up old ires and
          ire-less nces (ie nce with ND_STALE flag that is sitting around
          in V4 nce table) when there is a memory pinch

thiru-13a Rename
        G_LOCK_* to NDP_G_LOCK_*,
        G_WALKER to NDP_G_WALKER
        G_EXIT to NDP_G_EXIT
        v4ll_g_lock to ndp_g_lock_v4, 
        ndp_g_lock to ndp_g_lock_v6.
        ndp_g_walker_cleanup to ndp_g_walker_cleanup_v6
        v4ll_g_walker_cleanup to ndp_g_walker_cleanup_v4

RESP: ACCEPT.
      jdc-174 suggests reorganizing this area

thiru-13b   Please remove the reference to ire_dlureq_mp/fp_mp at L62 ip_ndp.h

RESP: ACCEPT 

thiru-14 ip_ndp.c delete L237 conditional, ill is known to be v6.

RESP: ACCEPT  ( also refer to jdc-177 and DM-7)

thiru-15 Please delete the ndp_walk_impl() and expand inline callers
         to call both ndp_walk_v4 and ndp_walk_v6().  (There aren't any
         other functions in ip_ndp.c with the _impl_ prefix/suffixes).

RESP: ACCEPT  (will be done as implementation of jdc-174, thriu-13a,b)

thiru-16 ip_if.c L6428 is useless. ire_nce is known to be NULL here. Since
         ipif_net_type is IRE_LOOPBACK, the rfq, stq passed in to ire_create at
         L6394 is null and hence ire_nce must be null.

RESP: ACCEPT

thiru-17 nce_fp_mp and nce_res_mp are protected by the nce_lock in the
         case of v6.  ire_dlureq_mp and ire_fp_mp were protected by
         ire_lock, but now that they are in nce, shouldn't they be
         protected by nce_lock. This issue is only in the case of broadcast
         ires. For the others we don't need a lock.

RESP:ACCEPT

thiru-18. I think we should remove the ire_fastpath* functions and
          switch to using nce_fastpath functions, now that we have
          moved to putting the fastpath in the nces. At least please
          study if there are any issues.

RESP:ACCEPT

thiru-19 IRB_R_ENTER, IRB_W_ENTER, IRB_RW_EXIT. Either all code should uniformly
   use these macros, or all code should just call just rw_enter. This macro
   doesn't seem to be particularly useful.

RESP:ACCEPT


thiru-20 In ire_get_bucket(), at L1056 rn_match will return the most
         specific route, which may not match the mask of the route we
         are actually trying to add.  We need to search the dupedkey
         list and get the node corresponding to our mask or call
         rnh_match_args instead.

RESP:ACCEPT

thiru-21 ire_get_bucket() assignment at L1047 is not required. This
         actually happens in rn_newpair() at L495 radix.c. rdst and
         initialization at L1030 - L1034 is not needed, instead
         rt->rt_dst can be directly initialized at L1048.

RESP:INVESTIGATE Will see if above suggestion provides any perf benifits

thiru-22 ire_find_best_route() - Should not loop across the dupedkey chain.
         That loop happens in rn_match_args().

RESP: REJECT
      This is needed when, for example, we have an interface route
      with a /26 mask, and we do

    # route add 192.168.81.0/24 192.168.81.1

       Both routes are saved under the same radix node.  When we do
       ire_ftable_lookup we don't pass a mask to rn_match, so it
       will only look at the head of the dupedkey list for a match. So,
       if we now try to delete
           # route delete 192.168.81.0/24 192.168.81.1
       the delete will fail (because ire_match_args is only called for
       the /26 route, with your suggestion).

thiru-23 L58 down ire_ftable_args_t. Always please use a 1 to 3 character
         prefix to the structure members eg. ifa_addr, ifa_mask etc
         instead of addr, mask etc.  Will facilitate cscope search for
         say assignments to ifa_mask etc.

RESP: ACCEPT

thiru-24 The assert at L179 ip_ftable.c in ire_ftable_lookup is bogus.
         A thread could be having a pointer to the ire and it could call
         ire_delete() any time after we look up the ire and drop the
         irb bucket lock drop the which would end up setting the
         IRE_MARK_CONDEMNED.

RESP: ACCEPT

thiru-25 ire_ftable_lookup() - Default route support

        The support and distinction between the IRE_DEFAULT and
        IRE_INTERFACE for default routes seems to have been removed. Why ?
        The onnv MATCH_IRE_DEFAULT logic would prefer IRE_DEFAULT, if no
        IRE_DEFAULT is available only then would it return IRE_INTERFACE.
        Among IRE_DEFAULTs it round robins, but not among IRE_INTERFACE.
        ire_round_robin()'s behavior is different from the above. 

        At L185, there is a check for ire_type of IRE_DEFAULT and only then
        ire_round_robin is called. This behavior is also very different
        from onnv.  The default route list could consist of a list of
        IRE_INTERFACE and IRE_DEFAULT in no particular order, and what
        you encounter at the start of the list is not predictable.

RESP: ACCEPT We will restore onnv behaviour

thiru-26 ip_ire_delete() - same comments as above.

    I am not sure what is the intent of these changes 25, 26. If these are
    unintentional you should restore the onnv behavior.

RESP: ACCEPT We will restore onnv behaviour


thiru-27. The code in rn_match_args L382 down should be as follows.

                /*
                 * Although we found an exact match on the key, rn_leaf_fn
                 * is looking for some other criteria as well. Continue
                 * looking as if the exact match failed.
                 */
                b = 0;
                goto keeplooking;

        }
keydiff:
        test = (*cp ^ *cp2) & 0xff; /* find first bit that differs */
        for (b = 7; (test >>= 1) > 0; )
                b--;
keeplooking:;
        matched_off = cp - v;
        b += matched_off << 3;
        rn_bit = -1 - b;

         The search key and the route matched completely in all bits upto
         the netmask of the route, otherwise we would have gone to
         'keydiff'. So the matched offset is really cp - v. The parent's
         rn_offset represents the byte in which the search key is being
         tested, and that is not relevant.

RESP: INVESTIGATE - will investigate the optimization suggestion.

thiru-28 rn_lookup(): Multiple threads executing in rn_lookup() -> rn_addmask()
   can corrupt the addmask_key and last_zeroed variables which are global.
   The 'addmask_key' is only a scratch variable and can be made a stack
   variable. What is the logic doing with 'last_zeroed' ? That needs to
   be understood before you can decide what to do with it.

RESP: ACCEPT
      last_zeroed is just a one-behind fastpath var
      (used to zero out the trailing bits in the key passed to rn_mask).
       As you point out, both can be made into stack vars, and the
       one-behind feature of last_zeroed can be abandoned.

thiru-29 Why do we need MKGet and MKFree to be anything more than calls to kmem*
   i.e. we don't need another level of internal buffer management.

RESP: ACCEPT ( also refer to jdc-14)

thiru-30 Given the function pointer table rnh_* for the
         lookup/add/delete operations, it is not good to also have
         direct calls to parallel lookup functions that are not listed
         in the table. rn_match is the table driven function.  But
         rn_match_args is used privately by IP. One solution is to add
         another operation say rnh_matchaddr_args to the function table. 

RESP: CLARIFY What would the extra level of indirection buy us?

thiru-31 ire_forward: The block at L876 for the IRE_CACHE is not very
         useful. If the interface route is deleted the ire cache will
         also be eventually deleted through the ihandle linkage. Instead
         we should return much earlier in block L794 down, irrespective
         of whether the nce_state is reachable or not.

RESP: CLARIFY.
      Are you suggesting that the ire_ihandle_lookup_offlink() at L882 is
      useless or that it should be moved up somewhere else?

thiru-32 ip_xmit_v4: L26978. You should enqueue at the tail to preserve message
         ordering.  i.e. call nce_queue_mp_common.

RESP: ACCEPT ( refer to jdc-79)

thiru-33 It is good to avoid dropping packets in the outbound side and instead
   queue them and enable flow control if canputnext fails. In ip_wput_ire()
   we check for canputnext at label blocked: and subsequently just do putnext.
   Now the replacement of putnext with ip_xmit_v4, causes another check for
   canputnext and then drop the packet. Instead ip_xmit_v4() must be passed
   another boolean parameter. If this is true, we do the canputnext check
   in ip_xmit_v4 and drop the packet (eg. forwarding path). If it is false
   we just call putnext. (eg. ip_wput_ire path). 

RESP: CLARIFY
      There are two calls to ip_xmit_v4() in ip_wput_ire(0: one is at L27105
      before the "blocked" label at 20804, the other is after it at L21118
      Furthermore there is canput logic in ip_xmit_v4() itself in L27027
      Are you suggesting the following changes?:

      o We add another bootlean arg (lets call it "canputnext_check") to 
ip_xmit_v4()
      and change code at ip_xmit_v4() such that if this new arg is B_TRUE then 
we
      do the canputnext at L27027 of ip_xmit_v4, if B_FALSE the skip the 
caputnext check
      and just do a putnext.

     o Have all callers of ip_xmit_v4() set it as B_TRUE *except* the two
       callers of ip_xmit_v4() from ip_wput_ire(L27105, 21118) 

    I suppose this way we wont be doing double canput check in:
          ip_wput_ire->ip_xmit_v4

    but other than the redundancy, is there a flaw in doing canput check twice? 


thiru-34 label send_pkt: at L26985 can be deleted.

RESP: ACCEPT 

thiru-35 default: default case at L27100 must an ASSERT(0);

RESP: ACCEPT  (refer to jdc-86)

thiru-36 The checks at L18033 tcp.c, L6071 udp.c,  L135 ip_impl.h
         related to the nce must be factored out into a macro and
         called from each of the above places to avoid duplication.
         The check can also be simplified as :

                (ire->ire_nce && ire->ire_nce->nce_fp_mp &&
                    MBLKL(ire->ire_nce->nce_fp_mp) <= MBLKHEAD(mp)) 

         We can avoid the state check because nce_fp_mp can be non-null
         only in ND_REACHABLE state (for IPv4).

RESP: ACCEPT Will use a macro instead. 


========================================================
THiru's  Suggested Alternative locking scheme proposal
========================================================

Lock order: Radix tree lock, irb_lock.

Requirements:
i.  Walker must not hold any locks during the walker callback.
ii  Walker must not see a truncated tree during the walk because of any node
    deletion.
iii Existing code assumes ire_bucket is valid if it is non-null and is used
    in many places in the code to walk the irb list. Thus even if all the
    ires in a bucket have been deleted, we still can't free the radix node
    until the ires have actually been inactive'd (freed).

Tree traversal - Need to hold the global tree lock in read mode.
Before dropping the global tree lock, need to either increment the ire_refcnt
or the irb_refcnt. Either ensures the radix node can't be deleted.

Tree add - Need to hold the global tree lock in write mode to add a radix node.
To prevent the node from being deleted, increment the irb_refcnt, after the
node is added to the tree. The ire itself is added later while holding the
irb_lock, but not the tree lock.

Tree delete - Need to hold the global tree lock and irb_lock in write mode.
All associated ires must be inactive (i.e. freed), and irb_refcnt must be zero.

Walker - Increment irb_refcnt before calling the walker callback. Hold the
global tree lock (read mode) for traversal.

/*
 * Lookup:
 *
 * Hold radix tree read lock and do tree traversal
 * Locate radix node of interest
 * Grab bucket lock and traverse list of ires  
 * IRE_REFHOLD the ire of interest
 * Drop all locks, return ire.
 *
 * The radix node can't vanish until all associated ires
 * become inactive. Thus IRE_REFHOLD is an indirect hold
 * on the radix node.
 */
ire_find_best_route(radix node rn)
{
        grab bucket lock 
        list traversal, locate the ire of interest
        if found an ire of interest
                IRE_REFHOLD(ire)
        drop bucket lock        
        return (found ire);
}

ire_ftable_lookup()
{ 

        RADIX_NODE_HEAD_RLOCK;
        /* ire_find_best_route returns a REFHELD ire  */
        (rn, ire) = do tree traversal, call ire_find_best_route  

        RADIX_NODE_HEAD_UNLOCK;
        
        if (default route)
                do the loop and get the right ire in this bucket.

        return (ire);
}
 
/*
 * Add: Need radix tree write lock to add a new radix node to the tree.
 * ire addition to the irb list itself needs only irb lock and does not
 * need radix tree lock.
 */
ire_add_v4()
{ 

        /* The following in ire_get_bucket() */

        RADIX_NODE_HEAD_WLOCK
        rn = add new radix node
        if (rn == NULL) {
                /* The node already exists. So we locate the existing node */   
                rn = rn_match_args();
                if (rn == NULL) {
                        RADIX_NODE_HEAD_UNLOCK
                        return (failure);
                }
        }
        /*
         * Prevent the radix node from vanishing by incrementing irb_cnt.
         * The irb_cnt essentially serves as a refcnt on the radix node. The
         * radix node can't vanish until the irb_cnt drops to zero.
         */
        IRB_REFHOLD(rn);
        RADIX_NODE_HEAD_UNLOCK

        /* The following in ire_add_v4() */
        ....
        link the ire into the list
        if (ire->ire_type & IRE_FORWARDTABLE)
                irb->irb_ire_outstanding_cnt++;
        ...
        IRB_REFRELE(rn);
}
        
/*
 * Delete: Need irb_lock and zero irb_cnt (no walker) to remove ire from
 * irb list. Need radix tree write lock, irb_lock, zero irb_cnt, and
 * zero irb_ire_outstanding_cnt to delete the radix tree node. 
 */
ire_delete() - same as onnv, no changes (except for replacing
IRE_MARK_CONDEMNED by IRB_MARK_CONDEMNED)

ire_inactive()
{
end:
        /* is the ire anchored off the radix tree */
        if (ire->ire_type & IRE_FORWARDTABLE && v4ire) {
                irb = ire->ire_bucket;
                if (irb != NULL) {
                        rw_enter(&irb->irb_lock, RW_WRITER);      
                        irb->irb_ire_outstanding_cnt--;
                        /*
                         * Instead of examining the conditions for freeing
                         * the radix node here, we do it by calling
                         * IRB_REFRELE which is a single point in the code
                         * that embeds that logic. Bump up the refcnt to
                         * be able to call IRB_REFRELE
                         */
                        irb->irb_refcnt++;
                        rw_exit(&irb->irb_lock);
                        IRB_REFRELE(irb);
                }
        }

        ...

}                       
                        
        
/*
 * This assumes only radix node related irb. Please make any needed changes
 * to make this function usable for other IRBs as well.
 */
IRB_REFRELE(irb)
{
        ire_t *ire_list;                     

        rw_enter(&irb->irb_lock, RW_WRITER);      

        for (;;) {
                ASSERT(irb->irb_refcnt != 0);            
                if (irb->irb_refcnt != 1) {
                        /*
                         * Someone has a reference to this radix node
                         * or there is some bucket walker.
                         */
                        irb->irb_refcnt--;
                        rw_exit(&(irb)->irb_lock);    
                        return;
                }

                /*
                 * There is no other walker, nor is there any other thread
                 * that holds a direct ref to this radix node. Do the clean up
                 * if needed. Call to ire_unlink will clear the 
                 * IRB_MARK_CONDEMNED flag
                 */
                if (irb->irb_marks & IRB_MARK_CONDEMNED) {
                        ire_list = ire_unlink(irb);        
                        rw_exit(&(irb)->irb_lock);        

                        if (ire_list != NULL)
                                ire_cleanup(ire_list);          

                        rw_enter(&irb->irb_lock, RW_WRITER);      
                        continue;
                } 

                /*
                 * Now check if there are still any ires associated with
                 * this radix node.
                 */
                if (irb->irb_ire_outstanding_cnt != 0) {
                        irb->irb_refcnt--;
                        rw_exit(&(irb)->irb_lock);    
                        return;
                }

                /*
                 * Everything is clear. Zero walkers, Zero threads with a ref
                 * to this radix node, Zero ires associated with this radix
                 * node. Due to lock order, check the above conditions again
                 * after grabbing all locks in the right order
                 */
                rw_exit(&(irb)->irb_lock);    
                RADIX_NODE_WRITE_LOCK;
                rw_enter(&irb->irb_lock, RW_WRITER);      
                if (irb->irb_refcnt == 1 &&
                    irb->irb_ire_outstanding_cnt == 0) {
                        radix_node_delete_and_free(radix node of this irb);
                        rw_exit(&irb->irb_lock);
                        RADIX_NODE_UNLOCK;
                        return;
                }
                RADIX_NODE_UNLOCK;
        }
}

/*
 * get_next_leaf_node: get the next leaf node. same logic as BSD.
 * if arg is NULL, get 1st leaf node.
 */
rn_walktree()
{
        rn = NULL;
        rn_next = get_next_leaf_node(NULL);
        RADIX_TREE_RLOCK;

        while (rn_next != NULL) {
                /*
                 * IRB_REFHOLD ensures rn_next will not be deleted and its
                 * links remain intact when we return from callback
                 */
                IRB_REFHOLD(RN_IRB(rn_next));
                RADIX_NODE_UNLOCK(rn);
                if (rn != NULL)
                        IRB_REFRELE(RN_IRB(rn));

                rn = rn_next;
                callback(RN_IRB(rn));

                RADIX_NODE_RLOCK;
                rn_next = get_next_leaf_node(rn);
        } 
        if (rn != NULL)
                IRB_REFRELE(RN_IRB(rn));
}

callback()

does not do do IRB_REFHOLD, IRB_REFRELE. Instead rn_walktree does it
on behalf of the callback.

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to