[networking-discuss] Re: Surya (forwarding improvements) design review -> due Dec 27 2005

Thirumalai Srinivasan Sun, 26 Mar 2006 19:12:42 -0800

file:///net/npt.sfbay/export/surya/surya-gate-review/webrev/index.html



Code review comments in the attachment.

Thirumalai

                radix tree lookup/add/walk/delete locking

1.ire_add_v4 -> ire_get_bucket -> returns &rt->rt_irb after dropping the
  global radix node lock. Now the radix node is not protected henceforth
  and can disappear any time before the ire is actually linked in to the
  radix node. To see how it can happen consider the following sequence.
  
  T1. route add dst = 192.168/16, ipif = bge0
  Thread T1 starts off, creates and links the radix node for 192.168/16
  into the tree and returns the irb. Let us suspend this thread at
  L3266 ip_ire.c. 

  T2. route add dst = 192.168/16, ipif = bge1

  Now T2 starts off, tries to add a duplicate (route,netmask) pair, fails
  and uses the radix node that T1 added, and links in the ire into the irb.
  and finishes to completion.
 
  T3. route delete dst = 192.168/16, ipif = bge1
  T4. netstat walker on the above bucket.

  T4 walks the above bucket. Now T3 -> ire_delete locates the ire and removes
  the radix node from the tree and marks the irb as DEAD and finishes the
  deletion actions. The walker finishes eventually and IRB_REFRELE
  ends up deleting the radix node.
 
  Now T1 resumes and finds a freed radix node. 

1b.(We had to consider T4 a walker thread due to another bug. The radix node
    free only happens in IRB_REFRELE. If there is no walker during the
    deletion of an ire, the radix node is leaked.)

2. In rn_walktree(), the global radix lock is held while doing the traversal.
   But that is not sufficient. After dropping the lock at L1129 radix.c
   the access to the node in the block in L1121 - L1134 is unprotected and
   could be accessing freed memory. Similarly we get a pointer to the next
   radix node at L1124 and then use it much later at L1139 while we have
   dropped the lock in the interrugnum. The walker callbacks rtfunc and
   ire_ihandle_onlink_match do IRB_REFHOLD to walk the list, but there is
   nothing to prevent the list from vanishing even before IRB_REFHOLD is called
   or even before the callbacks are called.

In both 1. and 2. we need to increment a refcount on the radix node
before dropping the global radix tree lock. (could use the irb_cnt as the
refcount under the irb_lock).

ire_delete() 

3a In ire_delete(), we check for irb_ire_cnt of 1 and if so we remove the
   radix node from the tree. irb_ire_cnt is protected by the irb_lock and
   it can increase now, if another thread is in ire_add_v4. Note that the
   the thread in ire_add_v4 is not holding the global radix lock, but just
   the irb_lock while it is adding the ire to the irb list. 

3b Another issue is that the radix node deletion may never happen. Consider
   that we have 2 ires in the bucket ire1, ire2 and consider 2 threads T1
   trying to delete ire1, and T2 trying to delete ire2 simultaneously.
   
   T1 -> ire_delete(ire1), grabs radix tree lock, finds irb_ire_cnt is 2
   and does not remove the radix node from the tree. Drops the radix tree lock
   and calls ire_delete_common(ire1). Let us suspend this thread now.

   T2 -> ire_delete(ire2), grabs radix tree lock, finds irb_ire_cnt is 2 and
   does not remove the radix node from the tree. Drops the radix tree lock
   and calls ire_delete_common(ire2). Proceeds to completion.
   Now let T1 also complete. Both ires are deleted, but radix node remains.

3c The other issue is that if the radix node that is being removed from the
   tree is the 'next' in the tree walker, then even if the node itself is not
   freed (due to non-zero irb_cnt due to other walkers) the walk will be
   truncated, and that is bad. A truncated walk may cause unplumb to hang.
   
   L3857 ip_ire.c should never happen. 

4. IRB_REFRELE

   The last thread that brings the irb_refcnt to 0 frees the radix node
   if it has been removed from the tree. However the ires of this node
   which were located by threads before the ires were deleted may still
   be using them, and they may access ire_bucket which would access freed
   memory. Existing code assumes a non-zero ire_bucket to be always valid.
   Thus the radix node free must happen only when all the ires have been
   inactive'd (ire_inactive).

Thus the add, delete, lookup, and walk need a mutually consistent locking 
mechanism. A suggested scheme is given in the attachment.

                NCE state, nce_res_mp and locking issues

5a. In the ND_INITIAL state the nce->nce_res_mp is a template or copy of
   the ill_resolver_mp. After successful resolution the nce_res_mp mutates
   to a dl_unitdata_req. This can lead to confusion since the type of mp
   is known only while holding nce_lock. Consider the following sequence
   2 threads T1, T2 in the forwarding path and host path simultaneously.

   T1.1 ire_forward -> ire_create -> ire_add creates and adds an incomplete ire.
        ire->ire_nce is in  ND_INITIAL state.
   T2.1 ip_newroute -> hits on above nce -> sends arp request resolution.
   T1.2 ire_forward -> ip_rput_process_forward -> ip_rput_forward -> ip_xmit_v4
        grabs nce_lock, checks the nce_state, changes it to ND_INCOMPLETE,
        drops nce_lock 
   T2.2 arp resolution initiated in T2.1 completes and nce goes to ND_REACHABLE
   T1.3 ip_xmit_v4 -> ire_arpresolve would end up sending a bad request to
        ARP. The nce_res_mp is now a dl_unitdata_req (after arp resolution)
        and not an ill_resolver_mp.

   i.e. essentially the nce is resolved in between L27091 and the execution
   of ire_arpresolve causing the problem.

5b To get the state right, ip_newroute() must transition the nce to
   ND_INCOMPLETE before sending the resolution to ARP.

5c. Since 1 thread in the forwarding path, and multiple threads in the host path
   can attempt to simultaneously resolve an nce, the nce_res_mp may be null
   and we have code to get it from the associated ill_resolver_mp in case it
   is null. L7714  ip.c, L6489 ip_ire.c. 

   It looks like we will be better off to avoid this mutation of nce_res_mp
   and leave it null to start with, and install nce_res_mp only after
   the nce is resolved. There is a comment at L6560 ip_ire.c which is somewhat
   similar.

6  In IPv6, only ire cache IREs have a non-null ire_nce. In IPv4,
   interface ires also have a non-null ire_nce. Why the dissimilarity ?
   Do you propose to extend IPv6 interface IREs also to have an ire_nce ?
   Look at L4881 ip6.c, L5816 ip6.c. This stands different from L7473 ip.c,
   L7701 ip.c. In one case we have to call ill_dlur_gen and build it, while
   in the other we can do a copy from the ire->ire_nce->nce_res_mp.

7. The mblk callback for freeb (ire_freemblk) and esballoc logic are now
   highly avoidable (and convoluted) since the original design goal of the
   mblk holding a ref on the nce is no longer met. The attempt to access
   IP structures and drop nce references in ire_freemblk, when the mblk does
   not hold any ref on the nce is dangerous. Mblks can travel outside of IP,
   and any other entity that comes across this mblk and frees it will cause
   corruption.  Eg. Consider the following scenario.

   Assume unplumb starts and an arp request sent up ends up on the streamhead
   since ARP has been popped off. Now the comment at L7746 ip.c is this
   scenario. The mblk (arp request and ire mblk) is queued at the streamhead.
   IP finishes the unplumb, since there are no refs from the mblk to the nce.
   IP frees up all its structures. Now the close returns and the streamhead
   frees up the mblk causing ire_freemblk to be executed. Now any access to
   the ill at L6639 is accessing freed memory.

   The solution is to get rid of this esballoc and free. The essential part
   of ire_mblkfree must be called from IP on getting back an arp reply.
   If the mblk is freed by ARP or the streamhead, it is fine, since there
   are no extra refs on any nce, thus any unplumb will go through. 
   All blocks of code that do freeb(ire->ire_mp) will need to be revisited and
   redone after removing ire_freemblk. An alternative is to add an nce
   parameter to ire_add_then_send(). This way the mblk ire's ire->ire_nce
   can always be null.

   ire_freemblk: This function is unstructured and convoluted.
   i. The argument name 'fake_ire' is inappropriate since it can
   be either fake_ire or the ire created through ire_create_mp. 
   ii. This function does not do any useful work in the normal successful
   case of arp resolution. 
   iii. In ire_freemblk() we need to locate the nce. Instead we try to locate
   the ire cache and then the nce from the ire cache. This is incorrect,
   since an ire cache may not exist and we won't do the cleanup. This happens
   in the case of ip_newroute initiated arp request. Instead we should use
   ndp_lookup to locate the nce.
   iv. If the arp resolution fails, we want to call ndp_delete(). i.e.
   logically in the block at L6699, i.e. delete all associated ires since
   link layer resolution failed.
   v. If this ire is marked IRE_MARK_NOADD, why do you want to delete other
   IREs tied to the same nce ? 

8  ip_wput_nondata:
   Calling ire_nce_init at L27105 is convoluted, all we need to do when an
   arp response arrives, is to locate the nce, by doing an ndp_lookup using
   the ire->ire_addr, and the ill on which the arp response arrives.

9 ip_wput_ire: label noprepend: can be deleted

10 ipfil_sendpkt: If ipfil_sendpkt -> ire_route_lookup happens for a
   destination that is onlink ire_route_lookup() would return an IRE_INTERFACE
   and sire would be null. The check at L731 in ire_forward must check whether
   supplied_ire is null to determine the caller.

   At L1283 if ire_type is IRE_IF_NORESOLVER/IRE_CACHE, then we end up calling
   ip_xmit_v4 at L1317 with a null value for ire.

11 ip_arp_news() must delete the affected nces, otherwise gratuituous arp
   will be broken. 

12 The NCEs need to be periodically deleted. The V6 has got a protocol
   to transition NCEs through reachable, delay, probe, unreachable and delete
   them. Given that V4 does not have anything, we must at least have some
   mechanism to delete the NCEs periodically. Both arp and ire caches have
   periodic timers to delete the entries, but v4 NCEs have nothing. So this
   is a regression.

13 Rename
        G_LOCK_* to NDP_G_LOCK_*,
        G_WALKER to NDP_G_WALKER
        G_EXIT to NDP_G_EXIT
        v4ll_g_lock to ndp_g_lock_v4, 
        ndp_g_lock to ndp_g_lock_v6.
        ndp_g_walker_cleanup to ndp_g_walker_cleanup_v6
        v4ll_g_walker_cleanup to ndp_g_walker_cleanup_v4

   Please remove the reference to ire_dlureq_mp/fp_mp at L62 ip_ndp.h


14 ip_ndp.c delete L237 conditional, ill is known to be v6.

15 Please delete the ndp_walk_impl() and expand inline callers to call both
   ndp_walk_v4 and ndp_walk_v6().  (There aren't any other functions in
   ip_ndp.c with the _impl_ prefix/suffixes).

16 ip_if.c L6428 is useless. ire_nce is known to be NULL here. Since
   ipif_net_type is IRE_LOOPBACK, the rfq, stq passed in to ire_create at
   L6394 is null and hence ire_nce must be null.

17 nce_fp_mp and nce_res_mp are protected by the nce_lock in the case of v6.
   ire_dlureq_mp and ire_fp_mp were protected by ire_lock, but now that they
   are in nce, shouldn't they be protected by nce_lock. This issue is only
   in the case of broadcast ires. For the others we don't need a lock.

18. I think we should remove the ire_fastpath* functions and switch to using
    nce_fastpath functions, now that we have moved to putting the fastpath
    in the nces. At least please study if there are any issues.

19 IRB_R_ENTER, IRB_W_ENTER, IRB_RW_EXIT. Either all code should uniformly
   use these macros, or all code should just call just rw_enter. This macro
   doesn't seem to be particularly useful.

                                ip_ftable.c

20 In ire_get_bucket(), at L1056 rn_match will return the most specific route,
   which may not match the mask of the route we are actually trying to add.
   We need to search the dupedkey list and get the node corresponding to our
   mask or call rnh_match_args instead.

21 ire_get_bucket() assignment at L1047 is not required. This actually happens
   in rn_newpair() at L495 radix.c. rdst and initialization at L1030 - L1034
   is not needed, instead rt->rt_dst can be directly initialized at L1048.

22 ire_find_best_route() - Should not loop across the dupedkey chain.
   That loop happens in rn_match_args().

23 L58 down ire_ftable_args_t. Always please use a 1 to 3 character prefix to
   the structure members eg. ifa_addr, ifa_mask etc instead of addr, mask etc.
   Will facilitate cscope search for say assignments to ifa_mask etc.

24 The assert at L179 ip_ftable.c in ire_ftable_lookup is bogus.
   A thread could be having a pointer to the ire and it could call
   ire_delete() any time after we look up the ire and drop the irb bucket lock
   drop the which would end up setting the IRE_MARK_CONDEMNED.

25 ire_ftable_lookup() - Default route support

The support and distinction between the IRE_DEFAULT and IRE_INTERFACE for
default routes seems to have been removed. Why ? The onnv MATCH_IRE_DEFAULT
logic would prefer IRE_DEFAULT, if no IRE_DEFAULT is available only then
would it return IRE_INTERFACE. Among IRE_DEFAULTs it round robins, but
not amond IRE_INTERFACE. ire_round_robin()'s behavior is different from
the above. 

At L185, there is a check for ire_type of IRE_DEFAULT and only then
ire_round_robin is called. This behavior is also very different from onnv.
The default route list could consist of a list of IRE_INTERFACE and
IRE_DEFAULT in no particular order, and what you encounter at the start
of the list is not predictable.

26 ip_ire_delete() - same comments as above.

Iam not sure what is the intent of these changes 25, 26. If these are
unintentional you should restore the onnv behavior.

                                radix.c

27. The code in rn_match_args L382 down should be as follows.

                /*
                 * Although we found an exact match on the key, rn_leaf_fn
                 * is looking for some other criteria as well. Continue
                 * looking as if the exact match failed.
                 */
                b = 0;
                goto keeplooking;

        }
keydiff:
        test = (*cp ^ *cp2) & 0xff; /* find first bit that differs */
        for (b = 7; (test >>= 1) > 0; )
                b--;
keeplooking:;
        matched_off = cp - v;
        b += matched_off << 3;
        rn_bit = -1 - b;

The search key and the route matched completely in all bits upto the netmask
of the route, otherwise we would have gone to 'keydiff'. So the matched offset
is really cp - v. The parent's rn_offset represents the byte in which the
search key is being tested, and that is not relevant.

28 rn_lookup(): Multiple threads executing in rn_lookup() -> rn_addmask()
   can corrupt the addmask_key and last_zeroed variables which are global.
   The 'addmask_key' is only a scratch variable and can be made a stack
   variable. What is the logic doing with 'last_zeroed' ? That needs to
   be understood before you can decide what to do with it.

29 Why do we need MKGet and MKFree to be anything more than calls to kmem*
   i.e. we don't need another level of internal buffer management.

30 Given the function pointer table rnh_* for the lookup/add/delete operations,
   it is not good to also have direct calls to parallel lookup functions
   that are not listed in the table. rn_match is the table driven function.
   But rn_match_args is used privately by IP. One solution is to add another
   operation say rnh_matchaddr_args to the function table. 

31 ire_forward: The block at L876 for the IRE_CACHE is not very useful. If the
   interface route is deleted the ire cache will also be eventually deleted
   through the ihandle linkage. Instead we should return much earlier in
   block L794 down, irrespective of whether the nce_state is reachable or not.

                                ip.c

32 ip_xmit_v4: L26978. You should enqueue at the tail to preserve message
   ordering.  i.e. call nce_queue_mp_common.

33 It is good to avoid dropping packets in the outbound side and instead
   queue them and enable flow control if canputnext fails. In ip_wput_ire()
   we check for canputnext at label blocked: and subsequently just do putnext.
   Now the replacement of putnext with ip_xmit_v4, causes another check for
   canputnext and then drop the packet. Instead ip_xmit_v4() must be passed
   another boolean parameter. If this is true, we do the canputnext check
   in ip_xmit_v4 and drop the packet (eg. forwarding path). If it is false
   we just call putnext. (eg. ip_wput_ire path). 

34 label send_pkt: at L26985 can be deleted.

35 default: default case at L27100 must an ASSERT(0);

36 The checks at L18033 tcp.c, L6071 udp.c,  L135 ip_impl.h related to the nce
   must be factored out into a macro and called from each of the above places
   to avoid duplication.
   The check can also be simplified as
   (ire->ire_nce && ire->ire_nce->nce_fp_mp &&
    MBLKL(ire->ire_nce->nce_fp_mp) <= MBLKHEAD(mp)) 
   We can avoid the state check because nce_fp_mp can be non-null only in 
   ND_REACHABLE state (for IPv4).

-------------------------------------------------------------------------

Lock order: Radix tree lock, irb_lock.

Requirements:
i.  Walker must not hold any locks during the walker callback.
ii  Walker must not see a truncated tree during the walk because of any node
    deletion.
iii Existing code assumes ire_bucket is valid if it is non-null and is used
    in many places in the code to walk the irb list. Thus even if all the
    ires in a bucket have been deleted, we still can't free the radix node
    until the ires have actually been inactive'd (freed).

Tree traversal - Need to hold the global tree lock in read mode.
Before dropping the global tree lock, need to either increment the ire_refcnt
or the irb_refcnt. Either ensures the radix node can't be deleted.

Tree add - Need to hold the global tree lock in write mode to add a radix node.
To prevent the node from being deleted, increment the irb_refcnt, after the
node is added to the tree. The ire itself is added later while holding the
irb_lock, but not the tree lock.

Tree delete - Need to hold the global tree lock and irb_lock in write mode.
All associated ires must be inactive (i.e. freed), and irb_refcnt must be zero.

Walker - Increment irb_refcnt before calling the walker callback. Hold the
global tree lock (read mode) for traversal.

/*
 * Lookup:
 *
 * Hold radix tree read lock and do tree traversal
 * Locate radix node of interest
 * Grab bucket lock and traverse list of ires  
 * IRE_REFHOLD the ire of interest
 * Drop all locks, return ire.
 *
 * The radix node can't vanish until all associated ires
 * become inactive. Thus IRE_REFHOLD is an indirect hold
 * on the radix node.
 */
ire_find_best_route(radix node rn)
{
        grab bucket lock 
        list traversal, locate the ire of interest
        if found an ire of interest
                IRE_REFHOLD(ire)
        drop bucket lock        
        return (found ire);
}

ire_ftable_lookup()
{ 

        RADIX_NODE_HEAD_RLOCK;
        /* ire_find_best_route returns a REFHELD ire  */
        (rn, ire) = do tree traversal, call ire_find_best_route  

        RADIX_NODE_HEAD_UNLOCK;
        
        if (default route)
                do the loop and get the right ire in this bucket.

        return (ire);
}
 
/*
 * Add: Need radix tree write lock to add a new radix node to the tree.
 * ire addition to the irb list itself needs only irb lock and does not
 * need radix tree lock.
 */
ire_add_v4()
{ 

        /* The following in ire_get_bucket() */

        RADIX_NODE_HEAD_WLOCK
        rn = add new radix node
        if (rn == NULL) {
                /* The node already exists. So we locate the existing node */   
                rn = rn_match_args();
                if (rn == NULL) {
                        RADIX_NODE_HEAD_UNLOCK
                        return (failure);
                }
        }
        /*
         * Prevent the radix node from vanishing by incrementing irb_cnt.
         * The irb_cnt essentially serves as a refcnt on the radix node. The
         * radix node can't vanish until the irb_cnt drops to zero.
         */
        IRB_REFHOLD(rn);
        RADIX_NODE_HEAD_UNLOCK

        /* The following in ire_add_v4() */
        ....
        link the ire into the list
        if (ire->ire_type & IRE_FORWARDTABLE)
                irb->irb_ire_outstanding_cnt++;
        ...
        IRB_REFRELE(rn);
}
        
/*
 * Delete: Need irb_lock and zero irb_cnt (no walker) to remove ire from
 * irb list. Need radix tree write lock, irb_lock, zero irb_cnt, and
 * zero irb_ire_outstanding_cnt to delete the radix tree node. 
 */
ire_delete() - same as onnv, no changes (except for replacing
IRE_MARK_CONDEMNED by IRB_MARK_CONDEMNED)

ire_inactive()
{
end:
        /* is the ire anchored off the radix tree */
        if (ire->ire_type & IRE_FORWARDTABLE && v4ire) {
                irb = ire->ire_bucket;
                if (irb != NULL) {
                        rw_enter(&irb->irb_lock, RW_WRITER);      
                        irb->irb_ire_outstanding_cnt--;
                        /*
                         * Instead of examining the conditions for freeing
                         * the radix node here, we do it by calling
                         * IRB_REFRELE which is a single point in the code
                         * that embeds that logic. Bump up the refcnt to
                         * be able to call IRB_REFRELE
                         */
                        irb->irb_refcnt++;
                        rw_exit(&irb->irb_lock);
                        IRB_REFRELE(irb);
                }
        }

        ...

}                       
                        
        
/*
 * This assumes only radix node related irb. Please make any needed changes
 * to make this function usable for other IRBs as well.
 */
IRB_REFRELE(irb)
{
        ire_t *ire_list;                     

        rw_enter(&irb->irb_lock, RW_WRITER);      

        for (;;) {
                ASSERT(irb->irb_refcnt != 0);            
                if (irb->irb_refcnt != 1) {
                        /*
                         * Someone has a reference to this radix node
                         * or there is some bucket walker.
                         */
                        irb->irb_refcnt--;
                        rw_exit(&(irb)->irb_lock);    
                        return;
                }

                /*
                 * There is no other walker, nor is there any other thread
                 * that holds a direct ref to this radix node. Do the clean up
                 * if needed. Call to ire_unlink will clear the 
                 * IRB_MARK_CONDEMNED flag
                 */
                if (irb->irb_marks & IRB_MARK_CONDEMNED) {
                        ire_list = ire_unlink(irb);        
                        rw_exit(&(irb)->irb_lock);        

                        if (ire_list != NULL)
                                ire_cleanup(ire_list);          

                        rw_enter(&irb->irb_lock, RW_WRITER);      
                        continue;
                } 

                /*
                 * Now check if there are still any ires associated with
                 * this radix node.
                 */
                if (irb->irb_ire_outstanding_cnt != 0) {
                        irb->irb_refcnt--;
                        rw_exit(&(irb)->irb_lock);    
                        return;
                }

                /*
                 * Everything is clear. Zero walkers, Zero threads with a ref
                 * to this radix node, Zero ires associated with this radix
                 * node. Due to lock order, check the above conditions again
                 * after grabbing all locks in the right order
                 */
                rw_exit(&(irb)->irb_lock);    
                RADIX_NODE_WRITE_LOCK;
                rw_enter(&irb->irb_lock, RW_WRITER);      
                if (irb->irb_refcnt == 1 &&
                    irb->irb_ire_outstanding_cnt == 0) {
                        radix_node_delete_and_free(radix node of this irb);
                        rw_exit(&irb->irb_lock);
                        RADIX_NODE_UNLOCK;
                        return;
                }
                RADIX_NODE_UNLOCK;
        }
}

/*
 * get_next_leaf_node: get the next leaf node. same logic as BSD.
 * if arg is NULL, get 1st leaf node.
 */
rn_walktree()
{
        rn = NULL;
        rn_next = get_next_leaf_node(NULL);
        RADIX_TREE_RLOCK;

        while (rn_next != NULL) {
                /*
                 * IRB_REFHOLD ensures rn_next will not be deleted and its
                 * links remain intact when we return from callback
                 */
                IRB_REFHOLD(RN_IRB(rn_next));
                RADIX_NODE_UNLOCK(rn);
                if (rn != NULL)
                        IRB_REFRELE(RN_IRB(rn));

                rn = rn_next;
                callback(RN_IRB(rn));

                RADIX_NODE_RLOCK;
                rn_next = get_next_leaf_node(rn);
        } 
        if (rn != NULL)
                IRB_REFRELE(RN_IRB(rn));
}

callback()

does not do do IRB_REFHOLD, IRB_REFRELE. Instead rn_walktree does it
on behalf of the callback.

_______________________________________________
networking-discuss mailing list
[email protected]

[networking-discuss] Re: Surya (forwarding improvements) design review -> due Dec 27 2005

Reply via email to