[clearview-discuss] code review request for the Clearview UV component

Peter Memishian Wed, 7 Nov 2007 19:08:36 -0500

 > >  > External webrev:
 > >  > 
 > >  >        http://cr.grommit.com/~yun/webrev_uv_09_28
 > >  > 
 > >  > Internal webrev:
 > >  > 
 > >  > 
 > > http://jurassic.sfbay/net/forwar.prc/builds/yz147064/clearview-uv-review/webrev_review/
 > >  
 > >  > 
 > >  > Cscope:
 > >  > 
 > >  >        /net/forwar.prc/builds/yz147064/clearview-uv-review/
 > > 
 > Revised webrev:
 > 
 >       /net/aquila.prc/export/home/cathy/clearview/webrev_rcm
 >
 > > General comments:
 > 
 > >           Also, I'm a bit concerned about the use of link
 > >           identifiers here, since e.g. cfgadm will output them (right edge
 > >           truncated to avoid 80-column wrap):
 > > 
 > >           cfgadm: Component system is busy, try again:
 > >                          Resource                               Info
 > >           --------------------------------------  ------------------
 > >           /devices/pci at 9,700000/pci at 2/network at 0   Network 
 > > interface 
 > >           SUNW_network/1                          ce0 AGGRs: aggr0
 > >           SUNW_network/2                          aggr0 hosts IP add
 > >           SUNW_ip/10.8.57.204                     Plumbed IP Address
 > >            
 > >           What would be the impact of using link names instead?
 > > 
 >
 > Note that all datalink configuration are based on linkids, so a
 > rename-link should not affect the datalink datalink configuration
 > offline/online/configure. But if we use link names, there would be
 > races between the RCM operation and a "dladm rename-link" operation,
 > and would cause unexpected result.


What specific race are you concerned about?  Given that the network, VLAN,
and aggr RCM modules rebuild the resource name cache before processing a
notification, wouldn't renames be transparently handled?  Are you worried
about a rename-link running in parallel with a cfgadm operation?  That
seems like a recipe for confusion no matter what, and also pretty unlikely
given that it would require the datalinks to be unplumbed.

 > >           If (for some reason) linkids must be used at the link layer,
 > >           would it be possible to use link names at a higher level?  I'm
 > >           really uncomfortable with e.g. ip_rcm working with linkids.
 > > 
 > That is possible. But that would mean link rcm modules (for example, 
 > aggr_rcm), needs to populate a event to its consumer twice (one in the form 
 > of SUNW_datalink/<linkid>, another in the form of SUNW_network/<linkname>). 
 > Because its consumer could be at or above the link layer (for example, IP 
 > and VLAN).
 > 
 > I am not sure whether that implementation is clearer.

Yeah, I agree that's messy.

My real concern here is the link identifiers appearing in the cfgadm
output -- I was really hoping to avoid that.

 > >         * Similar to the above: please be sure that link identifiers do
 > >           not appear in diagnostic messages the system administrator could
 > >           see.  (If possible, I'd prefer to avoid link identifiers in all
 > >           diagnostic messages).
 > > 
 > You mean aggr_log_err() etc. right? Sure. For debug information, I think we 
 > would want the link id, as that is the resource name get populated.

For debug information, could we output the resource name instead?  (I
realize that this may contain the linkid, depending on the outcome of
the discussion above).

 > >         * A number of functions have comments stating "Call with
 > >           cache_lock held".  I'd prefer the comment be removed and an
 > >           assert() added to the function in its place.  (I've also noted
 > >           the places I've found where we call them without holding
 > >           cache_lock in the specific comments below.)
 > > 
 > free_node() could be called without holding the lock. It is fine. See below.

OK, thanks.

 > >         * Are we sure it's safe to call *_consumer_offline() while holding
 > >           cache_lock?  Seems like it should be OK, but I'm slightly
 > >           nervous about calling other modules (via rcm_request_offline())
 > >           while holding locks.
 > > 
 > Yes. It should be fine to hold locks. Note that it is not safe to hold the 
 > lock when calling rcm_notify_event(RCM_RESOURCE_LINK_NEW), as it could 
 > result another RCM_RESOURCE_LINK_NEW event.

Seems like that would only be problematic if the first
RCM_RESOURCE_LINK_NEW led to synchronous processing of the second one.
Could you elaborate on the issue?

 > >         * Occasionally, the vlan_cache and aggr_cache structure variables
 > >           use the name `probe' rather than `node' -- why?  What does
 > >           `probe' mean in this context?
 > > 
 > I don't know. ip_rcm and network_rcm sometimes use "probe". I changed to use 
 >"node" in vlan_rcm and aggr_rcm.

Indeed; thanks.

 > >         * The `vlan_cache' structure seems misnamed -- this appears to
 > >           represent a link, not a VLAN.  Likewise, it seems like
 > >           CACHE_VLAN_* should really be CACHE_LINK_*.
 > > 
 > Changed to link_cache_t. Changed to CACHE_NODE_*.

Sounds good.

 > >         * On a related note: some of the functions have confusing names.
 > >           For instance, the oddly-named vlan_offline_vlan() is really
 > >           semantically link_offline_vlan() (but e.g. vlan_update() is
 > >           properly named).
 > > 
 > The first vlan is really just a prefix.

Then shouldn't vlan_update() be vlan_update_vlan() or somesuch?  In any
case, a prefix like vrcm_ might be less confusing -- but I'm OK if you
want to leave things as-is.

 > > vlan_rcm.c:
 > >
 > >         * 73: Could we consistently use an enum for flags, rather than
 > >           using a combination of uint32_t and bitfields?  (enums are
 > >           debugger-friendly.)
 > > 
 > I changed to a enum type , but it is still bitfield.

That's exactly what I had in mind.  By "bitfields", I was referring to
things like `dv_force : 1'

 > >         * 197: Unclear how this cache_free() would ever have any work to
 > >           do, since if vlan_register() was never called the cache will be
 > >           empty, and if it was called, then upon unregistration
 > >           vlan_unregister() will have already emptied the cache.  (And if
 > >           items were left in the cache, is it safe to remove them without
 > >           calling rcm_unregister_interest()?)  If the cache_free() call is
 > >           needless, then the cache_free() routine can be removed as well.
 > > 
 > The xxx_unregister() function does not seem to be called at all. So that I 
 > am not sure whether cache_free() could have work to do. But at this point, 
 > since the framework does not provide the rcm_handle, it is not possible to 
 > call rcm_unregister_interest() here.
 > 
 > It seems a framework issue.

Uck.  A bug should be logged on this against the framework.  Also, a short
comment explaining this issue in rcm_mod_fini() would be useful.

 > >         * 416, 1194: Can you explain the need for the `break_on_failure'
 > >           flag?  Why not always try to online all VLANs?
 > > 
 > The reason was:
 > 
 > one is recovery after offline fails, break_on_failure  would be set to 
 > B_FALSE, and vlan_online_vlan() and vlan_consumer_online_common() will try 
 > best to online all VLANs and consumers.
 > 
 > Another is undo_offline() which is called by the framework. break_on_failure 
 > would be set to B_FALSE, and stop online operation once a failure occurs.
 > 
 > But I looked the cfgadm code again, it seems in the second case, the 
 > framework does not have any mechanism to report any failure of 
 > undo_offline(), and what it does is to try best to recover from a failed 
 > offline operation. I agree it should do the same and the first case.
 > 
 > I will make the change: ignore any failure occurs and online all the VLANs.
 > 
 > As a result, vlan_consumer_online(), vlan_online_vlan() will all return void 
 > instead of int.

Sounds good.

 > >         * 602, 1203, 1252: Would simplify code to use alloca() instead.
 > > 
 > Changed to declare rsrc as an array.

Even better.

 > >         * 610: Under what situation do we get here without
 > >           VLAN_CONSUMER_OFFLINED being set?  (When some force mode is
 > >           enabled?)  It seems like this would mean that we're about to
 > >           remove a link that still has consumers actively using one or
 > >           more of its VLANs, which would be problematic.
 > > 
 > I don't think this case can happen - xxx_remove() should be called (by 
 > confrim_rcm()) when the device is already successfully detached. I could 
 > change the check to an assertion if you prefer that.

Yes.

 > >         * 614: If rcm_notify_remove() fails part-way through, what
 > >           restores the state of the VLANs that were removed?
 > > 
 > No, the framework does not provide such API, and each rcm plugin usually 
 > just remove the cached node and populate the REMOVE event here. It is really 
 > hard to say what to do to recover this.

Indeed.  Minimally, I think we should print a warning in this case.

 > >         * 660: The ordering here seems strange -- shouldn't we defer the
 > >           removal of the node until after the consumers have agreed on
 > >           the remove?
 > > 
 > I don't think it matters.

Right, it only mattered if we could reasonably handle the failure of
vlan_consumer_remove().

 > >         * 661: This call to node_free() is done without holding cache_lock.
 > > 
 > It is already removed from the cache, so no need to hold the lock.

I see.

 > >         * 1004: This doesn't seem safe: we've already inserted `probe'
 > >           into the cache on line 987 -- and it has a reference to `rsrc'.
 > > 
 > I moved cache_insert() after allocation of "vlan" succeeds.

OK, that should work, though it seems more intuitive to check
CACHE_NODE_NEW rather than vc_vlan == NULL when deciding whether
to call cache_insert() -- or will that not work?

 > >         * 1164: As per previous comments, seems like vlan_log_err() should
 > >           take a linkid rather than a node (and print a link name).  Also,
 > >           a format like:
 > > 
 > >                 VLAN: vlan2: cannot offline consumers
 > > 
 > >           ... seems preferable to the current:
 > > 
 > >                 VLAN: cannot offline consumers(vlan2)
 > > 
 > >           ... but I realize the format you chose is well-established
 > >           practice with other RCM modules.
 > >
 > Like you said, I will follow the conventions and will not change the format.

Understood.

 >
 > >         * 1191: Remove "/* ARRGSUSED */" (the extra R wasn't causing it to
 > >           do anything anyway ;-)'
 > > 
 > errorp is not used, so I will just fix the typo.

Ah, I see that this is now a callback rather than a helper function, so
the unused argument makes sense.

 > >         * 1425: Shouldn't this message be at the top of the function?
 > > 
 > I don't like to print out the message for every persistent VLAN.

OK.

 > >         * 1461: CACHE_REFRESH is passed here, but it's ignored since
 > >           the handle is NULL.  But given that this codepath is only
 > >           reached via vlan_notify_event(), and that's already done a
 > >           cache_update(), it's probably OK to just pass CACHE_NO_REFRESH,
 > >           and then remove the hd != NULL check.
 > > 
 > I will add rcm_handle_t argument to the vlan_configure() function. Since 
 > vlan_configure() itself changes the list of active VLANs, I'd prefer to add 
 > refresh the cache here.

OK.

 > > aggr_rcm.c:
 >
 > >         * 221-227: As with vlan_rcm.c:197, it's unclear how this would
 > >           ever have any work to do -- and if it did have work to do, it
 > >           seems like it would need to call rcm_unregister_interest().
 > > 
 > The aggr_unregister() function does not seem to be called at all. So that I 
 > am not sure whether 221-227 could have work to do. But at this point, since 
 > the framework does not provide the rcm_handle, it is not possible to call 
 > rcm_unregister_interest() here.

As above, please add a comment (and file the bug).

 > >         * 494, 497: Should this be "IP/VLAN" consumers?  (Or perhaps we
 > >           should just refrain from guessing what the consumers might be.)
 > > 
 > Changed "IP consumers" to "Consumers"

OK.

 > >         * 577: What's the rationale behind setting *up to B_TRUE even
 > >           if dladm_aggr_up() fails?
 > > 
 > Fixed.
 > 
 > Just be noted that it was not a problem, as *up is not used in the
 > failure case.

Understood.

 > >         * 610: Under what situation do we get here without
 > >           CAHCE_AGGR_CONSUMER_OFFLINED being set?  As with VLANs, it 
 > >           seems odd to remove a link that still has consumers actively
 > >           using it.
 > > 
 > Did you mean line 710?

Sorry, I cut-and-pasted from the vlan_rcm.c comments and forgot to update
the line number.

 > If the link is not the only port in the aggregation. Then 
 > CAHCE_AGGR_CONSUMER_OFFLINED is not set.

Ah, yes.  Could you add a short comment covering this?

 > >         * 736: Don't we need to hold cache_lock here to safely access
 > >           vc_aggr?  (If not, why is it needed at line 1401?)
 > > 
 > No. We don't, because this node has already been removed from the cache list.

I see.

 > >         * 738: What does the name `exported' mean here?
 > > 
 > It seems a common practice through out the existing rcm modules, which means 
 > the resource names which would populate the RCM events.

OK.

 > >         * 749: This call to node_free() is done without holding cache_lock.
 > > 
 > Again, this node is already removed from the cache list.

I see.

 > >         * 1100: If this calloc() fails and `exist' is true, we end up
 > >           leaving an aggr in aggr_head with a NULL da_ports which will
 > >           e.g. later cause problems in aggr_is_only_port().  
 > 
 > I changed cache_update() to continue to remove the staled node even 
 > aggr_update_all() fails.
 > 
 > Therefore, no matter what fails, aggr_update_all() would try to walk through 
 > all the existing active aggregations and update the cache and aggr list. If 
 > "exist" is B_TRUE, the aggr would still be marked as AGGR_STALE and will 
 > later be freed in cache_update().

So in the case of memory failure, an existing aggregation may vanish?
As an alternative, perhaps we could remove da_ports entirely and have
a da_lastport field that is either an "invalid linkid" if the
aggregation has multiple ports, or the linkid of the remaining port if
only one remains?  (Or is there some other reason we need to track the
other ports?)

 > >           Same with the
 > >           logic on lines 1115-1117.  If `exist' is false, then we end up
 > >           freeing aggr even though it's currently inserted on the list via
 > >           the aggr_list_insert() on line 1092.
 > > 
 >          if (!exist)
 >                  aggr_list_insert(aggr);
 > 
 >          aggr->da_flags &= ~AGGR_STALE;

OK.

 > >         * 1244: As per previous comments, seems like aggr_log_err() could
 > >           take a linkid rather than a node (and print a link name).  Same
 > >           comments also apply on the format of the messages.
 > > 
 > For the rcm_log_message() part of this function, I think we still should 
 > print a linkid. Otherwise, I accept your comments.

What's special abou the rcm_log_message() part?

 > >         * 1501: I'm a bit confused on the semantics of `retval' -- seems
 > >           like it provides the status of the last aggregation to be
 > >           examined, which doesn't seem like what we want.
 > > 
 > >         * 1553: If the configuration was successful and a port can only
 > >           be in one aggregation, why do we return DLADM_WALK_CONTINUE?
 > > 
 > I've made some change based on your above comments. Please take a look once 
 > I send out the updated webrev.

Looks good.

--
meem

[clearview-discuss] code review request for the Clearview UV component

Reply via email to