Meem,

Sorry for this very late reply. See inline:

>  > > General comments:
>  > 
>  > >           Also, I'm a bit concerned about the use of link
>  > >           identifiers here, since e.g. cfgadm will output them (right 
> edge
>  > >           truncated to avoid 80-column wrap):
>  > > 
>  > >           cfgadm: Component system is busy, try again:
>  > >                          Resource                               Info
>  > >           --------------------------------------  ------------------
>  > >           /devices/pci at 9,700000/pci at 2/network at 0   Network 
> interface 
>  > >           SUNW_network/1                          ce0 AGGRs: aggr0
>  > >           SUNW_network/2                          aggr0 hosts IP add
>  > >           SUNW_ip/10.8.57.204                     Plumbed IP Address
>  > >            
>  > >           What would be the impact of using link names instead?
>  > > 
>  >
>  > Note that all datalink configuration are based on linkids, so a
>  > rename-link should not affect the datalink datalink configuration
>  > offline/online/configure. But if we use link names, there would be
>  > races between the RCM operation and a "dladm rename-link" operation,
>  > and would cause unexpected result.
> 
> What specific race are you concerned about?  Given that the network, VLAN,
> and aggr RCM modules rebuild the resource name cache before processing a
> notification, wouldn't renames be transparently handled?  Are you worried
> about a rename-link running in parallel with a cfgadm operation?  That
> seems like a recipe for confusion no matter what, and also pretty unlikely
> given that it would require the datalinks to be unplumbed.
> 
True, I agree a "dladm rename-link" operation in parallel with a DR
operation is rare, but it could happen. Note that it does not matter whether
the link is plumbed or not. For example, assuming an aggregation aggr1 which
  is created over bge0, bge0 can be renamed when we tries to offline/online
bge0, even when aggr1 is not plumbed.

If we don't care about the above cases at all, I guess we could change the
resource name from the linkid to the linkname. I suspect the code would be
much more messy though, as we often need to do the <linkname, linkid>
translation.

Regarding the RCM_RESOURCE_LINK_NEW event (mostly used in the configuration
process after the hardware is DRed in), it currently also carries linkids.
Because this configuration processing is mostly asynchronous, the chance of
the race would be greater than than the offline/online processing.

>  > >           If (for some reason) linkids must be used at the link layer,
>  > >           would it be possible to use link names at a higher level?  I'm
>  > >           really uncomfortable with e.g. ip_rcm working with linkids.
>  > > 
>  > That is possible. But that would mean link rcm modules (for example, 
>  > aggr_rcm), needs to populate a event to its consumer twice (one in the 
> form 
>  > of SUNW_datalink/<linkid>, another in the form of 
> SUNW_network/<linkname>). 
>  > Because its consumer could be at or above the link layer (for example, IP 
>  > and VLAN).
>  > 
>  > I am not sure whether that implementation is clearer.
> 
> Yeah, I agree that's messy.
> 
> My real concern here is the link identifiers appearing in the cfgadm
> output -- I was really hoping to avoid that.
> 
Understood. This is a tradeoff, let me know if you still prefer using link
names in the ip_rcm.

>  > >         * Similar to the above: please be sure that link identifiers do
>  > >           not appear in diagnostic messages the system administrator 
> could
>  > >           see.  (If possible, I'd prefer to avoid link identifiers in all
>  > >           diagnostic messages).
>  > > 
>  > You mean aggr_log_err() etc. right? Sure. For debug information, I think 
> we 
>  > would want the link id, as that is the resource name get populated.

>  > > aggr_rcm.c
> 
> For debug information, could we output the resource name instead?  (I
> realize that this may contain the linkid, depending on the outcome of
> the discussion above).
> 
I've tried to outcome the resource name as much as possible. But some of the
debug information cannot be changed. For example, in aggr_offline_port():

        rcm_log_message(RCM_TRACE2, "AGGR: delete aggregation %u\n",
            aggr->da_aggrid);

>  > >         * Are we sure it's safe to call *_consumer_offline() while 
> holding
>  > >           cache_lock?  Seems like it should be OK, but I'm slightly
>  > >           nervous about calling other modules (via rcm_request_offline())
>  > >           while holding locks.
>  > > 
>  > Yes. It should be fine to hold locks. Note that it is not safe to hold the 
>  > lock when calling rcm_notify_event(RCM_RESOURCE_LINK_NEW), as it could 
>  > result another RCM_RESOURCE_LINK_NEW event.
> 
> Seems like that would only be problematic if the first
> RCM_RESOURCE_LINK_NEW led to synchronous processing of the second one.
> Could you elaborate on the issue?
> 
I don't think we ever call rcm_notify_event(RCM_RESOURCE_LINK_NEW) with the
locks held. Did you see any code does that?

>  > > vlan_rcm.c:

>  > >         * 610: Under what situation do we get here without
>  > >           VLAN_CONSUMER_OFFLINED being set?  (When some force mode is
>  > >           enabled?)  It seems like this would mean that we're about to
>  > >           remove a link that still has consumers actively using one or
>  > >           more of its VLANs, which would be problematic.
>  > > 
>  > I don't think this case can happen - xxx_remove() should be called (by 
>  > confrim_rcm()) when the device is already successfully detached. I could 
>  > change the check to an assertion if you prefer that.
> 
> Yes.
> 
Changed.

>  > >         * 614: If rcm_notify_remove() fails part-way through, what
>  > >           restores the state of the VLANs that were removed?
>  > > 
>  > No, the framework does not provide such API, and each rcm plugin usually 
>  > just remove the cached node and populate the REMOVE event here. It is 
> really 
>  > hard to say what to do to recover this.
> 
> Indeed.  Minimally, I think we should print a warning in this case.
> 
Added.
> 
>  > >         * 1004: This doesn't seem safe: we've already inserted `probe'
>  > >           into the cache on line 987 -- and it has a reference to `rsrc'.
>  > > 
>  > I moved cache_insert() after allocation of "vlan" succeeds.
> 
> OK, that should work, though it seems more intuitive to check
> CACHE_NODE_NEW rather than vc_vlan == NULL when deciding whether
> to call cache_insert() -- or will that not work?
> 
Sure. I changed it.

>  > > aggr_rcm.c:
>  >
>  > >         * 221-227: As with vlan_rcm.c:197, it's unclear how this would
>  > >           ever have any work to do -- and if it did have work to do, it
>  > >           seems like it would need to call rcm_unregister_interest().
>  > > 
>  > The aggr_unregister() function does not seem to be called at all. So that 
> I 
>  > am not sure whether 221-227 could have work to do. But at this point, 
> since 
>  > the framework does not provide the rcm_handle, it is not possible to call 
>  > rcm_unregister_interest() here.
> 
> As above, please add a comment (and file the bug).
> 
>  > >         * 610: Under what situation do we get here without
>  > >           CAHCE_AGGR_CONSUMER_OFFLINED being set?  As with VLANs, it 
>  > >           seems odd to remove a link that still has consumers actively
>  > >           using it.
>  > > 
>  > If the link is not the only port in the aggregation. Then 
>  > CAHCE_AGGR_CONSUMER_OFFLINED is not set.
> 
> Ah, yes.  Could you add a short comment covering this?
> 
Added the comments.

>  > >         * 1100: If this calloc() fails and `exist' is true, we end up
>  > >           leaving an aggr in aggr_head with a NULL da_ports which will
>  > >           e.g. later cause problems in aggr_is_only_port().  
>  > 
>  > I changed cache_update() to continue to remove the staled node even 
>  > aggr_update_all() fails.
>  > 
>  > Therefore, no matter what fails, aggr_update_all() would try to walk 
> through 
>  > all the existing active aggregations and update the cache and aggr list. 
> If 
>  > "exist" is B_TRUE, the aggr would still be marked as AGGR_STALE and will 
>  > later be freed in cache_update().
> 
> So in the case of memory failure, an existing aggregation may vanish?
> As an alternative, perhaps we could remove da_ports entirely and have
> a da_lastport field that is either an "invalid linkid" if the
> aggregation has multiple ports, or the linkid of the remaining port if
> only one remains?  (Or is there some other reason we need to track the
> other ports?)
> 
Good idea. I've made the changes.

>  > >         * 1244: As per previous comments, seems like aggr_log_err() could
>  > >           take a linkid rather than a node (and print a link name).  Same
>  > >           comments also apply on the format of the messages.
>  > > 
>  > For the rcm_log_message() part of this function, I think we still should 
>  > print a linkid. Otherwise, I accept your comments.
> 
> What's special abou the rcm_log_message() part?
> 
Because I think that would be RCM related, which would print the resource
name which has linkid in it. But I agree that rsrc should be printed instead
of directly print out the linkid, I've changed it.

The recent webrev is here. Note that it includes my other bug fixes, and it 
hasn't been tested yet.

/net/aquila.prc//export/home/cathy/clearview-snapshot-fp-remove/webrev_1211/index.html

Thanks
- Cathy



Reply via email to