Peter Memishian wrote: > > > > > General comments: > > > > > > > > > Also, I'm a bit concerned about the use of link > > > > > identifiers here, since e.g. cfgadm will output them > (right edge > > > > > truncated to avoid 80-column wrap): > > > > > > > > > > cfgadm: Component system is busy, try again: > > > > > Resource Info > > > > > -------------------------------------- ------------------ > > > > > /devices/pci at 9,700000/pci at 2/network at 0 Network > interface > > > > > SUNW_network/1 ce0 AGGRs: aggr0 > > > > > SUNW_network/2 aggr0 hosts IP add > > > > > SUNW_ip/10.8.57.204 Plumbed IP Address > > > > > > > > > > What would be the impact of using link names instead? > > > > > > > > > > > > > Note that all datalink configuration are based on linkids, so a > > > > rename-link should not affect the datalink datalink configuration > > > > offline/online/configure. But if we use link names, there would be > > > > races between the RCM operation and a "dladm rename-link" operation, > > > > and would cause unexpected result. > > > > > > What specific race are you concerned about? Given that the network, > VLAN, > > > and aggr RCM modules rebuild the resource name cache before processing a > > > notification, wouldn't renames be transparently handled? Are you worried > > > about a rename-link running in parallel with a cfgadm operation? That > > > seems like a recipe for confusion no matter what, and also pretty > unlikely > > > given that it would require the datalinks to be unplumbed. > > > > > True, I agree a "dladm rename-link" operation in parallel with a DR > > operation is rare, but it could happen. Note that it does not matter > whether > > the link is plumbed or not. For example, assuming an aggregation aggr1 > which > > is created over bge0, bge0 can be renamed when we tries to offline/online > > bge0, even when aggr1 is not plumbed. > > Do you mean "even when aggr1 is plumbed"? > Ah, right.
> > If we don't care about the above cases at all, I guess we could change the > > resource name from the linkid to the linkname. I suspect the code would be > > much more messy though, as we often need to do the <linkname, linkid> > > translation. > > > > Regarding the RCM_RESOURCE_LINK_NEW event (mostly used in the configuration > > process after the hardware is DRed in), it currently also carries linkids. > > Because this configuration processing is mostly asynchronous, the chance of > > the race would be greater than than the offline/online processing. > > Understood. I'm OK with leaving this as-is given that the resource names > are already pretty cryptic (e.g. /devices/pci at 9,700000/pci at 2/network at > 0) > and that the Info column from cfgadm provides human-consumable > information, but we may need to revisit this in the future based on > operational experience from the field. > Okay. > > > For debug information, could we output the resource name instead? (I > > > realize that this may contain the linkid, depending on the outcome of > > > the discussion above). > > > > > I've tried to outcome the resource name as much as possible. But some of > the > > debug information cannot be changed. For example, in aggr_offline_port(): > > > > rcm_log_message(RCM_TRACE2, "AGGR: delete aggregation %u\n", > > aggr->da_aggrid); > > I presume we don't have the information at this point to map it back to a > linkname (has that info already been deleted)? > The <link name, linkid> mapping is still stored in the dlmgmtd daemon, and we could get the link name if we really need it. But I am not convinced that we need it, as the message is really for diagnose purpose, calling libdladm APIs to convert linkid to linkname seems overkilled. > > > > > * Are we sure it's safe to call *_consumer_offline() while > holding > > > > > cache_lock? Seems like it should be OK, but I'm slightly > > > > > nervous about calling other modules (via > rcm_request_offline()) > > > > > while holding locks. > > > > > > > > > Yes. It should be fine to hold locks. Note that it is not safe to > hold the > > > > lock when calling rcm_notify_event(RCM_RESOURCE_LINK_NEW), as it > could > > > > result another RCM_RESOURCE_LINK_NEW event. > > > > > > Seems like that would only be problematic if the first > > > RCM_RESOURCE_LINK_NEW led to synchronous processing of the second one. > > > Could you elaborate on the issue? > > > > > I don't think we ever call rcm_notify_event(RCM_RESOURCE_LINK_NEW) with the > > locks held. Did you see any code does that? > > No, but I'm questioning your original comment that it would be unsafe to > hold the lock when calling rcm_notify_event(RCM_RESOURCE_LINK_NEW). Seems > like this wouldn't be a problem even if we did hold the lock because the > handling of the notification is performed asynchronously. > The notification could be synchronous if it is called within the rcm_daemon. See rcm_direct_call(). > > > > > aggr_rcm.c: > > > > > > > > > * 221-227: As with vlan_rcm.c:197, it's unclear how this > would > > > > > ever have any work to do -- and if it did have work to > do, it > > > > > seems like it would need to call > rcm_unregister_interest(). > > > > > > > > > The aggr_unregister() function does not seem to be called at all. So > that I > > > > am not sure whether 221-227 could have work to do. But at this point, > since > > > > the framework does not provide the rcm_handle, it is not possible to > call > > > > rcm_unregister_interest() here. > > > > > > As above, please add a comment (and file the bug). > > So was the bug filed? > 6642490. Thanks - Cathy
