[clearview-discuss] re: DR and net_online

Peter Memishian Tue, 19 Dec 2006 14:20:28 -0500

 > As you might know, that the net_online() function (in the network_rcm 
 > plugin) is actually a "net_undo_offline()" operation to reverse all 
 > operation has been done before offline fails.
 > 
 > As an example, ip_rcm plugin keeps a list of the IP interfaces on a specific 
 > device, and ip_undo_offline() will try to replumb all IP interfaces based on 
 > the list (which has been unplumbed in ip_offline()).


Indeed.

 > To do the similar after vanity naming, network_rcm needs to keep a list of 
 > active aggregations and (explicit created) VLANs which are associated with 
 > each specific device, and I expect that code will be complicated. Another 
 > alternative is to create aggregations or VLANs based on the persistent 
 > configuration. The downside of this alternative is that this might not be a 
 > exact reverse operation because of the difference between the persistent 
 > configuration and the active configuration.
 > 
 > Note that after DR in a device, we will also try to restore all 
 > configuration associated with the device, at that time, we can only do that 
 > based on the persistent configuration because the active configuration has 
 > been long gone.

Right.

 > Because of this, I am proposing to do the same (restore based on the 
 > persistent configuration) in net_online(). That is also the proposal in the 
 > UV design document.

Just to be clear: are you proposing to have *all* network configuration
restored from the persistent configuration, or just the configuration for
VLANs and aggregations?  Minimally, I think the behavior needs to be the
same across all links and IP interfaces.

That said, while I agree it's an annoying edge case and a lot of code, I'd
vote against this approach since:

        1. Having an "offline" operation fail is not unusual, especially
           given how poor the administrative interfaces are to map
           hardware resources (NICs, memory boards, ...) to hardware
           slots.  If the administrator names the wrong slot, the offline
           operation will likely fail, but any network configuration on
           the affected NICs that has been changed since boot will not be
           reapplied.

        2. Failed offline handling should be consistent for all resources.
           For instance, suppose both disk and NIC configuration must be
           restored after a failed offline.  Having the disk RCM code
           restore the mounts that used to exist (regardless of whether
           they were present at boot), but having the network RCM code
           restore the persistent configuration seems confusing and hard
           to justify administratively.

        3. In the future, there's a strong possibility we will allow link
           and IP configuration remain in a "detached" state across a DR
           event, and then allow newly-inserted hardware to attach to that
           link and IP configuration.  That would move our model away from
           using persistent configuration across a DR -- but what you've
           proposed does the opposite, by further sedimenting the use of
           persistent configuration.

Note that if (3) is realized, we would actually end up with less code,
since we would on longer need to tear down the network configuration at
all as part of a DR event -- and since nothing is torn down, nothing would
need to be restored on failure.

-- 
meem

[clearview-discuss] re: DR and net_online

Reply via email to