On 11/10/2016 07:22 AM, Brent Eagles wrote: > Hi all, > > > A recent critical issue that has come up that has compelled me to > propose reconsidering our default and OVS based network configuration > examples : > > https://bugs.launchpad.net/tripleo/+bug/1640812 - Network connectivity > lost on node reboot > > I've been thinking about it for awhile, but you could say this bug was > the "last straw". > > While the precise root cause of this issue is still in question, part > of the problem is that the overcloud nodes communicate with the > undercloud and each other through an OVS bridge which is also used by > the overcloud neutron service for external network traffic. For several > valid reasons, neutron sets the OVS bridge fail_mode to secure (details > in respective man pages, etc, etc). This mode is stored persistently so > when the system is rebooted, the bridge is recreated with the secure > fail_mode in place, blocking network traffic - including DHCP - until > something comes along and starts setting up flow rules to allow traffic > to flow. Without an IP address, the node is effectively "unplugged". > For some reason this isn't happening 100% of the time on the current > version of CentOS (7.2), but seems to be pretty much 100% on RHEL 7.3. > > It raises the question if it is valid for neutron to modify an OVS > bridge that it *did not create* in a fundamental way like this. If so, > it implies a contract between the deployer and neutron that the > deployer can make "no assumptions" about what will happen with the > bridge once neutron has been configured to access it. If this implied > contract is valid, required and acceptable, then bridges used for > neutron should not be used for anything else. The implications with > respect to tripleo is that we should reconsider how we use OVS bridges > for network configuration in the overcloud. For example, in single NIC > situations, instead of having: > > (triple configured) > - eth0 > - br-ex -used for control plane access, internal api, management, > external, etc. also neutron is configured to use this for the external > traffic e.g. dataplane in our defaults, which is why the fail_mode gets > altered > > (neutron configured) > > - br-int > - br-tun > > To something like: > (triple configured) > - eth0 > - br-ctl - used as br-ex is currently used except neutron knows > nothing about it. > - br-ex -patched to br-ctl - ostensibly for external traffic and this > is what neutron in the overcloud is configured to use > (neutron configured) > - br-int > - br-tun > > (In all cases, neutron configures patches, etc. between bridges *it > knows about* as needed. That is, in the second case, tripleo would > configure the patch between br-ctl and br-ex) > > At the cost of an extra bridge (ovs bridge to ovs bridge with patch > ports is allegedly cheap btw) we get: > 1. an independently configured bridge for overcloud traffic insulates > non-tenant node traffic against changes to neutron, including upgrades, > neutron bugs, etc. > 2. insulates neutron from changes to the underlying network that it > doesn't "care" about. > 3. In OVS only environments, the difference between a single nic > environment and one where there is a dedicated nic for external traffic > is, instead of a patch port from br-ctl to br-ex, it is directly > connected to the nic for the external traffic. > > Even without the issue that instigated this message, I think that this > is a change worth considering. > > > Cheers, > > > Brent > > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
Brent, Thanks for taking the time to analyze this situation. I see a couple of potential issues with the topology you are suggesting. First of all, what about the scenario where a system has only 2x10Gb NICs, and the operator wishes to bond these together on a single bridge? If we require separate bridges for Neutron than we do for the control plane, then it would be impossible to configure a system with only 2 NICs in a fault-tolerant way. Second, there will be a large percentage of users who already have a shared br-ex that wish to upgrade. Do we tell them that due to an architectural change, they now must redeploy a new cloud with a new topology to use the latest version? So while I would be on-board with changing our default for new installations, I don't think that relieves us of the responsibility to figure out how to handle the edge cases where a separate bridge is not feasible. -- Dan Sneddon | Senior Principal OpenStack Engineer dsned...@redhat.com | redhat.com/openstack dsneddon:irc | @dxs:twitter __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev