Could this be caused by a case mismatch between the MAC address as it exists in the database and the MAC that comes from the agent?
When the interfaces are updated with data from the agent we attempt to match the MAC to an existing interface ( https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/network/manager.py#L682-L690). If that doesn't work we attempt to match by name. Looking at the data that comes from the agent the MAC is always capitalized while in the database it's lower-case. It seems like checking the MAC will fail and we'll fall through to matching by name. If the interfaces haven't been reordered then it doesn't matter whether or not we match on name or MAC. However, if the order has changed we'll have an issue. When the interfaces are matched by name they'll be updated with the agent info. Because we matched by name that will stay the same and we'll update the MAC instead, which isn't what we want. e.g. First boot: 1 | eth0 | 00:aa 2 | eth1 |00:bb If the interface order is changed we'll have (as sent by the agent): eth0 (00:BB) eth1 (00:AA) Because the MAC case doesn't match we'll end up matching by name. This means we update the wrong database record. We have: 1 | eth0 | 00:bb 2 | eth1 | 00:aa Instead of 1 | eth1 | 00:aa 2 | eth0 | 00:bb On Thu, Nov 20, 2014 at 4:29 PM, Andrew Woodward <xar...@gmail.com> wrote: > In order for this to occur, this means that the node has to be > bootstrapped and discover to nailgun, added to a cluster, and then > bootstrap again (reboot) and have the agent update with a different > nic order? > > i think the issue will only occur when networks are mapped to the > interfaces, in this case the root cause is that the ethX name is used > as the key attribute for updates, but really the mac should be the > real key. If we change this behavior, then we should be able to have > it update properly regardless of the current interface name. > > On Thu, Nov 20, 2014 at 12:01 PM, Dmitriy Shulyak <dshul...@mirantis.com> > wrote: > > Hi folks, > > > > There was interesting research today on random nics ordering for nodes in > > bootstrap stage. And in my opinion it requires separate thread... > > I will try to describe what the problem is and several ways to solve it. > > Maybe i am missing the simple way, if you see it - please participate. > > Link to LP bug: https://bugs.launchpad.net/fuel/+bug/1394466 > > > > When a node is booted first time it registers its interfaces in nailgun, > see > > sample of data (only related to discussion parts): > > - name: eth0 > > ip: 10.0.0.3/24 > > mac: 00:00:03 > > - name: eth1 > > ip: None > > mac: 00:00:04 > > * eth0 is admin network interface which was used for initial pxe boot > > > > We have networks, for simplicity lets assume there is 2: > > - admin > > - public > > When the node is added to cluster, in general you will see next schema: > > - name: eth0 > > ip: 10.0.0.3/24 > > mac: 00:00:03 > > networks: > > - admin > > - public > > - name: eth1 > > ip: None > > mac: 00:00:04 > > > > At this stage node is still using default system with bootstrap profile, > so > > there is no custom system with udev rules. And on next reboot there is no > > way to guarantee that network cards will be discovered by kernel in same > > order. If network cards is discovered in order that is diffrent from > > original and nics configuration is updated, it is possible to end up > with: > > - name: eth0 > > ip: None > > mac: 00:00:04 > > networks: > > - admin > > - public > > - name: eth1 > > mac: 00:00:03 > > ip: 10.0.0.3/24 > > Here you can see that networks is left connected to eth0 (in db). And > > ofcourse this schema doesnt reflect physical infrastructure. I hope it is > > clear now what is the problem. > > If you want to investigate it yourself, please find db dump in snapshot > > attached to the bug, you will be able to find described here case. > > What happens next: > > 1. netcfg/choose_interface for kernel is misconfigured, and in my > example it > > will be 00:00:04, but should be 00:00:03 > > 2. network configuration for l23network will be simply corrupted > > > > So - possible solutions: > > 1. Reflect node interfaces ordering, with networks reassignment - Hard > and > > hackish > > 2. Do not update any interfaces info if networks assigned to them, then > udev > > rules will be applied and nics will be reordered into original state - i > > would say easy and reliable solution > > 3. Create cobbler system when node is booted first time, and add udev > rules > > - it looks to me like proper solution, but requires design > > > > Please share your thoughts/ideas, afaik this issue is not rare on scale > > deployments. > > Thank you > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStackfirstname.lastname@example.org > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > -- > Andrew > Mirantis > Ceph community > > _______________________________________________ > OpenStack-dev mailing list > OpenStackemail@example.com > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list OpenStackfirstname.lastname@example.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev