On Wed, Dec 20, 2017 at 9:08 AM, Ben Nemec <[email protected]> wrote: > > > On 12/19/2017 05:34 PM, Joe Talerico wrote: >> >> On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins <[email protected]> wrote: >>> >>> >>> >>> On 19 December 2017 at 22:23, Brian Haley <[email protected]> wrote: >>>> >>>> >>>> On 12/19/2017 04:00 PM, Ben Nemec wrote: >>>>> >>>>> >>>>> >>>>> >>>>> On 12/19/2017 02:43 PM, Brian Haley wrote: >>>>>> >>>>>> >>>>>> On 12/19/2017 11:53 AM, Ben Nemec wrote: >>>>>>> >>>>>>> >>>>>>> The reboot is done (mostly...see below). >>>>>>> >>>>>>> On 12/18/2017 05:11 PM, Joe Talerico wrote: >>>>>>>> >>>>>>>> >>>>>>>> Ben - Can you provide some links to the ovs port exhaustion issue >>>>>>>> for >>>>>>>> some background? >>>>>>> >>>>>>> >>>>>>> >>>>>>> I don't know if we ever had a bug opened, but there's some discussion >>>>>>> of it in >>>>>>> >>>>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html >>>>>>> I've also copied Derek since I believe he was the one who found it >>>>>>> originally. >>>>>>> >>>>>>> The gist is that after about 3 months of tripleo-ci running in this >>>>>>> cloud we start to hit errors creating instances because of problems >>>>>>> creating >>>>>>> OVS ports on the compute nodes. Sometimes we see a huge number of >>>>>>> ports in >>>>>>> general, other times we see a lot of ports that look like this: >>>>>>> >>>>>>> Port "qvod2cade14-7c" >>>>>>> tag: 4095 >>>>>>> Interface "qvod2cade14-7c" >>>>>>> >>>>>>> Notably they all have a tag of 4095, which seems suspicious to me. I >>>>>>> don't know whether it's actually an issue though. >>>>>> >>>>>> >>>>>> >>>>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the >>>>>> agent. >>>>>> >>>>>> The 'qvo' here shows it's part of the VETH pair that os-vif created >>>>>> when >>>>>> it plugged in the VM (the other half is 'qvb'), and they're created so >>>>>> that >>>>>> iptables rules can be applied by neutron. It's part of the "old" way >>>>>> to do >>>>>> security groups with the OVSHybridIptablesFirewallDriver, and can >>>>>> eventually >>>>>> go away once the OVSFirewallDriver can be used everywhere (requires >>>>>> newer >>>>>> OVS and agent). >>>>>> >>>>>> I wonder if you can run the ovs_cleanup utility to clean some of these >>>>>> up? >>>>> >>>>> >>>>> >>>>> As in neutron-ovs-cleanup? Doesn't that wipe out everything, including >>>>> any ports that are still in use? Or is there a different tool I'm not >>>>> aware >>>>> of that can do more targeted cleanup? >>>> >>>> >>>> >>>> Crap, I thought there was an option to just cleanup these dead devices, >>>> I >>>> should have read the code, it's either neutron ports (default) or all >>>> ports. >>>> Maybe that should be an option. >>> >>> >>> >>> iirc neutron-ovs-cleanup was being run following the reboot as part of a >>> ExecStartPre= on one of the neutron services this is what essentially >>> removed the ports for us. >>> >>> >> >> There is actually unit files for cleanup (netns|ovs|lb), specifically >> for ovs-cleanup[1] >> >> Maybe this can be ran to mitigate the need for a reboot? > > > That's what Brian suggested too, but running it with instances on the node > will cause an outage because it cleans up everything, including in-use > ports. The reason a reboot works is basically that it causes this unit to > run when the node comes back up because it's a dep of the other services. > So it's possible we could use this to skip the complete reboot, but that's > not the time-consuming part of the process. It's waiting for all the > instances to cycle off so we don't cause spurious failures when we wipe the > ovs ports. Actually rebooting the nodes takes about five minutes (and it's > only that long because of an old TripleO bug).
ack. There are options you can pass with the cleanup to not nuke everything. I wonder if it is a combination of ovs-cleanup + restarting the ovs-agent? Anyway, doesn't seem that big of a problem then. /me gets off his uptime soapbox Joe > > >> >> [1] >> [Unit] >> Description=OpenStack Neutron Open vSwitch Cleanup Utility >> After=syslog.target network.target openvswitch.service >> Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service >> neutron-l3-agent.service openstack-nova-compute.service >> >> [Service] >> Type=oneshot >> User=neutron >> ExecStart=/usr/bin/neutron-ovs-cleanup --config-file >> /usr/share/neutron/neutron-dist.conf --config-file >> /etc/neutron/neutron.conf --config-file >> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir >> /etc/neutron/conf.d/common --config-dir >> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file >> /var/log/neutron/ovs-cleanup.log >> ExecStop=/usr/bin/neutron-ovs-cleanup --config-file >> /usr/share/neutron/neutron-dist.conf --config-file >> /etc/neutron/neutron.conf --config-file >> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir >> /etc/neutron/conf.d/common --config-dir >> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file >> /var/log/neutron/ovs-cleanup.log >> PrivateTmp=true >> RemainAfterExit=yes >> >> [Install] >> WantedBy=multi-user.target >> ~ >>>> >>>> >>>> >>>> >>>> -Brian >>>> >>>> >>>>> Oh, also worth noting that I don't think we have os-vif in this cloud >>>>> because it's so old. There's no os-vif package installed anyway. >>>>> >>>>>> >>>>>> -Brian >>>>>> >>>>>>> I've had some offline discussions about getting someone on this cloud >>>>>>> to debug the problem. Originally we decided not to pursue it since >>>>>>> it's not >>>>>>> hard to work around and we didn't want to disrupt the environment by >>>>>>> trying >>>>>>> to move to later OpenStack code (we're still back on Mitaka), but it >>>>>>> was >>>>>>> pointed out to me this time around that from a downstream perspective >>>>>>> we >>>>>>> have users on older code as well and it may be worth debugging to >>>>>>> make sure >>>>>>> they don't hit similar problems. >>>>>>> >>>>>>> To that end, I've left one compute node un-rebooted for debugging >>>>>>> purposes. The downstream discussion is ongoing, but I'll update here >>>>>>> if we >>>>>>> find anything. >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Joe >>>>>>>> >>>>>>>> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> It's that magical time again. You know the one, when we reboot rh1 >>>>>>>>> to avoid >>>>>>>>> OVS port exhaustion. :-) >>>>>>>>> >>>>>>>>> If all goes well you won't even notice that this is happening, but >>>>>>>>> there is >>>>>>>>> the possibility that a few jobs will fail while the te-broker host >>>>>>>>> is >>>>>>>>> rebooted so I wanted to let everyone know. If you notice anything >>>>>>>>> else >>>>>>>>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. >>>>>>>>> I >>>>>>>>> have >>>>>>>>> been known to forget to restart services after the reboot. >>>>>>>>> >>>>>>>>> I'll send a followup when I'm done. >>>>>>>>> >>>>>>>>> -Ben >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> __________________________________________________________________________ >>>>>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>>>>> Unsubscribe: >>>>>>>>> [email protected]?subject:unsubscribe >>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> __________________________________________________________________________ >>>>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>>>> Unsubscribe: >>>>>>>> [email protected]?subject:unsubscribe >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> __________________________________________________________________________ >>>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>>> Unsubscribe: >>>>>>> [email protected]?subject:unsubscribe >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> __________________________________________________________________________ >>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>> Unsubscribe: >>>>>> [email protected]?subject:unsubscribe >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>>> >>>> >>>> >>>> __________________________________________________________________________ >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: >>>> [email protected]?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >>> >>> >>> >>> __________________________________________________________________________ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> [email protected]?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
