Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

Ben Nemec Wed, 20 Dec 2017 06:10:12 -0800


On 12/19/2017 05:34 PM, Joe Talerico wrote:

On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins <[email protected]> wrote:



On 19 December 2017 at 22:23, Brian Haley <[email protected]> wrote:


On 12/19/2017 04:00 PM, Ben Nemec wrote:




On 12/19/2017 02:43 PM, Brian Haley wrote:


On 12/19/2017 11:53 AM, Ben Nemec wrote:


The reboot is done (mostly...see below).

On 12/18/2017 05:11 PM, Joe Talerico wrote:


Ben - Can you provide some links to the ovs port exhaustion issue for
some background?



I don't know if we ever had a bug opened, but there's some discussion
of it in
http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
I've also copied Derek since I believe he was the one who found it
originally.

The gist is that after about 3 months of tripleo-ci running in this
cloud we start to hit errors creating instances because of problems creating
OVS ports on the compute nodes.  Sometimes we see a huge number of ports in
general, other times we see a lot of ports that look like this:

Port "qvod2cade14-7c"
              tag: 4095
              Interface "qvod2cade14-7c"

Notably they all have a tag of 4095, which seems suspicious to me.  I
don't know whether it's actually an issue though.



Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.

The 'qvo' here shows it's part of the VETH pair that os-vif created when
it plugged in the VM (the other half is 'qvb'), and they're created so that
iptables rules can be applied by neutron.  It's part of the "old" way to do
security groups with the OVSHybridIptablesFirewallDriver, and can eventually
go away once the OVSFirewallDriver can be used everywhere (requires newer
OVS and agent).

I wonder if you can run the ovs_cleanup utility to clean some of these
up?



As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
any ports that are still in use?  Or is there a different tool I'm not aware
of that can do more targeted cleanup?



Crap, I thought there was an option to just cleanup these dead devices, I
should have read the code, it's either neutron ports (default) or all ports.
Maybe that should be an option.



iirc neutron-ovs-cleanup was being run following the reboot as part of a
ExecStartPre= on one of the neutron services this is what essentially
removed the ports for us.


There is actually unit files for cleanup (netns|ovs|lb), specifically
for ovs-cleanup[1]

Maybe this can be ran to mitigate the need for a reboot?

That's what Brian suggested too, but running it with instances on thenode will cause an outage because it cleans up everything, includingin-use ports. The reason a reboot works is basically that it causesthis unit to run when the node comes back up because it's a dep of theother services. So it's possible we could use this to skip the completereboot, but that's not the time-consuming part of the process. It'swaiting for all the instances to cycle off so we don't cause spuriousfailures when we wipe the ovs ports. Actually rebooting the nodes takesabout five minutes (and it's only that long because of an old TripleO bug).


[1]
[Unit]
Description=OpenStack Neutron Open vSwitch Cleanup Utility
After=syslog.target network.target openvswitch.service
Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service
neutron-l3-agent.service openstack-nova-compute.service

[Service]
Type=oneshot
User=neutron
ExecStart=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
ExecStop=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
PrivateTmp=true
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
~




-Brian

Oh, also worth noting that I don't think we have os-vif in this cloud
because it's so old.  There's no os-vif package installed anyway.


-Brian

I've had some offline discussions about getting someone on this cloud
to debug the problem.  Originally we decided not to pursue it since it's not
hard to work around and we didn't want to disrupt the environment by trying
to move to later OpenStack code (we're still back on Mitaka), but it was
pointed out to me this time around that from a downstream perspective we
have users on older code as well and it may be worth debugging to make sure
they don't hit similar problems.

To that end, I've left one compute node un-rebooted for debugging
purposes.  The downstream discussion is ongoing, but I'll update here if we
find anything.


Thanks,
Joe

On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec <[email protected]>
wrote:


Hi,

It's that magical time again.  You know the one, when we reboot rh1
to avoid
OVS port exhaustion. :-)

If all goes well you won't even notice that this is happening, but
there is
the possibility that a few jobs will fail while the te-broker host is
rebooted so I wanted to let everyone know.  If you notice anything
else
hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I
have
been known to forget to restart services after the reboot.

I'll send a followup when I'm done.

-Ben


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
[email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
[email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
[email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev





__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
[email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

Reply via email to