Re: [Openstack-operators] [Openstack] Recovering from full outage

Torin Woltjer Fri, 06 Jul 2018 11:17:32 -0700

I explored creating a second "selfservice" vxlan to see if DHCP would work on 
it as it does on my external "provider" network. The new vxlan network shares 
the same problems as the old vxlan network. Am I having problems with VXLAN in 
particular?


Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: "Torin Woltjer" <[email protected]>
Sent: 7/6/18 12:05 PM
To: <[email protected]>
Cc: [email protected], [email protected]
Subject: Re: [Openstack] Recovering from full outage
Interestingly, I can ping the neutron router at 172.16.1.1 just fine, but DHCP 
(located at 172.16.1.2 and 172.16.1.3) fails. The instance that I manually 
added the IP address to has a floating IP, and oddly enough I am able to ping 
DHCP on the provider network, which suggests that DHCP may be working on other 
networks but not on my selfservice network. I was able to confirm this by 
creating a new virtual machine directly on the provider network, I was able to 
ping to it and SSH into it right off of the bat, as it obtained the proper 
address on its own. 
"/var/lib/neutron/dhcp/d85c2a00-a637-4109-83f0-7c2949be4cad/leases" is empty. 
"/var/lib/neutron/dhcp/d85c2a00-a637-4109-83f0-7c2949be4cad/leases" contains:
fa:16:3e:3f:94:17,host-172-16-1-8.openstacklocal,172.16.1.8
fa:16:3e:e0:57:e7,host-172-16-1-7.openstacklocal,172.16.1.7
fa:16:3e:db:a7:cb,host-172-16-1-12.openstacklocal,172.16.1.12
fa:16:3e:f8:10:99,host-172-16-1-10.openstacklocal,172.16.1.10
fa:16:3e:a7:82:4c,host-172-16-1-3.openstacklocal,172.16.1.3
fa:16:3e:f8:23:1d,host-172-16-1-14.openstacklocal,172.16.1.14
fa:16:3e:63:53:a4,host-172-16-1-1.openstacklocal,172.16.1.1
fa:16:3e:b7:41:a8,host-172-16-1-2.openstacklocal,172.16.1.2
fa:16:3e:5e:25:5f,host-172-16-1-4.openstacklocal,172.16.1.4
fa:16:3e:3a:a2:53,host-172-16-1-100.openstacklocal,172.16.1.100
fa:16:3e:46:39:e2,host-172-16-1-13.openstacklocal,172.16.1.13
fa:16:3e:06:de:e0,host-172-16-1-18.openstacklocal,172.16.1.18

I've done system restarts since the power outage and the agent hasn't corrected 
itself. I've restarted all neutron services as I've done things, I could also 
try stopping and starting dnsmasq.

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: George Mihaiescu <[email protected]>
Sent: 7/6/18 11:15 AM
To: [email protected]
Cc: "[email protected]" <[email protected]>, 
"[email protected]" 
<[email protected]>, [email protected]
Subject: Re: [Openstack] Recovering from full outage
Can you manually assign an IP address to a VM and once inside, ping the address 
of the dhcp server?
That would confirm if there is connectivity at least.

Also, on the controller node where the dhcp server for that network is, check 
the "/var/lib/neutron/dhcp/d85c2a00-a637-4109-83f0-7c2949be4cad/leases" and 
make sure there are entries corresponding to your instances.

In my experience, if neutron is broken after working fine (so excluding any 
miss-configuration), then an agent is out-of-sync and restart usually fixes 
things.

On Fri, Jul 6, 2018 at 9:38 AM, Torin Woltjer <[email protected]> 
wrote:
I have done tcpdumps on both the controllers and on a compute node.
Controller:
`ip netns exec qdhcp-d85c2a00-a637-4109-83f0-7c2949be4cad tcpdump -vnes0 -i 
ns-83d68c76-b8 port 67`
`tcpdump -vnes0 -i any port 67`
Compute:
`tcpdump -vnes0 -i brqd85c2a00-a6 port 68`

For the first command on the controller, there are no packets captured at all. 
The second command on the controller captures packets, but they don't appear to 
be relevant to openstack. The dump from the compute node shows constant 
requests are getting sent by openstack instances.

In summary; DHCP requests are being sent, but are never received.

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: George Mihaiescu <[email protected]>
Sent: 7/5/18 4:50 PM
To: [email protected]
Subject: Re: [Openstack] Recovering from full outage

The cloud-init requires network connectivity by default in order to reach the 
metadata server for the hostname, ssh-key, etc

You can configure cloud-init to use the config-drive, but the lack of network 
connectivity will make the instance useless anyway, even though it will have 
you ssh-key and hostname...

Did you check the things I told you?

On Jul 5, 2018, at 16:06, Torin Woltjer <[email protected]> wrote:

Are IP addresses set by cloud-init on boot? I noticed that cloud-init isn't 
working on my VMs. created a new instance from an ubuntu 18.04 image to test 
with, the hostname was not set to the name of the instance and could not login 
as users I had specified in the configuration.

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: George Mihaiescu <[email protected]>
Sent: 7/5/18 12:57 PM
To: [email protected]
Cc: "[email protected]" <[email protected]>, 
"[email protected]" 
<[email protected]>
Subject: Re: [Openstack] Recovering from full outage
You should tcpdump inside the qdhcp namespace to see if the requests make it 
there, and also check iptables rules on the compute nodes for the return 
traffic.

On Thu, Jul 5, 2018 at 12:39 PM, Torin Woltjer <[email protected]> 
wrote:
Yes, I've done this. The VMs hang for awhile waiting for DHCP and eventually 
come up with no addresses. neutron-dhcp-agent has been restarted on both 
controllers. The qdhcp netns's were all present; I stopped the service, removed 
the qdhcp netns's, noted the dhcp agents show offline by `neutron agent-list`, 
restarted all neutron services, noted the qdhcp netns's were recreated, 
restarted a VM again and it still fails to pull an IP address.

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: George Mihaiescu <[email protected]>
Sent: 7/5/18 10:38 AM
To: [email protected]
Subject: Re: [Openstack] Recovering from full outage
Did you restart the neutron-dhcp-agent  and rebooted the VMs?

On Thu, Jul 5, 2018 at 10:30 AM, Torin Woltjer <[email protected]> 
wrote:
The qrouter netns appears once the lock_path is specified, the neutron router 
is pingable as well. However, instances are not pingable. If I log in via 
console, the instances have not been given IP addresses, if I manually give 
them an address and route they are pingable and seem to work. So the router is 
working correctly but dhcp is not working.

No errors in any of the neutron or nova logs on controllers or compute nodes.

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: "Torin Woltjer" <[email protected]>
Sent: 7/5/18 8:53 AM
To: <[email protected]>
Cc: [email protected], [email protected]
Subject: Re: [Openstack] Recovering from full outage
There is no lock path set in my neutron configuration. Does it ultimately 
matter what it is set to as long as it is consistent? Does it need to be set on 
compute nodes as well as controllers?

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: George Mihaiescu <[email protected]>
Sent: 7/3/18 7:47 PM
To: [email protected]
Cc: [email protected], [email protected]
Subject: Re: [Openstack] Recovering from full outage

Did you set a lock_path in the neutron’s config?

On Jul 3, 2018, at 17:34, Torin Woltjer <[email protected]> wrote:

The following errors appear in the neutron-linuxbridge-agent.log on both 
controllers: http://paste.openstack.org/show/724930/

No such errors are on the compute nodes themselves.

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: "Torin Woltjer" <[email protected]>
Sent: 7/3/18 5:14 PM
To: <[email protected]>
Cc: "[email protected]" 
<[email protected]>, "[email protected]" 
<[email protected]>
Subject: Re: [Openstack] Recovering from full outage
Running `openstack server reboot` on an instance just causes the instance to be 
stuck in a rebooting status. Most notable of the logs is neutron-server.log 
which shows the following:
http://paste.openstack.org/show/724917/

I realized that rabbitmq was in a failed state, so I bootstrapped it, rebooted 
controllers, and all of the agents show online.
http://paste.openstack.org/show/724921/
And all of the instances can be properly started, however I cannot ping any of 
the instances floating IPs or the neutron router. And when logging into an 
instance with the console, there is no IP address on any interface.

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

----------------------------------------
From: George Mihaiescu <[email protected]>
Sent: 7/3/18 11:50 AM
To: [email protected]
Subject: Re: [Openstack] Recovering from full outage
Try restarting them using "openstack server reboot" and also check the 
nova-compute.log and neutron agents logs on the compute nodes.

On Tue, Jul 3, 2018 at 11:28 AM, Torin Woltjer <[email protected]> 
wrote:
We just suffered a power outage in out data center and I'm having trouble 
recovering the Openstack cluster. All of the nodes are back online, every 
instance shows active but `virsh list --all` on the compute nodes show that all 
of the VMs are actually shut down. Running `ip addr` on any of the nodes shows 
that none of the bridges are present and `ip netns` shows that all of the 
network namespaces are missing as well. So despite all of the neutron service 
running, none of the networking appears to be active, which is concerning. How 
do I solve this without recreating all of the networks?

Torin Woltjer

Grand Dial Communications - A ZK Tech Inc. Company

616.776.1066 ext. 2006
www.granddial.com

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : [email protected]
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [Openstack] Recovering from full outage

Reply via email to