On 03/21/2014 08:41 AM, Solly Ross wrote:
> Well, if messages are getting dropped on the floor due to communication 
> issues, that's not a good thing.
> If you have time, could you determine why the messages are getting dropped on 
> the floor?  We shouldn't be
> doing things that require both the controller and compute nodes until we have 
> a connection.

Currently when doing a reboot we set the REBOOTING task_state in the database 
in compute-api and then send an RPC cast. That seems awfully risky given that 
if that message gets lost or the call fails for any reason we could end up 
stuck in the REBOOTING state forever.  I think it might make sense to have the 
power state audit clear the REBOOTING state if appropriate, but others with 
more experience should make that call.


The timeline that I have looks like this.  We had some buggy code that sent all 
the instances for a reboot when the controller came up.  The first two are in 
the controller logs below, and these are the ones that failed.


controller: (running everything but nova-compute)
nova-api log:

/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.712 8187 INFO 
nova.compute.api [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:23.898 8187 INFO 
nova.osapi_compute.wsgi.server [req-a84e25bd-85b4-478c-a845-7e8034df3ab2 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 "POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action
 HTTP/1.1" status: 202 len: 185 time: 0.2299521
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.152 8128 INFO 
nova.compute.api [req-429feb82-a50d-4bf0-a9a4-bca036e55356 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:25.273 8128 INFO 
nova.osapi_compute.wsgi.server [req-429feb82-a50d-4bf0-a9a4-bca036e55356 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 "POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action
 HTTP/1.1" status: 202 len: 185 time: 0.1583798

After this there are other reboot requests for the other instances, and those 
ones passed.


Interestingly, we later see this
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.476 8134 INFO 
nova.compute.api [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:45.477 8134 INFO 
nova.osapi_compute.wsgi.server [req-2e0b67a0-0cd9-471f-b115-e4f07436f1c4 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 "POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/c967e4ef-8cf4-4fac-8aab-c5ea5c3c3bb4/action
 HTTP/1.1" status: 409 len: 303 time: 0.1177511
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.831 8143 INFO 
nova.compute.api [req-afeb680b-91fd-4446-b4d8-fd264541369d 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] [instance: 
17169e6d-6693-4e95-9900-ba250dad5a39] API::reboot reboot_type=SOFT
/var/log/nova/nova-api.log.2.gz:2014-03-20 11:33:48.832 8143 INFO 
nova.osapi_compute.wsgi.server [req-afeb680b-91fd-4446-b4d8-fd264541369d 
8162b2e247704e218ed13094889a5244 48c9875f2edb4a36bbe598effbe835cf] 
192.168.204.195 "POST 
/v2/48c9875f2edb4a36bbe598effbe835cf/servers/17169e6d-6693-4e95-9900-ba250dad5a39/action
 HTTP/1.1" status: 409 len: 303 time: 0.0366399


Presumably the 409 responses are because nova thinks that these instances are 
currently rebooting.



compute:
2014-03-20 11:33:14.213 12229 INFO nova.openstack.common.rpc.common [-] 
Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.225 12229 INFO nova.openstack.common.rpc.common [-] 
Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.244 12229 INFO nova.openstack.common.rpc.common [-] 
Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:14.246 12229 INFO nova.openstack.common.rpc.common [-] 
Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.234 12229 INFO nova.openstack.common.rpc.common [-] 
Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:26.277 12229 INFO nova.openstack.common.rpc.common [-] 
Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.240 12229 INFO nova.openstack.common.rpc.common [-] 
Reconnecting to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:29.276 12229 INFO nova.openstack.common.rpc.common [-] 
Connected to AMQP server on 192.168.204.2:5672
2014-03-20 11:33:35.871 12229 INFO nova.compute.manager 
[req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 
48c9875f2edb4a36bbe598effbe835cf] [instance: 
74a07f0b-0016-42c4-b625-59ef48254e7e] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:35.871 12229 AUDIT nova.compute.manager 
[req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 
48c9875f2edb4a36bbe598effbe835cf] [instance: 
74a07f0b-0016-42c4-b625-59ef48254e7e] Rebooting instance
2014-03-20 11:33:36.484 12229 INFO nova.virt.libvirt.driver 
[req-a10b008b-c9d0-4f31-8acb-e42fb43b64fe 8162b2e247704e218ed13094889a5244 
48c9875f2edb4a36bbe598effbe835cf] [instance: 
74a07f0b-0016-42c4-b625-59ef48254e7e] LIBVIRT::reboot reboot_type=SOFT
2014-03-20 11:33:38.367 12229 INFO nova.compute.manager 
[req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 
48c9875f2edb4a36bbe598effbe835cf] [instance: 
62e51539-e7a1-4560-9687-6ab07b953b9f] MANAGER::reboot_instance reboot_type=SOFT
2014-03-20 11:33:38.368 12229 AUDIT nova.compute.manager 
[req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 
48c9875f2edb4a36bbe598effbe835cf] [instance: 
62e51539-e7a1-4560-9687-6ab07b953b9f] Rebooting instance
2014-03-20 11:33:38.982 12229 INFO nova.virt.libvirt.driver 
[req-dc31d13e-e331-4ed9-a36f-1a58d363b459 8162b2e247704e218ed13094889a5244 
48c9875f2edb4a36bbe598effbe835cf] [instance: 
62e51539-e7a1-4560-9687-6ab07b953b9f] LIBVIRT::reboot reboot_type=SOFT
<etc>


As you can see, the two requests that got "lost" were sent during the period 
after the first batch of AMQP connections, but before the second batch.  I 
didn't log controller-side timestamps for the successful reboot requests but 
they were after the two failed ones.

Thanks,
Chris

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to