Re: [openstack-dev] [tripleo] Suggestions for OOO

Dmitry Tantsur Tue, 11 Oct 2016 03:43:50 -0700

On 10/11/2016 02:39 AM, Joe Talerico wrote:

Hey all,
The past couple of days I have making comments on IRC to discuss some
of the issues I have bumped into when scaling Newton to > 30 compute
nodes.


- `bulk import`, the operation to go from enroll -> manage can take
20-30 minutes to complete. Can we have this be a non-blocking
operation with a message to the user that they cannot continue until
the nodes they want to deploy on go from enroll->manage?

The only thing that enroll->manage does is to check the power credentials. Itshould never take more than 30-60 seconds (and even this is too much, and mightbe a sign of problems with the environment). I suspect that the workflowprocesses nodes sequentially, though, hence these 30-60 seconds multiply by thenumber of nodes. If so, the workflow definitely needs fixing.

- overcloud deploy - when pxe completes I have seen a hand-full of
nodes not reboot, or just get jammed up in the pxe screen. When this
occurs I run:
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') off ; fi; done
# (192 is the first octet)
- Then -
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') on ; fi; done

This typically fixes the deployment so things can continue, however it
would be great to have this type of logic added to OOO, where if a
node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
ironic simply reboots the host..

Unfortunately, it's hard to define "reachable". Also 120 seconds is way toolittle for some servers, it can well take them 5 minutes to boot.

I would rather figure out why PXE gets stuck on your environment. Maybe you needa firmware update.


Also, I suggest if the second attempt fails, reschedule the host --
sometimes I have seen where a raid controller or something goes bad
out of our control.

We do have reschedule in place, but I suspect the current Ironic timeout (1hour?) is too large for Nova.


Thanks for listening!
rook

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] Suggestions for OOO

Reply via email to