On 10/11/2016 02:39 AM, Joe Talerico wrote:
Hey all,
The past couple of days I have making comments on IRC to discuss some
of the issues I have bumped into when scaling Newton to > 30 compute
nodes.
- `bulk import`, the operation to go from enroll -> manage can take
20-30 minutes to complete. Can we have this be a non-blocking
operation with a message to the user that they cannot continue until
the nodes they want to deploy on go from enroll->manage?
The only thing that enroll->manage does is to check the power credentials. It
should never take more than 30-60 seconds (and even this is too much, and might
be a sign of problems with the environment). I suspect that the workflow
processes nodes sequentially, though, hence these 30-60 seconds multiply by the
number of nodes. If so, the workflow definitely needs fixing.
- overcloud deploy - when pxe completes I have seen a hand-full of
nodes not reboot, or just get jammed up in the pxe screen. When this
occurs I run:
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') off ; fi; done
# (192 is the first octet)
- Then -
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') on ; fi; done
This typically fixes the deployment so things can continue, however it
would be great to have this type of logic added to OOO, where if a
node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
ironic simply reboots the host..
Unfortunately, it's hard to define "reachable". Also 120 seconds is way too
little for some servers, it can well take them 5 minutes to boot.
I would rather figure out why PXE gets stuck on your environment. Maybe you need
a firmware update.
Also, I suggest if the second attempt fails, reschedule the host --
sometimes I have seen where a raid controller or something goes bad
out of our control.
We do have reschedule in place, but I suspect the current Ironic timeout (1
hour?) is too large for Nova.
Thanks for listening!
rook
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev