On 10/11/2016 02:39 AM, Joe Talerico wrote:
Hey all,
The past couple of days I have making comments on IRC to discuss some
of the issues I have bumped into when scaling Newton to > 30 compute
nodes.

- `bulk import`, the operation to go from enroll -> manage can take
20-30 minutes to complete. Can we have this be a non-blocking
operation with a message to the user that they cannot continue until
the nodes they want to deploy on go from enroll->manage?

The only thing that enroll->manage does is to check the power credentials. It should never take more than 30-60 seconds (and even this is too much, and might be a sign of problems with the environment). I suspect that the workflow processes nodes sequentially, though, hence these 30-60 seconds multiply by the number of nodes. If so, the workflow definitely needs fixing.

- overcloud deploy - when pxe completes I have seen a hand-full of
nodes not reboot, or just get jammed up in the pxe screen. When this
occurs I run:
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') off ; fi; done
# (192 is the first octet)
- Then -
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') on ; fi; done

This typically fixes the deployment so things can continue, however it
would be great to have this type of logic added to OOO, where if a
node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
ironic simply reboots the host..

Unfortunately, it's hard to define "reachable". Also 120 seconds is way too little for some servers, it can well take them 5 minutes to boot.

I would rather figure out why PXE gets stuck on your environment. Maybe you need a firmware update.


Also, I suggest if the second attempt fails, reschedule the host --
sometimes I have seen where a raid controller or something goes bad
out of our control.

We do have reschedule in place, but I suspect the current Ironic timeout (1 hour?) is too large for Nova.


Thanks for listening!
rook

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to