On Fri, Jun 9, 2017 at 7:28 AM, Justin Kilpatrick <[email protected]> wrote: > On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <[email protected]> wrote: >> This number of "300", does it come from your testing or from other sources? >> If the former, which driver were you using? What exactly problems have you >> seen approaching this number? > > I haven't encountered this issue personally, but talking to Joe > Talerico and some operators at summit around this number a single > conductor begins to fall behind polling all of the out of band > interfaces for the machines that it's responsible for. You start to > see what you would expect from polling running behind, like incorrect > power states listed for machines and a general inability to perform > machine operations in a timely manner. > > Having spent some time at the Ironic operators form this is pretty > normal and the correct response is just to scale out conductors, this > is a problem with TripleO because we don't really have a scale out > option with a single machine design. Fortunately just increasing the > time between interface polling acts as a pretty good stopgap for this > and lets Ironic catch up. > > I may get some time on a cloud of that scale in the future, at which > point I will have hard numbers to give you. One of the reasons I made > YODA was the frustrating prevalence of anecdotes instead of hard data > when it came to one of the most important parts of the user > experience. If it doesn't deploy people don't use it, full stop. > >> Could you please elaborate? (a bug could also help). What exactly were you >> doing? > > https://bugs.launchpad.net/ironic/+bug/1680725
Additionally, I would like to see more verbose output from the cleaning : https://bugs.launchpad.net/ironic/+bug/1670893 > > Describes exactly what I'm experiencing. Essentially the problem is > that nodes can and do fail to pxe, then cleaning fails and you just > lose the nodes. Users have to spend time going back and babysitting > these nodes and there's no good instructions on what to do with failed > nodes anyways. The answer is move them to manageable and then to > available at which point they go back into cleaning until it finally > works. > > Like introspection was a year ago this is a cavalcade of documentation > problems and software issues. I mean really everything *works* > technically but the documentation acts like cleaning will work all the > time and so does the software, leaving the user to figure out how to > accommodate the realities of the situation without so much as a > warning that it might happen. > > This comes out as more of a ux issue than a software one, but we can't > just ignore these. > > - Justin > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
