This is great information Justin, thanks for sharing. It will prove useful as we scale up our ironic deployments.
It seems to me that a reference configuration of ironic would be a useful resource for many people. Some key decisions affecting scalability and performance may at first seem arbitrary but have an impact on performance and scalability, such as: - BIOS vs. UEFI - PXE vs. iPXE bootloader - TFTP vs. HTTP for kernel/ramdisk transfer - iSCSI vs. Swift (or one day standalone HTTP?) for image transfer - Hardware specific drivers vs. IPMI - Local boot vs. netboot - Fat images vs. slim + post-configuration - Any particularly useful configuration tunables (power state polling interval, nova build concurrency, others?) I personally use kolla + kolla-ansible which by default uses PXE + TFTP + iSCSI which is arguably not the best combination. Cheers, Mark On 9 June 2017 at 12:28, Justin Kilpatrick <jkilp...@redhat.com> wrote: > On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <dtant...@redhat.com> > wrote: > > This number of "300", does it come from your testing or from other > sources? > > If the former, which driver were you using? What exactly problems have > you > > seen approaching this number? > > I haven't encountered this issue personally, but talking to Joe > Talerico and some operators at summit around this number a single > conductor begins to fall behind polling all of the out of band > interfaces for the machines that it's responsible for. You start to > see what you would expect from polling running behind, like incorrect > power states listed for machines and a general inability to perform > machine operations in a timely manner. > > Having spent some time at the Ironic operators form this is pretty > normal and the correct response is just to scale out conductors, this > is a problem with TripleO because we don't really have a scale out > option with a single machine design. Fortunately just increasing the > time between interface polling acts as a pretty good stopgap for this > and lets Ironic catch up. > > I may get some time on a cloud of that scale in the future, at which > point I will have hard numbers to give you. One of the reasons I made > YODA was the frustrating prevalence of anecdotes instead of hard data > when it came to one of the most important parts of the user > experience. If it doesn't deploy people don't use it, full stop. > > > Could you please elaborate? (a bug could also help). What exactly were > you > > doing? > > https://bugs.launchpad.net/ironic/+bug/1680725 > > Describes exactly what I'm experiencing. Essentially the problem is > that nodes can and do fail to pxe, then cleaning fails and you just > lose the nodes. Users have to spend time going back and babysitting > these nodes and there's no good instructions on what to do with failed > nodes anyways. The answer is move them to manageable and then to > available at which point they go back into cleaning until it finally > works. > > Like introspection was a year ago this is a cavalcade of documentation > problems and software issues. I mean really everything *works* > technically but the documentation acts like cleaning will work all the > time and so does the software, leaving the user to figure out how to > accommodate the realities of the situation without so much as a > warning that it might happen. > > This comes out as more of a ux issue than a software one, but we can't > just ignore these. > > - Justin > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev