Others made good points for posting this on the ML, so here it is in full. Sorry for the markdown formatting, I just copied this from the blog post.
// jim Another cycle, another summit. The ironic project had ten design summit sessions to get together and chat about some of our current and future work. We also led a cross-project session on bare metal networking, had a joint session with nova, and a contributor's meetup for the first half of Friday. The following is a summary of those sessions. # Cross-project: the future of bare-metal networking [Etherpad](https://etherpad.openstack.org/p/newton-baremetal-networking) This session was meant to have the Nova, Ironic, and Neutron folks get together and figure out some of the details of the [work we're doing](https://review.openstack.org/#/c/277853/) to decouple the physical network infrastructure from the logical networking that users interact with. Unfortunately, we spent most of the time explaining the problem and the goals, and not much time actually figuring out how things should work. We were able to decide that the trunk port work in neutron should mostly work for us. There was plenty of hallway chats about this throughout the week, and from those I think we have a good idea of what needs to be done. The spec linked above will be updated soon to clarify where we are at here. # Nova-compatible serial and graphical consoles [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-console) This session began with a number of proposals to implement serial and graphical consoles that would work with Nova, and a goal to narrow them down so folks can move forward with the code. The first thing we decided is that in the short term, we want to focus on the serial console. It's supported by almost all hardware and most cases where someone needs a console are covered by a simple serial console. We do want to do graphical consoles eventually, but would like to take one thing at a time. We then spent some time dissecting our requirements (and preferences) for what we want an implementation to do, which are listed toward the bottom of the etherpad. We narrowed the serial console work down to two implementations: * [ironic-console-server](https://review.openstack.org/#/c/306755/). The tl;dr here is that the conductor will shell out to a command that creates a listening port, and forks a process that connects to the console, and pipes data between the two. This command is called once per console session. The upside with this approach is that operators don't need to do much when the change is deployed. * [ironic-ipmiproxy](https://review.openstack.org/#/c/296869/). This is similar to ironic-console-server, except that it runs as its own daemon with a small REST API for start/stop/get. It spawns a process for each console, which does not fork itself. The upside here is that it can be scaled independently, and has no implications on conductor failover; however it will need its own HA model as well, and will be more work for deployers. It seems like most folks agree that the latter is more desirable, in terms of scaling model and such, but we didn't quite come to consensus during the session. We need to do that ASAP. We also talked a bit about console logging, and the pitfalls of doing it automatically for every instance. For example, some BMCs crash if power status is called repeatedly with a serial-over-lan session active (this is something to consider for regular console attach as well). We'll need to make this operator configurable, possibly per-node, so that we aren't automatically crashing bad BMCs for people. The nova team agreed later that this is fine, as long as a decent error is returned in this case. # Status and future of our gate [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-gate) We discussed the current status of our gate, and the plans for Newton. We first talked about third-party CI, and where we're at with that. Kurt Taylor is doing the main tracking of that, and explained where and how we're tracking it. There was a call for help for some of the missing data, and getting all the right pages updated (stackalytics, openstack.org marketplace, etc). We also talked about the current changes going into our gate that we want to push forward. Moving to tinyipa and virtualbmc (with ipmitool drivers) are the main changes right now. We discussed the progress on upgrade testing via grenade. There hasn't been a lot of progress made, but some of the groundwork to make local testing easy has been done. Later in the week, during the priorities session, we agreed that the upgrade testing was our top priority right now, and some folks volunteered to help move it along. # Hardware pool management [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-hardware-pools) This topic is talked about at nearly every summit, and we said that we need to at least solve the internals this round. We narrowed in on what we think is a good architecture for this. Given that names are hard, we decided we would add a "thing" resource (yes, this name will change). This is some sort of management interface for a group of nodes, and every node will be mapped 1:1 to a "thing" by default. Credentials can be optionally placed in the "thing" instead of on the node, and there can be a 1:n thing:node mapping. This will allow ironic to do group operations for hardware to support it. Of course, there will be "thing" drivers, because every hardware does this differently. :) We also decided to make this an internal-only feature for now, and not expose it to the REST API yet. It can be used as an optimization in internal code, or to support hardware that can only be managed as a group. We may eventually decide to expose group management features to the REST API, but not yet. # Driver composition [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-driver-composition) This is another topic we keep discussing without coming to conclusions, and I think we made good progress this round. Significant work went into the spec before the summit, and folks came prepared with a solid proposal. We agreed on the concept of a "hardware type", which declares compatibility between driver interface implementations. These will be hard-coded into ironic, to what the vendor expects and the generic interfaces that ironic provides. We also agreed that out of tree implementations of an interface should not be allowed to declare compatibility with in-tree vendor hardware types. For example, one could not make a power interface out of tree that declares compatibility with the in-tree "pizza box" hardware type. Out of tree drivers can, however, declare their own hardware types that may be used. We also discussed upgrades, and using a new `hardware_type` field on the node object, to be used for the migration path. We didn't fully come to consensur on the upgrade path, but we're close, and the details are in the etherpad. # Making ops less worse [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops) We discussed some common failure cases that operators see, and how we can solve them in code. We discussed flaky BMCs, which end with the node in maintenance mode, and if Ironic can get them out of that mode automagically. We identified the need to distinguish between maintenance set by ironic and set by operators, and do things like attempt to connect to the BMC on a power state request, and turn off maintenance mode if successful. JayF is going to write a spec for this differentiation. Folks also expressed the desire to be able to reset the BMC via APIs. We have a BMC reset function in the vendor interface for the ipmitool driver; dtantsur volunteered to write a spec to promote that method to an official ManagementInterface method. We also talked for a while about stuck states. This has been mostly solved in code, but is still a problem for some deployers. We decided that we should not have a "reset-state" API like nova does, but rather a command line tool to handle this. lintan has volunteered to write a proposal for this; I have also posted some [straw man code](https://review.openstack.org/#/c/311273/) that someone is welcome to take over or use. # Anomaly detection and resolution [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-anomaly-detection) This session kind of naturally flowed from the previous one, but was more about how we can automatically detect failure cases, and potentially automatically fix them. If we can't do these, then what do we need to build to allow external tooling to do it? We concluded with a few things that we can do now to get started: * Build a node error event DB table, such that errors can be fetched via our API. * Sending notifications on every state change (this is already in progress). Other tools can subscribe to them to watch for anomalies. * Add a periodic task that polls BMCs for hardware event/error logs. We could store these or emit them as notifications. # Ansible deploy driver [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ansible-deploy) Some folks from Mirantis presented their proposal for an ansible-based deploy driver, and allowed us to ask questions. The primary use case for this is to allow operators to easily change the deploy process; the ansible playbooks for this are configuration, node code. Some people had concerns about this, especially around supportability and the fact that we (upstream) effectively have no control over how a deployment works. This is analogous to allowing an operator to modify spawn() in the nova libvirt driver. However, most people present were okay with this. We discussed how to build and secure ramdisks for this, and tossed some ideas around. We didn't come to any clear consensus, though. Last, we found that currently each node is deployed in serial. We noted that this driver is a non-starter until it can deploy many (50?) machines in parallel from a single conductor host. As such, we've decided that until this is possible, the team shouldn't be spending much review time on this. # Live upgrades [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-live-upgrades) Here, we discussed what we need to get to rolling upgrades, and what we should be testing to confirm these work. We noted that the requirement for the "supports rolling upgrade" tag in the governance repo is only testing last-stable to master upgrade (and the equivalent for changes on the stable branch. Given that we do intermediate releases, we also want to test upgrades from last numbered release (because that may be more recent than last stable) to master. Last, we should run a job that upgrades ironic but does not upgrade nova, to make sure services can be upgraded independently. We decided that for ironic upgrades, conductor should go first, followed by the API. This is so that the API doesn't expose functionality before a conductor supports it. We decided for full cloud upgrades, ironic should go before nova, because older nova should always work with newer ironic. We should also upgrade neutron before ironic, because ironic consumes neutron and we don't want to depend on functionality that doesn't exist yet. There's an action item for me to check with the Neutron folks on this, to make sure Neutron before ironic before nova seems kosher to them. # Inspector HA [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-inspector) Milan gave a quick presentation on his proposal for an HA model for inspector, and we discussed. Things we agreed on: * the general proposal * use tooz for locking and leader election * split it into an api and conductor service * conductor runs active-active * don't split firewall and dhcp services to a separate service Details are in the etherpad. :) # Newton priorities [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-priorities) We discussed our priorities for the Newton cycle here. Of note, we decided that we need to get cold upgrade testing (i.e. grenade) running ASAP. We have lots of large changes lined up that feel like they could easily break upgrades, and want to be able to test them. Much of the team is jumping in to help get this going. The priorities for the cycle have been published [here](http://specs.openstack.org/openstack/ironic-specs/priorities/newton-priorities.html). The etherpad also lists some smaller work that we want to prioritize, but did not publish as such. My big task for early this cycle is to build a quick landing page that has all priorities with relevant links to them, and these small things will be included on this page. # Nova/Ironic cross-project [Etherpad](https://etherpad.openstack.org/p/newton-nova-ironic) We started this session by updating the Nova team on the status of a few things. We discussed the multitenant networking work, and what's left to do there. We wondered out loud if the "routed networks" feature planned for Nova will conflict with this work - johnthetubaguy and myself are to investigate this further. We talked about the multiple-compute work, and if the generic-resource-pools work is a better route to getting there. This discussion has continued beyond the summit and is being investigated further. We then talked about the future console work, and went over what we decided in the previous session we had about that. We discussed what nova needs from the ironic team - full tempest runs (minus what ironic doesn't support) and faster CI runs. Surprise! We discussed some progress and some options here. Last, we talked for a few minutes about passing configuration from flavors to ironic - think BIOS configuration on the fly, depending on the flavor requested. This was obviously too big a topic to solve in a few minutes, but we got the wheels spinning. # Summary All in all, it was a productive summit for the ironic team, and we have a clear vision for the next six months. On Mon, May 09, 2016 at 06:00:46PM -0400, Jim Rollenhagen wrote: > Hey all, > > I wrote a recap of the summit on my blog: > http://jroll.ghost.io/newton-summit-recap/ > > I hope this covers everything that folks missed or couldn't remember. As > always, questions/comments/concerns welcome. > > // jim > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev