Thank you Julia, for sacrificing yourself and going to Australia; I'm glad the koalas didn't get you :)
This summary is GREAT! I'm trying to figure out how we take all these asks into consideration with all the existing asks and TODOs that are on our plate. I guess the best plan of action (and a bit more procrastination) is to discuss this at our virtual mid-cycle meetup next week [1]. --ruby [1] http://lists.openstack.org/pipermail/openstack-dev/2017-November/124725.html On Tue, Nov 14, 2017 at 11:18 AM, Julia Kreger <juliaashleykre...@gmail.com> wrote: > Greetings ironic folk! > > Like many other teams, we had very few ironic contributors make it to > Sydney. As such, I wanted to go ahead and write up a summary that > covers takeaways, questions, and obvious action items for the > community that were raised by operators and users present during the > sessions, so that we can use this as feedback to help guide our next > steps and feature planning. > > Much of this is from my memory combined with notes on the various > etherpads. I would like to explicitly thank NobodyCam for reading > through this in advance to see if I was missing anything at a high > level since he was present in the vast majority of these sessions, and > dtantsur for sanity checking the content and asking for some > elaboration in some cases. > > -Julia > > > > Ironic Project Update > ===================== > > Questions largely arose around use of boot from volume, including some > scenarios we anticipated that would arise, as well as new scenarios > that we had not considered. > > Boot nodes booting from the same volume > --------------------------------------- > > From a technical standpoint, when BFV is used with iPXE chain loading, > the chain loader reads the boot loader and related data from the > cinder (or, realistically, any iSCSI volume). This means that a > skilled operator is able to craft a specific volume that may just turn > around and unpack a ramdisk and operate the machine solely from RAM, > or that utilize an NFS root. > > This sort of technical configuration would not be something an average > user would make use of, but there are actual use cases that some large > scale deployment operators would make use of and that would provide > them value. > > Additionally, this topic and the desire for this capability also come > up during the “Building a bare metal cloud is hard” talk Q&A. > > Action Item: Check the data model to see if we prohibit, and consider > removing the prohibition against using the same volume across nodes, > if any. > > Cinder-less BFV support > ----------------------- > > Some operators are curious about booting Ironic managed nodes without > cinder in a BFV context. This is something we anticipated and built > the API and CLI interfaces to support this. Realistically, we just > need to offer the ability for the data to be read and utilized. > > Action Item: Review code and ensure that we have a some sort of no-op > driver or method that allows cinder-less node booting. For existing > drivers, it would be the shipment of the information to the BMC or the > write-out of iPXE templates as necessary. > > Boot IPA from a cinder volume > ----------------------------- > > With larger IPA images, specifically in cases where the image contains > a substantial amount of utilized or tooling to perform cleaning, > providing a mechanism to point the deployment Ramdisk to a cinder > volume would allow more efficient IO access. > > Action Item: Discuss further - Specifically how we could support as we > would need to better understand how some of the operators might use > such functionality. > > Dedicated Storage Fabric support > -------------------------------- > > A question of dedicated storage fabric/networking support arose. For > users of FibreChannel, they generally have a dedicated storage fabric > by the very nature of separate infrasturcture. However, with ethernet > networking where iSCSI software initiators are used, or even possibly > converged network adapters, things get a little more complex. > > Presently, with the iPXE boot from volume support, we boot using the > same interface details for the neutron VIF that the node is attached > with. > > Moving forward, with BFV, the concept was to support the use of > explicitly defined interfaces as storage interfaces, which could be > denoted as "volume connectors" in ironic by type defined as "mac". In > theory, we begin to get functionality along these lines once > https://review.openstack.org/#/c/468353/ lands, as the user could > define two networks, and the storage network should then fall to the > explicit volume connector interface(s). The operator would just need > to ensure that the settings being used on that storage network are > such that the node can boot and reach the iSCSI endpoint, and that a > default route is not provided. > > The question then may be, does Ironic do this quietly for the user > requesting the VM or not, and how do we document the use such that > operators can conceptualize it. How do we make this work at a larger > scale? How could this fit or not fit into multi-site deployments? > > In order to determine if there is more to do, we need to have more > discussions with operators. > > Action items: > > * Determine overall needs for operators, since this is implementation > architecture centric. > * Plan forward path form there, if it makes sense. > > Note: This may require more information to be stored or leveraged in > terms of structural or location based data. > > Migration questions from classic drivers to Hardware types > ---------------------------------------------------------- > > One explicit question from the operator community was if we intended > to perform a migration from the classic driver to hardware types. In a > sense, there are two issues here. The first being a perception of the > work and the second is there a good way to cleanly identify, and > transform classic drivers during upgrade. > > Action item: > > * For whatever reason the ironic community felt it was un-necessary to > facilitate a migration for users from drivers to resource classes, > even though we have direct analogs. The ironic community should > re-evaluate and consider implementing migration logic to ease user > migration. > * In order to proceed, Ironic does need to understand if operators > would be okay if the upgrade process failed, that is if the > pre-upgrade checks detected that the configuration was incompatible > for a migration to be successful. This could allow an operator to > correct their configuration file and re-execute the upgrade attempt. > > > Ironic use Feedback Session > =========================== > > https://etherpad.openstack.org/p/SYD-forum-ironic-feedback > > The feedback session felt particularly productive because developers > were far out numbered by operators. > > Current Troubles/Not Working for Operators > ------------------------------------------ > > * Current RAID deployment process where we apply raid configuration > generally during the cleaning step, prior to deploy. > ** One of the proposed solutions was the marriage of traits, deploy > templates, and the application of deployment templates upon > deployment. > ** The concern is that this will lead to an explosion of flavors, and > some operators environments are already extremely flavor-full. “I > presently run `nova flavor-list`, and go get a coffee” > ** The mitigating factor will be the ability to allow at-boot time > definition by the user initiating the deployment, that additional > traits could be proposed on the command line. This was mentioned by > Sam Betts, and one of the nova cores present indicated that it was > part of their plan. > > * UEFI iPXE boot - Specifically some operators are encountering > issues, with some vendors hardware, that “should” be compatible, > however is not actually working except in specific scenarios. > ** This is not an ironic bug. > ** In the specific case that an operator reported, they were forced > into use of a vendor driver and specific settings, which seemed like > something they would have preferred to avoid. > ** The community members, as long with the users and operators present > agreed that a good solution would be to propose documentation updates > to our repository that detail when drivers _do not_ work, or when > there are weird compatibility issues that are not quite visible. > ** It may be worth considering some sort of matrix to raise visibility > of drivers compatibility/interoperability moving forward. The Ironic > team would not push back if an operator wishes to being updating our > Admin documentation with such information. > > > Action Items: > * The community should encourage operators to submit documentation > changes when they become aware of such issues. > * The community should also encourage vendor driver maintainers to > explicitly document their known-good/tested scenarios, as some > hardware with-in the same family can vary. > > > What Operators are indicating that they need > -------------------------------------------- > > Firmware Updates > ~~~~~~~~~~~~~~~~ > > Our present firmware update model is dependent upon a hardware manager > driving the process during cleaning, which presently requires the > hardware manager to be built inside the ramdisk image. This is > problematic as it requires operators to craft and build hardware > managers that fit their needs, and then ensure those are running on > the specific hosts to upgrade their firmware. > > While this may seem easy and reasonable for a small deployment, there > is an operations disconnect in many organizations between who blesses > new firmware versions, and who controls the hardware. In some cases, a > team may be in charge of certifying and testing new firmware, while > another team entirely operates the cloud. These process and > operational constraints also prevent hardware managers from being > shared in the open, because they could potentially reveal security > state of a deployment. Simply put, operators need something easier, > especially when they may receive twenty different chassis in a single > year. > > While we discussed this as a group, we did seem to begin to reach an > understanding of what would be useful. > > Several operators made it clear that they feel that Ironic is in a > position to help drive standardization across vendors. > > What operators are looking for: > > * A framework or scaffolding to facilitate centrally managed firmware > updates where the current state information is published upward, and > the system replies with the firmware to be applied. > ** Depending on the deployment, an operator may choose to assert > firmware upon every cleaning, but they need to be able to identify, > the hardware, current firmware, and necessary versions by some sort of > policy. > ** Any version policy may vary across the infrastructure, based on > either resource class, or hardware ownership concepts. > ** This may, in itself just be a hardware manager that calls out to an > external service, and executes based upon what the service tells it to > do. > > * Ironic to work with vendors to encourage some sort of standardized > packaging and/or installation process such that the firmware updating > code can be as generic as possible. > > One other note worth mentioning, some operators spoke of stress > testing their hardware during cleaning processes. It seems like a > logical thing to do, however this would be best something for a few > operators to explicitly propose what they wish to test, and how they > do it presently so we as a community can gain a better understanding. > > Action Items: > > * Poll hardware vendors during the next weekly meeting and attempt to > build an understanding and consensus. > * With feedback, we will then have to take the next step is trying to > determine how to fit such a service into Ironic along with what > ironic's expectations are for drivers regarding firmware upgrades. > > TPM Key Assertion > ~~~~~~~~~~~~~~~~~ > > Some operators utilize their TPMs for secure key storage, and need a > mechanism to update/assert a new key to overwrite existing key data. > The key data in the TPM is used by the running system, and we have no > operational need to store or utilize the data. Presently some > operators perform this manually, and replacing keys on systems running > elsewhere in the world is presently a human intensive process. > > The consensus in the room was that this might be a good out of band > management interface feature which could be in the management > interfaces for the vendor drivers. We presently minimally use the > management interface. > > From a security standpoint, this is also something we shouldn’t store > locally, but only be a clean pass-through conduit for the data, which > makes explicitly out-of-band usage even more appealing with vendor > drivers. > > Action Item: Poll hardware vendors during the next weekly meeting or > two in order to begin discussion of viability/capability to support. > This could be passthru functionality in the driver, but if two drivers > can eventually support such functionality, we should standardize this > upfront. > > Reversing the communications flow > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > One of the often discussed items in security conscious environments is > the need to be able to have the conductor initiate communication to > the agent. While this model will not work for all deployments, we've > had a consensus in the past that this is something we should consider > doing as an optional setting. > > In other words, IPA could have a mode of operation where it no longer > heartbeats to the API, where the conductor would lookup the address > from Neutron, and proceed to poll it until the node came online. The > conductor would then poll that address on a regular basis, much like > heart-beating works today. We should keep in mind, that this polling > operation will have an increased impact on conductor load. > > Several operators present in the session expressed interest, with > others indicated this would be a breaking change for their > environment's security model, and as such any movement in this > direction must be optional. > > Action Item: Someone writes a specification and poll the larger > operators that we know in the community for thoughts, in order to see > if it meets their needs. > > Documentation on known issues and fixes or incompatibilities > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Operators would like to see more information on known driver issues or > incompatibilities explicitly stated in the documentation. Additionally > operators would like a single location for "how to fix this issue" > that is not a bug tracking system. > > There seemed to be consensus that these were good things to document, > and like the like this, and it seems like the community does not > disagree. That being said, the operators are the ones who will have a > tendency for us to be more aware of such issues. > > The best way for the operator community to help the developers, is to > propose documentation patches to the ironic repository to raise > awareness and visibility with-in our own documentation. We must keep > in mind that we must curate this information as well, since some of > these things are not necessarily “bugs”, much like the UEFI boot > issues noted earlier. > > Automatic BIOS Configuration > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > tl;dr: We are working on it. > > Use of System UUID in addition to MAC addresses > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Some operators would like to see changes in order to allow booting > based upon a recorded system UUID. In these cases, the operator may be > using the "noop" network interface driver, or another custom driver > that does not involve neutron. > > To the reasons to support UUID based booting are extensive: > > * iPXE can attempt the system UUID first, so support for utilizing the > UUID could remove several possible transactions. > > * The first ethernet device to iPXE may not be the one known to > ironic, and may still obtain an IP address based upon the environment > configuration/operation. This will largely be the case in an > environment where DHCP is not managed by neutron. Presently, operators > have to wait for the unknown interfaces to fail completely, and > eventually reach a known network interface. > > ** Possibly worse, the order may not be consistent depending on the > hardware boot order / switch configuration / cabling. > ** Operators indicated that swapped cabling with Link Aggregates is a > semi-common problem they encounter. > > * MAC addresses of nodes may just not be known, and evolving to > support hard coded UUIDs does provide us some greater flexibility in > the terms of being able to boot a node where we do not know nor > control the IP addressing assigned. > > > In addition to UUIDs, an operator expressed interest in having the > same boot behavior, however with the IP address allocated, as opposed > to the UUID. This also deserves consideration as it may be useful for > some operators. > > Action Items: > > * We should determine a method of storage of the UUID if discovered or > already known. Some operators may already know the address. > > ** Suggestion from an operator: Maybe just allow setting of uuid when > a node is created, like we do with ports, so that a operator or > inspector could set the node uuid to be the same as the systems uuid, > thus eliminating the need for another field. > ** Ironic contributor: Alternatively, we should just add a boolean > that writes it out and offers it as an initial step, and then falls > back to the MAC address based attempt. > > * Update template generation to support writing a symlink with the > UUID and / or MAC addresses. > > * Explore possibility of doing the same with IP addresses. > > Diskless boot > ~~~~~~~~~~~~~ > > This is a repeat theme that has arisen before, and in many cases could > be solved via the BFV iPXE functionality, however other operators have > expressed need in the past for more generic boot options in order to > boot their nodes. There has been some prior specifications on making > generic PXE interfaces available for things such as network switches. > As such, we should likely re-evaluate those specifications. > > Action Item: Ironic should re-evaluate current position, and review > related specifications. > > Physical Location/Hardware Ownership > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > This one didn't quite make the notes, but a couple attendees seem to > remember it and it is worth mentioning. > > Presently, there is no ability to have a geographically diverse ironic > installation where a single pane of glass is provided by the API. To > add further complexity, ironic may be in a situation where it is > managing hardware that the operator might not explicitly own, that > needs to be in a delineated pool. We presently have no way to > represent or control scheduling in these cases. > > This quickly leads to tenant awareness, as in an operator may have a > tenant that owns hardware hardware in the datacenter. Naturally, this > can get complex quite quickly, but it seems logical for many users as > they have either trusted users that may wish to manually deploy > hardware, and in many of those cases, it is desired that hardware be > used by no other tenant. This concept may also be extended to a > concept of "authorized users" who have temporary implied rights to > interact with ironic based upon permissions and current ownership. > > To keep this short, the impacts of this are _massive_ as they are > intertwined fundamental changes to how ironic represents data at the > API level as well as to how ironic executes requests as the end goal > would be to provide the ability to provide the granularity to say > "These two conductors are for x environment" or the granularity of > "your only allowed to see x nodes". As a result of all of this, it > would be a huge API change. The current concept of which, is just > build upon the existing 1.0 API as a 2.0 API. > > Action Item: TheJulia and Nobodycam volunteer as tributes... to start a > spec. > > > Ironic On-boarding Session > ========================== > > The Sydney summit was the first attempt by the ironic team to execute > an on-boarding session for the community. As such, the intent was to > take a free form approach as an attempt to answer questions that > anyone had for community members, which would provide feedback into > what new contributors might be interested in moving forward. > > By in large, the questions that were asked boiled down to the > following questions: > Where do I find the code? > This was largely a question of what repository contains what pieces. > > How do I setup a test environment? > This was very much a question of getting started, which led into the > next logical question. > > How do I test without real physical servers? > The answer became Devstack or Bifrost, depending on your use case and > desire to perform full-stack development or lightweight work with or > along side Ironic. > > Can I test using remote VMs? > Overall the answer was yes, but that you needed to handle networking > yourself to bridge it through and have some mechanism to control > power. Ironic-staging-drivers was brought up as a repository that > might have useful drivers in these cases. Ironic should to look at > improving some of our docs to highlight the possibilities? > > What alternatives to devstack are there? > Bifrost was raised as an example. We failed to mention kolla as an option. > :( > > How do we see community priorities? > This was very easy for us, but for a new contributor coming into the > community, it is not as clear. Ironic should consider improving > documentation to make it very clear where to find this information. > > > Action Items: > * Some of Ironic's documentation for new contributors may need > revision to provide some of these contextual details upfront, or we > might need to consider a Q&A portion of the documentation. > * The ironic community should ensure that the above questions are > largely answered in whatever material is presented as part of the next > on-boarding session at the Vancouver summit. > > > Mogan/Nova/Ironic Session > ========================= > > https://etherpad.openstack.org/p/SYD-forum-baremetal-ironic-mogan-nova > > The purpose of this session was to help compare, contrast, and provide > background to the community as to the purpose behind the Mogan > project, which was to create a baremetal centric user-facing compute > API allowing non-administrator users to provision and directly control > baremetal. The baremetal in mogan's context could be baremetal that > already exists, or baremetal that is created from some sort of > composible infrastructure. > > The Mogan PTL started with an overview to provide context to the > community, and then the community shifted to asking questions to gain > context, including polling operators for interest and concerns. > Primarily, operator concern was over creating divergence and user > confusion in the community. > > Once we had some operator input, we attempted to identify differences > and shortcomings in Ironic and Nova that primarily drove the effort. > What we were able to identify from a work-in-progress comparison > document largely indicated that was additional insight into aggregates > which was partly due to affinity/anti-affinity needs. Additional > functionality exists in Mogan to list servers available for management > and then directly act upon them, although the extent of what > additional actions can be taken upon a baremetal node had not been > identified. > > As the discussions went on, the Ironic team members that were present > were able to express our concerns over communication. It largely > seemed to be a surprise that some of our hardware teams were working > in the direction of composible hardware, and that the use model mogan > sought could fit into our scope and workflow for composible hardware. > Largely, for composible hardware, we would need some way to represent > a node that a user wishes to perform an action upon. In some cases > now, that is performed with placeholder records representing possible > capacity. > > Naturally for ironic, making it user facing would be a very large > change to Ironic's API, however these are changes, based on other > sessions, that Ironic may wish to explore given stated operator needs. > > The discussion for both Ironic and Nova was more of a “How do we best > navigate” instead of “If we should navigate” question, which in it's > self is positive. > > Some of these items included improving the view of available physical > baremetal. Regional/Availability zoning, tenant utilization of the > API, and possibly hardware ownership concepts. Many of these items, as > touched in the feedback session, are intertwined. > > Overall, the session was good in that we were able to gain consensus > that the core issues which spurred the creation of Mogan are > addressable by the present Ironic and Nova contributors. Complete > gap/feature comparison remains as an outstanding item, which may still > influence the discussion going forward. > > > Baremetal Scheduling session > ============================ > > https://etherpad.openstack.org/p/SYD-forum-baremetal-scheduling > > We were originally hoping to cancel this session and redirect everyone > into the nova placement status update, but we soon found out that > there were some lingering questions as well as concerns of operators > that needed to be discussed. > > We started out in discussion and came to the realization that there > could very well be a trait explosion in the attempt to support > affinity and anti-affinity efforts. While for baremetal it could be a > side-effect, it does not line up with the nova model. Conceptually, we > could quickly end up with trait lists that could look something like: > > CUSTOM_AC_GRID_C > CUSTOM_ROOM1_POWER_GRID_C > CUSTOM_CABINET_4 > CUSTOM_Charlotte_DC3 > CUSTOM_Charlotte_DC3_ROW2_CAB4 > CUSTOM_CUSTOMER_TAG > CUSTOM_OWNED_ENV > NET1GB > NET2GB > NET10GB > NET10GB_DUAL > CUSTOM_STORAGE_FABRIC_A > CUSTOM_FC_FABRIC_B > CUSTOM_REDUNDANT_COOLING > CUSTOM_IS_A_BIKE_SHED_ON_THE_MOON > CUSTOM_IS_NOT_LORD_VADERS_BIKESHED > > At some point, someone remarked “It seems like there is just no > solution that is going to work for everyone by default.” The remark > was not just resource class determination, but trait identification, > but also encompassed scheduling affinity and anti-affinity, which > repeatedly came up in discussions over the week. > > This quickly raised an operator desire for the ironic community to > solve for what would fit 80% of use cases, and then iterate moving > forward. The example brought up in discussion was to give operators an > explicit configuration parameter that they could use to assert > resource_class, or possibly even static trait lists until they can > populate the node with what should be there for their deployment, or > for that individual hardware installation in their environment. > > While, the ironic community solution is "write introspection rules", > it seems operators just want something simpler that is a standing > default, like an explicit default in the configuration file. > > Some operators pointed out that with their processes, they would > largely know or be able to reconcile the differences in their > environment and make those in ironic as-needed. > > Eventually, the discussion shifted to affinity/anti-affinity which > could partially make use of tags, although that as previously > detailed, would quickly result in a tag explosion depending on how an > operator implements and chooses to manage their environment. > > Action items: > * Ironic needs to discuss as a group and what the impact of this > discussion means. Many of themes beyond providing configurable > defaults to meet the 80% of users, have repeatedly come up and really > drive towards some of the things detailed as part of the feedback > session. > * For "resource_class" defaults, Dmitry was kind enough to create > https://bugs.launchpad.net/ironic/+bug/1732190 as this does seem like > a quick and easy thing for us to address. > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev