Nice talk on this stuff: https://www.youtube.com/watch?v=GZeUntdObCA
On 12 May 2016 at 10:54, Matt Jarvis <[email protected]> wrote: > Very familiar list Tim, and we end up working around a lot of them with > horrible hardware specific code. Our bugbears also include : > > Required configuration only being available via a web interface - eg. > setting hostname of the BMC on Supermicro hardware > IPMI hanging and requiring complete removal and reload of the kernel modules > to enable resetting > Undocumented functions requiring raw IPMI commands - again on Supermicro > there is some black magic to set dedicated ports, check power supply status > etc. > Web interfaces requiring Java, and totally broken on mainstream browsers - > HP ILO's in particular, which are almost impossible to use with a Mac. > Firmware and BIOS'es which don't allow command line updating from inside a > running OS > > We're used to being able to flash BIOS images and CMOS settings by writing > directly to the memory addresses, but more and more modern hardware won't > let you do this anymore :( > > We're hoping Redfish will solve some of the configuration related issues, > although obviously it won't make any difference to flaky BMC implementations > and proprietary tooling to update firmware. > > On 12 May 2016 at 06:25, Tim Bell <[email protected]> wrote: >> >> >> >> On 12/05/16 06:22, "Stig Telfer" <[email protected]> wrote: >> >> >Hi All - >> > >> >Jim Rollenhagen from the Ironic project has just posted a great summit >> > report of Ironic team activities on the openstack-devs mailing list[1], >> > which included this item which will be of interest to the Scientific WG >> > members who are looking to work on bare metal activities this cycle: >> > >> >> # Making ops less worse >> >> >> >> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops) >> >> >> >> We discussed some common failure cases that operators see, and how we >> >> can solve them in code. >> >> >> >> We discussed flaky BMCs, which end with the node in maintenance mode, >> >> and if Ironic can get them out of that mode automagically. We >> >> identified >> >> the need to distinguish between maintenance set by ironic and set by >> >> operators, and do things like attempt to connect to the BMC on a power >> >> state request, and turn off maintenance mode if successful. JayF is >> >> going to write a spec for this differentiation. >> >> >> >> Folks also expressed the desire to be able to reset the BMC via APIs. >> >> We >> >> have a BMC reset function in the vendor interface for the ipmitool >> >> driver; dtantsur volunteered to write a spec to promote that method to >> >> an official ManagementInterface method. >> >> >> >> We also talked for a while about stuck states. This has been mostly >> >> solved in code, but is still a problem for some deployers. We decided >> >> that we should not have a "reset-state" API like nova does, but rather >> >> a >> >> command line tool to handle this. lintan has volunteered to write a >> >> proposal for this; I have also posted some [straw man >> >> code](https://review.openstack.org/#/c/311273/) that someone is welcome >> >> to take over or use. >> > >> >The operator issues already identified cover some things we’ve hit at >> > Cambridge, please do scan through and contribute if there is anything they >> > have not covered. >> > >> >> We have certainly had our share of BMC problems through the years. It is >> often frustrating as the very time you find you need the console, it is not >> working. Having Ironic doing an active monitoring (without overloading) >> would be a real help. >> >> The other item we’ve found difficult has been in the configuration: >> >> - Software maintenance is very limited. Some vendors choose to produce new >> versions of the BMC microcode without changing the version number reported >> by the BMC which makes consistent management difficult. There is no common >> API defined for updating the code. >> - Implementations between IPMI 1.5 and IPMI 2.0 vary significantly and >> between commodity white boxes and blades >> - BMCs have different Lan channels according to manufacturer for remote >> access >> - The tty speeds vary which means that the booted OS needs to have >> different cmdlines for the kernel according to the underlying hardware >> - the number of additional accounts is limited in some BMCs and password >> management is very basic. Currently, we define distinct users for read-only >> access to the SDRs (e.g. monitoring), console and power operations since >> these need to be kept in different systems. We also have unique passwords >> for each machine, all of which requires tracking. Foreman helps here but it >> is not ideal. >> - BMC replacement is also frequent. A process to re-import a replacement >> BMC (new MAC, no user accounts defined) would re-installing the box is >> needed. >> - we have a fairly complex reset process which hits the BMC with different >> levels of reset. We’ve also sometimes found the need to reset the IPMI >> kernel modules at the same time which go into a loop. >> >> I’m not expecting Ironic to fix all of this but it would be great to have >> a block of code which we can gradually improve together. There are other >> good initiatives like OpenBMC but they won’t help with the existing boxes. >> >> I think my best advice to Ironic for BMC management would be consider the >> BMC as a potentially unreliable device. Thus, along with performing the >> actions, checking they completed and probing that a function which was >> working an hour ago is still working now (but not overloading it)… we’ll be >> looking at Ironic this year so we’ll be able to help on the failure cases. >> >> Tim >> >> >Best wishes, >> >Stig >> > >> >[1] >> > http://lists.openstack.org/pipermail/openstack-dev/2016-May/094658.html >> >_______________________________________________ >> >OpenStack-operators mailing list >> >[email protected] >> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> >> _______________________________________________ >> OpenStack-operators mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > > > > DataCentred Limited registered in England and Wales no. 05611763 > _______________________________________________ > OpenStack-operators mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > _______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
