I forgot how to reply-all here :)

// jim

On Wed, May 18, 2016 at 05:35:55PM -0400, Jim Rollenhagen wrote:
> On Tue, May 17, 2016 at 11:32:25PM +0100, Stig Telfer wrote:
> > Is there anywhere that these experiences can be captured in a way that 
> > might help?
> > 
> > For example, I have a few DRAC-managed servers.  About half have fallen 
> > into a state where the pxe_drac driver can’t do anything with them 
> > (python-dracclient claims another transaction is underway).  But 
> > pxe_ipmitool works happily.
> > 
> > I’m pretty sure Ironic is not at fault here so it doesn’t seem fair to 
> > catalogue these things as Ironic bugs.  Perhaps the best action would be 
> > for Ironic to be more informative when it identifies a BMC is playing up.
> > 
> > Jay and Jim - any thoughts?
> 
> Yeah, unfortunately we can't fix the terribleness of all the BMCs in the
> world. We are working on a few different efforts to help operators deal
> with these, generally (which are described in my summit wrapup).
> Nova-style notifications, BMC reset APIs, automatically returning nodes
> to service when a BMC is reachable again, etc.
> 
> I'd totally file a bug with python-dracclient for the specific DRAC
> thing you mentioned.
> 
> In general, feel free to file bugs, if it's something we can deal with
> we will triage it, if not we'll keep it in mind for the more general
> handling of these things.
> 
> Does that help?
> 
> // jim
> 
> > 
> > Best wishes,
> > Stig
> > 
> > 
> > > On 12 May 2016, at 11:37, Peter Love <p.l...@lancaster.ac.uk> wrote:
> > > 
> > > Nice talk on this stuff: https://www.youtube.com/watch?v=GZeUntdObCA
> > > 
> > > On 12 May 2016 at 10:54, Matt Jarvis <matt.jar...@datacentred.co.uk> 
> > > wrote:
> > >> Very familiar list Tim, and we end up working around a lot of them with
> > >> horrible hardware specific code. Our bugbears also include :
> > >> 
> > >> Required configuration only being available via a web interface - eg.
> > >> setting hostname of the BMC on Supermicro hardware
> > >> IPMI hanging and requiring complete removal and reload of the kernel 
> > >> modules
> > >> to enable resetting
> > >> Undocumented functions requiring raw IPMI commands - again on Supermicro
> > >> there is some black magic to set dedicated ports, check power supply 
> > >> status
> > >> etc.
> > >> Web interfaces requiring Java, and totally broken on mainstream browsers 
> > >> -
> > >> HP ILO's in particular, which are almost impossible to use with a Mac.
> > >> Firmware and BIOS'es which don't allow command line updating from inside 
> > >> a
> > >> running OS
> > >> 
> > >> We're used to being able to flash BIOS images and CMOS settings by 
> > >> writing
> > >> directly to the memory addresses, but more and more modern hardware won't
> > >> let you do this anymore :(
> > >> 
> > >> We're hoping Redfish will solve some of the configuration related issues,
> > >> although obviously it won't make any difference to flaky BMC 
> > >> implementations
> > >> and proprietary tooling to update firmware.
> > >> 
> > >> On 12 May 2016 at 06:25, Tim Bell <tim.b...@cern.ch> wrote:
> > >>> 
> > >>> 
> > >>> 
> > >>> On 12/05/16 06:22, "Stig Telfer" <stig.openst...@telfer.org> wrote:
> > >>> 
> > >>>> Hi All -
> > >>>> 
> > >>>> Jim Rollenhagen from the Ironic project has just posted a great summit
> > >>>> report of Ironic team activities on the openstack-devs mailing list[1],
> > >>>> which included this item which will be of interest to the Scientific WG
> > >>>> members who are looking to work on bare metal activities this cycle:
> > >>>> 
> > >>>>> # Making ops less worse
> > >>>>> 
> > >>>>> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops)
> > >>>>> 
> > >>>>> We discussed some common failure cases that operators see, and how we
> > >>>>> can solve them in code.
> > >>>>> 
> > >>>>> We discussed flaky BMCs, which end with the node in maintenance mode,
> > >>>>> and if Ironic can get them out of that mode automagically. We
> > >>>>> identified
> > >>>>> the need to distinguish between maintenance set by ironic and set by
> > >>>>> operators, and do things like attempt to connect to the BMC on a power
> > >>>>> state request, and turn off maintenance mode if successful. JayF is
> > >>>>> going to write a spec for this differentiation.
> > >>>>> 
> > >>>>> Folks also expressed the desire to be able to reset the BMC via APIs.
> > >>>>> We
> > >>>>> have a BMC reset function in the vendor interface for the ipmitool
> > >>>>> driver; dtantsur volunteered to write a spec to promote that method to
> > >>>>> an official ManagementInterface method.
> > >>>>> 
> > >>>>> We also talked for a while about stuck states. This has been mostly
> > >>>>> solved in code, but is still a problem for some deployers. We decided
> > >>>>> that we should not have a "reset-state" API like nova does, but rather
> > >>>>> a
> > >>>>> command line tool to handle this. lintan has volunteered to write a
> > >>>>> proposal for this; I have also posted some [straw man
> > >>>>> code](https://review.openstack.org/#/c/311273/) that someone is 
> > >>>>> welcome
> > >>>>> to take over or use.
> > >>>> 
> > >>>> The operator issues already identified cover some things we’ve hit at
> > >>>> Cambridge, please do scan through and contribute if there is anything 
> > >>>> they
> > >>>> have not covered.
> > >>>> 
> > >>> 
> > >>> We have certainly had our share of BMC problems through the years. It is
> > >>> often frustrating as the very time you find you need the console, it is 
> > >>> not
> > >>> working. Having Ironic doing an active monitoring (without overloading)
> > >>> would be a real help.
> > >>> 
> > >>> The other item we’ve found difficult has been in the configuration:
> > >>> 
> > >>> - Software maintenance is very limited. Some vendors choose to produce 
> > >>> new
> > >>> versions of the BMC microcode without changing the version number 
> > >>> reported
> > >>> by the BMC which makes consistent management difficult. There is no 
> > >>> common
> > >>> API defined for updating the code.
> > >>> - Implementations between IPMI 1.5 and IPMI 2.0 vary significantly and
> > >>> between commodity white boxes and blades
> > >>> - BMCs have different Lan channels according to manufacturer for remote
> > >>> access
> > >>> - The tty speeds vary which means that the booted OS needs to have
> > >>> different cmdlines for the kernel according to the underlying hardware
> > >>> - the number of additional accounts is limited in some BMCs and password
> > >>> management is very basic. Currently, we define distinct users for 
> > >>> read-only
> > >>> access to the SDRs (e.g. monitoring), console and power operations since
> > >>> these need to be kept in different systems. We also have unique 
> > >>> passwords
> > >>> for each machine, all of which requires tracking. Foreman helps here 
> > >>> but it
> > >>> is not ideal.
> > >>> - BMC replacement is also frequent. A process to re-import a replacement
> > >>> BMC (new MAC, no user accounts defined) would re-installing the box is
> > >>> needed.
> > >>> - we have a fairly complex reset process which hits the BMC with 
> > >>> different
> > >>> levels of reset. We’ve also sometimes found the need to reset the IPMI
> > >>> kernel modules at the same time which go into a loop.
> > >>> 
> > >>> I’m not expecting Ironic to fix all of this but it would be great to 
> > >>> have
> > >>> a block of code which we can gradually improve together. There are other
> > >>> good initiatives like OpenBMC but they won’t help with the existing 
> > >>> boxes.
> > >>> 
> > >>> I think my best advice to Ironic for BMC management would be consider 
> > >>> the
> > >>> BMC as a potentially unreliable device. Thus, along with performing the
> > >>> actions, checking they completed and probing that a function which was
> > >>> working an hour ago is still working now (but not overloading it)… 
> > >>> we’ll be
> > >>> looking at Ironic this year so we’ll be able to help on the failure 
> > >>> cases.
> > >>> 
> > >>> Tim
> > >>> 
> > >>>> Best wishes,
> > >>>> Stig
> > >>>> 
> > >>>> [1]
> > >>>> http://lists.openstack.org/pipermail/openstack-dev/2016-May/094658.html
> > >>>> _______________________________________________
> > >>>> OpenStack-operators mailing list
> > >>>> OpenStack-operators@lists.openstack.org
> > >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> > >>> 
> > >>> _______________________________________________
> > >>> OpenStack-operators mailing list
> > >>> OpenStack-operators@lists.openstack.org
> > >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> > >> 
> > >> 
> > >> 
> > >> DataCentred Limited registered in England and Wales no. 05611763
> > >> _______________________________________________
> > >> OpenStack-operators mailing list
> > >> OpenStack-operators@lists.openstack.org
> > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> > >> 
> > > 
> > > _______________________________________________
> > > OpenStack-operators mailing list
> > > OpenStack-operators@lists.openstack.org
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> > 

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to