On Tue, Sep 20, 2016 at 11:36:29AM -0400, Sean Dague wrote: > On 09/20/2016 11:20 AM, Daniel P. Berrange wrote: > > On Tue, Sep 20, 2016 at 11:01:23AM -0400, Sean Dague wrote: > >> On 09/20/2016 10:38 AM, Daniel P. Berrange wrote: > >>> On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote: > >>>> This is a bit delayed due to the release rush, finally getting back to > >>>> writing up my experiences at the Ops Meetup. > >>>> > >>>> Nova Feedback Session > >>>> ===================== > >>>> > >>>> We had a double session for Feedback for Nova from Operators, raw > >>>> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova. > >>>> > >>>> The median release people were on in the room was Kilo. Some were > >>>> upgrading to Liberty, many had older than Kilo clouds. Remembering > >>>> these are the larger ops environments that are engaged enough with the > >>>> community to send people to the Ops Meetup. > >>>> > >>>> > >>>> Performance Bottlenecks > >>>> ----------------------- > >>>> > >>>> * scheduling issues with Ironic - (this is a bug we got through during > >>>> the week after the session) > >>>> * live snapshots actually end up performance issue for people > >>>> > >>>> The workarounds config group was not well known, and everyone in the > >>>> room wished we advertised that a bit more. The solution for snapshot > >>>> performance is in there > >>>> > >>>> There were also general questions about what scale cells should be > >>>> considered at. > >>>> > >>>> ACTION: we should make sure workarounds are advertised better > >>> > >>> Workarounds ought to be something that admins are rarely, if > >>> ever, having to deal with. > >>> > >>> If the lack of live snapshot is such a major performance problem > >>> for ops, this tends to suggest that our default behaviour is wrong, > >>> rather than a need to publicise that operators should set this > >>> workaround. > >>> > >>> eg, instead of optimizing for the case of a broken live snapshot > >>> support by default, we should optimize for the case of working > >>> live snapshot by default. The broken live snapshot stuff was so > >>> rare that no one has ever reproduced it outside of the gate > >>> AFAIK. > >>> > >>> IOW, rather than hardcoding disable_live_snapshot=True in nova, > >>> we should just set it in the gate CI configs, and leave it set > >>> to False in Nova, so operators get good performance out of the > >>> box. > >>> > >>> Also it has been a while since we added the workaround, and IIRC, > >>> we've got newer Ubuntu available on at least some of the gate > >>> hosts now, so we have the ability to test to see if it still > >>> hits newer Ubuntu. > >> > >> Here is my reconstruction of the snapshot issue from what I can remember > >> of the conversation. > >> > >> Nova defaults to live snapshots. This uses the libvirt facility which > >> dumps both memory and disk. And then we throw away the memory. For large > >> memory guests (especially volume backed ones that might have a fast path > >> for the disk) this leads to a lot of overhead for no gain. The > >> workaround got them past it. > > > > I think you've got it backwards there. > > > > Nova defaults to *not* using live snapshots: > > > > cfg.BoolOpt( > > 'disable_libvirt_livesnapshot', > > default=True, > > help=""" > > Disable live snapshots when using the libvirt driver. > > ...""") > > > > > > When live snapshot is disabled like this, the snapshot code is unable > > to guarantee a consistent disk state. So the libvirt nova driver will > > stop the guest by doing a managed save (this saves all memory to > > disk), then does the disk snapshot, then restores the managed saved > > (which loads all memory from disk). > > > > This is terrible for multiple reasons > > > > 1. the guest workload stops running while snapshot is taken > > 2. we churn disk I/O saving & loading VM memory > > 3. you can't do it at all if host PCI devices are attached to > > the VM > > > > Enabling live snapshots by default fixes all these problems, at the > > risk of hitting the live snapshot bug we saw in the gate CI but never > > anywhere else. > > Ah, right. I'll propose inverting the default and we'll see if we can > get past the testing in the gate - https://review.openstack.org/#/c/373430/
NB the bug was non-deterministic and rare, even in the gate, so the real test is whether it gets past the gate 20 times in a row :-) Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev