On Tue, Sep 20, 2016 at 11:01:23AM -0400, Sean Dague wrote:
> On 09/20/2016 10:38 AM, Daniel P. Berrange wrote:
> > On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote:
> >> This is a bit delayed due to the release rush, finally getting back to
> >> writing up my experiences at the Ops Meetup.
> >> Nova Feedback Session
> >> =====================
> >> We had a double session for Feedback for Nova from Operators, raw
> >> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
> >> The median release people were on in the room was Kilo. Some were
> >> upgrading to Liberty, many had older than Kilo clouds. Remembering
> >> these are the larger ops environments that are engaged enough with the
> >> community to send people to the Ops Meetup.
> >> Performance Bottlenecks
> >> -----------------------
> >> * scheduling issues with Ironic - (this is a bug we got through during
> >> the week after the session)
> >> * live snapshots actually end up performance issue for people
> >> The workarounds config group was not well known, and everyone in the
> >> room wished we advertised that a bit more. The solution for snapshot
> >> performance is in there
> >> There were also general questions about what scale cells should be
> >> considered at.
> >> ACTION: we should make sure workarounds are advertised better
> > Workarounds ought to be something that admins are rarely, if
> > ever, having to deal with.
> > If the lack of live snapshot is such a major performance problem
> > for ops, this tends to suggest that our default behaviour is wrong,
> > rather than a need to publicise that operators should set this
> > workaround.
> > eg, instead of optimizing for the case of a broken live snapshot
> > support by default, we should optimize for the case of working
> > live snapshot by default. The broken live snapshot stuff was so
> > rare that no one has ever reproduced it outside of the gate
> > AFAIK.
> > IOW, rather than hardcoding disable_live_snapshot=True in nova,
> > we should just set it in the gate CI configs, and leave it set
> > to False in Nova, so operators get good performance out of the
> > box.
> > Also it has been a while since we added the workaround, and IIRC,
> > we've got newer Ubuntu available on at least some of the gate
> > hosts now, so we have the ability to test to see if it still
> > hits newer Ubuntu.
> Here is my reconstruction of the snapshot issue from what I can remember
> of the conversation.
> Nova defaults to live snapshots. This uses the libvirt facility which
> dumps both memory and disk. And then we throw away the memory. For large
> memory guests (especially volume backed ones that might have a fast path
> for the disk) this leads to a lot of overhead for no gain. The
> workaround got them past it.
I think you've got it backwards there.
Nova defaults to *not* using live snapshots:
Disable live snapshots when using the libvirt driver.
When live snapshot is disabled like this, the snapshot code is unable
to guarantee a consistent disk state. So the libvirt nova driver will
stop the guest by doing a managed save (this saves all memory to
disk), then does the disk snapshot, then restores the managed saved
(which loads all memory from disk).
This is terrible for multiple reasons
1. the guest workload stops running while snapshot is taken
2. we churn disk I/O saving & loading VM memory
3. you can't do it at all if host PCI devices are attached to
Enabling live snapshots by default fixes all these problems, at the
risk of hitting the live snapshot bug we saw in the gate CI but never
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
OpenStack Development Mailing List (not for usage questions)