On 4/13/2016 8:21 PM, Thomas Huth wrote: > On 13.04.2016 15:15, Michael S. Tsirkin wrote: >> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: >>> The balloon code currently calls madvise() with TARGET_PAGE_SIZE >>> as length parameter, and an address which is directly based on >>> the page address supplied by the guest. Since the virtio-balloon >>> protocol is always based on 4k based addresses/sizes, no matter >>> what the host and guest are using as page size, this has a couple >>> of issues which could even lead to data loss in the worst case. >>> >>> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that >>> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than >>> 4k, we also destroy the 4k areas after the current one - which >>> might be wrong since the guest did not want free that area yet (in >>> case the guest used as smaller MMU page size than the hard-coded >>> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define >>> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size >>> parameter for the madvise() call instead. >> >> this makes absolute sense. >> >>> Then, there's yet another problem: If the host page size is bigger >>> than the 4k balloon page size, we can not simply call madvise() on >>> each of the 4k balloon addresses that we get from the guest - since >>> the madvise() always evicts the whole host page, not only a 4k area! >> >> Does it really round length up? >> Wow, it does: >> len = (len_in + ~PAGE_MASK) & PAGE_MASK; >> >> which seems to be undocumented, but has been there forever. > > Yes, that's ugly - I also had to take a look at the kernel sources to > understand what this call is supposed to do when being called with a > size < PAGE_SIZE. > >>> So in this case, we've got to track the 4k fragments of a host page >>> and only call madvise(DONTNEED) when all fragments have been collected. >>> This of course only works fine if the guest sends consecutive 4k >>> fragments - which is the case in the most important scenarios that >>> I try to address here (like a ppc64 guest with 64k page size that >>> is running on a ppc64 host with 64k page size). In case the guest >>> uses a page size that is smaller than the host page size, we might >>> need to add some more additional logic here later to increase the >>> probability of being able to release memory, but at least the guest >>> should now not crash anymore due to unintentionally evicted pages. > ... >>> static void virtio_balloon_instance_init(Object *obj) >>> diff --git a/include/hw/virtio/virtio-balloon.h >>> b/include/hw/virtio/virtio-balloon.h >>> index 35f62ac..04b7c0c 100644 >>> --- a/include/hw/virtio/virtio-balloon.h >>> +++ b/include/hw/virtio/virtio-balloon.h >>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { >>> int64_t stats_last_update; >>> int64_t stats_poll_interval; >>> uint32_t host_features; >>> + void *current_addr; >>> + unsigned long *fragment_bits; >>> + int fragment_bits_size; >>> } VirtIOBalloon; >>> >>> #endif >>> -- >>> 1.8.3.1 >> >> >> It looks like fragment_bits would have to be migrated. >> Which is a lot of complexity. > > I think we could simply omit this for now. In case somebody migrates the > VM while the ballooning is going on, we'd loose the information for one > host page, so we might miss one madvise(DONTNEED), but I think we could > live with that. > >> And work arounds for specific guest behaviour are really ugly. >> There are patches on-list to maintain a balloon bitmap - >> that will enable fixing it cleanly. > > Ah, good to know, I wasn't aware of them yet, so that will be a chance > for a really proper final solution, I hope. > >> How about we just skip madvise if host page size is > balloon >> page size, for 2.6? > > That would mean a regression compared to what we have today. Currently, > the ballooning is working OK for 64k guests on a 64k ppc host - rather > by chance than on purpose, but it's working. The guest is always sending > all the 4k fragments of a 64k page, and QEMU is trying to call madvise() > for every one of them, but the kernel is ignoring madvise() on > non-64k-aligned addresses, so we end up with a situation where the > madvise() frees a whole 64k page which is also declared as free by the > guest. > > I think we should either take this patch as it is right now (without > adding extra code for migration) and later update it to the bitmap code > by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and > fix it properly after the bitmap code has been applied. But disabling > the balloon code for 64k guests on 64k hosts completely does not sound > very appealing to me. What do you think? >
please find v3 version of below set of balloon bitmap patch series posted on qemu-devel list. balloon: maintain bitmap for pages released by guest balloon driver. balloon: add balloon bitmap migration capability and setup bitmap migration status. balloon: reset balloon bitmap ramblock size on source and target. migration: skip scanning and migrating ram pages released by virtio-balloon driver. - Jitendra > Thomas >