balloon: Fixes for different host page sizes

Jitendra Kolhe Sun, 22 May 2016 23:27:47 -0700

On 4/13/2016 8:21 PM, Thomas Huth wrote:
> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
>>> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
>>> as length parameter, and an address which is directly based on
>>> the page address supplied by the guest. Since the virtio-balloon
>>> protocol is always based on 4k based addresses/sizes, no matter
>>> what the host and guest are using as page size, this has a couple
>>> of issues which could even lead to data loss in the worst case.
>>>
>>> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
>>> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
>>> 4k, we also destroy the 4k areas after the current one - which
>>> might be wrong since the guest did not want free that area yet (in
>>> case the guest used as smaller MMU page size than the hard-coded
>>> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
>>> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
>>> parameter for the madvise() call instead.
>>
>> this makes absolute sense.
>>
>>> Then, there's yet another problem: If the host page size is bigger
>>> than the 4k balloon page size, we can not simply call madvise() on
>>> each of the 4k balloon addresses that we get from the guest - since
>>> the madvise() always evicts the whole host page, not only a 4k area!
>>
>> Does it really round length up?
>> Wow, it does:
>>         len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>
>> which seems to be undocumented, but has been there forever.
> 
> Yes, that's ugly - I also had to take a look at the kernel sources to
> understand what this call is supposed to do when being called with a
> size < PAGE_SIZE.
> 
>>> So in this case, we've got to track the 4k fragments of a host page
>>> and only call madvise(DONTNEED) when all fragments have been collected.
>>> This of course only works fine if the guest sends consecutive 4k
>>> fragments - which is the case in the most important scenarios that
>>> I try to address here (like a ppc64 guest with 64k page size that
>>> is running on a ppc64 host with 64k page size). In case the guest
>>> uses a page size that is smaller than the host page size, we might
>>> need to add some more additional logic here later to increase the
>>> probability of being able to release memory, but at least the guest
>>> should now not crash anymore due to unintentionally evicted pages.
> ...
>>>  static void virtio_balloon_instance_init(Object *obj)
>>> diff --git a/include/hw/virtio/virtio-balloon.h 
>>> b/include/hw/virtio/virtio-balloon.h
>>> index 35f62ac..04b7c0c 100644
>>> --- a/include/hw/virtio/virtio-balloon.h
>>> +++ b/include/hw/virtio/virtio-balloon.h
>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>>>      int64_t stats_last_update;
>>>      int64_t stats_poll_interval;
>>>      uint32_t host_features;
>>> +    void *current_addr;
>>> +    unsigned long *fragment_bits;
>>> +    int fragment_bits_size;
>>>  } VirtIOBalloon;
>>>  
>>>  #endif
>>> -- 
>>> 1.8.3.1
>>
>>
>> It looks like fragment_bits would have to be migrated.
>> Which is a lot of complexity.
> 
> I think we could simply omit this for now. In case somebody migrates the
> VM while the ballooning is going on, we'd loose the information for one
> host page, so we might miss one madvise(DONTNEED), but I think we could
> live with that.
> 
>> And work arounds for specific guest behaviour are really ugly.
>> There are patches on-list to maintain a balloon bitmap -
>> that will enable fixing it cleanly.
> 
> Ah, good to know, I wasn't aware of them yet, so that will be a chance
> for a really proper final solution, I hope.
> 
>> How about we just skip madvise if host page size is > balloon
>> page size, for 2.6?
> 
> That would mean a regression compared to what we have today. Currently,
> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> by chance than on purpose, but it's working. The guest is always sending
> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> for every one of them, but the kernel is ignoring madvise() on
> non-64k-aligned addresses, so we end up with a situation where the
> madvise() frees a whole 64k page which is also declared as free by the
> guest.
> 
> I think we should either take this patch as it is right now (without
> adding extra code for migration) and later update it to the bitmap code
> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> fix it properly after the bitmap code has been applied. But disabling
> the balloon code for 64k guests on 64k hosts completely does not sound
> very appealing to me. What do you think?
>


please find v3 version of below set of balloon bitmap patch series posted
on qemu-devel list.

  balloon: maintain bitmap for pages released by guest balloon driver.
  balloon: add balloon bitmap migration capability and setup bitmap
    migration status.
  balloon: reset balloon bitmap ramblock size on source and target.
  migration: skip scanning and migrating ram pages released by
    virtio-balloon driver.

- Jitendra

>  Thomas
>

Re: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes

Reply via email to