Re: [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)

Igor Mammedov Tue, 28 Feb 2023 08:19:43 -0800

On Fri, 24 Feb 2023 22:41:16 +0100
"Maciej S. Szmigiero" <m...@maciej.szmigiero.name> wrote:


> From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>
> 
> This driver is like virtio-balloon on steroids: it allows both changing the
> guest memory allocation via ballooning and inserting extra RAM into it by
> adding required memory backends and providing them to the driver.


this sounds pretty much like what virtio-mem does, modulo used protocol.
Would it be too crazy ask to reuse virtio-mem by teaching it new protocol
and avoid adding new device with all mgmt hurdles that virtio-mem has
already solved?


> One of advantages of these over ACPI-based PC DIMM hotplug is that such
> memory can be hotplugged in much smaller granularity because the ACPI DIMM
> slot limit does not apply.
> 
> Hot-adding additional memory is done by creating a new memory backend (for
> example by executing HMP command
> "object_add memory-backend-ram,id=mem1,size=4G"), then executing a new
> "hv-balloon-add-memory" QMP command, providing the id of that memory
> backend as the "id" parameter.
> 
> In contrast with ACPI DIMM hotplug where one can only request to unplug a
> whole DIMM stick this driver allows removing memory from guest in single
> page (4k) units via ballooning.
> 
> After a VM reboot each previously hot-added memory backend gets released.
> A "HV_BALLOON_MEMORY_BACKEND_UNUSED" QMP event is emitted in this case so
> the software controlling QEMU knows that it either needs to delete that
> memory backend (if no longer needed) or re-insert it.
> 
> In the future, the guest boot memory size might be changed on reboot
> instead, taking into account the effective size that VM had before that
> reboot (much like Hyper-V does).
> 
> For performance reasons, the guest-released memory is tracked in a few
> range trees, as a series of (start, count) ranges.
> Each time a new page range is inserted into such tree its neighbors are
> checked as candidates for possible merging with it.
> 
> Besides performance reasons, the Dynamic Memory protocol itself uses page
> ranges as the data structure in its messages, so relevant pages need to be
> merged into such ranges anyway.
> 
> One has to be careful when tracking the guest-released pages, since the
> guest can maliciously report returning pages outside its current address
> space, which later clash with the address range of newly added memory.
> Similarly, the guest can report freeing the same page twice.
> 
> The above design results in much better ballooning performance than when
> using virtio-balloon with the same guest: 230 GB / minute with this driver
> versus 70 GB / minute with virtio-balloon.
> 
> During a ballooning operation most of time is spent waiting for the guest
> to come up with newly freed page ranges, processing the received ranges on
> the host side (in QEMU and KVM) is nearly instantaneous.
> 
> The unballoon operation is also pretty much instantaneous:
> thanks to the merging of the ballooned out page ranges 200 GB of memory can
> be returned to the guest in about 1 second.
> With virtio-balloon this operation takes about 2.5 minutes.
> 
> These tests were done against a Windows Server 2019 guest running on a
> Xeon E5-2699, after dirtying the whole memory inside guest before each
> balloon operation.
> 
> Using a range tree instead of a bitmap to track the removed memory also
> means that the solution scales well with the guest size: even a 1 TB range
> takes just few bytes of memory.
> 
> Since the required GTree operations aren't present in every Glib version
> a check for them was added to "configure" script, together with new
> "--enable-hv-balloon" and "--disable-hv-balloon" arguments.
> If these GTree operations are missing in the system's Glib version this
> driver will be skipped during QEMU build.
> 
> An optional "status-report=on" device parameter requests memory status
> events from the guest (typically sent every second), which allow the host
> to learn both the guest memory available and the guest memory in use
> counts.
> They are emitted externally as "HV_BALLOON_STATUS_REPORT" QMP events.
> 
> The driver is named hv-balloon since the Linux kernel client driver for
> the Dynamic Memory Protocol is named as such and to follow the naming
> pattern established by the virtio-balloon driver.
> The whole protocol runs over Hyper-V VMBus.
> 
> The driver was tested against Windows Server 2012 R2, Windows Server 2016
> and Windows Server 2016 guests and obeys the guest alignment requirements
> reported to the host via DM_CAPABILITIES_REPORT message.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com>
> ---
>  Kconfig.host           |    3 +
>  configure              |   36 +
>  hw/hyperv/Kconfig      |    5 +
>  hw/hyperv/hv-balloon.c | 2185 ++++++++++++++++++++++++++++++++++++++++
>  hw/hyperv/meson.build  |    1 +
>  hw/hyperv/trace-events |   16 +
>  meson.build            |    4 +-
>  qapi/machine.json      |   68 ++
>  8 files changed, 2317 insertions(+), 1 deletion(-)
>  create mode 100644 hw/hyperv/hv-balloon.c
> 
> diff --git a/Kconfig.host b/Kconfig.host
> index d763d89269..2ee71578f3 100644
> --- a/Kconfig.host
> +++ b/Kconfig.host
> @@ -46,3 +46,6 @@ config FUZZ
>  config VFIO_USER_SERVER_ALLOWED
>      bool
>      imply VFIO_USER_SERVER
> +
> +config HV_BALLOON_POSSIBLE
> +    bool
> diff --git a/configure b/configure
> index cf6db3d551..b534955f58 100755
> --- a/configure
> +++ b/configure
> @@ -283,6 +283,7 @@ bsd_user=""
>  pie=""
>  coroutine=""
>  plugins="$default_feature"
> +hv_balloon="$default_feature"
>  meson=""
>  ninja=""
>  bindir="bin"
> @@ -866,6 +867,10 @@ for opt do
>    ;;
>    --disable-vfio-user-server) vfio_user_server="disabled"
>    ;;
> +  --enable-hv-balloon) hv_balloon=yes
> +  ;;
> +  --disable-hv-balloon) hv_balloon=no
> +  ;;
>    # everything else has the same name in configure and meson
>    --*) meson_option_parse "$opt" "$optarg"
>    ;;
> @@ -1019,6 +1024,7 @@ cat << EOF
>    debug-info      debugging information
>    safe-stack      SafeStack Stack Smash Protection. Depends on
>                    clang/llvm and requires coroutine backend ucontext.
> +  hv-balloon      hv-balloon driver where supported (requires Glib 2.68+ 
> GTree API)
>  
>  NOTE: The object files are built at the place where configure is launched
>  EOF
> @@ -1740,6 +1746,32 @@ EOF
>    fi
>  fi
>  
> +##########################################
> +# check for hv-balloon
> +
> +if test "$hv_balloon" != "no"; then
> +     cat > $TMPC << EOF
> +#include <string.h>
> +#include <gmodule.h>
> +int main(void) {
> +    GTree *tree;
> +
> +    tree = g_tree_new((GCompareFunc)strcmp);
> +    (void)g_tree_node_first(tree);
> +    g_tree_destroy(tree);
> +    return 0;
> +}
> +EOF
> +     if compile_prog "$glib_cflags" "$glib_libs" ; then
> +             hv_balloon=yes
> +     else
> +             if test "$hv_balloon" = "yes" ; then
> +                     feature_not_found "hv-balloon" "Update Glib"
> +             fi
> +             hv_balloon="no"
> +     fi
> +fi
> +
>  ##########################################
>  # functions to probe cross compilers
>  
> @@ -2336,6 +2368,10 @@ if test "$have_tsan" = "yes" && test 
> "$have_tsan_iface_fiber" = "yes" ; then
>      echo "CONFIG_TSAN=y" >> $config_host_mak
>  fi
>  
> +if test "$hv_balloon" = "yes" ; then
> +  echo "CONFIG_HV_BALLOON_POSSIBLE=y" >> $config_host_mak
> +fi
> +
>  if test "$plugins" = "yes" ; then
>      echo "CONFIG_PLUGIN=y" >> $config_host_mak
>  fi
> diff --git a/hw/hyperv/Kconfig b/hw/hyperv/Kconfig
> index fcf65903bd..8f8be1bcce 100644
> --- a/hw/hyperv/Kconfig
> +++ b/hw/hyperv/Kconfig
> @@ -16,3 +16,8 @@ config SYNDBG
>      bool
>      default y
>      depends on VMBUS
> +
> +config HV_BALLOON
> +    bool
> +    default y
> +    depends on HV_BALLOON_POSSIBLE && VMBUS && HAPVDIMM
> diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
> new file mode 100644
> index 0000000000..b11f005189
> --- /dev/null
> +++ b/hw/hyperv/hv-balloon.c
> @@ -0,0 +1,2185 @@
> +/*
> + * QEMU Hyper-V Dynamic Memory Protocol driver
> + *
> + * Copyright (C) 2020-2023 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "exec/address-spaces.h"
> +#include "exec/cpu-common.h"
> +#include "exec/memory.h"
> +#include "exec/ramblock.h"
> +#include "hw/boards.h"
> +#include "hw/hyperv/dynmem-proto.h"
> +#include "hw/hyperv/vmbus.h"
> +#include "hw/mem/hapvdimm.h"
> +#include "hw/mem/pc-dimm.h"
> +#include "hw/qdev-core.h"
> +#include "hw/qdev-properties.h"
> +#include "monitor/qdev.h"
> +#include "qapi/error.h"
> +#include "qapi/qapi-commands-machine.h"
> +#include "qapi/qapi-events-machine.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qemu/error-report.h"
> +#include "qemu/module.h"
> +#include "qemu/units.h"
> +#include "qemu/timer.h"
> +#include "sysemu/balloon.h"
> +#include "sysemu/reset.h"
> +#include "trace.h"
> +
> +/*
> + * temporarily avoid warnings about enhanced GTree API usage requiring a
> + * too recent Glib version until GLIB_VERSION_MAX_ALLOWED finally reaches
> + * the Glib version with this API
> + */
> +#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
> +
> +#define TYPE_HV_BALLOON "hv-balloon"
> +#define HV_BALLOON_GUID "525074DC-8985-46e2-8057-A307DC18A502"
> +#define HV_BALLOON_PFN_SHIFT 12
> +#define HV_BALLOON_PAGE_SIZE (1 << HV_BALLOON_PFN_SHIFT)
> +
> +/*
> + * Some Windows versions (at least Server 2019) will crash with various
> + * error codes when receiving DM protocol requests (at least
> + * DM_MEM_HOT_ADD_REQUEST) immediately after boot.
> + *
> + * It looks like Hyper-V from Server 2016 uses a 50-second after-boot
> + * delay, probably to workaround this issue, so we'll use this value, too.
> + */
> +#define HV_BALLOON_POST_INIT_WAIT (50 * 1000)
> +
> +#define HV_BALLOON_HA_CHUNK_SIZE (2 * GiB)
> +#define HV_BALLOON_HA_CHUNK_PAGES (HV_BALLOON_HA_CHUNK_SIZE / 
> HV_BALLOON_PAGE_SIZE)
> +
> +#define HV_BALLOON_HR_CHUNK_PAGES 585728
> +/*
> + *                                ^ that's the maximum number of pages
> + * that Windows returns in one hot remove response
> + *
> + * If the number requested is too high Windows will no longer honor
> + * these requests
> + */
> +
> +typedef enum State {
> +    /* not a real state */
> +    S_NO_CHANGE = 0,
> +
> +    S_WAIT_RESET,
> +    S_CLOSED,
> +    S_VERSION,
> +    S_CAPS,
> +    S_POST_INIT_WAIT,
> +    S_IDLE,
> +    S_HOT_ADD_RB_WAIT,
> +    S_HOT_ADD_POSTING,
> +    S_HOT_ADD_REPLY_WAIT,
> +    S_HOT_ADD_SKIP_CURRENT,
> +    S_HOT_ADD_PROCESSED_CLEAR_PENDING,
> +    S_HOT_ADD_PROCESSED_NEXT,
> +    S_HOT_REMOVE,
> +    S_BALLOON_POSTING,
> +    S_BALLOON_RB_WAIT,
> +    S_BALLOON_REPLY_WAIT,
> +    S_UNBALLOON_POSTING,
> +    S_UNBALLOON_RB_WAIT,
> +    S_UNBALLOON_REPLY_WAIT,
> +} State;
> +
> +typedef struct StateDesc {
> +    State state;
> +    const char *desc;
> +} StateDesc;
> +
> +typedef struct PageRange {
> +    uint64_t start;
> +    uint64_t count;
> +} PageRange;
> +
> +/* type safety */
> +typedef struct PageRangeTree {
> +    GTree *t;
> +} PageRangeTree;
> +
> +typedef struct HAPVDIMMRange {
> +    HAPVDIMMDevice *hapvdimm;
> +
> +    PageRange range;
> +    uint64_t used;
> +
> +    /*
> +     * Pages not currently usable due to guest alignment reqs or
> +     * not hot added in the first place
> +     */
> +    uint64_t unused_head, unused_tail;
> +
> +    /* Memory removed from the guest backed by this HAPVDIMM */
> +    PageRangeTree removed_guest, removed_both;
> +} HAPVDIMMRange;
> +
> +/* type safety */
> +typedef struct HAPVDIMMRangeTree {
> +    GTree *t;
> +} HAPVDIMMRangeTree;
> +
> +typedef struct HvBalloon {
> +    VMBusDevice parent;
> +    State state;
> +    bool status_reports;
> +
> +    union dm_version version;
> +    union dm_caps caps;
> +
> +    QEMUTimer post_init_timer;
> +    guint del_todo_process_timer;
> +
> +    unsigned int trans_id;
> +
> +    /* Guest target size */
> +    uint64_t target;
> +    bool target_changed;
> +    uint64_t target_diff;
> +
> +    /*
> +     * All HAPVDIMMs under control of this driver
> +     * (but excluding the ones in hapvdimms_del_todo)
> +     */
> +    HAPVDIMMRangeTree hapvdimms;
> +
> +    /* Non-HAPVDIMM removed memory */
> +    PageRangeTree removed_guest, removed_both;
> +
> +    /* Grand totals of removed memory (both HAPVDIMM and non-HAPVDIMM) */
> +    uint64_t removed_guest_ctr, removed_both_ctr;
> +
> +    /* HAPVDIMMs waiting to be added during current connection */
> +    GSList *ha_todo;
> +    uint64_t ha_current_count;
> +
> +    /* HAPVDIMMs waiting to be deleted, not in any of the above structures */
> +    GSList *hapvdimms_del_todo;
> +} HvBalloon;
> +
> +#define HV_BALLOON(obj) OBJECT_CHECK(HvBalloon, (obj), TYPE_HV_BALLOON)
> +
> +#define HV_BALLOON_SET_STATE(hvb, news)             \
> +    do {                                            \
> +        assert(news != S_NO_CHANGE);                \
> +        hv_balloon_state_set(hvb, news, # news);    \
> +    } while (0)
> +
> +#define HV_BALLOON_STATE_DESC_SET(stdesc, news)         \
> +    _hv_balloon_state_desc_set(stdesc, news, # news)
> +
> +#define HV_BALLOON_STATE_DESC_INIT \
> +    {                              \
> +        .state = S_NO_CHANGE,      \
> +    }
> +
> +#define SUM_OVERFLOW_U64(in1, in2) ((in1) > UINT64_MAX - (in2))
> +#define SUM_SATURATE_U64(in1, in2)              \
> +    ({                                          \
> +        uint64_t _in1 = (in1), _in2 = (in2);    \
> +        uint64_t _result;                       \
> +                                                \
> +        if (!SUM_OVERFLOW_U64(_in1, _in2)) {    \
> +            _result = _in1 + _in2;              \
> +        } else {                                \
> +            _result = UINT64_MAX;               \
> +        }                                       \
> +                                                \
> +        _result;                                \
> +    })
> +
> +typedef struct HvBalloonReq {
> +    VMBusChanReq vmreq;
> +} HvBalloonReq;
> +
> +/* PageRange */
> +static void page_range_intersect(const PageRange *range,
> +                                 uint64_t start, uint64_t count,
> +                                 PageRange *out)
> +{
> +    uint64_t end1 = range->start + range->count;
> +    uint64_t end2 = start + count;
> +    uint64_t end = MIN(end1, end2);
> +
> +    out->start = MAX(range->start, start);
> +    out->count = out->start < end ? end - out->start : 0;
> +}
> +
> +static uint64_t page_range_intersection_size(const PageRange *range,
> +                                             uint64_t start, uint64_t count)
> +{
> +    PageRange trange;
> +
> +    page_range_intersect(range, start, count, &trange);
> +    return trange.count;
> +}
> +
> +/* return just the part of range before (start) */
> +static void page_range_part_before(const PageRange *range,
> +                                   uint64_t start, PageRange *out)
> +{
> +    uint64_t endr = range->start + range->count;
> +    uint64_t end = MIN(endr, start);
> +
> +    out->start = range->start;
> +    if (end > out->start) {
> +        out->count = end - out->start;
> +    } else {
> +        out->count = 0;
> +    }
> +}
> +
> +/* return just the part of range after (start, count) */
> +static void page_range_part_after(const PageRange *range,
> +                                  uint64_t start, uint64_t count,
> +                                  PageRange *out)
> +{
> +    uint64_t end = range->start + range->count;
> +    uint64_t ends = start + count;
> +
> +    out->start = MAX(range->start, ends);
> +    if (end > out->start) {
> +        out->count = end - out->start;
> +    } else {
> +        out->count = 0;
> +    }
> +}
> +
> +static bool page_range_joinable_left(const PageRange *range,
> +                                     uint64_t start, uint64_t count)
> +{
> +    return start + count == range->start;
> +}
> +
> +static bool page_range_joinable_right(const PageRange *range,
> +                                      uint64_t start, uint64_t count)
> +{
> +    return range->start + range->count == start;
> +}
> +
> +static bool page_range_joinable(const PageRange *range,
> +                                uint64_t start, uint64_t count)
> +{
> +    return page_range_joinable_left(range, start, count) ||
> +        page_range_joinable_right(range, start, count);
> +}
> +
> +/* PageRangeTree */
> +static gint page_range_tree_key_compare(gconstpointer leftp,
> +                                        gconstpointer rightp,
> +                                        gpointer user_data)
> +{
> +    const uint64_t *left = leftp, *right = rightp;
> +
> +    if (*left < *right) {
> +        return -1;
> +    } else if (*left > *right) {
> +        return 1;
> +    } else { /* *left == *right */
> +        return 0;
> +    }
> +}
> +
> +static GTreeNode *page_range_tree_insert_new(PageRangeTree tree,
> +                                             uint64_t start, uint64_t count)
> +{
> +    uint64_t *key = g_malloc(sizeof(*key));
> +    PageRange *range = g_malloc(sizeof(*range));
> +
> +    assert(count > 0);
> +
> +    *key = range->start = start;
> +    range->count = count;
> +
> +    return g_tree_insert_node(tree.t, key, range);
> +}
> +
> +static void page_range_tree_insert(PageRangeTree tree,
> +                                   uint64_t start, uint64_t count,
> +                                   uint64_t *dupcount)
> +{
> +    GTreeNode *node;
> +    bool joinable;
> +    uint64_t intersection;
> +    PageRange *range;
> +
> +    assert(!SUM_OVERFLOW_U64(start, count));
> +    if (count == 0) {
> +        return;
> +    }
> +
> +    node = g_tree_upper_bound(tree.t, &start);
> +    if (node) {
> +        node = g_tree_node_previous(node);
> +    } else {
> +        node = g_tree_node_last(tree.t);
> +    }
> +
> +    if (node) {
> +        range = g_tree_node_value(node);
> +        assert(range);
> +        intersection = page_range_intersection_size(range, start, count);
> +        joinable = page_range_joinable_right(range, start, count);
> +    }
> +
> +    if (!node ||
> +        (!intersection && !joinable)) {
> +        /*
> +         * !node case: the tree is empty or the very first node in the tree
> +         * already has a higher key (the start of its range).
> +         * the other case: there is a gap in the tree between the new range
> +         * and the previous one.
> +         * anyway, let's just insert the new range into the tree.
> +         */
> +        node = page_range_tree_insert_new(tree, start, count);
> +        assert(node);
> +        range = g_tree_node_value(node);
> +        assert(range);
> +    } else {
> +        /*
> +         * the previous range in the tree either partially covers the new
> +         * range or ends just at its beginning - extend it
> +         */
> +        if (dupcount) {
> +            *dupcount += intersection;
> +        }
> +
> +        count += start - range->start;
> +        range->count = MAX(range->count, count);
> +    }
> +
> +    /* check next nodes for possible merging */
> +    for (node = g_tree_node_next(node); node; ) {
> +        PageRange *rangecur;
> +
> +        rangecur = g_tree_node_value(node);
> +        assert(rangecur);
> +
> +        intersection = page_range_intersection_size(rangecur,
> +                                                    range->start, 
> range->count);
> +        joinable = page_range_joinable_left(rangecur,
> +                                            range->start, range->count);
> +        if (!intersection && !joinable) {
> +            /* the current node is disjoint */
> +            break;
> +        }
> +
> +        if (dupcount) {
> +            *dupcount += intersection;
> +        }
> +
> +        count = rangecur->count + (rangecur->start - range->start);
> +        range->count = MAX(range->count, count);
> +
> +        /* the current node was merged in, remove it */
> +        start = rangecur->start;
> +        node = g_tree_node_next(node);
> +        /* no hinted removal in GTree... */
> +        g_tree_remove(tree.t, &start);
> +    }
> +}
> +
> +static bool page_range_tree_pop(PageRangeTree tree, PageRange *out,
> +                                uint64_t maxcount)
> +{
> +    GTreeNode *node;
> +    PageRange *range;
> +
> +    node = g_tree_node_last(tree.t);
> +    if (!node) {
> +        return false;
> +    }
> +
> +    range = g_tree_node_value(node);
> +    assert(range);
> +
> +    out->start = range->start;
> +
> +    /* can't modify range->start as it is the node key */
> +    if (range->count > maxcount) {
> +        out->start += range->count - maxcount;
> +        out->count = maxcount;
> +        range->count -= maxcount;
> +    } else {
> +        out->count = range->count;
> +        /* no hinted removal in GTree... */
> +        g_tree_remove(tree.t, &out->start);
> +    }
> +
> +    return true;
> +}
> +
> +static bool page_range_tree_intree_any(PageRangeTree tree,
> +                                       uint64_t start, uint64_t count)
> +{
> +    GTreeNode *node;
> +
> +    if (count == 0) {
> +        return false;
> +    }
> +
> +    /* find the first node that can possibly intersect our range */
> +    node = g_tree_upper_bound(tree.t, &start);
> +    if (node) {
> +        /*
> +         * a NULL node below means that the very first node in the tree
> +         * already has a higher key (the start of its range).
> +         */
> +        node = g_tree_node_previous(node);
> +    } else {
> +        /* a NULL node below means that the tree is empty */
> +        node = g_tree_node_last(tree.t);
> +    }
> +    /* node range start <= range start */
> +
> +    if (!node) {
> +        /* node range start > range start */
> +        node = g_tree_node_first(tree.t);
> +    }
> +
> +    for ( ; node; node = g_tree_node_next(node)) {
> +        PageRange *range = g_tree_node_value(node);
> +
> +        assert(range);
> +        /*
> +         * if this node starts beyond or at the end of our range so does
> +         * every next one
> +         */
> +        if (range->start >= start + count) {
> +            break;
> +        }
> +
> +        if (page_range_intersection_size(range, start, count) > 0) {
> +            return true;
> +        }
> +    }
> +
> +    return false;
> +}
> +
> +static PageRangeTree page_range_tree_new(void)
> +{
> +    PageRangeTree tree;
> +
> +    tree.t = g_tree_new_full(page_range_tree_key_compare, NULL,
> +                             g_free, g_free);
> +    return tree;
> +}
> +
> +static void page_range_tree_destroy(PageRangeTree *tree)
> +{
> +    /* g_tree_destroy() is not NULL-safe */
> +    if (!tree->t) {
> +        return;
> +    }
> +
> +    g_tree_destroy(tree->t);
> +    tree->t = NULL;
> +}
> +
> +/* HAPVDIMMDevice */
> +static uint64_t hapvdimm_get_addr(HAPVDIMMDevice *hapvdimm)
> +{
> +    return object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_ADDR_PROP,
> +                                    &error_abort) / HV_BALLOON_PAGE_SIZE;
> +}
> +
> +static uint64_t hapvdimm_get_size(HAPVDIMMDevice *hapvdimm)
> +{
> +    return object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_SIZE_PROP,
> +                                    &error_abort) / HV_BALLOON_PAGE_SIZE;
> +}
> +
> +static void hapvdimm_get_range(HAPVDIMMDevice *hapvdimm, PageRange *out)
> +{
> +    out->start = hapvdimm_get_addr(hapvdimm);
> +    assert(out->start > 0);
> +
> +    out->count = hapvdimm_get_size(hapvdimm);
> +    assert(out->count > 0);
> +}
> +
> +static HostMemoryBackend *hapvdimm_get_memdev(HAPVDIMMDevice *hapvdimm)
> +{
> +    Object *memdev_obj;
> +
> +    memdev_obj = object_property_get_link(OBJECT(hapvdimm),
> +                                          HAPVDIMM_MEMDEV_PROP,
> +                                          &error_abort);
> +    return MEMORY_BACKEND(memdev_obj);
> +}
> +
> +/* HAPVDIMMRange */
> +static HAPVDIMMRange *hapvdimm_range_new(HAPVDIMMDevice *hapvdimm)
> +{
> +    HAPVDIMMRange *hpr = g_malloc(sizeof(*hpr));
> +
> +    hpr->hapvdimm = HAPVDIMM(object_ref(hapvdimm));
> +    hapvdimm_get_range(hapvdimm, &hpr->range);
> +
> +    hpr->removed_guest = page_range_tree_new();
> +    hpr->removed_both = page_range_tree_new();
> +
> +    /* mark the whole range as unused */
> +    hpr->used = 0;
> +    hpr->unused_head = hpr->range.count;
> +    hpr->unused_tail = 0;
> +
> +    return hpr;
> +}
> +
> +static void hapvdimm_range_free(HAPVDIMMRange *hpr)
> +{
> +    g_autoptr(HAPVDIMMDevice) hapvdimm = g_steal_pointer(&hpr->hapvdimm);
> +
> +    page_range_tree_destroy(&hpr->removed_guest);
> +    page_range_tree_destroy(&hpr->removed_both);
> +
> +    g_free(hpr);
> +}
> +
> +/* the hapvdimm range reduced by unused head and tail */
> +static void hapvdimm_range_get_effective_range(HAPVDIMMRange *hpr,
> +                                               PageRange *out)
> +{
> +    out->start = hpr->range.start + hpr->unused_head;
> +    out->count = hpr->range.count - hpr->unused_head - hpr->unused_tail;
> +}
> +
> +/* HAPVDIMMRangeTree */
> +static gint hapvdimm_tree_key_compare(gconstpointer leftp, gconstpointer 
> rightp,
> +                                      gpointer user_data)
> +{
> +    /*
> +     * hapvdimm tree is also keyed on page range start, so we can simply 
> reuse
> +     * the comparison function from the page range tree
> +     */
> +    return page_range_tree_key_compare(leftp, rightp, user_data);
> +}
> +
> +static HAPVDIMMRange *hapvdimm_tree_insert_new(HvBalloon *balloon,
> +                                               HAPVDIMMDevice *hapvdimm)
> +{
> +    HAPVDIMMRange *hpr;
> +    uint64_t *key;
> +
> +    hpr = hapvdimm_range_new(hapvdimm);
> +
> +    key = g_malloc(sizeof(*key));
> +    *key = hpr->range.start;
> +
> +    g_tree_insert(balloon->hapvdimms.t, key, hpr);
> +
> +    return hpr;
> +}
> +
> +/* The HAPVDIMM must not be on the ha_todo list since it's going to get 
> unref'ed. */
> +static void hapvdimm_tree_remove(HvBalloon *balloon, HAPVDIMMDevice 
> *hapvdimm)
> +{
> +    uint64_t addr;
> +
> +    addr = hapvdimm_get_addr(hapvdimm);
> +    assert(addr > 0);
> +
> +    g_tree_remove(balloon->hapvdimms.t, &addr);
> +}
> +
> +/* total RAM includes memory currently removed from the guest */
> +static gboolean hapvdimm_tree_total_ram_node(gpointer key,
> +                                             gpointer value,
> +                                             gpointer data)
> +{
> +    HAPVDIMMRange *hpr = value;
> +    uint64_t *size = data;
> +    PageRange rangeeff;
> +
> +    hapvdimm_range_get_effective_range(hpr, &rangeeff);
> +    *size += rangeeff.count;
> +
> +    return false;
> +}
> +
> +static uint64_t hapvdimm_tree_total_ram(HvBalloon *balloon)
> +{
> +    uint64_t size = 0;
> +
> +    g_tree_foreach(balloon->hapvdimms.t, hapvdimm_tree_total_ram_node, 
> &size);
> +    return size;
> +}
> +
> +static void hapvdimm_tree_value_free(gpointer data)
> +{
> +    HAPVDIMMRange *hpr = data;
> +
> +    hapvdimm_range_free(hpr);
> +}
> +
> +static HAPVDIMMRangeTree hapvdimm_tree_new(void)
> +{
> +    HAPVDIMMRangeTree tree;
> +
> +    tree.t = g_tree_new_full(hapvdimm_tree_key_compare, NULL, g_free,
> +                             hapvdimm_tree_value_free);
> +    return tree;
> +}
> +
> +static void hapvdimm_tree_destroy(HAPVDIMMRangeTree *tree)
> +{
> +    /* g_tree_destroy() is not NULL-safe */
> +    if (!tree->t) {
> +        return;
> +    }
> +
> +    g_tree_destroy(tree->t);
> +    tree->t = NULL;
> +}
> +
> +static gboolean ha_todo_add_all_node(gpointer key,
> +                                     gpointer value,
> +                                     gpointer data)
> +{
> +    HAPVDIMMRange *hpr = value;
> +    HvBalloon *balloon = data;
> +
> +    /* assume the hpr is fresh */
> +    assert(hpr->used == 0);
> +    assert(hpr->unused_head == hpr->range.count);
> +    assert(hpr->unused_tail == 0);
> +
> +    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
> +
> +    return false;
> +}
> +
> +static void ha_todo_add_all(HvBalloon *balloon)
> +{
> +    assert(balloon->ha_todo == NULL);
> +    g_tree_foreach(balloon->hapvdimms.t, ha_todo_add_all_node, balloon);
> +}
> +
> +static void ha_todo_clear(HvBalloon *balloon)
> +{
> +    g_slist_free(g_steal_pointer(&balloon->ha_todo));
> +}
> +
> +/* TODO: unify the code below with virtio-balloon and cache the value */
> +static int build_dimm_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> +        DeviceState *dev = DEVICE(obj);
> +        if (dev->realized) { /* only realized DIMMs matter */
> +            *list = g_slist_prepend(*list, dev);
> +        }
> +    }
> +
> +    object_child_foreach(obj, build_dimm_list, opaque);
> +    return 0;
> +}
> +
> +static ram_addr_t get_current_ram_size(void)
> +{
> +    GSList *list = NULL, *item;
> +    ram_addr_t size = current_machine->ram_size;
> +
> +    build_dimm_list(qdev_get_machine(), &list);
> +    for (item = list; item; item = g_slist_next(item)) {
> +        Object *obj = OBJECT(item->data);
> +        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM))
> +            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
> +                                            &error_abort);
> +    }
> +    g_slist_free(list);
> +
> +    return size;
> +}
> +
> +/* total RAM includes memory currently removed from the guest */
> +static uint64_t hv_balloon_total_ram(HvBalloon *balloon)
> +{
> +    ram_addr_t ram_size = get_current_ram_size();
> +    uint64_t ram_size_pages = ram_size >> HV_BALLOON_PFN_SHIFT;
> +    uint64_t hapvdimm_size_pages = hapvdimm_tree_total_ram(balloon);
> +
> +    assert(ram_size_pages > 0);
> +
> +    return SUM_SATURATE_U64(ram_size_pages, hapvdimm_size_pages);
> +}
> +
> +/*
> + * calculating the total RAM size is a slow operation,
> + * avoid it as much as possible
> + */
> +static uint64_t hv_balloon_total_removed_rs(HvBalloon *balloon,
> +                                            uint64_t ram_size_pages)
> +{
> +    uint64_t total_removed;
> +
> +    total_removed = SUM_SATURATE_U64(balloon->removed_guest_ctr,
> +                                     balloon->removed_both_ctr);
> +
> +    /* possible if guest returns pages outside actual RAM */
> +    if (total_removed > ram_size_pages) {
> +        total_removed = ram_size_pages;
> +    }
> +
> +    return total_removed;
> +}
> +
> +static bool hv_balloon_state_is_init(HvBalloon *balloon)
> +{
> +    return balloon->state == S_WAIT_RESET ||
> +        balloon->state == S_CLOSED ||
> +        balloon->state == S_VERSION ||
> +        balloon->state == S_CAPS;
> +}
> +
> +/* Returns whether the state has actually changed */
> +static bool hv_balloon_state_set(HvBalloon *balloon,
> +                                 State newst, const char *newststr)
> +{
> +    if (newst == S_NO_CHANGE || balloon->state == newst) {
> +        return false;
> +    }
> +
> +    balloon->state = newst;
> +    trace_hv_balloon_state_change(newststr);
> +    return true;
> +}
> +
> +static void _hv_balloon_state_desc_set(StateDesc *stdesc,
> +                                       State newst, const char *newststr)
> +{
> +    /* state setting is only permitted on a freshly init desc */
> +    assert(stdesc->state == S_NO_CHANGE);
> +
> +    assert(newst != S_NO_CHANGE);
> +
> +    stdesc->state = newst;
> +    stdesc->desc = newststr;
> +}
> +
> +static void del_todo_process(HvBalloon *balloon)
> +{
> +    while (balloon->hapvdimms_del_todo) {
> +        HAPVDIMMDevice *hapvdimm = balloon->hapvdimms_del_todo->data;
> +        HostMemoryBackend *backend;
> +        const char *backend_id;
> +
> +        backend = hapvdimm_get_memdev(hapvdimm);
> +        backend_id = object_get_canonical_path_component(OBJECT(backend));
> +
> +        object_unparent(OBJECT(hapvdimm));
> +        object_unref(OBJECT(hapvdimm));
> +        qapi_event_send_hv_balloon_memory_backend_unused(backend_id);
> +
> +        balloon->hapvdimms_del_todo =
> +            g_slist_remove(balloon->hapvdimms_del_todo, hapvdimm);
> +    }
> +
> +    if (balloon->del_todo_process_timer) {
> +        g_source_remove(balloon->del_todo_process_timer);
> +        balloon->del_todo_process_timer = 0;
> +    }
> +}
> +
> +static gboolean del_todo_process_timer(gpointer user_data)
> +{
> +    HvBalloon *balloon = user_data;
> +
> +    balloon->del_todo_process_timer = 0;
> +
> +    del_todo_process(balloon);
> +
> +    return G_SOURCE_REMOVE;
> +}
> +
> +static void del_todo_append(HvBalloon *balloon,
> +                            HAPVDIMMDevice *hapvdimm)
> +{
> +    balloon->hapvdimms_del_todo = g_slist_append(balloon->hapvdimms_del_todo,
> +                                                 object_ref(hapvdimm));
> +}
> +
> +static void del_todo_add(HvBalloon *balloon,
> +                         HAPVDIMMDevice *hapvdimm)
> +{
> +    hapvdimm_tree_remove(balloon, hapvdimm);
> +    del_todo_append(balloon, hapvdimm);
> +}
> +
> +static gboolean del_todo_add_all_node(gpointer key,
> +                                      gpointer value,
> +                                      gpointer data)
> +{
> +    HAPVDIMMRange *hpr = value;
> +    HvBalloon *balloon = data;
> +
> +    del_todo_append(balloon, hpr->hapvdimm);
> +
> +    return false;
> +}
> +
> +static void del_todo_add_all(HvBalloon *balloon)
> +{
> +    g_tree_foreach(balloon->hapvdimms.t, del_todo_add_all_node, balloon);
> +    hapvdimm_tree_destroy(&balloon->hapvdimms);
> +
> +    balloon->hapvdimms = hapvdimm_tree_new();
> +}
> +
> +static void del_todo_add_all_from_ha_todo(HvBalloon *balloon)
> +{
> +    while (balloon->ha_todo) {
> +        HAPVDIMMRange *hpr = balloon->ha_todo->data;
> +
> +        del_todo_add(balloon, hpr->hapvdimm);
> +        balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
> +    }
> +}
> +
> +static VMBusChannel *hv_balloon_get_channel_maybe(HvBalloon *balloon)
> +{
> +    return vmbus_device_channel(&balloon->parent, 0);
> +}
> +
> +static VMBusChannel *hv_balloon_get_channel(HvBalloon *balloon)
> +{
> +    VMBusChannel *chan;
> +
> +    chan = hv_balloon_get_channel_maybe(balloon);
> +    assert(chan != NULL);
> +    return chan;
> +}
> +
> +static ssize_t hv_balloon_send_packet(VMBusChannel *chan,
> +                                      struct dm_message *msg)
> +{
> +    int ret;
> +
> +    ret = vmbus_channel_reserve(chan, 0, msg->hdr.size);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                              NULL, 0, msg, msg->hdr.size, false,
> +                              msg->hdr.trans_id);
> +}
> +
> +static bool hv_balloon_unballoon_get_source(HvBalloon *balloon,
> +                                            PageRangeTree *dtree,
> +                                            uint64_t **dctr,
> +                                            HAPVDIMMRange **hpr)
> +{
> +    /* Try the boot memory first */
> +    if (g_tree_nnodes(balloon->removed_guest.t) > 0) {
> +        *dtree = balloon->removed_guest;
> +        *dctr = &balloon->removed_guest_ctr;
> +        *hpr = NULL;
> +    } else if (g_tree_nnodes(balloon->removed_both.t) > 0) {
> +        *dtree = balloon->removed_both;
> +        *dctr = &balloon->removed_both_ctr;
> +        *hpr = NULL;
> +    } else {
> +        GTreeNode *node;
> +
> +        for (node = g_tree_node_first(balloon->hapvdimms.t); node;
> +             node = g_tree_node_next(node)) {
> +            HAPVDIMMRange *hprnode = g_tree_node_value(node);
> +
> +            assert(hprnode);
> +            if (g_tree_nnodes(hprnode->removed_guest.t) > 0) {
> +                *dtree = hprnode->removed_guest;
> +                *dctr = &balloon->removed_guest_ctr;
> +                *hpr = hprnode;
> +                break;
> +            } else if (g_tree_nnodes(hprnode->removed_both.t) > 0) {
> +                *dtree = hprnode->removed_both;
> +                *dctr = &balloon->removed_both_ctr;
> +                *hpr = hprnode;
> +                break;
> +            }
> +        }
> +
> +        if (!node) {
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
> +static void hv_balloon_balloon_unballoon_start(HvBalloon *balloon,
> +                                               uint64_t ram_size_pages,
> +                                               StateDesc *stdesc)
> +{
> +    uint64_t total_removed = hv_balloon_total_removed_rs(balloon,
> +                                                         ram_size_pages);
> +
> +    assert(balloon->state == S_IDLE);
> +    assert(ram_size_pages > 0);
> +
> +    /*
> +     * we need to cache the value when starting the (un)balloon procedure
> +     * in case somebody changes the balloon target when the procedure is
> +     * in progress
> +     */
> +    if (balloon->target < ram_size_pages - total_removed) {
> +        balloon->target_diff = ram_size_pages - total_removed - 
> balloon->target;
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_RB_WAIT);
> +    } else {
> +        balloon->target_diff = balloon->target -
> +            (ram_size_pages - total_removed);
> +
> +        /*
> +         * careful here, the user might have set the balloon target
> +         * above the RAM size, so above the total removed count
> +         */
> +        balloon->target_diff = MIN(balloon->target_diff, total_removed);
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_RB_WAIT);
> +    }
> +
> +    balloon->target_changed = false;
> +}
> +
> +static void hv_balloon_unballoon_rb_wait(HvBalloon *balloon, StateDesc 
> *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    struct dm_unballoon_request *ur;
> +    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
> +
> +    assert(balloon->state == S_UNBALLOON_RB_WAIT);
> +
> +    if (vmbus_channel_reserve(chan, 0, ur_size) < 0) {
> +        return;
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_POSTING);
> +}
> +
> +static void hv_balloon_unballoon_posting(HvBalloon *balloon, StateDesc 
> *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    PageRangeTree dtree;
> +    uint64_t *dctr;
> +    HAPVDIMMRange *hpr;
> +    struct dm_unballoon_request *ur;
> +    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
> +    PageRange range;
> +    bool bret;
> +    ssize_t ret;
> +
> +    assert(balloon->state == S_UNBALLOON_POSTING);
> +    assert(balloon->target_diff > 0);
> +
> +    if (!hv_balloon_unballoon_get_source(balloon, &dtree, &dctr, &hpr)) {
> +        error_report("trying to unballoon but nothing ballooned");
> +        /*
> +         * there is little we can do as we might have already
> +         * sent the guest a partial request we can't cancel
> +         */
> +        return;
> +    }
> +
> +    assert(dtree.t);
> +    assert(dctr);
> +
> +    ur = alloca(ur_size);
> +    memset(ur, 0, ur_size);
> +    ur->hdr.type = DM_UNBALLOON_REQUEST;
> +    ur->hdr.size = ur_size;
> +    ur->hdr.trans_id = balloon->trans_id;
> +
> +    bret = page_range_tree_pop(dtree, &range, MIN(balloon->target_diff,
> +                                                  
> HV_BALLOON_HA_CHUNK_PAGES));
> +    assert(bret);
> +    /* TODO: madvise? */
> +
> +    *dctr -= range.count;
> +    balloon->target_diff -= range.count;
> +    if (hpr) {
> +        hpr->used += range.count;
> +    }
> +
> +    ur->range_count = 1;
> +    ur->range_array[0].finfo.start_page = range.start;
> +    ur->range_array[0].finfo.page_cnt = range.count;
> +    ur->more_pages = balloon->target_diff > 0;
> +
> +    trace_hv_balloon_outgoing_unballoon(ur->hdr.trans_id,
> +                                        range.count, range.start,
> +                                        balloon->target_diff);
> +
> +    if (ur->more_pages) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_RB_WAIT);
> +    } else {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_REPLY_WAIT);
> +    }
> +
> +    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                             NULL, 0, ur, ur_size, false,
> +                             ur->hdr.trans_id);
> +    if (ret <= 0) {
> +        error_report("error %zd when posting unballoon msg, expect problems",
> +                     ret);
> +    }
> +}
> +
> +static void hv_balloon_hot_add_start(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    HAPVDIMMRange *hpr;
> +    PageRange range;
> +
> +    assert(balloon->state == S_IDLE);
> +    assert(balloon->ha_todo);
> +
> +    hpr = balloon->ha_todo->data;
> +
> +    range.start = QEMU_ALIGN_UP(hpr->range.start,
> +                                (1 << 
> balloon->caps.cap_bits.hot_add_alignment)
> +                                * (MiB / HV_BALLOON_PAGE_SIZE));
> +    hpr->unused_head = range.start - hpr->range.start;
> +    if (hpr->unused_head >= hpr->range.count) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
> +        return;
> +    }
> +
> +    range.count = hpr->range.count - hpr->unused_head;
> +    range.count = QEMU_ALIGN_DOWN(range.count,
> +                                  (1 << 
> balloon->caps.cap_bits.hot_add_alignment)
> +                                  * (MiB / HV_BALLOON_PAGE_SIZE));
> +    if (range.count == 0) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
> +        return;
> +    }
> +    hpr->unused_tail = hpr->range.count - hpr->unused_head - range.count;
> +    hpr->used = 0;
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_RB_WAIT);
> +}
> +
> +static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    struct dm_hot_add *ha;
> +    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
> +
> +    assert(balloon->state == S_HOT_ADD_RB_WAIT);
> +
> +    if (vmbus_channel_reserve(chan, 0, ha_size) < 0) {
> +        return;
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_POSTING);
> +}
> +
> +static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    HAPVDIMMRange *hpr;
> +    struct dm_hot_add *ha;
> +    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
> +    union dm_mem_page_range *ha_region;
> +    PageRange range;
> +    uint64_t chunk_max_size;
> +    ssize_t ret;
> +
> +    assert(balloon->state == S_HOT_ADD_POSTING);
> +    assert(balloon->ha_todo);
> +
> +    hpr = balloon->ha_todo->data;
> +
> +    range.start = hpr->range.start + hpr->unused_head + hpr->used;
> +    range.count = hpr->range.count;
> +    range.count -= hpr->unused_head;
> +    range.count -= hpr->used;
> +    range.count -= hpr->unused_tail;
> +
> +    chunk_max_size = MAX((1 << balloon->caps.cap_bits.hot_add_alignment) *
> +                         (MiB / HV_BALLOON_PAGE_SIZE),
> +                         HV_BALLOON_HA_CHUNK_PAGES);
> +    range.count = MIN(range.count, chunk_max_size);
> +    balloon->ha_current_count = range.count;
> +
> +    ha = alloca(ha_size);
> +    ha_region = &(&ha->range)[1];
> +    memset(ha, 0, ha_size);
> +    ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
> +    ha->hdr.size = ha_size;
> +    ha->hdr.trans_id = balloon->trans_id;
> +
> +    ha->range.finfo.start_page = range.start;
> +    ha->range.finfo.page_cnt = range.count;
> +    ha_region->finfo.start_page = range.start;
> +    ha_region->finfo.page_cnt = ha->range.finfo.page_cnt;
> +
> +    trace_hv_balloon_outgoing_hot_add(ha->hdr.trans_id,
> +                                      range.count, range.start);
> +
> +    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                             NULL, 0, ha, ha_size, false,
> +                             ha->hdr.trans_id);
> +    if (ret <= 0) {
> +        error_report("error %zd when posting hot add msg, expect problems",
> +                     ret);
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_REPLY_WAIT);
> +}
> +
> +static void hv_balloon_hot_add_finish(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    HAPVDIMMRange *hpr;
> +
> +    assert(balloon->state == S_HOT_ADD_SKIP_CURRENT ||
> +           balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING ||
> +           balloon->state == S_HOT_ADD_PROCESSED_NEXT);
> +    assert(balloon->ha_todo);
> +
> +    hpr = balloon->ha_todo->data;
> +
> +    balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
> +    if (balloon->state == S_HOT_ADD_SKIP_CURRENT) {
> +        del_todo_add(balloon, hpr->hapvdimm);
> +    } else if (balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING) {
> +        del_todo_add_all_from_ha_todo(balloon);
> +    }
> +
> +    /* let other things happen, too, between hot adds to be done */
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
> +}
> +
> +static void hv_balloon_balloon_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    size_t bl_size = sizeof(struct dm_balloon);
> +
> +    assert(balloon->state == S_BALLOON_RB_WAIT);
> +
> +    if (vmbus_channel_reserve(chan, 0, bl_size) < 0) {
> +        return;
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_POSTING);
> +}
> +
> +static void hv_balloon_balloon_posting(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    struct dm_balloon bl;
> +    size_t bl_size = sizeof(bl);
> +    ssize_t ret;
> +
> +    assert(balloon->state == S_BALLOON_POSTING);
> +    assert(balloon->target_diff > 0);
> +
> +    memset(&bl, 0, sizeof(bl));
> +    bl.hdr.type = DM_BALLOON_REQUEST;
> +    bl.hdr.size = bl_size;
> +    bl.hdr.trans_id = balloon->trans_id;
> +    bl.num_pages = MIN(balloon->target_diff, HV_BALLOON_HR_CHUNK_PAGES);
> +
> +    trace_hv_balloon_outgoing_balloon(bl.hdr.trans_id, bl.num_pages,
> +                                      balloon->target_diff);
> +
> +    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                             NULL, 0, &bl, bl_size, false,
> +                             bl.hdr.trans_id);
> +    if (ret <= 0) {
> +        error_report("error %zd when posting balloon msg, expect problems",
> +                     ret);
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_REPLY_WAIT);
> +}
> +
> +static void hv_balloon_idle_state(HvBalloon *balloon,
> +                                  StateDesc *stdesc)
> +{
> +    bool can_balloon = balloon->caps.cap_bits.balloon;
> +    bool want_unballoon = false;
> +    bool want_hot_add = balloon->ha_todo != NULL;
> +    bool want_balloon = false;
> +    uint64_t ram_size_pages;
> +
> +    assert(balloon->state == S_IDLE);
> +
> +    if (can_balloon && balloon->target_changed) {
> +        uint64_t total_removed;
> +
> +        ram_size_pages = hv_balloon_total_ram(balloon);
> +        total_removed = hv_balloon_total_removed_rs(balloon,
> +                                                    ram_size_pages);
> +
> +        want_unballoon = total_removed > 0 &&
> +            balloon->target > ram_size_pages - total_removed;
> +        want_balloon = balloon->target < ram_size_pages - total_removed;
> +    }
> +
> +    /*
> +     * the order here is important, first we unballoon, then hot add,
> +     * then balloon (or hot remove)
> +     */
> +    if (want_unballoon) {
> +        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages, stdesc);
> +    } else if (want_hot_add) {
> +        hv_balloon_hot_add_start(balloon, stdesc);
> +    } else if (want_balloon) {
> +        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages, stdesc);
> +    }
> +}
> +
> +static const struct {
> +    void (*handler)(HvBalloon *balloon, StateDesc *stdesc);
> +} state_handlers[] = {
> +    [S_IDLE].handler = hv_balloon_idle_state,
> +    [S_UNBALLOON_RB_WAIT].handler = hv_balloon_unballoon_rb_wait,
> +    [S_UNBALLOON_POSTING].handler = hv_balloon_unballoon_posting,
> +    [S_HOT_ADD_RB_WAIT].handler = hv_balloon_hot_add_rb_wait,
> +    [S_HOT_ADD_POSTING].handler = hv_balloon_hot_add_posting,
> +    [S_HOT_ADD_SKIP_CURRENT].handler = hv_balloon_hot_add_finish,
> +    [S_HOT_ADD_PROCESSED_CLEAR_PENDING].handler = hv_balloon_hot_add_finish,
> +    [S_HOT_ADD_PROCESSED_NEXT].handler = hv_balloon_hot_add_finish,
> +    [S_BALLOON_RB_WAIT].handler = hv_balloon_balloon_rb_wait,
> +    [S_BALLOON_POSTING].handler = hv_balloon_balloon_posting,
> +};
> +
> +static void hv_balloon_handle_state(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    if (!state_handlers[balloon->state].handler) {
> +        return;
> +    }
> +
> +    state_handlers[balloon->state].handler(balloon, stdesc);
> +}
> +
> +static void hv_balloon_remove_response_insert_range(PageRangeTree tree,
> +                                                    const PageRange *range,
> +                                                    uint64_t *ctr1,
> +                                                    uint64_t *ctr2,
> +                                                    uint64_t *ctr3)
> +{
> +    uint64_t dupcount, effcount;
> +
> +    if (range->count == 0) {
> +        return;
> +    }
> +
> +    dupcount = 0;
> +    page_range_tree_insert(tree, range->start, range->count, &dupcount);
> +
> +    assert(dupcount <= range->count);
> +    effcount = range->count - dupcount;
> +
> +    *ctr1 += effcount;
> +    *ctr2 += effcount;
> +    if (ctr3) {
> +        *ctr3 += effcount;
> +    }
> +}
> +
> +static void hv_balloon_remove_response_handle_range(HvBalloon *balloon,
> +                                                    PageRange *range,
> +                                                    bool both,
> +                                                    uint64_t *removedctr)
> +{
> +    GTreeNode *node;
> +    PageRangeTree globaltree = both ? balloon->removed_both :
> +        balloon->removed_guest;
> +    uint64_t *globalctr = both ? &balloon->removed_both_ctr :
> +        &balloon->removed_guest_ctr;
> +
> +    if (range->count == 0) {
> +        return;
> +    }
> +
> +    trace_hv_balloon_remove_response(range->count, range->start, both);
> +
> +    /* find the first node that can possibly intersect our range */
> +    node = g_tree_upper_bound(balloon->hapvdimms.t, &range->start);
> +    if (node) {
> +        /*
> +         * a NULL node below means that the very first node in the tree
> +         * already has a higher key (the start of its range).
> +         */
> +        node = g_tree_node_previous(node);
> +    } else {
> +        /* a NULL node below means that the tree is empty */
> +        node = g_tree_node_last(balloon->hapvdimms.t);
> +    }
> +    /* node range start <= range start */
> +
> +    if (!node) {
> +        /* node range start > range start */
> +        node = g_tree_node_first(balloon->hapvdimms.t);
> +    }
> +
> +    for ( ; node && range->count > 0; node = g_tree_node_next(node)) {
> +        HAPVDIMMRange *hpr = g_tree_node_value(node);
> +        PageRangeTree hprtree;
> +        PageRange rangeeff, rangehole, rangecommon;
> +        uint64_t hprremoved = 0;
> +
> +        assert(hpr);
> +        hprtree = both ? hpr->removed_both : hpr->removed_guest;
> +        hapvdimm_range_get_effective_range(hpr, &rangeeff);
> +
> +        /*
> +         * if this node starts beyond or at the end of the range so does
> +         * every next one
> +         */
> +        if (rangeeff.start >= range->start + range->count) {
> +            break;
> +        }
> +
> +        /* process the hole before the current hpr, if it exists */
> +        page_range_part_before(range, rangeeff.start, &rangehole);
> +        hv_balloon_remove_response_insert_range(globaltree, &rangehole,
> +                                                globalctr, removedctr, NULL);
> +        if (rangehole.count > 0) {
> +            trace_hv_balloon_remove_response_hole(rangehole.count,
> +                                                  rangehole.start,
> +                                                  range->count, range->start,
> +                                                  rangeeff.start, both);
> +        }
> +
> +        /*
> +         * process the hpr part, can be empty for the very first node 
> processed
> +         * or due to difference between the nominal and effective hpr start
> +         */
> +        page_range_intersect(range, rangeeff.start, rangeeff.count,
> +                             &rangecommon);
> +        hv_balloon_remove_response_insert_range(hprtree, &rangecommon,
> +                                                globalctr, removedctr,
> +                                                &hprremoved);
> +        hpr->used -= hprremoved;
> +        if (rangecommon.count > 0) {
> +            trace_hv_balloon_remove_response_common(rangecommon.count,
> +                                                    rangecommon.start,
> +                                                    range->count, 
> range->start,
> +                                                    rangeeff.count,
> +                                                    rangeeff.start, 
> hprremoved,
> +                                                    both);
> +        }
> +
> +        /* calculate what's left after the current hpr */
> +        rangecommon = *range;
> +        page_range_part_after(&rangecommon, rangeeff.start, rangeeff.count,
> +                              range);
> +    }
> +
> +    /* process the remainder of the range that lies outside of the hpr tree 
> */
> +    if (range->count > 0) {
> +        hv_balloon_remove_response_insert_range(globaltree, range,
> +                                                globalctr, removedctr, NULL);
> +        trace_hv_balloon_remove_response_remainder(range->count, 
> range->start,
> +                                                   both);
> +        range->count = 0;
> +    }
> +}
> +
> +static void hv_balloon_remove_response_handle_pages(HvBalloon *balloon,
> +                                                    PageRange *range,
> +                                                    uint64_t start,
> +                                                    uint64_t count,
> +                                                    bool both,
> +                                                    uint64_t *removedctr)
> +{
> +    assert(count > 0);
> +
> +    /*
> +     * if there is an existing range that the new range can't be joined to
> +     * dump it into tree(s)
> +     */
> +    if (range->count > 0 && !page_range_joinable(range, start, count)) {
> +        hv_balloon_remove_response_handle_range(balloon, range, both,
> +                                                removedctr);
> +    }
> +
> +    if (range->count == 0) {
> +        range->start = start;
> +        range->count = count;
> +    } else if (page_range_joinable_left(range, start, count)) {
> +        range->start = start;
> +        range->count += count;
> +    } else { /* page_range_joinable_right() */
> +        range->count += count;
> +    }
> +}
> +
> +static gboolean hv_balloon_handle_remove_host_addr_node(gpointer key,
> +                                                        gpointer value,
> +                                                        gpointer data)
> +{
> +    PageRange *range = value;
> +    uint64_t pageoff;
> +
> +    for (pageoff = 0; pageoff < range->count; ) {
> +        void *addr = (void *)((range->start + pageoff) * 
> HV_BALLOON_PAGE_SIZE);
> +        RAMBlock *rb;
> +        ram_addr_t rb_offset;
> +        size_t rb_page_size;
> +        size_t discard_size;
> +
> +        rb = qemu_ram_block_from_host(addr, false, &rb_offset);
> +        rb_page_size = qemu_ram_pagesize(rb);
> +
> +        if (rb_page_size != HV_BALLOON_PAGE_SIZE) {
> +            /* TODO: these should end in "removed_guest" */
> +            warn_report("guest reported removed page backed by unsupported 
> page size %zu",
> +                        rb_page_size);
> +            pageoff++;
> +            continue;
> +        }
> +
> +        discard_size = MIN(range->count - pageoff,
> +                           (rb->max_length - rb_offset) /
> +                           HV_BALLOON_PAGE_SIZE);
> +        discard_size = MAX(discard_size, 1);
> +
> +        if (ram_block_discard_range(rb, rb_offset, discard_size *
> +                                    HV_BALLOON_PAGE_SIZE) != 0) {
> +            warn_report("guest reported removed page failed discard");
> +        }
> +
> +        pageoff += discard_size;
> +    }
> +
> +    return false;
> +}
> +
> +static void hv_balloon_handle_remove_host_addr_tree(PageRangeTree tree)
> +{
> +    g_tree_foreach(tree.t, hv_balloon_handle_remove_host_addr_node, NULL);
> +}
> +
> +static int hv_balloon_handle_remove_section(PageRangeTree tree,
> +                                            const MemoryRegionSection 
> *section,
> +                                            uint64_t count)
> +{
> +    void *addr = memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region;
> +    uint64_t addr_page;
> +
> +    assert(count > 0);
> +
> +    if ((uintptr_t)addr % HV_BALLOON_PAGE_SIZE) {
> +        warn_report("guest reported removed pages at an unaligned host addr 
> %p",
> +                    addr);
> +        return -EINVAL;
> +    }
> +
> +    addr_page = (uintptr_t)addr / HV_BALLOON_PAGE_SIZE;
> +    page_range_tree_insert(tree, addr_page, count, NULL);
> +
> +    return 0;
> +}
> +
> +static void hv_balloon_handle_remove_ranges(HvBalloon *balloon,
> +                                            union dm_mem_page_range ranges[],
> +                                            uint32_t count)
> +{
> +    uint64_t removedcnt;
> +    PageRangeTree removed_host_addr;
> +    PageRange range_guest, range_both;
> +
> +    removed_host_addr = page_range_tree_new();
> +    range_guest.count = range_both.count = removedcnt = 0;
> +    for (unsigned int ctr = 0; ctr < count; ctr++) {
> +        union dm_mem_page_range *mr = &ranges[ctr];
> +        hwaddr pa;
> +        MemoryRegionSection section;
> +
> +        for (unsigned int offset = 0; offset < mr->finfo.page_cnt; ) {
> +            int ret;
> +            uint64_t pageno = mr->finfo.start_page + offset;
> +            uint64_t pagecnt = 1;
> +
> +            pa = (hwaddr)pageno << HV_BALLOON_PFN_SHIFT;
> +            section = memory_region_find(get_system_memory(), pa,
> +                                         (mr->finfo.page_cnt - offset) *
> +                                         HV_BALLOON_PAGE_SIZE);
> +            if (!section.mr) {
> +                warn_report("guest reported removed page %"PRIu64" not found 
> in RAM",
> +                            pageno);
> +                ret = -EINVAL;
> +                goto finish_page;
> +            }
> +
> +            pagecnt = section.size / HV_BALLOON_PAGE_SIZE;
> +            if (pagecnt <= 0) {
> +                warn_report("guest reported removed page %"PRIu64" in a 
> section smaller than page size",
> +                            pageno);
> +                pagecnt = 1; /* skip the whole page */
> +                ret = -EINVAL;
> +                goto finish_page;
> +            }
> +
> +            if (!memory_region_is_ram(section.mr) ||
> +                memory_region_is_rom(section.mr) ||
> +                memory_region_is_romd(section.mr)) {
> +                warn_report("guest reported removed page %"PRIu64" in a 
> section that is not an ordinary RAM",
> +                            pageno);
> +                ret = -EINVAL;
> +                goto finish_page;
> +            }
> +
> +            ret = hv_balloon_handle_remove_section(removed_host_addr, 
> &section,
> +                                                   pagecnt);
> +
> +        finish_page:
> +            if (ret == 0) {
> +                hv_balloon_remove_response_handle_pages(balloon,
> +                                                        &range_both,
> +                                                        pageno, pagecnt,
> +                                                        true, &removedcnt);
> +            } else {
> +                hv_balloon_remove_response_handle_pages(balloon,
> +                                                        &range_guest,
> +                                                        pageno, pagecnt,
> +                                                        false, &removedcnt);
> +            }
> +
> +            if (section.mr) {
> +                memory_region_unref(section.mr);
> +            }
> +
> +            offset += pagecnt;
> +        }
> +    }
> +
> +    hv_balloon_remove_response_handle_range(balloon, &range_both, true,
> +                                            &removedcnt);
> +    hv_balloon_remove_response_handle_range(balloon, &range_guest, false,
> +                                            &removedcnt);
> +
> +    hv_balloon_handle_remove_host_addr_tree(removed_host_addr);
> +    page_range_tree_destroy(&removed_host_addr);
> +
> +    if (removedcnt > balloon->target_diff) {
> +        warn_report("guest reported more pages removed than currently 
> pending (%"PRIu64" vs %"PRIu64")",
> +                    removedcnt, balloon->target_diff);
> +        balloon->target_diff = 0;
> +    } else {
> +        balloon->target_diff -= removedcnt;
> +    }
> +}
> +
> +static bool hv_balloon_handle_msg_size(HvBalloonReq *req, size_t minsize,
> +                                       const char *msgname)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    uint32_t msglen = vmreq->msglen;
> +
> +    if (msglen >= minsize) {
> +        return true;
> +    }
> +
> +    warn_report("%s message too short (%u vs %zu), ignoring", msgname,
> +                (unsigned int)msglen, minsize);
> +    return false;
> +}
> +
> +static void hv_balloon_handle_version_request(HvBalloon *balloon,
> +                                              HvBalloonReq *req,
> +                                              StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_version_request *msgVr = vmreq->msg;
> +    struct dm_version_response respVr;
> +
> +    if (balloon->state != S_VERSION) {
> +        warn_report("unexpected DM_VERSION_REQUEST in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgVr),
> +                                    "DM_VERSION_REQUEST")) {
> +        return;
> +    }
> +
> +    trace_hv_balloon_incoming_version(msgVr->version.major_version,
> +                                      msgVr->version.minor_version);
> +
> +    memset(&respVr, 0, sizeof(respVr));
> +    respVr.hdr.type = DM_VERSION_RESPONSE;
> +    respVr.hdr.size = sizeof(respVr);
> +    respVr.hdr.trans_id = msgVr->hdr.trans_id;
> +    respVr.is_accepted = msgVr->version.version >= DYNMEM_PROTOCOL_VERSION_1 
> &&
> +        msgVr->version.version <= DYNMEM_PROTOCOL_VERSION_3;
> +
> +    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respVr);
> +
> +    if (respVr.is_accepted) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_CAPS);
> +    }
> +}
> +
> +static void hv_balloon_handle_caps_report(HvBalloon *balloon,
> +                                          HvBalloonReq *req,
> +                                          StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_capabilities *msgCap = vmreq->msg;
> +    struct dm_capabilities_resp_msg respCap;
> +
> +    if (balloon->state != S_CAPS) {
> +        warn_report("unexpected DM_CAPABILITIES_REPORT in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgCap),
> +                                    "DM_CAPABILITIES_REPORT")) {
> +        return;
> +    }
> +
> +    trace_hv_balloon_incoming_caps(msgCap->caps.caps);
> +    balloon->caps = msgCap->caps;
> +
> +    memset(&respCap, 0, sizeof(respCap));
> +    respCap.hdr.type = DM_CAPABILITIES_RESPONSE;
> +    respCap.hdr.size = sizeof(respCap);
> +    respCap.hdr.trans_id = msgCap->hdr.trans_id;
> +    respCap.is_accepted = 1;
> +    respCap.hot_remove = 1;
> +    respCap.suppress_pressure_reports = !balloon->status_reports;
> +    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respCap);
> +
> +    if (balloon->caps.cap_bits.hot_add) {
> +        ha_todo_add_all(balloon);
> +    }
> +
> +    timer_mod(&balloon->post_init_timer,
> +              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
> +              HV_BALLOON_POST_INIT_WAIT);
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_POST_INIT_WAIT);
> +}
> +
> +static void hv_balloon_handle_status_report(HvBalloon *balloon,
> +                                            HvBalloonReq *req)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_status *msgStatus = vmreq->msg;
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgStatus),
> +                                    "DM_STATUS_REPORT")) {
> +        return;
> +    }
> +
> +    if (!balloon->status_reports) {
> +        return;
> +    }
> +
> +    
> qapi_event_send_hv_balloon_status_report((uint64_t)msgStatus->num_committed *
> +                                             HV_BALLOON_PAGE_SIZE,
> +                                             (uint64_t)msgStatus->num_avail *
> +                                             HV_BALLOON_PAGE_SIZE);
> +}
> +
> +static void hv_balloon_handle_unballoon_response(HvBalloon *balloon,
> +                                                 HvBalloonReq *req,
> +                                                 StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_unballoon_response *msgUrR = vmreq->msg;
> +
> +    if (balloon->state != S_UNBALLOON_REPLY_WAIT) {
> +        warn_report("unexpected DM_UNBALLOON_RESPONSE in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgUrR),
> +                                    "DM_UNBALLOON_RESPONSE"))
> +        return;
> +
> +    trace_hv_balloon_incoming_unballoon(msgUrR->hdr.trans_id);
> +
> +    balloon->trans_id++;
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
> +}
> +
> +static void hv_balloon_handle_hot_add_response(HvBalloon *balloon,
> +                                               HvBalloonReq *req,
> +                                               StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_hot_add_response *msgHaR = vmreq->msg;
> +    HAPVDIMMRange *hpr;
> +
> +    if (balloon->state != S_HOT_ADD_REPLY_WAIT) {
> +        warn_report("unexpected DM_HOT_ADD_RESPONSE in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgHaR),
> +                                    "DM_HOT_ADD_RESPONSE"))
> +        return;
> +
> +    trace_hv_balloon_incoming_hot_add(msgHaR->hdr.trans_id, msgHaR->result,
> +                                      msgHaR->page_count);
> +
> +    balloon->trans_id++;
> +
> +    assert(balloon->ha_todo);
> +    hpr = balloon->ha_todo->data;
> +
> +    if (msgHaR->result) {
> +        if (msgHaR->page_count > balloon->ha_current_count) {
> +            warn_report("DM_HOT_ADD_RESPONSE page count higher than 
> requested (%"PRIu32" vs %"PRIu64")",
> +                        msgHaR->page_count, balloon->ha_current_count);
> +            msgHaR->page_count = balloon->ha_current_count;
> +        }
> +
> +        hpr->used += msgHaR->page_count;
> +    }
> +
> +    if (!msgHaR->result || msgHaR->page_count < balloon->ha_current_count) {
> +        if (hpr->used == 0) {
> +            /*
> +             * apparently the guest didn't like the current range at all,
> +             * let's try the next one
> +             */
> +            HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
> +            return;
> +        }
> +
> +        /*
> +         * the current planned range was only partially hot-added, take note
> +         * how much of it remains and don't attempt any further hot adds
> +         */
> +        hpr->unused_tail = hpr->range.count - hpr->unused_head - hpr->used;
> +
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_PROCESSED_CLEAR_PENDING);
> +        return;
> +    }
> +
> +    /* any pages remaining in this hpr? */
> +    if (hpr->range.count - hpr->unused_head - hpr->used -
> +        hpr->unused_tail > 0) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_RB_WAIT);
> +    } else {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_PROCESSED_NEXT);
> +    }
> +}
> +
> +static void hv_balloon_handle_balloon_response(HvBalloon *balloon,
> +                                               HvBalloonReq *req,
> +                                               StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_balloon_response *msgBR = vmreq->msg;
> +
> +    if (balloon->state != S_BALLOON_REPLY_WAIT) {
> +        warn_report("unexpected DM_BALLOON_RESPONSE in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgBR),
> +                                    "DM_BALLOON_RESPONSE"))
> +        return;
> +
> +    trace_hv_balloon_incoming_balloon(msgBR->hdr.trans_id, 
> msgBR->range_count,
> +                                      msgBR->more_pages);
> +
> +    if (vmreq->msglen < sizeof(*msgBR) +
> +        (uint64_t)sizeof(msgBR->range_array[0]) * msgBR->range_count) {
> +        warn_report("DM_BALLOON_RESPONSE too short for the range count");
> +        return;
> +    }
> +
> +    if (msgBR->range_count == 0) {
> +        /* The guest is already at its minimum size */
> +        msgBR->more_pages = 0;
> +        balloon->target_diff = 0;
> +    } else {
> +        hv_balloon_handle_remove_ranges(balloon,
> +                                        msgBR->range_array,
> +                                        msgBR->range_count);
> +    }
> +
> +    if (!msgBR->more_pages) {
> +        balloon->trans_id++;
> +
> +        if (balloon->target_diff > 0) {
> +            HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_RB_WAIT);
> +        } else {
> +            HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
> +        }
> +    }
> +}
> +
> +static void hv_balloon_handle_packet(HvBalloon *balloon, HvBalloonReq *req,
> +                                     StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_message *msg = vmreq->msg;
> +
> +    if (vmreq->msglen < sizeof(msg->hdr)) {
> +        return;
> +    }
> +
> +    switch (msg->hdr.type) {
> +    case DM_VERSION_REQUEST:
> +        hv_balloon_handle_version_request(balloon, req, stdesc);
> +        break;
> +
> +    case DM_CAPABILITIES_REPORT:
> +        hv_balloon_handle_caps_report(balloon, req, stdesc);
> +        break;
> +
> +    case DM_STATUS_REPORT:
> +        hv_balloon_handle_status_report(balloon, req);
> +        break;
> +
> +    case DM_MEM_HOT_ADD_RESPONSE:
> +        hv_balloon_handle_hot_add_response(balloon, req, stdesc);
> +        break;
> +
> +    case DM_UNBALLOON_RESPONSE:
> +        hv_balloon_handle_unballoon_response(balloon, req, stdesc);
> +        break;
> +
> +    case DM_BALLOON_RESPONSE:
> +        hv_balloon_handle_balloon_response(balloon, req, stdesc);
> +        break;
> +
> +    default:
> +        warn_report("unknown DM message %u", msg->hdr.type);
> +        break;
> +    }
> +}
> +
> +static bool hv_balloon_recv_channel(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan;
> +    HvBalloonReq *req;
> +
> +    if (balloon->state == S_WAIT_RESET ||
> +        balloon->state == S_CLOSED) {
> +        return false;
> +    }
> +
> +    chan = hv_balloon_get_channel(balloon);
> +    if (vmbus_channel_recv_start(chan)) {
> +        return false;
> +    }
> +
> +    while ((req = vmbus_channel_recv_peek(chan, sizeof(*req)))) {
> +        hv_balloon_handle_packet(balloon, req, stdesc);
> +        vmbus_free_req(req);
> +        vmbus_channel_recv_pop(chan);
> +
> +        if (stdesc->state != S_NO_CHANGE) {
> +            break;
> +        }
> +    }
> +
> +    return vmbus_channel_recv_done(chan) > 0;
> +}
> +
> +static bool hv_balloon_event_loop_state(HvBalloon *balloon)
> +{
> +    StateDesc state_new = HV_BALLOON_STATE_DESC_INIT;
> +
> +    hv_balloon_handle_state(balloon, &state_new);
> +    return hv_balloon_state_set(balloon, state_new.state, state_new.desc);
> +}
> +
> +static bool hv_balloon_event_loop_recv(HvBalloon *balloon)
> +{
> +    StateDesc state_new = HV_BALLOON_STATE_DESC_INIT;
> +    bool any_recv, state_changed;
> +
> +    any_recv = hv_balloon_recv_channel(balloon, &state_new);
> +    state_changed = hv_balloon_state_set(balloon,
> +                                         state_new.state, state_new.desc);
> +
> +    return state_changed || any_recv;
> +}
> +
> +static void hv_balloon_event_loop(HvBalloon *balloon)
> +{
> +    bool state_repeat, recv_repeat;
> +
> +    do {
> +        state_repeat = hv_balloon_event_loop_state(balloon);
> +        recv_repeat = hv_balloon_event_loop_recv(balloon);
> +    } while (state_repeat || recv_repeat);
> +}
> +
> +void qmp_hv_balloon_add_memory(const char *id, Error **errp)
> +{
> +    HvBalloon *balloon;
> +    uint64_t align;
> +    g_autofree gchar *align_str = NULL;
> +    g_autoptr(QDict) qdict = NULL;
> +    g_autoptr(DeviceState) dev = NULL;
> +    HAPVDIMMDevice *hapvdimm;
> +    PageRange range;
> +    HAPVDIMMRange *hpr;
> +
> +    balloon = HV_BALLOON(object_resolve_path_type("", TYPE_HV_BALLOON, 
> NULL));
> +    if (!balloon) {
> +        error_setg(errp, "no %s device present", TYPE_HV_BALLOON);
> +        return;
> +    }
> +
> +    if (hv_balloon_state_is_init(balloon)) {
> +        error_setg(errp, "no guest attached to the DM protocol yet");
> +        return;
> +    }
> +
> +    if (!balloon->caps.cap_bits.hot_add) {
> +        error_setg(errp,
> +                   "the current DM protocol guest has no support for memory 
> hot add");
> +        return;
> +    }
> +
> +    /* add device */
> +    qdict = qdict_new();
> +    qdict_put_str(qdict, "driver", TYPE_HAPVDIMM);
> +    qdict_put_str(qdict, HAPVDIMM_MEMDEV_PROP, id);
> +
> +    align = (1 << balloon->caps.cap_bits.hot_add_alignment) * MiB;
> +    align_str = g_strdup_printf("%" PRIu64, align);
> +    qdict_put_str(qdict, HAPVDIMM_ALIGN_PROP, align_str);
> +
> +    hapvdimm_allow_adding();
> +    dev = qdev_device_add_from_qdict(qdict, false, errp);
> +    hapvdimm_disallow_adding();
> +    if (!dev) {
> +        return;
> +    }
> +
> +    hapvdimm = HAPVDIMM(dev);
> +
> +    hapvdimm_get_range(hapvdimm, &range);
> +    if (page_range_tree_intree_any(balloon->removed_guest,
> +                                   range.start, range.count) ||
> +        page_range_tree_intree_any(balloon->removed_both,
> +                                   range.start, range.count)) {
> +        error_setg(errp,
> +                   "some of the device new pages were already returned by 
> the guest. this should not happen, please reboot the guest and try again");
> +        return;
> +    }
> +
> +    trace_hv_balloon_hapvdimm_range_add(range.count, range.start);
> +
> +    hpr = hapvdimm_tree_insert_new(balloon, hapvdimm);
> +
> +    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
> +
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_notify_cb(VMBusChannel *chan)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
> +
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_stat(void *opaque, BalloonInfo *info)
> +{
> +    HvBalloon *balloon = opaque;
> +    info->actual = (hv_balloon_total_ram(balloon) - 
> balloon->removed_both_ctr)
> +        << HV_BALLOON_PFN_SHIFT;
> +}
> +
> +static void hv_balloon_to_target(void *opaque, ram_addr_t target)
> +{
> +    HvBalloon *balloon = opaque;
> +    uint64_t target_pages = target >> HV_BALLOON_PFN_SHIFT;
> +
> +    if (!target_pages) {
> +        return;
> +    }
> +
> +    /*
> +     * always set target_changed, even with unchanged target, as the user
> +     * might be asking us to try again reaching it
> +     */
> +    balloon->target = target_pages;
> +    balloon->target_changed = true;
> +
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static int hv_balloon_open_channel(VMBusChannel *chan)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
> +
> +    if (balloon->state != S_CLOSED) {
> +        warn_report("guest trying to open a DM channel in invalid %d state",
> +                    balloon->state);
> +        return -EINVAL;
> +    }
> +
> +    HV_BALLOON_SET_STATE(balloon, S_VERSION);
> +    hv_balloon_event_loop(balloon);
> +
> +    return 0;
> +}
> +
> +static void hv_balloon_close_channel(VMBusChannel *chan)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
> +
> +    timer_del(&balloon->post_init_timer);
> +
> +    HV_BALLOON_SET_STATE(balloon, S_WAIT_RESET);
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_post_init_timer(void *opaque)
> +{
> +    HvBalloon *balloon = opaque;
> +
> +    if (balloon->state != S_POST_INIT_WAIT) {
> +        return;
> +    }
> +
> +    HV_BALLOON_SET_STATE(balloon, S_IDLE);
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_system_reset(void *opaque)
> +{
> +    HvBalloon *balloon = HV_BALLOON(opaque);
> +
> +    if (!balloon->hapvdimms_del_todo) {
> +        return;
> +    }
> +
> +    if (balloon->del_todo_process_timer) {
> +        return;
> +    }
> +
> +    balloon->del_todo_process_timer = g_idle_add(del_todo_process_timer,
> +                                                 balloon);
> +}
> +
> +static void hv_balloon_dev_realize(VMBusDevice *vdev, Error **errp)
> +{
> +    ERRP_GUARD();
> +    HvBalloon *balloon = HV_BALLOON(vdev);
> +    int ret;
> +
> +    /* used by hv_balloon_stat() */
> +    balloon->hapvdimms = hapvdimm_tree_new();
> +    balloon->state = S_WAIT_RESET;
> +
> +    ret = qemu_add_balloon_handler(hv_balloon_to_target, hv_balloon_stat,
> +                                   balloon);
> +    if (ret < 0) {
> +        /* This also protects against having multiple hv-balloon instances */
> +        error_setg(errp, "Only one balloon device is supported");
> +        goto ret_tree;
> +    }
> +
> +    timer_init_ms(&balloon->post_init_timer, QEMU_CLOCK_VIRTUAL,
> +                  hv_balloon_post_init_timer, balloon);
> +
> +    qemu_register_reset(hv_balloon_system_reset, balloon);
> +
> +    return;
> +
> +ret_tree:
> +    hapvdimm_tree_destroy(&balloon->hapvdimms);
> +}
> +
> +static void hv_balloon_reset_destroy_common(HvBalloon *balloon)
> +{
> +    ha_todo_clear(balloon);
> +    del_todo_add_all(balloon);
> +}
> +
> +static void hv_balloon_dev_reset(VMBusDevice *vdev)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vdev);
> +
> +    page_range_tree_destroy(&balloon->removed_guest);
> +    page_range_tree_destroy(&balloon->removed_both);
> +    balloon->removed_guest = page_range_tree_new();
> +    balloon->removed_both = page_range_tree_new();
> +
> +    hv_balloon_reset_destroy_common(balloon);
> +
> +    balloon->trans_id = 0;
> +    balloon->removed_guest_ctr = 0;
> +    balloon->removed_both_ctr = 0;
> +
> +    HV_BALLOON_SET_STATE(balloon, S_CLOSED);
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_dev_unrealize(VMBusDevice *vdev)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vdev);
> +
> +    qemu_unregister_reset(hv_balloon_system_reset, balloon);
> +
> +    hv_balloon_reset_destroy_common(balloon);
> +
> +    del_todo_process(balloon);
> +    assert(!balloon->del_todo_process_timer);
> +
> +    qemu_remove_balloon_handler(balloon);
> +
> +    page_range_tree_destroy(&balloon->removed_guest);
> +    page_range_tree_destroy(&balloon->removed_both);
> +    hapvdimm_tree_destroy(&balloon->hapvdimms);
> +}
> +
> +static Property hv_balloon_properties[] = {
> +    DEFINE_PROP_BOOL("status-report", HvBalloon,
> +                     status_reports, false),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void hv_balloon_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VMBusDeviceClass *vdc = VMBUS_DEVICE_CLASS(klass);
> +
> +    device_class_set_props(dc, hv_balloon_properties);
> +    qemu_uuid_parse(HV_BALLOON_GUID, &vdc->classid);
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    vdc->vmdev_realize = hv_balloon_dev_realize;
> +    vdc->vmdev_unrealize = hv_balloon_dev_unrealize;
> +    vdc->vmdev_reset = hv_balloon_dev_reset;
> +    vdc->open_channel = hv_balloon_open_channel;
> +    vdc->close_channel = hv_balloon_close_channel;
> +    vdc->chan_notify_cb = hv_balloon_notify_cb;
> +}
> +
> +static const TypeInfo hv_balloon_type_info = {
> +    .name = TYPE_HV_BALLOON,
> +    .parent = TYPE_VMBUS_DEVICE,
> +    .instance_size = sizeof(HvBalloon),
> +    .class_init = hv_balloon_class_init,
> +};
> +
> +static void hv_balloon_register_types(void)
> +{
> +    type_register_static(&hv_balloon_type_info);
> +}
> +
> +type_init(hv_balloon_register_types)
> diff --git a/hw/hyperv/meson.build b/hw/hyperv/meson.build
> index b43f119ea5..212e0ce51e 100644
> --- a/hw/hyperv/meson.build
> +++ b/hw/hyperv/meson.build
> @@ -2,3 +2,4 @@ specific_ss.add(when: 'CONFIG_HYPERV', if_true: 
> files('hyperv.c'))
>  specific_ss.add(when: 'CONFIG_HYPERV_TESTDEV', if_true: 
> files('hyperv_testdev.c'))
>  specific_ss.add(when: 'CONFIG_VMBUS', if_true: files('vmbus.c'))
>  specific_ss.add(when: 'CONFIG_SYNDBG', if_true: files('syndbg.c'))
> +specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c'))
> diff --git a/hw/hyperv/trace-events b/hw/hyperv/trace-events
> index b4c35ca8e3..3b98ac3689 100644
> --- a/hw/hyperv/trace-events
> +++ b/hw/hyperv/trace-events
> @@ -16,3 +16,19 @@ vmbus_gpadl_torndown(uint32_t gpadl_id) "gpadl #%d"
>  vmbus_open_channel(uint32_t chan_id, uint32_t gpadl_id, uint32_t target_vp) 
> "channel #%d gpadl #%d target vp %d"
>  vmbus_channel_open(uint32_t chan_id, uint32_t status) "channel #%d status %d"
>  vmbus_close_channel(uint32_t chan_id) "channel #%d"
> +
> +# hv-balloon
> +hv_balloon_state_change(const char *tostr) "-> %s"
> +hv_balloon_incoming_version(uint16_t major, uint16_t minor) "incoming proto 
> version %u.%u"
> +hv_balloon_incoming_caps(uint32_t caps) "incoming caps 0x%x"
> +hv_balloon_outgoing_unballoon(uint32_t trans_id, uint64_t count, uint64_t 
> start, uint64_t rempages) "posting unballoon %"PRIu32" for %"PRIu64" @ 
> 0x%"PRIx64", remaining %"PRIu64
> +hv_balloon_incoming_unballoon(uint32_t trans_id) "incoming unballoon 
> response %"PRIu32
> +hv_balloon_outgoing_hot_add(uint32_t trans_id, uint64_t count, uint64_t 
> start) "posting hot add %"PRIu32" for %"PRIu64" @ 0x%"PRIx64
> +hv_balloon_incoming_hot_add(uint32_t trans_id, uint32_t result, uint32_t 
> count) "incoming hot add response %"PRIu32", result %"PRIu32", count %"PRIu32
> +hv_balloon_outgoing_balloon(uint32_t trans_id, uint64_t count, uint64_t 
> rempages) "posting balloon %"PRIu32" for %"PRIu64", remaining %"PRIu64
> +hv_balloon_incoming_balloon(uint32_t trans_id, uint32_t range_count, 
> uint32_t more_pages) "incoming balloon response %"PRIu32", ranges %"PRIu32", 
> more %"PRIu32
> +hv_balloon_hapvdimm_range_add(uint64_t count, uint64_t start) "adding 
> hapvdimm range %"PRIu64" @ 0x%"PRIx64
> +hv_balloon_remove_response(uint64_t count, uint64_t start, unsigned int 
> both) "processing remove response range %"PRIu64" @ 0x%"PRIx64", both %u"
> +hv_balloon_remove_response_hole(uint64_t counthole, uint64_t starthole, 
> uint64_t countrange, uint64_t startrange, uint64_t starthpr, unsigned int 
> both) "response range hole %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 
> 0x%"PRIx64", before hpr start 0x%"PRIx64", both %u"
> +hv_balloon_remove_response_common(uint64_t countcommon, uint64_t 
> startcommon, uint64_t countrange, uint64_t startrange, uint64_t counthpr, 
> uint64_t starthpr, uint64_t removed, unsigned int both) "response common 
> range %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 0x%"PRIx64" with hpr 
> %"PRIu64" @ 0x%"PRIx64", removed %"PRIu64", both %u"
> +hv_balloon_remove_response_remainder(uint64_t count, uint64_t start, 
> unsigned int both) "remove response remaining range %"PRIu64" @ 0x%"PRIx64", 
> both %u"
> diff --git a/meson.build b/meson.build
> index 6cb2b1a42f..2d9c01b6ec 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2550,7 +2550,8 @@ host_kconfig = \
>    ('CONFIG_LINUX' in config_host ? ['CONFIG_LINUX=y'] : []) + \
>    (have_pvrdma ? ['CONFIG_PVRDMA=y'] : []) + \
>    (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : []) + \
> -  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : [])
> +  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : []) + \
> +  ('CONFIG_HV_BALLOON_POSSIBLE' in config_host ? 
> ['CONFIG_HV_BALLOON_POSSIBLE=y'] : [])
>  
>  ignored = [ 'TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_ARCH' ]
>  
> @@ -4027,6 +4028,7 @@ summary_info += {'libudev':           libudev}
>  summary_info += {'FUSE lseek':        fuse_lseek.found()}
>  summary_info += {'selinux':           selinux}
>  summary_info += {'libdw':             libdw}
> +summary_info += {'hv-balloon support': 
> config_host.has_key('CONFIG_HV_BALLOON_POSSIBLE')}
>  summary(summary_info, bool_yn: true, section: 'Dependencies')
>  
>  if not supported_cpus.contains(cpu)
> diff --git a/qapi/machine.json b/qapi/machine.json
> index b9228a5e46..04ff95337a 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1104,6 +1104,74 @@
>  { 'event': 'BALLOON_CHANGE',
>    'data': { 'actual': 'int' } }
>  
> +##
> +# @hv-balloon-add-memory:
> +#
> +# Hot-add memory backend via Hyper-V Dynamic Memory Protocol.
> +#
> +# @id: the name of the memory backend object to hot-add
> +#
> +# Returns: Nothing on success
> +#          Error if there's no guest connected with hot-add capability,
> +#          @id is not a valid memory backend or it's already in use.
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# -> { "execute": "hv-balloon-add-memory", "arguments": { "id": "mb1" } }
> +# <- { "return": {} }
> +#
> +##
> +{ 'command': 'hv-balloon-add-memory', 'data': {'id': 'str'} }
> +
> +##
> +# @HV_BALLOON_STATUS_REPORT:
> +#
> +# Emitted when the hv-balloon driver receives a "STATUS" message from
> +# the guest.
> +#
> +# @commited: the amount of memory in use inside the guest plus the amount
> +#            of the memory unusable inside the guest (ballooned out,
> +#            offline, etc.)
> +#
> +# @available: the amount of the memory inside the guest available for new
> +#             allocations ("free")
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# <- { "event": "HV_BALLOON_STATUS_REPORT",
> +#      "data": { "commited": 816640000, "available": 3333054464 },
> +#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
> +#
> +##
> +{ 'event': 'HV_BALLOON_STATUS_REPORT',
> +  'data': { 'commited': 'size', 'available': 'size' } }
> +
> +##
> +# @HV_BALLOON_MEMORY_BACKEND_UNUSED:
> +#
> +# Emitted when the hv-balloon driver marks a memory backend object
> +# unused so it can now be removed, if required.
> +#
> +# This can happen because the VM was restarted.
> +#
> +# @id: the memory backend object id
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# <- { "event": "HV_BALLOON_MEMORY_BACKEND_UNUSED",
> +#      "data": { "id": "mb1" },
> +#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
> +#
> +##
> +{ 'event': 'HV_BALLOON_MEMORY_BACKEND_UNUSED',
> +  'data': { 'id': 'str' } }
> +
>  ##
>  # @MemoryInfo:
>  #
>

Re: [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)

Reply via email to