Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

Jason Wang Sun, 24 Mar 2024 23:14:52 -0700

On Sat, Mar 23, 2024 at 5:14 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
>
>
>
> On 3/21/2024 10:08 PM, Jason Wang wrote:
> > On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>
> >>
> >> On 3/20/2024 8:56 PM, Jason Wang wrote:
> >>> On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu <si-wei....@oracle.com> wrote:
> >>>>
> >>>> On 3/19/2024 8:27 PM, Jason Wang wrote:
> >>>>> On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>> wrote:
> >>>>>> On 3/17/2024 8:22 PM, Jason Wang wrote:
> >>>>>>> On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>>>> wrote:
> >>>>>>>> On 3/14/2024 9:03 PM, Jason Wang wrote:
> >>>>>>>>> On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu <si-wei....@oracle.com> 
> >>>>>>>>> wrote:
> >>>>>>>>>> On setups with one or more virtio-net devices with vhost on,
> >>>>>>>>>> dirty tracking iteration increases cost the bigger the number
> >>>>>>>>>> amount of queues are set up e.g. on idle guests migration the
> >>>>>>>>>> following is observed with virtio-net with vhost=on:
> >>>>>>>>>>
> >>>>>>>>>> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> >>>>>>>>>> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> >>>>>>>>>> 1 queue -> 6.89%     [.] vhost_dev_sync_region.isra.13
> >>>>>>>>>> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> >>>>>>>>>>
> >>>>>>>>>> With high memory rates the symptom is lack of convergence as soon
> >>>>>>>>>> as it has a vhost device with a sufficiently high number of queues,
> >>>>>>>>>> the sufficient number of vhost devices.
> >>>>>>>>>>
> >>>>>>>>>> On every migration iteration (every 100msecs) it will redundantly
> >>>>>>>>>> query the *shared log* the number of queues configured with vhost
> >>>>>>>>>> that exist in the guest. For the virtqueue data, this is necessary,
> >>>>>>>>>> but not for the memory sections which are the same. So essentially
> >>>>>>>>>> we end up scanning the dirty log too often.
> >>>>>>>>>>
> >>>>>>>>>> To fix that, select a vhost device responsible for scanning the
> >>>>>>>>>> log with regards to memory sections dirty tracking. It is selected
> >>>>>>>>>> when we enable the logger (during migration) and cleared when we
> >>>>>>>>>> disable the logger. If the vhost logger device goes away for some
> >>>>>>>>>> reason, the logger will be re-selected from the rest of vhost
> >>>>>>>>>> devices.
> >>>>>>>>>>
> >>>>>>>>>> After making mem-section logger a singleton instance, constant cost
> >>>>>>>>>> of 7%-9% (like the 1 queue report) will be seen, no matter how many
> >>>>>>>>>> queues or how many vhost devices are configured:
> >>>>>>>>>>
> >>>>>>>>>> 48 queues -> 8.71%    [.] vhost_dev_sync_region.isra.13
> >>>>>>>>>> 2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> >>>>>>>>>>
> >>>>>>>>>> Co-developed-by: Joao Martins <joao.m.mart...@oracle.com>
> >>>>>>>>>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
> >>>>>>>>>> Signed-off-by: Si-Wei Liu <si-wei....@oracle.com>
> >>>>>>>>>>
> >>>>>>>>>> ---
> >>>>>>>>>> v3 -> v4:
> >>>>>>>>>>        - add comment to clarify effect on cache locality and
> >>>>>>>>>>          performance
> >>>>>>>>>>
> >>>>>>>>>> v2 -> v3:
> >>>>>>>>>>        - add after-fix benchmark to commit log
> >>>>>>>>>>        - rename vhost_log_dev_enabled to vhost_dev_should_log
> >>>>>>>>>>        - remove unneeded comparisons for backend_type
> >>>>>>>>>>        - use QLIST array instead of single flat list to store vhost
> >>>>>>>>>>          logger devices
> >>>>>>>>>>        - simplify logger election logic
> >>>>>>>>>> ---
> >>>>>>>>>>       hw/virtio/vhost.c         | 67 
> >>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++-----
> >>>>>>>>>>       include/hw/virtio/vhost.h |  1 +
> >>>>>>>>>>       2 files changed, 62 insertions(+), 6 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >>>>>>>>>> index 612f4db..58522f1 100644
> >>>>>>>>>> --- a/hw/virtio/vhost.c
> >>>>>>>>>> +++ b/hw/virtio/vhost.c
> >>>>>>>>>> @@ -45,6 +45,7 @@
> >>>>>>>>>>
> >>>>>>>>>>       static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
> >>>>>>>>>>       static struct vhost_log 
> >>>>>>>>>> *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
> >>>>>>>>>> +static QLIST_HEAD(, vhost_dev) 
> >>>>>>>>>> vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> >>>>>>>>>>
> >>>>>>>>>>       /* Memslots used by backends that support private memslots 
> >>>>>>>>>> (without an fd). */
> >>>>>>>>>>       static unsigned int used_memslots;
> >>>>>>>>>> @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev 
> >>>>>>>>>> *dev)
> >>>>>>>>>>           }
> >>>>>>>>>>       }
> >>>>>>>>>>
> >>>>>>>>>> +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
> >>>>>>>>>> +{
> >>>>>>>>>> +    assert(dev->vhost_ops);
> >>>>>>>>>> +    assert(dev->vhost_ops->backend_type > 
> >>>>>>>>>> VHOST_BACKEND_TYPE_NONE);
> >>>>>>>>>> +    assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
> >>>>>>>>>> +
> >>>>>>>>>> +    return dev == 
> >>>>>>>>>> QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
> >>>>>>>>> A dumb question, why not simple check
> >>>>>>>>>
> >>>>>>>>> dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
> >>>>>>>> Because we are not sure if the logger comes from vhost_log_shm[] or
> >>>>>>>> vhost_log[]. Don't want to complicate the check here by calling into
> >>>>>>>> vhost_dev_log_is_shared() everytime when the .log_sync() is called.
> >>>>>>> It has very low overhead, isn't it?
> >>>>>> Whether this has low overhead will have to depend on the specific
> >>>>>> backend's implementation for .vhost_requires_shm_log(), which the 
> >>>>>> common
> >>>>>> vhost layer should not assume upon or rely on the current 
> >>>>>> implementation.
> >>>>>>
> >>>>>>> static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
> >>>>>>> {
> >>>>>>>         return dev->vhost_ops->vhost_requires_shm_log &&
> >>>>>>>                dev->vhost_ops->vhost_requires_shm_log(dev);
> >>>>>>> }
> >>>>> For example, if I understand the code correctly, the log type won't be
> >>>>> changed during runtime, so we can endup with a boolean to record that
> >>>>> instead of a query ops?
> >>>> Right now the log type won't change during runtime, but I am not sure if
> >>>> this may prohibit future revisit to allow change at the runtime,
> >>> We can be bothered when we have such a request then.
> >>>
> >>>> then
> >>>> there'll be complex code involvled to maintain the state.
> >>>>
> >>>> Other than this, I think it's insufficient to just check the shm log
> >>>> v.s. normal log. The logger device requires to identify a leading logger
> >>>> device that gets elected in vhost_dev_elect_mem_logger(), as all the
> >>>> dev->log points to the same logger that is refenerce counted, that we
> >>>> have to add extra field and complex logic to maintain the election
> >>>> status.
> >>> One thing I don't understand here (and in the changelog) is why do we
> >>> need an election here?
> >> vhost_sync_dirty_bitmap() not just scans the guest memory sections but
> >> the specific one for virtqueues (used rings) also. To save more CPU
> >> cycles to the best extend, the guest memory must be scanned only once in
> >> each log iteration, though the logging for used rings would still have
> >> to use the specific vhost instance, so all vhost_device instance still
> >> keeps the dev->log pointer to the shared log as-is. Generally the shared
> >> memory logger can be picked from an arbitrary vhost_device instance, but
> >> to keep the code simple, performant and predictable
> > This is the point, I don't see why election is simpler than picking an
> > arbitrary shared log in this case.
> Maybe I missed your point, but I am confused and fail to understand why
> electing a fixed vhost_dev is not as simple? Regardless of the
> specifics, I think the point is one _fixed_ vhost_dev has to be picked
> amongst all the instances per backend type in charge of logging guest
> memory, no matter if it's at the start on the list or not.


See below.

>
> >
> >> , logger selection is
> >> made on the control path at the vhost add/remove time rather than be
> >> determined at the dirty log collection runtime, the latter of which is
> >> in the hotpath.
> >>
> >>>> I thought that Eugenio's previous suggestion tried to simplify
> >>>> the logic in vhost_dev_elect_mem_logger(), as the QLIST_FIRST macro that
> >>>> gets expanded to use the lh_first field for the QLIST would simply
> >>>> satisfy the basic need. Why extra logic to make the check ever more
> >>>> complex, is there any benefit by adding more fields to the vhost_dev?
> >>> I don't get here, the idea is to just pick one shared log which should
> >>> be much more simpler than what is proposed here.
> >> The code you showed earlier won't work as all vhost_device instance
> >> points to the same dev->log device...
> > This part I don't understand.
>
> vhost_log_get() has the following:
>
>      log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
>
>      if (!log || log->size != size) {
>          log = vhost_log_alloc(size, share);
>          if (share) {
>              vhost_log_shm[backend_type] = log;
>          } else {
>              vhost_log[backend_type] = log;
>          }
>      } else {
>          ++log->refcnt;
>      }
>
> So for a specific backend type, vhost_log_get() would return the same
> logger device (stored at either vhost_log_shm[] or vhost_log[]) to all
> vhost_dev instances, and the check you suggested earlier:
>
> dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
>
> will always hold true if the vhost_dev instance (representing the
> specific virtqueue) is active.

Right, so the point is see if we can find something simpler to avoid
the QLIST as it involves vhost_dev which seems unnecessary.

Does something like a counter work?

vhost_sync_dirty_bitmap() {
      rounds ++;
      vhost_dev_sync_region(rounds);
}

vhost_dev_sync_region(rounds) {
      if (dev->log->rounds == rounds)
           return;
      else
           dev->log->rounds;
}

(pseudo codes, just used to demonstrate the idea).

Thanks

>
> Regards,
> -Siwei
> >
> > Thanks
> >
> >> Regards,
> >> -Siwei
> >>> Thanks
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>>>>> And it helps to simplify the logic.
> >>>>>> Generally yes, but when it comes to hot path operations the performance
> >>>>>> consideration could override this principle. I think there's no harm to
> >>>>>> check against logger device cached in vhost layer itself, and the
> >>>>>> current patch does not create a lot of complexity or performance side
> >>>>>> effect (actually I think the conditional should be very straightforward
> >>>>>> to turn into just a couple of assembly compare and branch instructions
> >>>>>> rather than indirection through another jmp call).
> >>>>> Thanks
> >>>>>
> >>>>>> -Siwei
> >>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>>> -Siwei
> >>>>>>>>> ?
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
>

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

Reply via email to