Re: [External] : Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-25 Thread Jason Wang
On Tue, Mar 26, 2024 at 7:21 AM Si-Wei Liu  wrote:
>
>
>
> On 3/24/2024 11:13 PM, Jason Wang wrote:
> > On Sat, Mar 23, 2024 at 5:14 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 3/21/2024 10:08 PM, Jason Wang wrote:
> >>> On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:
> 
>  On 3/20/2024 8:56 PM, Jason Wang wrote:
> > On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  
> > wrote:
> >> On 3/19/2024 8:27 PM, Jason Wang wrote:
> >>> On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  
> >>> wrote:
>  On 3/17/2024 8:22 PM, Jason Wang wrote:
> > On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  
> > wrote:
> >> On 3/14/2024 9:03 PM, Jason Wang wrote:
> >>> On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu 
> >>>  wrote:
>  On setups with one or more virtio-net devices with vhost on,
>  dirty tracking iteration increases cost the bigger the number
>  amount of queues are set up e.g. on idle guests migration the
>  following is observed with virtio-net with vhost=on:
> 
>  48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
>  8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
>  1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
>  2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> 
>  With high memory rates the symptom is lack of convergence as soon
>  as it has a vhost device with a sufficiently high number of 
>  queues,
>  the sufficient number of vhost devices.
> 
>  On every migration iteration (every 100msecs) it will redundantly
>  query the *shared log* the number of queues configured with vhost
>  that exist in the guest. For the virtqueue data, this is 
>  necessary,
>  but not for the memory sections which are the same. So 
>  essentially
>  we end up scanning the dirty log too often.
> 
>  To fix that, select a vhost device responsible for scanning the
>  log with regards to memory sections dirty tracking. It is 
>  selected
>  when we enable the logger (during migration) and cleared when we
>  disable the logger. If the vhost logger device goes away for some
>  reason, the logger will be re-selected from the rest of vhost
>  devices.
> 
>  After making mem-section logger a singleton instance, constant 
>  cost
>  of 7%-9% (like the 1 queue report) will be seen, no matter how 
>  many
>  queues or how many vhost devices are configured:
> 
>  48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
>  2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> 
>  Co-developed-by: Joao Martins 
>  Signed-off-by: Joao Martins 
>  Signed-off-by: Si-Wei Liu 
> 
>  ---
>  v3 -> v4:
>  - add comment to clarify effect on cache locality and
>    performance
> 
>  v2 -> v3:
>  - add after-fix benchmark to commit log
>  - rename vhost_log_dev_enabled to vhost_dev_should_log
>  - remove unneeded comparisons for backend_type
>  - use QLIST array instead of single flat list to store 
>  vhost
>    logger devices
>  - simplify logger election logic
>  ---
> hw/virtio/vhost.c | 67 
>  ++-
> include/hw/virtio/vhost.h |  1 +
> 2 files changed, 62 insertions(+), 6 deletions(-)
> 
>  diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>  index 612f4db..58522f1 100644
>  --- a/hw/virtio/vhost.c
>  +++ b/hw/virtio/vhost.c
>  @@ -45,6 +45,7 @@
> 
> static struct vhost_log 
>  *vhost_log[VHOST_BACKEND_TYPE_MAX];
> static struct vhost_log 
>  *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
>  +static QLIST_HEAD(, vhost_dev) 
>  vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> 
> /* Memslots used by backends that support private 
>  memslots (without an fd). */
> static unsigned int used_memslots;
>  @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev 
>  *dev)
> }
> }
> 
>  +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
>  +{
> >>

Re: [External] : Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-25 Thread Si-Wei Liu




On 3/24/2024 11:13 PM, Jason Wang wrote:

On Sat, Mar 23, 2024 at 5:14 AM Si-Wei Liu  wrote:



On 3/21/2024 10:08 PM, Jason Wang wrote:

On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:


On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:

On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:

On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
- add comment to clarify effect on cache locality and
  performance

v2 -> v3:
- add after-fix benchmark to commit log
- rename vhost_log_dev_enabled to vhost_dev_should_log
- remove unneeded comparisons for backend_type
- use QLIST array instead of single flat list to store vhost
  logger devices
- simplify logger election logic
---
   hw/virtio/vhost.c | 67 
++-
   include/hw/virtio/vhost.h |  1 +
   2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an 
fd). */
   static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
   }
   }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
 return dev->vhost_ops->vhost_requires_shm_log &&
dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this, I think it's insufficient to just check the shm log
v.s. normal log. The l

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-24 Thread Jason Wang
On Sat, Mar 23, 2024 at 5:14 AM Si-Wei Liu  wrote:
>
>
>
> On 3/21/2024 10:08 PM, Jason Wang wrote:
> > On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 3/20/2024 8:56 PM, Jason Wang wrote:
> >>> On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:
> 
>  On 3/19/2024 8:27 PM, Jason Wang wrote:
> > On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  
> > wrote:
> >> On 3/17/2024 8:22 PM, Jason Wang wrote:
> >>> On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  
> >>> wrote:
>  On 3/14/2024 9:03 PM, Jason Wang wrote:
> > On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  
> > wrote:
> >> On setups with one or more virtio-net devices with vhost on,
> >> dirty tracking iteration increases cost the bigger the number
> >> amount of queues are set up e.g. on idle guests migration the
> >> following is observed with virtio-net with vhost=on:
> >>
> >> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> >> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> >> 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
> >> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> >>
> >> With high memory rates the symptom is lack of convergence as soon
> >> as it has a vhost device with a sufficiently high number of queues,
> >> the sufficient number of vhost devices.
> >>
> >> On every migration iteration (every 100msecs) it will redundantly
> >> query the *shared log* the number of queues configured with vhost
> >> that exist in the guest. For the virtqueue data, this is necessary,
> >> but not for the memory sections which are the same. So essentially
> >> we end up scanning the dirty log too often.
> >>
> >> To fix that, select a vhost device responsible for scanning the
> >> log with regards to memory sections dirty tracking. It is selected
> >> when we enable the logger (during migration) and cleared when we
> >> disable the logger. If the vhost logger device goes away for some
> >> reason, the logger will be re-selected from the rest of vhost
> >> devices.
> >>
> >> After making mem-section logger a singleton instance, constant cost
> >> of 7%-9% (like the 1 queue report) will be seen, no matter how many
> >> queues or how many vhost devices are configured:
> >>
> >> 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
> >> 2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> >>
> >> Co-developed-by: Joao Martins 
> >> Signed-off-by: Joao Martins 
> >> Signed-off-by: Si-Wei Liu 
> >>
> >> ---
> >> v3 -> v4:
> >>- add comment to clarify effect on cache locality and
> >>  performance
> >>
> >> v2 -> v3:
> >>- add after-fix benchmark to commit log
> >>- rename vhost_log_dev_enabled to vhost_dev_should_log
> >>- remove unneeded comparisons for backend_type
> >>- use QLIST array instead of single flat list to store vhost
> >>  logger devices
> >>- simplify logger election logic
> >> ---
> >>   hw/virtio/vhost.c | 67 
> >> ++-
> >>   include/hw/virtio/vhost.h |  1 +
> >>   2 files changed, 62 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index 612f4db..58522f1 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -45,6 +45,7 @@
> >>
> >>   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
> >>   static struct vhost_log 
> >> *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
> >> +static QLIST_HEAD(, vhost_dev) 
> >> vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> >>
> >>   /* Memslots used by backends that support private memslots 
> >> (without an fd). */
> >>   static unsigned int used_memslots;
> >> @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev 
> >> *dev)
> >>   }
> >>   }
> >>
> >> +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
> >> +{
> >> +assert(dev->vhost_ops);
> >> +assert(dev->vhost_ops->backend_type > 
> >> VHOST_BACKEND_TYPE_NONE);
> >> +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
> >> +
> >> +return dev == 
> >> QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
> > A dumb question, why not simple check
> >
> > dev->log == vhost_log_shm[dev

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-22 Thread Si-Wei Liu




On 3/21/2024 10:08 PM, Jason Wang wrote:

On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:



On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:


On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:

On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
   - add comment to clarify effect on cache locality and
 performance

v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic
---
  hw/virtio/vhost.c | 67 
++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an 
fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
return dev->vhost_ops->vhost_requires_shm_log &&
   dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this, I think it's insufficient to just check the shm log
v.s. normal log. The logger device requires to identify a leading logger
device that gets elected in vhost_dev_elect_mem_logger(), as a

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-21 Thread Jason Wang
On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:
>
>
>
> On 3/20/2024 8:56 PM, Jason Wang wrote:
> > On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 3/19/2024 8:27 PM, Jason Wang wrote:
> >>> On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:
> 
>  On 3/17/2024 8:22 PM, Jason Wang wrote:
> > On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  
> > wrote:
> >> On 3/14/2024 9:03 PM, Jason Wang wrote:
> >>> On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  
> >>> wrote:
>  On setups with one or more virtio-net devices with vhost on,
>  dirty tracking iteration increases cost the bigger the number
>  amount of queues are set up e.g. on idle guests migration the
>  following is observed with virtio-net with vhost=on:
> 
>  48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
>  8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
>  1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
>  2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> 
>  With high memory rates the symptom is lack of convergence as soon
>  as it has a vhost device with a sufficiently high number of queues,
>  the sufficient number of vhost devices.
> 
>  On every migration iteration (every 100msecs) it will redundantly
>  query the *shared log* the number of queues configured with vhost
>  that exist in the guest. For the virtqueue data, this is necessary,
>  but not for the memory sections which are the same. So essentially
>  we end up scanning the dirty log too often.
> 
>  To fix that, select a vhost device responsible for scanning the
>  log with regards to memory sections dirty tracking. It is selected
>  when we enable the logger (during migration) and cleared when we
>  disable the logger. If the vhost logger device goes away for some
>  reason, the logger will be re-selected from the rest of vhost
>  devices.
> 
>  After making mem-section logger a singleton instance, constant cost
>  of 7%-9% (like the 1 queue report) will be seen, no matter how many
>  queues or how many vhost devices are configured:
> 
>  48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
>  2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> 
>  Co-developed-by: Joao Martins 
>  Signed-off-by: Joao Martins 
>  Signed-off-by: Si-Wei Liu 
> 
>  ---
>  v3 -> v4:
>    - add comment to clarify effect on cache locality and
>  performance
> 
>  v2 -> v3:
>    - add after-fix benchmark to commit log
>    - rename vhost_log_dev_enabled to vhost_dev_should_log
>    - remove unneeded comparisons for backend_type
>    - use QLIST array instead of single flat list to store vhost
>  logger devices
>    - simplify logger election logic
>  ---
>   hw/virtio/vhost.c | 67 
>  ++-
>   include/hw/virtio/vhost.h |  1 +
>   2 files changed, 62 insertions(+), 6 deletions(-)
> 
>  diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>  index 612f4db..58522f1 100644
>  --- a/hw/virtio/vhost.c
>  +++ b/hw/virtio/vhost.c
>  @@ -45,6 +45,7 @@
> 
>   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
>   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
>  +static QLIST_HEAD(, vhost_dev) 
>  vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> 
>   /* Memslots used by backends that support private memslots 
>  (without an fd). */
>   static unsigned int used_memslots;
>  @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
>   }
>   }
> 
>  +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
>  +{
>  +assert(dev->vhost_ops);
>  +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
>  +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
>  +
>  +return dev == 
>  QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
> >>> A dumb question, why not simple check
> >>>
> >>> dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
> >> Because we are not sure if the logger comes from vhost_log_shm[] or
> >> vhost_log[]. Don't want to complicate the check here by calling into
> >> vhost_dev_log_is_shared() everytime when the .log_sync() is called.
> > It has very low overhead, isn't it?
>  Whether this has low overhead will have to depe

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-21 Thread Si-Wei Liu




On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:



On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:


On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
  - add comment to clarify effect on cache locality and
performance

v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic
---
 hw/virtio/vhost.c | 67 
++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

 /* Memslots used by backends that support private memslots (without an 
fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
   return dev->vhost_ops->vhost_requires_shm_log &&
  dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this, I think it's insufficient to just check the shm log
v.s. normal log. The logger device requires to identify a leading logger
device that gets elected in vhost_dev_elect_mem_logger(), as all the
dev->log points to the same logger that is refenerce counted, that we
have to add extra field and complex 

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-20 Thread Jason Wang
On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:
>
>
>
> On 3/19/2024 8:27 PM, Jason Wang wrote:
> > On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 3/17/2024 8:22 PM, Jason Wang wrote:
> >>> On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:
> 
>  On 3/14/2024 9:03 PM, Jason Wang wrote:
> > On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  
> > wrote:
> >> On setups with one or more virtio-net devices with vhost on,
> >> dirty tracking iteration increases cost the bigger the number
> >> amount of queues are set up e.g. on idle guests migration the
> >> following is observed with virtio-net with vhost=on:
> >>
> >> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> >> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> >> 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
> >> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> >>
> >> With high memory rates the symptom is lack of convergence as soon
> >> as it has a vhost device with a sufficiently high number of queues,
> >> the sufficient number of vhost devices.
> >>
> >> On every migration iteration (every 100msecs) it will redundantly
> >> query the *shared log* the number of queues configured with vhost
> >> that exist in the guest. For the virtqueue data, this is necessary,
> >> but not for the memory sections which are the same. So essentially
> >> we end up scanning the dirty log too often.
> >>
> >> To fix that, select a vhost device responsible for scanning the
> >> log with regards to memory sections dirty tracking. It is selected
> >> when we enable the logger (during migration) and cleared when we
> >> disable the logger. If the vhost logger device goes away for some
> >> reason, the logger will be re-selected from the rest of vhost
> >> devices.
> >>
> >> After making mem-section logger a singleton instance, constant cost
> >> of 7%-9% (like the 1 queue report) will be seen, no matter how many
> >> queues or how many vhost devices are configured:
> >>
> >> 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
> >> 2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> >>
> >> Co-developed-by: Joao Martins 
> >> Signed-off-by: Joao Martins 
> >> Signed-off-by: Si-Wei Liu 
> >>
> >> ---
> >> v3 -> v4:
> >>  - add comment to clarify effect on cache locality and
> >>performance
> >>
> >> v2 -> v3:
> >>  - add after-fix benchmark to commit log
> >>  - rename vhost_log_dev_enabled to vhost_dev_should_log
> >>  - remove unneeded comparisons for backend_type
> >>  - use QLIST array instead of single flat list to store vhost
> >>logger devices
> >>  - simplify logger election logic
> >> ---
> >> hw/virtio/vhost.c | 67 
> >> ++-
> >> include/hw/virtio/vhost.h |  1 +
> >> 2 files changed, 62 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index 612f4db..58522f1 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -45,6 +45,7 @@
> >>
> >> static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
> >> static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
> >> +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> >>
> >> /* Memslots used by backends that support private memslots 
> >> (without an fd). */
> >> static unsigned int used_memslots;
> >> @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
> >> }
> >> }
> >>
> >> +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
> >> +{
> >> +assert(dev->vhost_ops);
> >> +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
> >> +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
> >> +
> >> +return dev == 
> >> QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
> > A dumb question, why not simple check
> >
> > dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
>  Because we are not sure if the logger comes from vhost_log_shm[] or
>  vhost_log[]. Don't want to complicate the check here by calling into
>  vhost_dev_log_is_shared() everytime when the .log_sync() is called.
> >>> It has very low overhead, isn't it?
> >> Whether this has low overhead will have to depend on the specific
> >> backend's implementation for .vhost_requires_shm_log(), which the common
> >> vhost layer should not assume upon or rely on the current implementation.
> >>
> >>> static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
> >>> {
> >>>   return dev->vhost_ops->vhost_requires_shm_log &&
> >>>

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-20 Thread Si-Wei Liu




On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:



On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:


On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
 - add comment to clarify effect on cache locality and
   performance

v2 -> v3:
 - add after-fix benchmark to commit log
 - rename vhost_log_dev_enabled to vhost_dev_should_log
 - remove unneeded comparisons for backend_type
 - use QLIST array instead of single flat list to store vhost
   logger devices
 - simplify logger election logic
---
hw/virtio/vhost.c | 67 
++-
include/hw/virtio/vhost.h |  1 +
2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

/* Memslots used by backends that support private memslots (without an fd). 
*/
static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
}
}

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
  return dev->vhost_ops->vhost_requires_shm_log &&
 dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?
Right now the log type won't change during runtime, but I am not sure if 
this may prohibit future revisit to allow change at the runtime, then 
there'll be complex code involvled to maintain the state.


Other than this, I think it's insufficient to just check the shm log 
v.s. normal log. The logger device requires to identify a leading logger 
device that gets elected in vhost_dev_elect_mem_logger(), as all the 
dev->log points to the same logger that is refenerce counted, that we 
have to add extra field and complex logic to maintain the election 
status. I thought that Eugenio's previous suggestion tried to simplify 
the logic in vhost_dev_elect_mem_logger(), as the QLIST_FI

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-19 Thread Jason Wang
On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:
>
>
>
> On 3/17/2024 8:22 PM, Jason Wang wrote:
> > On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 3/14/2024 9:03 PM, Jason Wang wrote:
> >>> On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:
>  On setups with one or more virtio-net devices with vhost on,
>  dirty tracking iteration increases cost the bigger the number
>  amount of queues are set up e.g. on idle guests migration the
>  following is observed with virtio-net with vhost=on:
> 
>  48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
>  8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
>  1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
>  2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> 
>  With high memory rates the symptom is lack of convergence as soon
>  as it has a vhost device with a sufficiently high number of queues,
>  the sufficient number of vhost devices.
> 
>  On every migration iteration (every 100msecs) it will redundantly
>  query the *shared log* the number of queues configured with vhost
>  that exist in the guest. For the virtqueue data, this is necessary,
>  but not for the memory sections which are the same. So essentially
>  we end up scanning the dirty log too often.
> 
>  To fix that, select a vhost device responsible for scanning the
>  log with regards to memory sections dirty tracking. It is selected
>  when we enable the logger (during migration) and cleared when we
>  disable the logger. If the vhost logger device goes away for some
>  reason, the logger will be re-selected from the rest of vhost
>  devices.
> 
>  After making mem-section logger a singleton instance, constant cost
>  of 7%-9% (like the 1 queue report) will be seen, no matter how many
>  queues or how many vhost devices are configured:
> 
>  48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
>  2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> 
>  Co-developed-by: Joao Martins 
>  Signed-off-by: Joao Martins 
>  Signed-off-by: Si-Wei Liu 
> 
>  ---
>  v3 -> v4:
>  - add comment to clarify effect on cache locality and
>    performance
> 
>  v2 -> v3:
>  - add after-fix benchmark to commit log
>  - rename vhost_log_dev_enabled to vhost_dev_should_log
>  - remove unneeded comparisons for backend_type
>  - use QLIST array instead of single flat list to store vhost
>    logger devices
>  - simplify logger election logic
>  ---
> hw/virtio/vhost.c | 67 
>  ++-
> include/hw/virtio/vhost.h |  1 +
> 2 files changed, 62 insertions(+), 6 deletions(-)
> 
>  diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>  index 612f4db..58522f1 100644
>  --- a/hw/virtio/vhost.c
>  +++ b/hw/virtio/vhost.c
>  @@ -45,6 +45,7 @@
> 
> static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
> static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
>  +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> 
> /* Memslots used by backends that support private memslots (without 
>  an fd). */
> static unsigned int used_memslots;
>  @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
> }
> }
> 
>  +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
>  +{
>  +assert(dev->vhost_ops);
>  +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
>  +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
>  +
>  +return dev == 
>  QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
> >>> A dumb question, why not simple check
> >>>
> >>> dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
> >> Because we are not sure if the logger comes from vhost_log_shm[] or
> >> vhost_log[]. Don't want to complicate the check here by calling into
> >> vhost_dev_log_is_shared() everytime when the .log_sync() is called.
> > It has very low overhead, isn't it?
> Whether this has low overhead will have to depend on the specific
> backend's implementation for .vhost_requires_shm_log(), which the common
> vhost layer should not assume upon or rely on the current implementation.
>
> >
> > static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
> > {
> >  return dev->vhost_ops->vhost_requires_shm_log &&
> > dev->vhost_ops->vhost_requires_shm_log(dev);
> > }

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

> >
> > And it helps to simplify the logic.
> Generally yes, but when it comes to hot path operations the per

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-18 Thread Si-Wei Liu




On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:



On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
- add comment to clarify effect on cache locality and
  performance

v2 -> v3:
- add after-fix benchmark to commit log
- rename vhost_log_dev_enabled to vhost_dev_should_log
- remove unneeded comparisons for backend_type
- use QLIST array instead of single flat list to store vhost
  logger devices
- simplify logger election logic
---
   hw/virtio/vhost.c | 67 
++-
   include/hw/virtio/vhost.h |  1 +
   2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an fd). 
*/
   static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
   }
   }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?
Whether this has low overhead will have to depend on the specific 
backend's implementation for .vhost_requires_shm_log(), which the common 
vhost layer should not assume upon or rely on the current implementation.




static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
 return dev->vhost_ops->vhost_requires_shm_log &&
dev->vhost_ops->vhost_requires_shm_log(dev);
}

And it helps to simplify the logic.
Generally yes, but when it comes to hot path operations the performance 
consideration could override this principle. I think there's no harm to 
check against logger device cached in vhost layer itself, and the 
current patch does not create a lot of complexity or performance side 
effect (actually I think the conditional should be very straightforward 
to turn into just a couple of assembly compare and branch instructions 
rather than indirection through another jmp call).


-Siwei



Thanks


-Siwei

?

Thanks






Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-17 Thread Jason Wang
On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:
>
>
>
> On 3/14/2024 9:03 PM, Jason Wang wrote:
> > On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:
> >> On setups with one or more virtio-net devices with vhost on,
> >> dirty tracking iteration increases cost the bigger the number
> >> amount of queues are set up e.g. on idle guests migration the
> >> following is observed with virtio-net with vhost=on:
> >>
> >> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> >> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> >> 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
> >> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
> >>
> >> With high memory rates the symptom is lack of convergence as soon
> >> as it has a vhost device with a sufficiently high number of queues,
> >> the sufficient number of vhost devices.
> >>
> >> On every migration iteration (every 100msecs) it will redundantly
> >> query the *shared log* the number of queues configured with vhost
> >> that exist in the guest. For the virtqueue data, this is necessary,
> >> but not for the memory sections which are the same. So essentially
> >> we end up scanning the dirty log too often.
> >>
> >> To fix that, select a vhost device responsible for scanning the
> >> log with regards to memory sections dirty tracking. It is selected
> >> when we enable the logger (during migration) and cleared when we
> >> disable the logger. If the vhost logger device goes away for some
> >> reason, the logger will be re-selected from the rest of vhost
> >> devices.
> >>
> >> After making mem-section logger a singleton instance, constant cost
> >> of 7%-9% (like the 1 queue report) will be seen, no matter how many
> >> queues or how many vhost devices are configured:
> >>
> >> 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
> >> 2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
> >>
> >> Co-developed-by: Joao Martins 
> >> Signed-off-by: Joao Martins 
> >> Signed-off-by: Si-Wei Liu 
> >>
> >> ---
> >> v3 -> v4:
> >>- add comment to clarify effect on cache locality and
> >>  performance
> >>
> >> v2 -> v3:
> >>- add after-fix benchmark to commit log
> >>- rename vhost_log_dev_enabled to vhost_dev_should_log
> >>- remove unneeded comparisons for backend_type
> >>- use QLIST array instead of single flat list to store vhost
> >>  logger devices
> >>- simplify logger election logic
> >> ---
> >>   hw/virtio/vhost.c | 67 
> >> ++-
> >>   include/hw/virtio/vhost.h |  1 +
> >>   2 files changed, 62 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> >> index 612f4db..58522f1 100644
> >> --- a/hw/virtio/vhost.c
> >> +++ b/hw/virtio/vhost.c
> >> @@ -45,6 +45,7 @@
> >>
> >>   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
> >>   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
> >> +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
> >>
> >>   /* Memslots used by backends that support private memslots (without an 
> >> fd). */
> >>   static unsigned int used_memslots;
> >> @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
> >>   }
> >>   }
> >>
> >> +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
> >> +{
> >> +assert(dev->vhost_ops);
> >> +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
> >> +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
> >> +
> >> +return dev == 
> >> QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
> > A dumb question, why not simple check
> >
> > dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
> Because we are not sure if the logger comes from vhost_log_shm[] or
> vhost_log[]. Don't want to complicate the check here by calling into
> vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
return dev->vhost_ops->vhost_requires_shm_log &&
   dev->vhost_ops->vhost_requires_shm_log(dev);
}

And it helps to simplify the logic.

Thanks

>
> -Siwei
> > ?
> >
> > Thanks
> >
>




Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-15 Thread Si-Wei Liu




On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
   - add comment to clarify effect on cache locality and
 performance

v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic
---
  hw/virtio/vhost.c | 67 ++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
Because we are not sure if the logger comes from vhost_log_shm[] or 
vhost_log[]. Don't want to complicate the check here by calling into 
vhost_dev_log_is_shared() everytime when the .log_sync() is called.


-Siwei

?

Thanks






Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Jason Wang
On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:
>
> On setups with one or more virtio-net devices with vhost on,
> dirty tracking iteration increases cost the bigger the number
> amount of queues are set up e.g. on idle guests migration the
> following is observed with virtio-net with vhost=on:
>
> 48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
> 8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
> 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
> 2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14
>
> With high memory rates the symptom is lack of convergence as soon
> as it has a vhost device with a sufficiently high number of queues,
> the sufficient number of vhost devices.
>
> On every migration iteration (every 100msecs) it will redundantly
> query the *shared log* the number of queues configured with vhost
> that exist in the guest. For the virtqueue data, this is necessary,
> but not for the memory sections which are the same. So essentially
> we end up scanning the dirty log too often.
>
> To fix that, select a vhost device responsible for scanning the
> log with regards to memory sections dirty tracking. It is selected
> when we enable the logger (during migration) and cleared when we
> disable the logger. If the vhost logger device goes away for some
> reason, the logger will be re-selected from the rest of vhost
> devices.
>
> After making mem-section logger a singleton instance, constant cost
> of 7%-9% (like the 1 queue report) will be seen, no matter how many
> queues or how many vhost devices are configured:
>
> 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
> 2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14
>
> Co-developed-by: Joao Martins 
> Signed-off-by: Joao Martins 
> Signed-off-by: Si-Wei Liu 
>
> ---
> v3 -> v4:
>   - add comment to clarify effect on cache locality and
> performance
>
> v2 -> v3:
>   - add after-fix benchmark to commit log
>   - rename vhost_log_dev_enabled to vhost_dev_should_log
>   - remove unneeded comparisons for backend_type
>   - use QLIST array instead of single flat list to store vhost
> logger devices
>   - simplify logger election logic
> ---
>  hw/virtio/vhost.c | 67 
> ++-
>  include/hw/virtio/vhost.h |  1 +
>  2 files changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 612f4db..58522f1 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -45,6 +45,7 @@
>
>  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
>  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
> +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
>
>  /* Memslots used by backends that support private memslots (without an fd). 
> */
>  static unsigned int used_memslots;
> @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
>  }
>  }
>
> +static inline bool vhost_dev_should_log(struct vhost_dev *dev)
> +{
> +assert(dev->vhost_ops);
> +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
> +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
> +
> +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

?

Thanks