Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-07 Thread Si-Wei Liu




On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote:

On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu  wrote:



On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:

On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu  wrote:


On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:

On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:

On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
include/qemu/iova-tree.h | 5 +++--
util/iova-tree.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
hwaddr iova;
hwaddr translated_addr;
hwaddr size;/* Inclusive */
+uint64_t id;
IOMMUAccessFlags perm;
} QEMU_PACKED DMAMap;
typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
 * @map: the mapping to search
 *
 * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
 *
 * Return: DMAMap pointer if found, or NULL if not found.  Note that
 * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

needle = args->needle;
if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Agr

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-07 Thread Si-Wei Liu




On 5/1/2024 11:18 PM, Eugenio Perez Martin wrote:

On Thu, May 2, 2024 at 12:09 AM Si-Wei Liu  wrote:



On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:

On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer  wrote:


On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:


On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:

On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
include/qemu/iova-tree.h | 5 +++--
util/iova-tree.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
hwaddr iova;
hwaddr translated_addr;
hwaddr size;/* Inclusive */
+uint64_t id;
IOMMUAccessFlags perm;
} QEMU_PACKED DMAMap;
typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
 * @map: the mapping to search
 *
 * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
 *
 * Return: DMAMap pointer if found, or NULL if not found.  Note that
 * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

needle = args->needle;
if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression w

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-01 Thread Si-Wei Liu




On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:

On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu  wrote:



On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:


On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:

On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
   include/qemu/iova-tree.h | 5 +++--
   util/iova-tree.c | 3 ++-
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
   hwaddr iova;
   hwaddr translated_addr;
   hwaddr size;/* Inclusive */
+uint64_t id;
   IOMMUAccessFlags perm;
   } QEMU_PACKED DMAMap;
   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
* @map: the mapping to search
*
* Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
*
* Return: DMAMap pointer if found, or NULL if not found.  Note that
* the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

   needle = args->needle;
   if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Agreed, yeap we can use memory_region_from_host for now.  Any reason why
reverse IOVATree was dropped, lack of use

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-01 Thread Si-Wei Liu




On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:

On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer  wrote:



On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:



On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:


On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
   include/qemu/iova-tree.h | 5 +++--
   util/iova-tree.c | 3 ++-
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
   hwaddr iova;
   hwaddr translated_addr;
   hwaddr size;/* Inclusive */
+uint64_t id;
   IOMMUAccessFlags perm;
   } QEMU_PACKED DMAMap;
   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
* @map: the mapping to search
*
* Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
*
* Return: DMAMap pointer if found, or NULL if not found.  Note that
* the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

   needle = args->needle;
   if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did 

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-29 Thread Si-Wei Liu




On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:



On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:


On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
  include/qemu/iova-tree.h | 5 +++--
  util/iova-tree.c | 3 ++-
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
  hwaddr iova;
  hwaddr translated_addr;
  hwaddr size;/* Inclusive */
+uint64_t id;
  IOMMUAccessFlags perm;
  } QEMU_PACKED DMAMap;
  typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
   * @map: the mapping to search
   *
   * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
   *
   * Return: DMAMap pointer if found, or NULL if not found.  Note that
   * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

  needle = args->needle;
  if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Agreed, yeap we can use memory_region_from_host for now.  Any reason why
reverse IOVATree was dropped, lack of users? But now we have one!


No, it is just simplicity. We already have an user in the hot patch in
the master branch, vhost_svq

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-25 Thread Si-Wei Liu




On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:



On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:


On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
 include/qemu/iova-tree.h | 5 +++--
 util/iova-tree.c | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
 hwaddr iova;
 hwaddr translated_addr;
 hwaddr size;/* Inclusive */
+uint64_t id;
 IOMMUAccessFlags perm;
 } QEMU_PACKED DMAMap;
 typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
  * @map: the mapping to search
  *
  * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
  *
  * Return: DMAMap pointer if found, or NULL if not found.  Note that
  * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

 needle = args->needle;
 if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.




But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Agreed, yeap we can use memory_region_from_host for now.  Any reason why
reverse IOVATree was dropped, lack of users? But now we have one!


No, it is just simplicity. We already have an user in the hot patch in
the master branch, vhost_svq_vring_write_descs. But I never profiled
enough to find if it is a bottleneck or not to be honest.
Right, wi

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-23 Thread Si-Wei Liu




On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:



On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:


On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
include/qemu/iova-tree.h | 5 +++--
util/iova-tree.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
hwaddr iova;
hwaddr translated_addr;
hwaddr size;/* Inclusive */
+uint64_t id;
IOMMUAccessFlags perm;
} QEMU_PACKED DMAMap;
typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
 * @map: the mapping to search
 *
 * Search for a mapping in the iova tree that translated_addr overlaps with 
the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
 *
 * Return: DMAMap pointer if found, or NULL if not found.  Note that
 * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

needle = args->needle;
if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.
Oh Sorry, I misread the code and I should look for g_tree_foreach () 
instead of g_tree_search_node(). So the former is indeed linear 
iteration, but it looks to be ordered?


https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.
Agreed, yeap we can use memory_region_from_host for now.  Any reason why 
reverse IOVATree was dropped, lack of users? But now we have one!


Thanks,
-Siwei


Thanks!


Of course,
memory_region_from_host() won't search out of the guest memory space for
sure. As this could be on the hot data path I have a little bit
hesitance over the potential cost or performance regression this change
could bring in, but maybe I'm overthinking it too much...

Thanks,
-Siwei


Thanks,
-Siwei

return false;
}






Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-19 Thread Si-Wei Liu




On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:



On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
   include/qemu/iova-tree.h | 5 +++--
   util/iova-tree.c | 3 ++-
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
   hwaddr iova;
   hwaddr translated_addr;
   hwaddr size;/* Inclusive */
+uint64_t id;
   IOMMUAccessFlags perm;
   } QEMU_PACKED DMAMap;
   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
* @map: the mapping to search
*
* Search for a mapping in the iova tree that translated_addr overlaps with 
the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
*
* Return: DMAMap pointer if found, or NULL if not found.  Note that
* the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

   needle = args->needle;
   if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...
Yeah, that will be another means of doing translation without having to 
complicate the API around iova_tree. I wonder how the lookup through 
memory_region_from_host() may perform compared to the iova tree one, the 
former looks to be an O(N) linear search on a linked list while the 
latter would be roughly O(log N) on an AVL tree? Of course, 
memory_region_from_host() won't search out of the guest memory space for 
sure. As this could be on the hot data path I have a little bit 
hesitance over the potential cost or performance regression this change 
could bring in, but maybe I'm overthinking it too much...


Thanks,
-Siwei




Thanks,
-Siwei

   return false;
   }






Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-18 Thread Si-Wei Liu




On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
  include/qemu/iova-tree.h | 5 +++--
  util/iova-tree.c | 3 ++-
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
  hwaddr iova;
  hwaddr translated_addr;
  hwaddr size;/* Inclusive */
+uint64_t id;
  IOMMUAccessFlags perm;
  } QEMU_PACKED DMAMap;
  typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
   * @map: the mapping to search
   *
   * Search for a mapping in the iova tree that translated_addr overlaps with 
the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
   *
   * Return: DMAMap pointer if found, or NULL if not found.  Note that
   * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,
  
  needle = args->needle;

  if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {


It looks this iterator can also be invoked by SVQ from 
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA 
space will be searched on without passing in the ID (GPA), and exact 
match for the same GPA range is not actually needed unlike the mapping 
removal case. Could we create an API variant, for the SVQ lookup case 
specifically? Or alternatively, add a special flag, say skip_id_match to 
DMAMap, and the id match check may look like below:


(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or 
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


Thanks,
-Siwei

  return false;
  }
  





Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-04-03 Thread Si-Wei Liu



On 4/2/2024 5:01 AM, Eugenio Perez Martin wrote:

On Tue, Apr 2, 2024 at 8:19 AM Si-Wei Liu  wrote:



On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:

On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu  wrote:

Hi Michael,

On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this?

Didn't get a chance to look into the detail yet in the past week, but
thought it may have something to do with the (internals of) iova tree
range allocation and the lookup routine. It started to fall apart at the
first vhost_vdpa_dma_unmap call showing up in the trace events, where it
should've gotten IOVA=0x201000,  but an incorrect IOVA address
0x1000 was ended up returning from the iova tree lookup routine.

HVAGPAIOVA
-
Map
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)

Unmap
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
0x2) ???
   shouldn't it be [0x201000,
0x221000) ???


It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(),
which is called from vhost_vdpa_listener_region_del(), can't properly
deal with overlapped region. Specifically, q35's mch_realize() has the
following:

579 memory_region_init_alias(>open_high_smram, OBJECT(mch),
"smram-open-high",
580  mch->ram_memory,
MCH_HOST_BRIDGE_SMRAM_C_BASE,
581  MCH_HOST_BRIDGE_SMRAM_C_SIZE);
582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
583 >open_high_smram, 1);
584 memory_region_set_enabled(>open_high_smram, false);

#0  0x564c30bf6980 in iova_tree_find_address_iterator
(key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at
../util/iova-tree.c:96
#1  0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
#2  0x564c30bf6b53 in iova_tree_find_iova (tree=,
map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
#3  0x564c309da0a9 in vhost_iova_tree_find_iova (tree=, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
#4  0x564c3085e49d in vhost_vdpa_listener_region_del
(listener=0x564c331024c8, section=0x7fffb6d74aa0) at
../hw/virtio/vhost-vdpa.c:444
#5  0x564c309f4931 in address_space_update_topology_pass
(as=as@entry=0x564c31ab1840 ,
old_view=old_view@entry=0x564c33364cc0,
new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at
../system/memory.c:977
#6  0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840
) at ../system/memory.c:1079
#7  0x564c309f86d0 in memory_region_transaction_commit () at
../system/memory.c:1132
#8  0x564c309f86d0 in memory_region_transaction_commit () at
../system/memory.c:1117
#9  0x564c307cce64 in mch_realize (d=,
errp=) at ../hw/pci-host/q35.c:584

However, it looks like iova_tree_find_address_iterator() only check if
the translated address (HVA) falls in to the range when trying to locate
the desired IOVA, causing the first DMAMap that happens to overlap in
the translated address (HVA) space to be returned prematurely:

   89 static gboolean iova_tree_find_address_iterator(gpointer key,
gpointer value,
   90 gpointer data)
   91 {
   :
   :
   99 if (map->translated_addr + map->size < needle->translated_addr ||
100 needle->translated_addr + needle->size < map->translated_addr) {
101 return false;
102 }
103
104 args->result = map;
105 return true;
106 }

In the QEMU trace file, it reveals that the first DMAMap as below gets
returned incorrectly instead the second, the latter of which is what the
actual IOVA corresponds to:

HVA GPA 
IOVA
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000)   
[0x1000, 0x80001000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)


I think the analysis is totally accurate as no code expects to unmap /
map overlapping regions. 

Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-04-02 Thread Si-Wei Liu




On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:

On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu  wrote:

Hi Michael,

On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this?

Didn't get a chance to look into the detail yet in the past week, but
thought it may have something to do with the (internals of) iova tree
range allocation and the lookup routine. It started to fall apart at the
first vhost_vdpa_dma_unmap call showing up in the trace events, where it
should've gotten IOVA=0x201000,  but an incorrect IOVA address
0x1000 was ended up returning from the iova tree lookup routine.

HVAGPAIOVA
-
Map
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)

Unmap
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
0x2) ???
  shouldn't it be [0x201000,
0x221000) ???

It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(), 
which is called from vhost_vdpa_listener_region_del(), can't properly 
deal with overlapped region. Specifically, q35's mch_realize() has the 
following:


579 memory_region_init_alias(>open_high_smram, OBJECT(mch), 
"smram-open-high",
580  mch->ram_memory, 
MCH_HOST_BRIDGE_SMRAM_C_BASE,

581  MCH_HOST_BRIDGE_SMRAM_C_SIZE);
582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
583 >open_high_smram, 1);
584 memory_region_set_enabled(>open_high_smram, false);

#0  0x564c30bf6980 in iova_tree_find_address_iterator 
(key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at 
../util/iova-tree.c:96

#1  0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
#2  0x564c30bf6b53 in iova_tree_find_iova (tree=, 
map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
#3  0x564c309da0a9 in vhost_iova_tree_find_iova (tree=out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
#4  0x564c3085e49d in vhost_vdpa_listener_region_del 
(listener=0x564c331024c8, section=0x7fffb6d74aa0) at 
../hw/virtio/vhost-vdpa.c:444
#5  0x564c309f4931 in address_space_update_topology_pass 
(as=as@entry=0x564c31ab1840 , 
old_view=old_view@entry=0x564c33364cc0, 
new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at 
../system/memory.c:977
#6  0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840 
) at ../system/memory.c:1079
#7  0x564c309f86d0 in memory_region_transaction_commit () at 
../system/memory.c:1132
#8  0x564c309f86d0 in memory_region_transaction_commit () at 
../system/memory.c:1117
#9  0x564c307cce64 in mch_realize (d=, 
errp=) at ../hw/pci-host/q35.c:584


However, it looks like iova_tree_find_address_iterator() only check if 
the translated address (HVA) falls in to the range when trying to locate 
the desired IOVA, causing the first DMAMap that happens to overlap in 
the translated address (HVA) space to be returned prematurely:


 89 static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

 90 gpointer data)
 91 {
 :
 :
 99 if (map->translated_addr + map->size < needle->translated_addr ||
100 needle->translated_addr + needle->size < map->translated_addr) {
101 return false;
102 }
103
104 args->result = map;
105 return true;
106 }

In the QEMU trace file, it reveals that the first DMAMap as below gets 
returned incorrectly instead the second, the latter of which is what the 
actual IOVA corresponds to:


HVA GPA 
IOVA
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000)   
[0x1000, 0x80001000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)


Maybe other than check the HVA range, we should also match GPA, or at 
least the size should exactly match?



Yes, I'm still not able to reproduce. In particular, I don't know how

Re: [External] : Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-25 Thread Si-Wei Liu




On 3/24/2024 11:13 PM, Jason Wang wrote:

On Sat, Mar 23, 2024 at 5:14 AM Si-Wei Liu  wrote:



On 3/21/2024 10:08 PM, Jason Wang wrote:

On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:


On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:

On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:

On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
- add comment to clarify effect on cache locality and
  performance

v2 -> v3:
- add after-fix benchmark to commit log
- rename vhost_log_dev_enabled to vhost_dev_should_log
- remove unneeded comparisons for backend_type
- use QLIST array instead of single flat list to store vhost
  logger devices
- simplify logger election logic
---
   hw/virtio/vhost.c | 67 
++-
   include/hw/virtio/vhost.h |  1 +
   2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an 
fd). */
   static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
   }
   }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
 return dev->vhost_ops->vhost_requires_shm_log &&
dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this,

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-22 Thread Si-Wei Liu




On 3/21/2024 10:08 PM, Jason Wang wrote:

On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:



On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:


On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:

On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
   - add comment to clarify effect on cache locality and
 performance

v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic
---
  hw/virtio/vhost.c | 67 
++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an 
fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
return dev->vhost_ops->vhost_requires_shm_log &&
   dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this, I think it's insufficient to just check the shm log
v.s. normal log. The logger device requires to identify a

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-21 Thread Si-Wei Liu




On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:



On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:


On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
  - add comment to clarify effect on cache locality and
performance

v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic
---
 hw/virtio/vhost.c | 67 
++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

 /* Memslots used by backends that support private memslots (without an 
fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
   return dev->vhost_ops->vhost_requires_shm_log &&
  dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this, I think it's insufficient to just check the shm log
v.s. normal log. The logger device requires to identify a leading logger
device that gets elected in vhost_dev_elect_mem_logger(), as all the
dev->log points to the same 

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-20 Thread Si-Wei Liu




On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:



On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:


On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
 - add comment to clarify effect on cache locality and
   performance

v2 -> v3:
 - add after-fix benchmark to commit log
 - rename vhost_log_dev_enabled to vhost_dev_should_log
 - remove unneeded comparisons for backend_type
 - use QLIST array instead of single flat list to store vhost
   logger devices
 - simplify logger election logic
---
hw/virtio/vhost.c | 67 
++-
include/hw/virtio/vhost.h |  1 +
2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

/* Memslots used by backends that support private memslots (without an fd). 
*/
static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
}
}

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
  return dev->vhost_ops->vhost_requires_shm_log &&
 dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?
Right now the log type won't change during runtime, but I am not sure if 
this may prohibit future revisit to allow change at the runtime, then 
there'll be complex code involvled to maintain the state.


Other than this, I think it's insufficient to just check the shm log 
v.s. normal log. The logger device requires to identify a leading logger 
device that gets elected in vhost_dev_elect_mem_logger(), as all the 
dev->log points to the same logger that is refenerce counted, that we 
have to add extra field and complex logic to maintain the election 
status. I thought that Eugenio's previous suggesti

Re: [PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-20 Thread Si-Wei Liu




On 3/19/2024 8:25 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:06 AM Si-Wei Liu  wrote:



On 3/17/2024 8:20 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:33 AM Si-Wei Liu  wrote:


On 3/14/2024 8:50 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

It's better to describe what's the advantage of doing this.

Yes, I can add that to the log. Although it's a niche use case, it was
actually a long standing limitation / bug that vhost-user and
vhost-kernel loggers can't co-exist per QEMU process, but today it's
just silent failure that may be ended up with. This bug fix removes that
implicit limitation in the code.

Ok.


Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
 - remove checking NULL return value from vhost_log_get

v2->v3:
 - remove non-effective assertion that never be reached
 - do not return NULL from vhost_log_get()
 - add neccessary assertions to vhost_log_get()
---
hw/virtio/vhost.c | 45 +
1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
do { } while (0)
#endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

/* Memslots used by backends that support private memslots (without an fd). 
*/
static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
r = -1;
}

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+

Under which condition could we hit this?

Just in case some other function inadvertently corrupted this earlier,
we have to capture discrepancy in the first place... On the other hand,
it will be helpful for other vhost backend writers to diagnose day-one
bug in the code. I feel just code comment here will not be
sufficient/helpful.

See below.


It seems not good to assert a local logic.

It seems to me quite a few local asserts are in the same file already,
vhost_save_backend_state,

For example it has assert for

assert(!dev->started);

which is not the logic of the function itself but require
vhost_dev_start() not to be called before.

But it looks like this patch you assert the code just a few lines
above the assert itself?

Yes, that was the intent - for e.g. xxx_ops may contain corrupted
xxx_ops.backend_type already before coming to this
vhost_set_backend_type() function. And we may capture this corrupted
state by asserting the expected xxx_ops.backend_type (to be consistent
with the backend_type passed in),

This can happen for all variables. Not sure why backend_ops is special.
The assert is just checking the backend_type field only. The other op 
fields in backend_ops have similar assert within the op function itself 
also. For e.g. vhost_user_requires_shm_log() and a lot of other 
vhost_user ops have the following:


    assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER);

vhost_vdpa_vq_get_addr() and a lot of other vhost_vdpa ops have:

    assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);

vhost_kernel ops has similar assertions as well.

The reason why it has to be checked against here is now the callers of 
vhost_log_get(), would pass in dev->vhost_ops->backend_type to the API, 
which are unable to verify the validity of the backend_type by 
themselves. The vhost_log_get() has necessary asserts to make bound 
check for the vhost_log[] or vhost_log_shm[] array, but specific assert 
against the exact backend type in vhost_set_backend_type() will further 
harden the implementation in vhost_log_get() and other backend ops.





which needs be done in the first place
when this discrepancy is detected. In practice I think there should be
no harm to add this assert, but this will add warranted guarantee to the
current code.

For example, such corruption can happen after the assert() so a TOCTOU issue.
Sure, it's best effort only. As pointed out earlier, I think together 
with this, there are other similar asserts already in various backend 
ops, which could be helpful to nail down the earliest point or a 
specific range where things may go wrong in the first place.


Thanks,
-Siwei



Thanks


Regards,
-Siwei


dev->vhost_ops = _ops;

...

assert(dev->vhost_ops->backend_type == backend_type)

?

Thanks


vhost_load_backend_state,
vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local
assert a problem?

Thanks,
-Siwei


Thanks






Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-18 Thread Si-Wei Liu




On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:



On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
- add comment to clarify effect on cache locality and
  performance

v2 -> v3:
- add after-fix benchmark to commit log
- rename vhost_log_dev_enabled to vhost_dev_should_log
- remove unneeded comparisons for backend_type
- use QLIST array instead of single flat list to store vhost
  logger devices
- simplify logger election logic
---
   hw/virtio/vhost.c | 67 
++-
   include/hw/virtio/vhost.h |  1 +
   2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an fd). 
*/
   static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
   }
   }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?
Whether this has low overhead will have to depend on the specific 
backend's implementation for .vhost_requires_shm_log(), which the common 
vhost layer should not assume upon or rely on the current implementation.




static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
 return dev->vhost_ops->vhost_requires_shm_log &&
dev->vhost_ops->vhost_requires_shm_log(dev);
}

And it helps to simplify the logic.
Generally yes, but when it comes to hot path operations the performance 
consideration could override this principle. I think there's no harm to 
check against logger device cached in vhost layer itself, and the 
current patch does not create a lot of complexity or performance side 
effect (actually I think the conditional should be very straightforward 
to turn into just a couple of assembly compare and branch instructions 
rather than indirection through another jmp call).


-Siwei



Thanks


-Siwei

?

Thanks






Re: [PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-18 Thread Si-Wei Liu




On 3/17/2024 8:20 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:33 AM Si-Wei Liu  wrote:



On 3/14/2024 8:50 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

It's better to describe what's the advantage of doing this.

Yes, I can add that to the log. Although it's a niche use case, it was
actually a long standing limitation / bug that vhost-user and
vhost-kernel loggers can't co-exist per QEMU process, but today it's
just silent failure that may be ended up with. This bug fix removes that
implicit limitation in the code.

Ok.


Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
- remove checking NULL return value from vhost_log_get

v2->v3:
- remove non-effective assertion that never be reached
- do not return NULL from vhost_log_get()
- add neccessary assertions to vhost_log_get()
---
   hw/virtio/vhost.c | 45 +
   1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
   do { } while (0)
   #endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an fd). 
*/
   static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
   r = -1;
   }

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+

Under which condition could we hit this?

Just in case some other function inadvertently corrupted this earlier,
we have to capture discrepancy in the first place... On the other hand,
it will be helpful for other vhost backend writers to diagnose day-one
bug in the code. I feel just code comment here will not be
sufficient/helpful.

See below.


   It seems not good to assert a local logic.

It seems to me quite a few local asserts are in the same file already,
vhost_save_backend_state,

For example it has assert for

assert(!dev->started);

which is not the logic of the function itself but require
vhost_dev_start() not to be called before.

But it looks like this patch you assert the code just a few lines
above the assert itself?
Yes, that was the intent - for e.g. xxx_ops may contain corrupted 
xxx_ops.backend_type already before coming to this 
vhost_set_backend_type() function. And we may capture this corrupted 
state by asserting the expected xxx_ops.backend_type (to be consistent 
with the backend_type passed in), which needs be done in the first place 
when this discrepancy is detected. In practice I think there should be 
no harm to add this assert, but this will add warranted guarantee to the 
current code.


Regards,
-Siwei



dev->vhost_ops = _ops;

...

assert(dev->vhost_ops->backend_type == backend_type)

?

Thanks


vhost_load_backend_state,
vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local
assert a problem?

Thanks,
-Siwei


Thanks






Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-15 Thread Si-Wei Liu




On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
   - add comment to clarify effect on cache locality and
 performance

v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic
---
  hw/virtio/vhost.c | 67 ++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
Because we are not sure if the logger comes from vhost_log_shm[] or 
vhost_log[]. Don't want to complicate the check here by calling into 
vhost_dev_log_is_shared() everytime when the .log_sync() is called.


-Siwei

?

Thanks






Re: [PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-15 Thread Si-Wei Liu




On 3/14/2024 8:50 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

It's better to describe what's the advantage of doing this.
Yes, I can add that to the log. Although it's a niche use case, it was 
actually a long standing limitation / bug that vhost-user and 
vhost-kernel loggers can't co-exist per QEMU process, but today it's 
just silent failure that may be ended up with. This bug fix removes that 
implicit limitation in the code.



Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
   - remove checking NULL return value from vhost_log_get

v2->v3:
   - remove non-effective assertion that never be reached
   - do not return NULL from vhost_log_get()
   - add neccessary assertions to vhost_log_get()
---
  hw/virtio/vhost.c | 45 +
  1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
  do { } while (0)
  #endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
  r = -1;
  }

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+

Under which condition could we hit this?
Just in case some other function inadvertently corrupted this earlier, 
we have to capture discrepancy in the first place... On the other hand, 
it will be helpful for other vhost backend writers to diagnose day-one 
bug in the code. I feel just code comment here will not be 
sufficient/helpful.



  It seems not good to assert a local logic.
It seems to me quite a few local asserts are in the same file already, 
vhost_save_backend_state, vhost_load_backend_state, 
vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local 
assert a problem?


Thanks,
-Siwei


Thanks






[PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu
On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
  - add comment to clarify effect on cache locality and
performance

v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic
---
 hw/virtio/vhost.c | 67 ++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@
 
 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }
 
+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ * This is done in order to get better cache locality and to avoid
+ * performance churn on the hot path for log scanning. Even when
+ * new devices come and go quickly, it wouldn't end up changing
+ * the active leading logger device at all.
+ */
+QLIST_INSERT_AFTER(QLIST_FIRST(_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
MemoryRegionSection *section,
hwaddr first,
@@ -166,12 +208,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 start_addr = MAX(first, start_addr);
 end_addr = MIN(last, end_addr);
 
-for (i = 0; i < dev->mem->nregions; ++i) {
-struct vhost_memory_region *reg = dev->mem->regions + i;
-   

[PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-14 Thread Si-Wei Liu
There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
  - remove checking NULL return value from vhost_log_get

v2->v3:
  - remove non-effective assertion that never be reached
  - do not return NULL from vhost_log_get()
  - add neccessary assertions to vhost_log_get()
---
 hw/virtio/vhost.c | 45 +
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
 do { } while (0)
 #endif
 
-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
 r = -1;
 }
 
+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+
 return r;
 }
 
@@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
 return log;
 }
 
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
 {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
 
 if (!log || log->size != size) {
 log = vhost_log_alloc(size, share);
 if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
 } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
 }
 } else {
 ++log->refcnt;
@@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
 static void vhost_log_put(struct vhost_dev *dev, bool sync)
 {
 struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
 
 if (!log) {
 return;
 }
 
+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
 --log->refcnt;
 if (log->refcnt == 0) {
 /* Sync only the range covered by the old log */
@@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
 vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
 }
 
-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
 g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
 qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
 log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
 }
 
 g_free(log);
@@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
 
 static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
 {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
 uint64_t log_base = (uintptr_t)log->log;
 int r;
 
@@ -2037,7 +2057,8 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
 uint64_t log_base;
 
 hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
   vhost_dev_log_is_shared(hdev));
 log_base = (uintptr_t)hdev->log->log;
 r = hdev->vhost_ops->vhost_set_log_base(hdev,
-- 
1.8.3.1




Re: [PATCH v3 1/2] vhost: dirty log should be per backend type

2024-03-14 Thread Si-Wei Liu




On 3/14/2024 8:25 AM, Eugenio Perez Martin wrote:

On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
v2->v3:
   - remove non-effective assertion that never be reached
   - do not return NULL from vhost_log_get()
   - add neccessary assertions to vhost_log_get()

---
  hw/virtio/vhost.c | 50 ++
  1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..efe2f74 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
  do { } while (0)
  #endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
  r = -1;
  }

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+
  return r;
  }

@@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
  return log;
  }

-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
  {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];

  if (!log || log->size != size) {
  log = vhost_log_alloc(size, share);
  if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
  } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
  }
  } else {
  ++log->refcnt;
@@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
  static void vhost_log_put(struct vhost_dev *dev, bool sync)
  {
  struct vhost_log *log = dev->log;
+VhostBackendType backend_type;

  if (!log) {
  return;
  }

+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
  --log->refcnt;
  if (log->refcnt == 0) {
  /* Sync only the range covered by the old log */
@@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
  vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
  }

-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
  g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
  qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
  log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
  }

  g_free(log);
@@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)

  static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
  {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
  uint64_t log_base = (uintptr_t)log->log;
  int r;

@@ -2037,8 +2057,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
  uint64_t log_base;

  hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {

I thought vhost_log_get couldn't return NULL :).

Sure, missed that. Will post a revised v4.

-Siwei


Other than that,

Acked-by: Eugenio Pérez 


+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
  log_base = (uintptr_t)hdev->log->log;
  r = hdev->

Re: [PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu




On 3/14/2024 8:34 AM, Eugenio Perez Martin wrote:

On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic

---
  hw/virtio/vhost.c | 63 ++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index efe2f74..d91858b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ */

Why is changing the logger device a problem? All the code paths are
either changing the QLIST or logging, isn't it?
Changing logger device doesn't affect functionality for sure, but may 
have inadvertent effect on cache locality, particularly it's relevant to 
the log scanning process in the hot path. The code makes sure there's no 
churn on the leading logger selection as a result of adding new vhost 
device, unless the selected logger device will be gone and a re-election 
of another logger is needed.


-Siwei




+QLIST_INSERT_AFTER(QLIST_FIRST(_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
  static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 MemoryRegionSection *section,
 hwaddr first,
@@ -166,12 +204,14 @@ static 

[PATCH v3 1/2] vhost: dirty log should be per backend type

2024-03-14 Thread Si-Wei Liu
There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
v2->v3: 
  - remove non-effective assertion that never be reached
  - do not return NULL from vhost_log_get()
  - add neccessary assertions to vhost_log_get()

---
 hw/virtio/vhost.c | 50 ++
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..efe2f74 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
 do { } while (0)
 #endif
 
-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
 r = -1;
 }
 
+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+
 return r;
 }
 
@@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
 return log;
 }
 
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
 {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
 
 if (!log || log->size != size) {
 log = vhost_log_alloc(size, share);
 if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
 } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
 }
 } else {
 ++log->refcnt;
@@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
 static void vhost_log_put(struct vhost_dev *dev, bool sync)
 {
 struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
 
 if (!log) {
 return;
 }
 
+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
 --log->refcnt;
 if (log->refcnt == 0) {
 /* Sync only the range covered by the old log */
@@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
 vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
 }
 
-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
 g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
 qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
 log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
 }
 
 g_free(log);
@@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
 
 static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
 {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
 uint64_t log_base = (uintptr_t)log->log;
 int r;
 
@@ -2037,8 +2057,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
 uint64_t log_base;
 
 hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
   vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {
+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
 log_base = (uintptr_t)hdev->log->log;
 r = hdev->vhost_ops->vhost_set_log_base(hdev,
 hdev->log_size ? log_base : 0,
-- 
1.8.3.1




[PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu
On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic

---
 hw/virtio/vhost.c | 63 ++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index efe2f74..d91858b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@
 
 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }
 
+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ */
+QLIST_INSERT_AFTER(QLIST_FIRST(_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
MemoryRegionSection *section,
hwaddr first,
@@ -166,12 +204,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 start_addr = MAX(first, start_addr);
 end_addr = MIN(last, end_addr);
 
-for (i = 0; i < dev->mem->nregions; ++i) {
-struct vhost_memory_region *reg = dev->mem->regions + i;
-vhost_dev_sync_region(dev, section, start_addr, end_addr,
-  reg->guest_phys_addr,
-  range_get_last(reg->guest_phys_addr,
- reg->memory_size));
+if (vhost_dev_should_log(dev)) {
+for (i = 0; i < dev->mem->nregions; ++i) {
+struct vhost_m

Re: [PATCH v2 1/2] vhost: dirty log should be per backend type

2024-03-13 Thread Si-Wei Liu




On 3/12/2024 8:07 AM, Michael S. Tsirkin wrote:

On Wed, Feb 14, 2024 at 10:42:29AM -0800, Si-Wei Liu wrote:

Hi Michael,

I'm taking off for 2+ weeks, but please feel free to provide comment and
feedback while I'm off. I'll be checking emails still, and am about to
address any opens as soon as I am back.

Thanks,
-Siwei

Eugenio sent some comments. I don't have more, just address these
please. Thanks!


Thanks Michael, good to know you don't have more other than the one from 
Eugenio. I will post a v3 shortly to address his comments.


-Siwei



Re: [PATCH 12/12] vdpa: fix network breakage after cancelling migration

2024-03-13 Thread Si-Wei Liu




On 3/13/2024 11:12 AM, Michael Tokarev wrote:

14.02.2024 14:28, Si-Wei Liu wrote:

Fix an issue where cancellation of ongoing migration ends up
with no network connectivity.

When canceling migration, SVQ will be switched back to the
passthrough mode, but the right call fd is not programed to
the device and the svq's own call fd is still used. At the
point of this transitioning period, the shadow_vqs_enabled
hadn't been set back to false yet, causing the installation
of call fd inadvertently bypassed.

Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding 
capabilities")

Cc: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
  hw/virtio/vhost-vdpa.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)


Is this a -stable material?
Probably yes, the pre-requisites of this patch are PATCH #10 and #11 
from this series (where SVQ_TSTATE_DISABLING gets defined and set).




If yes, is it also applicable for stable-7.2 (mentioned commit is in 
7.2.0),

which lacks v7.2.0-2327-gb276524386 "vdpa: Remember last call fd set",
or should this one also be picked up?
Eugenio can judge, but seems to me the relevant code path cannot be 
effectively called as the dynamic SVQ feature (switching over to SVQ 
dynamically when migration is started) is not supported from 7.2. Maybe 
not worth it to cherry-pick this one to 7.2. Cherry-pick to stable-8.0 
and above should be applicable though (it needs some tweaks on patch #10 
to move svq_switching from @struct VhostVDPAShared to @struct vhost_vdpa).


Regards,
-Siwei



Thanks,

/mjt


diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 004110f..dfeca8b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1468,7 +1468,15 @@ static int vhost_vdpa_set_vring_call(struct 
vhost_dev *dev,
    /* Remember last call fd because we can switch to SVQ 
anytime. */

  vhost_svq_set_svq_call_fd(svq, file->fd);
-    if (v->shadow_vqs_enabled) {
+    /*
+ * When SVQ is transitioning to off, shadow_vqs_enabled has
+ * not been set back to false yet, but the underlying call fd
+ * will have to switch back to the guest notifier to signal the
+ * passthrough virtqueues. In other situations, SVQ's own call
+ * fd shall be used to signal the device model.
+ */
+    if (v->shadow_vqs_enabled &&
+    v->shared->svq_switching != SVQ_TSTATE_DISABLING) {
  return 0;
  }







Re: [PATCH 04/12] vdpa: factor out vhost_vdpa_net_get_nc_vdpa

2024-02-14 Thread Si-Wei Liu




On 2/14/2024 10:54 AM, Eugenio Perez Martin wrote:

On Wed, Feb 14, 2024 at 1:39 PM Si-Wei Liu  wrote:

Introduce new API. No functional change on existing API.

Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 

I'm ok with the new function, but doesn't the compiler complain
because adding a static function is not used?
Hmmm, which one? vhost_vdpa_net_get_nc_vdpa is used by 
vhost_vdpa_net_first_nc_vdpa internally, and 
vhost_vdpa_net_first_nc_vdpa is used by vhost_vdpa_net_cvq_start (Patch 
01). I think we should be fine?


-Siwei



---
  net/vhost-vdpa.c | 13 +
  1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 06c83b4..4168cad 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -281,13 +281,18 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
  }


-/** From any vdpa net client, get the netclient of the first queue pair */
-static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+/** From any vdpa net client, get the netclient of the i-th queue pair */
+static VhostVDPAState *vhost_vdpa_net_get_nc_vdpa(VhostVDPAState *s, int i)
  {
  NICState *nic = qemu_get_nic(s->nc.peer);
-NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+NetClientState *nc_i = qemu_get_peer(nic->ncs, i);
+
+return DO_UPCAST(VhostVDPAState, nc, nc_i);
+}

-return DO_UPCAST(VhostVDPAState, nc, nc0);
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+return vhost_vdpa_net_get_nc_vdpa(s, 0);
  }

  static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
--
1.8.3.1






Re: [PATCH v2 1/2] vhost: dirty log should be per backend type

2024-02-14 Thread Si-Wei Liu

Hi Michael,

I'm taking off for 2+ weeks, but please feel free to provide comment and 
feedback while I'm off. I'll be checking emails still, and am about to 
address any opens as soon as I am back.


Thanks,
-Siwei

On 2/14/2024 3:50 AM, Si-Wei Liu wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
  hw/virtio/vhost.c | 49 +
  1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..ef6d9b5 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
  do { } while (0)
  #endif
  
-static struct vhost_log *vhost_log;

-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
  
  /* Memslots used by backends that support private memslots (without an fd). */

  static unsigned int used_memslots;
@@ -287,6 +287,8 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
  r = -1;
  }
  
+assert(dev->vhost_ops->backend_type == backend_type || r < 0);

+
  return r;
  }
  
@@ -319,16 +321,23 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, bool share)

  return log;
  }
  
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)

+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
  {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX)
+return NULL;
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
  
  if (!log || log->size != size) {

  log = vhost_log_alloc(size, share);
  if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
  } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
  }
  } else {
  ++log->refcnt;
@@ -340,11 +349,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
  static void vhost_log_put(struct vhost_dev *dev, bool sync)
  {
  struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
  
  if (!log) {

  return;
  }
  
+assert(dev->vhost_ops);

+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
  --log->refcnt;
  if (log->refcnt == 0) {
  /* Sync only the range covered by the old log */
@@ -352,13 +370,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
  vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
  }
  
-if (vhost_log == log) {

+if (vhost_log[backend_type] == log) {
  g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
  qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
  log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
  }
  
  g_free(log);

@@ -376,7 +394,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
  
  static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)

  {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
  uint64_t log_base = (uintptr_t)log->log;
  int r;
  
@@ -2037,8 +2056,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)

  uint64_t log_base;
  
  hdev->log_size = vhost_get_log_size(hdev);

-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {
+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
  log_base = (uintptr_t)hdev->log->log;
  r = hdev->vhost_ops->vhost_set_log_base(hdev,
  hdev->log_size ? log_base : 0,





Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-02-14 Thread Si-Wei Liu

Hi Eugenio,

Just to answer the question you had in the sync meeting as I've just 
tried, it seems that the issue is also reproducible even with VGA device 
and VNC display removed, and also reproducible with 8G mem size. You 
already knew that I can only repro with x-svq=on.


Regards,
-Siwei

On 2/13/2024 8:26 AM, Eugenio Perez Martin wrote:

On Tue, Feb 13, 2024 at 11:22 AM Michael S. Tsirkin  wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this? Can't merge patches which are known to break things ...


Sorry for the lack of news, I'll try to reproduce this week. Meanwhile
this patch should not be merged, as you mention.

Thanks!






Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-02-14 Thread Si-Wei Liu

Hi Michael,

On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this?
Didn't get a chance to look into the detail yet in the past week, but 
thought it may have something to do with the (internals of) iova tree 
range allocation and the lookup routine. It started to fall apart at the 
first vhost_vdpa_dma_unmap call showing up in the trace events, where it 
should've gotten IOVA=0x201000,  but an incorrect IOVA address 
0x1000 was ended up returning from the iova tree lookup routine.


HVA                    GPA                IOVA
-
Map
[0x7f7903e0, 0x7f7983e0)    [0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)    [0x1, 0x208000) 
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)    [0xfeda, 0xfedc) 
[0x201000, 0x221000)


Unmap
[0x7f7903ea, 0x7f7903ec)    [0xfeda, 0xfedc) [0x1000, 
0x2) ???
                                shouldn't it be [0x201000, 
0x221000) ???


PS, I will be taking off from today and for the next two weeks. Will try 
to help out looking more closely after I get back.


-Siwei

  Can't merge patches which are known to break things ...




[PATCH v2 2/2] vhost: Perform memory section dirty scans once per iteration

2024-02-14 Thread Si-Wei Liu
On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So
essentially we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost.c | 75 +++
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index ef6d9b5..997d560 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,9 @@
 
 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_dev *vhost_mem_logger[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_mlog_devices =
+QLIST_HEAD_INITIALIZER(vhost_mlog_devices);
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -149,6 +152,53 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }
 
+static bool vhost_log_dev_enabled(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == vhost_mem_logger[dev->vhost_ops->backend_type];
+}
+
+static void vhost_mlog_set_dev(struct vhost_dev *hdev, bool enable)
+{
+struct vhost_dev *logdev = NULL;
+VhostBackendType backend_type;
+bool reelect = false;
+
+assert(hdev->vhost_ops);
+assert(hdev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(hdev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+backend_type = hdev->vhost_ops->backend_type;
+
+if (enable && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+reelect = !vhost_mem_logger[backend_type];
+QLIST_INSERT_HEAD(_mlog_devices, hdev, logdev_entry);
+} else if (!enable && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+reelect = vhost_mem_logger[backend_type] == hdev;
+QLIST_REMOVE(hdev, logdev_entry);
+}
+
+if (!reelect)
+return;
+
+QLIST_FOREACH(hdev, _mlog_devices, logdev_entry) {
+if (!hdev->vhost_ops ||
+hdev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_NONE ||
+hdev->vhost_ops->backend_type >= VHOST_BACKEND_TYPE_MAX)
+continue;
+
+if (hdev->vhost_ops->backend_type == backend_type) {
+logdev = hdev;
+break;
+}
+}
+
+vhost_mem_logger[backend_type] = logdev;
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
MemoryRegionSection *section,
hwaddr first,
@@ -166,12 +216,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 start_addr = MAX(first, start_addr);
 end_addr = MIN(last, end_addr);
 
-for (i = 0; i < dev->mem->nregions; ++i) {
-struct vhost_memory_region *reg = dev->mem->regions + i;
-vhost_dev_sync_region(dev, section, start_addr, end_addr,
-  reg->guest_phys_addr,
-  range_get_last(reg->guest_phys_addr,
- reg->memory_size));
+if (vhost_log_dev_enabled(dev)) {
+for (i = 0; i < dev->mem->nregions; ++i) {
+struct vhost_memory_region *reg = dev->mem->regions + i;
+vhost_dev_sync_region(dev, section, start_addr, end_addr,
+  reg->guest_phys_addr,
+  range_get_last(reg->guest_phys_addr,
+ reg->m

[PATCH v2 1/2] vhost: dirty log should be per backend type

2024-02-14 Thread Si-Wei Liu
There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost.c | 49 +
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..ef6d9b5 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
 do { } while (0)
 #endif
 
-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -287,6 +287,8 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
 r = -1;
 }
 
+assert(dev->vhost_ops->backend_type == backend_type || r < 0);
+
 return r;
 }
 
@@ -319,16 +321,23 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
 return log;
 }
 
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
 {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX)
+return NULL;
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
 
 if (!log || log->size != size) {
 log = vhost_log_alloc(size, share);
 if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
 } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
 }
 } else {
 ++log->refcnt;
@@ -340,11 +349,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
 static void vhost_log_put(struct vhost_dev *dev, bool sync)
 {
 struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
 
 if (!log) {
 return;
 }
 
+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
 --log->refcnt;
 if (log->refcnt == 0) {
 /* Sync only the range covered by the old log */
@@ -352,13 +370,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
 vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
 }
 
-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
 g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
 qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
 log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
 }
 
 g_free(log);
@@ -376,7 +394,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
 
 static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
 {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
 uint64_t log_base = (uintptr_t)log->log;
 int r;
 
@@ -2037,8 +2056,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
 uint64_t log_base;
 
 hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
   vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {
+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
 log_base = (uintptr_t)hdev->log->log;
 r = hdev->vhost_ops->vhost_set_log_base(hdev,
 hdev->log_size ? log_base : 0,
-- 
1.8.3.1




[PATCH 07/12] vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events | 2 +-
 hw/virtio/vhost-vdpa.c | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 28d6d78..20577aa 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -57,7 +57,7 @@ vhost_vdpa_dev_start(void *dev, bool started) "dev: %p 
started: %d"
 vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned long long size, int 
refcnt, int fd, void *log) "dev: %p base: 0x%"PRIx64" size: %llu refcnt: %d fd: 
%d log: %p"
 vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, 
uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, 
uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 
0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" 
log_guest_addr: 0x%"PRIx64
 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
-vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
+vhost_vdpa_set_dev_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 0de7bdf..004110f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -972,7 +972,10 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, 
uint8_t *config,
 static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev,
  struct vhost_vring_state *ring)
 {
-trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
+struct vhost_vdpa *v = dev->opaque;
+
+trace_vhost_vdpa_set_dev_vring_base(dev, ring->index, ring->num,
+v->shadow_vqs_enabled);
 return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
 }
 
-- 
1.8.3.1




[PATCH 04/12] vdpa: factor out vhost_vdpa_net_get_nc_vdpa

2024-02-14 Thread Si-Wei Liu
Introduce new API. No functional change on existing API.

Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 06c83b4..4168cad 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -281,13 +281,18 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 }
 
 
-/** From any vdpa net client, get the netclient of the first queue pair */
-static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+/** From any vdpa net client, get the netclient of the i-th queue pair */
+static VhostVDPAState *vhost_vdpa_net_get_nc_vdpa(VhostVDPAState *s, int i)
 {
 NICState *nic = qemu_get_nic(s->nc.peer);
-NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+NetClientState *nc_i = qemu_get_peer(nic->ncs, i);
+
+return DO_UPCAST(VhostVDPAState, nc, nc_i);
+}
 
-return DO_UPCAST(VhostVDPAState, nc, nc0);
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+return vhost_vdpa_net_get_nc_vdpa(s, 0);
 }
 
 static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
-- 
1.8.3.1




[PATCH 08/12] vdpa: add trace events for vhost_vdpa_net_load_cmd

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 2 ++
 net/vhost-vdpa.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index aab666a..88f56f2 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -26,3 +26,5 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": 
offset=%u"
 
 # vhost-vdpa.c
 vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
+vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int 
data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d"
+vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) 
"vdpa state: %p class: %u cmd: %u retval: %d"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 48a5608..6ee438f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -677,6 +677,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s,
 
 assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
 cmd_size = sizeof(ctrl) + data_size;
+trace_vhost_vdpa_net_load_cmd(s, class, cmd, data_num, data_size);
 if (vhost_svq_available_slots(svq) < 2 ||
 iov_size(out_cursor, 1) < cmd_size) {
 /*
@@ -708,6 +709,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s,
 
 r = vhost_vdpa_net_cvq_add(s, , 1, , 1);
 if (unlikely(r < 0)) {
+trace_vhost_vdpa_net_load_cmd_retval(s, class, cmd, r);
 return r;
 }
 
-- 
1.8.3.1




[PATCH 10/12] vdpa: define SVQ transitioning state for mode switching

2024-02-14 Thread Si-Wei Liu
Will be used in following patches.

DISABLING(-1) means SVQ is being switched off to passthrough
mode.

ENABLING(1) means passthrough VQs are being switched to SVQ.

DONE(0) means SVQ switching is completed.

Signed-off-by: Si-Wei Liu 
---
 include/hw/virtio/vhost-vdpa.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index ad754eb..449bf5c 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -30,6 +30,12 @@ typedef struct VhostVDPAHostNotifier {
 void *addr;
 } VhostVDPAHostNotifier;
 
+typedef enum SVQTransitionState {
+SVQ_TSTATE_DISABLING = -1,
+SVQ_TSTATE_DONE,
+SVQ_TSTATE_ENABLING
+} SVQTransitionState;
+
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
 int device_fd;
@@ -67,6 +73,9 @@ typedef struct vhost_vdpa_shared {
 
 /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
 bool shadow_data;
+
+/* SVQ switching is in progress, or already completed? */
+SVQTransitionState svq_switching;
 } VhostVDPAShared;
 
 typedef struct vhost_vdpa {
-- 
1.8.3.1




[PATCH 01/12] vdpa: add back vhost_vdpa_net_first_nc_vdpa

2024-02-14 Thread Si-Wei Liu
Previous commits had it removed. Now adding it back because
this function will be needed by future patches.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 46e350a..4479ffa 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -280,6 +280,16 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 return size;
 }
 
+
+/** From any vdpa net client, get the netclient of the first queue pair */
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+NICState *nic = qemu_get_nic(s->nc.peer);
+NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
 static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
 {
 struct vhost_vdpa *v = >vhost_vdpa;
@@ -492,7 +502,7 @@ dma_map_err:
 
 static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 {
-VhostVDPAState *s;
+VhostVDPAState *s, *s0;
 struct vhost_vdpa *v;
 int64_t cvq_group;
 int r;
@@ -503,7 +513,8 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 s = DO_UPCAST(VhostVDPAState, nc, nc);
 v = >vhost_vdpa;
 
-v->shadow_vqs_enabled = v->shared->shadow_data;
+s0 = vhost_vdpa_net_first_nc_vdpa(s);
+v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
 s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
 
 if (v->shared->shadow_data) {
-- 
1.8.3.1




[PATCH 09/12] vdpa: add trace event for vhost_vdpa_net_load_mq

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 1 +
 net/vhost-vdpa.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index 88f56f2..cda960f 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -28,3 +28,4 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": 
offset=%u"
 vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
 vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int 
data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d"
 vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) 
"vdpa state: %p class: %u cmd: %u retval: %d"
+vhost_vdpa_net_load_mq(void *s, int ncurqps) "vdpa state: %p current_qpairs: 
%d"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 6ee438f..9f25221 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -901,6 +901,8 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
 return 0;
 }
 
+trace_vhost_vdpa_net_load_mq(s, n->curr_queue_pairs);
+
 mq.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs);
 const struct iovec data = {
 .iov_base = ,
-- 
1.8.3.1




[PATCH 06/12] vdpa: add vhost_vdpa_get_vring_base trace for svq mode

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events | 2 +-
 hw/virtio/vhost-vdpa.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 77905d1..28d6d78 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -58,7 +58,7 @@ vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned 
long long size, int r
 vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, 
uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, 
uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 
0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" 
log_guest_addr: 0x%"PRIx64
 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
 vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
-vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
+vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 
0x%"PRIx64
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 1d3154a..0de7bdf 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1424,6 +1424,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
*dev,
 
 if (v->shadow_vqs_enabled) {
 ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
+trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, true);
 return 0;
 }
 
@@ -1436,7 +1437,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
*dev,
 }
 
 ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring);
-trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num);
+trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, false);
 return ret;
 }
 
-- 
1.8.3.1




[PATCH 03/12] vdpa: factor out vhost_vdpa_last_dev

2024-02-14 Thread Si-Wei Liu
Generalize duplicated condition check for the last vq of vdpa
device to a common function.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index f7162da..1d3154a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -551,6 +551,11 @@ static bool vhost_vdpa_first_dev(struct vhost_dev *dev)
 return v->index == 0;
 }
 
+static bool vhost_vdpa_last_dev(struct vhost_dev *dev)
+{
+return dev->vq_index + dev->nvqs == dev->vq_index_end;
+}
+
 static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
uint64_t *features)
 {
@@ -1317,7 +1322,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
 }
 
-if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+if (!vhost_vdpa_last_dev(dev)) {
 return 0;
 }
 
@@ -1347,7 +1352,7 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
 {
 struct vhost_vdpa *v = dev->opaque;
 
-if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+if (!vhost_vdpa_last_dev(dev)) {
 return;
 }
 
-- 
1.8.3.1




[PATCH 05/12] vdpa: add vhost_vdpa_set_address_space_id trace

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 3 +++
 net/vhost-vdpa.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index 823a071..aab666a 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -23,3 +23,6 @@ colo_compare_tcp_info(const char *pkt, uint32_t seq, uint32_t 
ack, int hdlen, in
 # filter-rewriter.c
 colo_filter_rewriter_pkt_info(const char *func, const char *src, const char 
*dst, uint32_t seq, uint32_t ack, uint32_t flag) "%s: src/dst: %s/%s p: 
seq/ack=%u/%u  flags=0x%x"
 colo_filter_rewriter_conn_offset(uint32_t offset) ": offset=%u"
+
+# vhost-vdpa.c
+vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4168cad..48a5608 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -29,6 +29,7 @@
 #include "migration/migration.h"
 #include "migration/misc.h"
 #include "hw/virtio/vhost.h"
+#include "trace.h"
 
 /* Todo:need to add the multiqueue support here */
 typedef struct VhostVDPAState {
@@ -440,6 +441,8 @@ static int vhost_vdpa_set_address_space_id(struct 
vhost_vdpa *v,
 };
 int r;
 
+trace_vhost_vdpa_set_address_space_id(v, vq_group, asid_num);
+
 r = ioctl(v->shared->device_fd, VHOST_VDPA_SET_GROUP_ASID, );
 if (unlikely(r < 0)) {
 error_report("Can't set vq group %u asid %u, errno=%d (%s)",
-- 
1.8.3.1




[PATCH 11/12] vdpa: indicate transitional state for SVQ switching

2024-02-14 Thread Si-Wei Liu
svq_switching indicates the transitional state whether
or not SVQ mode switching is in progress, and towards
which direction. Add the neccessary state around where
the switching would take place.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 9f25221..96d95b9 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -317,6 +317,8 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState 
*s, bool enable)
 data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
 cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
   n->max_ncs - n->max_queue_pairs : 0;
+v->shared->svq_switching = enable ?
+SVQ_TSTATE_ENABLING : SVQ_TSTATE_DISABLING;
 /*
  * TODO: vhost_net_stop does suspend, get_base and reset. We can be smarter
  * in the future and resume the device if read-only operations between
@@ -329,6 +331,7 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState 
*s, bool enable)
 if (unlikely(r < 0)) {
 error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
 }
+v->shared->svq_switching = SVQ_TSTATE_DONE;
 }
 
 static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
-- 
1.8.3.1




[PATCH 12/12] vdpa: fix network breakage after cancelling migration

2024-02-14 Thread Si-Wei Liu
Fix an issue where cancellation of ongoing migration ends up
with no network connectivity.

When canceling migration, SVQ will be switched back to the
passthrough mode, but the right call fd is not programed to
the device and the svq's own call fd is still used. At the
point of this transitioning period, the shadow_vqs_enabled
hadn't been set back to false yet, causing the installation
of call fd inadvertently bypassed.

Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding capabilities")
Cc: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 004110f..dfeca8b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1468,7 +1468,15 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev 
*dev,
 
 /* Remember last call fd because we can switch to SVQ anytime. */
 vhost_svq_set_svq_call_fd(svq, file->fd);
-if (v->shadow_vqs_enabled) {
+/*
+ * When SVQ is transitioning to off, shadow_vqs_enabled has
+ * not been set back to false yet, but the underlying call fd
+ * will have to switch back to the guest notifier to signal the
+ * passthrough virtqueues. In other situations, SVQ's own call
+ * fd shall be used to signal the device model.
+ */
+if (v->shadow_vqs_enabled &&
+v->shared->svq_switching != SVQ_TSTATE_DISABLING) {
 return 0;
 }
 
-- 
1.8.3.1




[PATCH 02/12] vdpa: no repeat setting shadow_data

2024-02-14 Thread Si-Wei Liu
Since shadow_data is now shared in the parent data struct, it
just needs to be set only once by the first vq. This change
will make shadow_data independent of svq enabled state, which
can be optionally turned off when SVQ descritors and device
driver areas are all isolated to a separate address space.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4479ffa..06c83b4 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -354,13 +354,12 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
 if (s->always_svq ||
 migration_is_setup_or_active(migrate_get_current()->state)) {
 v->shadow_vqs_enabled = true;
-v->shared->shadow_data = true;
 } else {
 v->shadow_vqs_enabled = false;
-v->shared->shadow_data = false;
 }
 
 if (v->index == 0) {
+v->shared->shadow_data = v->shadow_vqs_enabled;
 vhost_vdpa_net_data_start_first(s);
 return 0;
 }
-- 
1.8.3.1




[PATCH 00/12] Preparatory patches for live migration downtime improvement

2024-02-14 Thread Si-Wei Liu
This small series is a spin-off from [1], where the patches
already acked from that large patchset may get merged earlier
without having to wait for those that are still in review.

The last 3 patches (10 - 12) are bug fix to an issue where
cancellation of ongoing migration may lead to busted network.
These are the only outstanding patches in this patchset with
no acknowledgement received as yet. Please try to review
them at the earliest oppotunity. Thanks!

Regards,
-Siwei

[1] [PATCH 00/40] vdpa-net: improve migration downtime through descriptor ASID 
and persistent IOTLB
https://lore.kernel.org/qemu-devel/1701970793-6865-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (12):
  vdpa: add back vhost_vdpa_net_first_nc_vdpa
  vdpa: no repeat setting shadow_data
  vdpa: factor out vhost_vdpa_last_dev
  vdpa: factor out vhost_vdpa_net_get_nc_vdpa
  vdpa: add vhost_vdpa_set_address_space_id trace
  vdpa: add vhost_vdpa_get_vring_base trace for svq mode
  vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode
  vdpa: add trace events for vhost_vdpa_net_load_cmd
  vdpa: add trace event for vhost_vdpa_net_load_mq
  vdpa: define SVQ transitioning state for mode switching
  vdpa: indicate transitional state for SVQ switching
  vdpa: fix network breakage after cancelling migration

 hw/virtio/trace-events |  4 ++--
 hw/virtio/vhost-vdpa.c | 27 ++-
 include/hw/virtio/vhost-vdpa.h |  9 +
 net/trace-events   |  6 ++
 net/vhost-vdpa.c   | 33 +
 5 files changed, 68 insertions(+), 11 deletions(-)

-- 
1.8.3.1




Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-02-05 Thread Si-Wei Liu

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw 
with x-svq=on should be gone. However, after rebase my tree on top of 
this, there's a new failure I found around setting up guest mappings at 
early boot, please see attached the specific QEMU config and 
corresponding event traces. Haven't checked into the detail yet, 
thinking you would need to be aware of ahead.


Regards,
-Siwei

On 2/1/2024 10:09 AM, Eugenio Pérez wrote:

As we are moving to keep the mapping through all the vdpa device life
instead of resetting it at VirtIO reset, we need to move all its
dependencies to the initialization too.  In particular devices with
x-svq=on need a valid iova_tree from the beginning.

Simplify the code also consolidating the two creation points: the first
data vq in case of SVQ active and CVQ start in case only CVQ uses it.

Suggested-by: Si-Wei Liu 
Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h | 16 ++-
  net/vhost-vdpa.c   | 36 +++---
  2 files changed, 18 insertions(+), 34 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 03ed2f2be3..ad754eb803 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -37,7 +37,21 @@ typedef struct vhost_vdpa_shared {
  struct vhost_vdpa_iova_range iova_range;
  QLIST_HEAD(, vdpa_iommu) iommu_list;
  
-/* IOVA mapping used by the Shadow Virtqueue */

+/*
+ * IOVA mapping used by the Shadow Virtqueue
+ *
+ * It is shared among all ASID for simplicity, whether CVQ shares ASID with
+ * guest or not:
+ * - Memory listener need access to guest's memory addresses allocated in
+ *   the IOVA tree.
+ * - There should be plenty of IOVA address space for both ASID not to
+ *   worry about collisions between them.  Guest's translations are still
+ *   validated with virtio virtqueue_pop so there is no risk for the guest
+ *   to access memory that it shouldn't.
+ *
+ * To allocate a iova tree per ASID is doable but it complicates the code
+ * and it is not worth it for the moment.
+ */
  VhostIOVATree *iova_tree;
  
  /* Copy of backend features */

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index cc589dd148..57edcf34d0 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -232,6 +232,7 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
  return;
  }
  qemu_close(s->vhost_vdpa.shared->device_fd);
+g_clear_pointer(>vhost_vdpa.shared->iova_tree, vhost_iova_tree_delete);
  g_free(s->vhost_vdpa.shared);
  }
  
@@ -329,16 +330,8 @@ static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
  
  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)

  {
-struct vhost_vdpa *v = >vhost_vdpa;
-
  migration_add_notifier(>migration_state,
 vdpa_net_migration_state_notifier);
-
-/* iova_tree may be initialized by vhost_vdpa_net_load_setup */
-if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
-v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
-   v->shared->iova_range.last);
-}
  }
  
  static int vhost_vdpa_net_data_start(NetClientState *nc)

@@ -383,19 +376,12 @@ static int vhost_vdpa_net_data_load(NetClientState *nc)
  static void vhost_vdpa_net_client_stop(NetClientState *nc)
  {
  VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-struct vhost_dev *dev;
  
  assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
  
  if (s->vhost_vdpa.index == 0) {

  migration_remove_notifier(>migration_state);
  }
-
-dev = s->vhost_vdpa.dev;
-if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-g_clear_pointer(>vhost_vdpa.shared->iova_tree,
-vhost_iova_tree_delete);
-}
  }
  
  static NetClientInfo net_vhost_vdpa_info = {

@@ -557,24 +543,6 @@ out:
  return 0;
  }
  
-/*

- * If other vhost_vdpa already have an iova_tree, reuse it for simplicity,
- * whether CVQ shares ASID with guest or not, because:
- * - Memory listener need access to guest's memory addresses allocated in
- *   the IOVA tree.
- * - There should be plenty of IOVA address space for both ASID not to
- *   worry about collisions between them.  Guest's translations are still
- *   validated with virtio virtqueue_pop so there is no risk for the guest
- *   to access memory that it shouldn't.
- *
- * To allocate a iova tree per ASID is doable but it complicates the code
- * and it is not worth it for the moment.
- */
-if (!v->shared->iova_tree) {
-v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
- 

Re: [PATCH 1/6] vdpa: check for iova tree initialized at net_client_start

2024-01-31 Thread Si-Wei Liu

Hi Eugenio,

Maybe there's some patch missing, but I saw this core dump when x-svq=on 
is specified while waiting for the incoming migration on destination host:


(gdb) bt
#0  0x5643b24cc13c in vhost_iova_tree_map_alloc (tree=0x0, 
map=map@entry=0x7ffd58c54830) at ../hw/virtio/vhost-iova-tree.c:89
#1  0x5643b234f193 in vhost_vdpa_listener_region_add 
(listener=0x5643b4403fd8, section=0x7ffd58c548d0) at 
/home/opc/qemu-upstream/include/qemu/int128.h:34
#2  0x5643b24e6a61 in address_space_update_topology_pass 
(as=as@entry=0x5643b35a3840 , 
old_view=old_view@entry=0x5643b442b5f0, 
new_view=new_view@entry=0x5643b44a2130, adding=adding@entry=true) at 
../system/memory.c:1004
#3  0x5643b24e6e60 in address_space_set_flatview (as=0x5643b35a3840 
) at ../system/memory.c:1080
#4  0x5643b24ea750 in memory_region_transaction_commit () at 
../system/memory.c:1132
#5  0x5643b24ea750 in memory_region_transaction_commit () at 
../system/memory.c:1117
#6  0x5643b241f4c1 in pc_memory_init 
(pcms=pcms@entry=0x5643b43c8400, 
system_memory=system_memory@entry=0x5643b43d18b0, 
rom_memory=rom_memory@entry=0x5643b449a960, pci_hole64_size=out>) at ../hw/i386/pc.c:954
#7  0x5643b240d088 in pc_q35_init (machine=0x5643b43c8400) at 
../hw/i386/pc_q35.c:222
#8  0x5643b21e1da8 in machine_run_board_init (machine=out>, mem_path=, errp=, 
errp@entry=0x5643b35b7958 )

    at ../hw/core/machine.c:1509
#9  0x5643b237c0f6 in qmp_x_exit_preconfig () at ../system/vl.c:2613
#10 0x5643b237c0f6 in qmp_x_exit_preconfig (errp=) at 
../system/vl.c:2704
#11 0x5643b237fcdd in qemu_init (errp=) at 
../system/vl.c:3753
#12 0x5643b237fcdd in qemu_init (argc=, 
argv=) at ../system/vl.c:3753
#13 0x5643b2158249 in main (argc=, argv=out>) at ../system/main.c:47


Shall we create the iova tree early during vdpa dev int for the x-svq=on 
case?


+    if (s->always_svq) {
+    /* iova tree is needed because of SVQ */
+    shared->iova_tree = vhost_iova_tree_new(shared->iova_range.first,
+ shared->iova_range.last);
+    }
+

Regards,
-Siwei

On 1/11/2024 11:02 AM, Eugenio Pérez wrote:

To map the guest memory while it is migrating we need to create the
iova_tree, as long as the destination uses x-svq=on. Checking to not
override it.

The function vhost_vdpa_net_client_stop clear it if the device is
stopped. If the guest starts the device again, the iova tree is
recreated by vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start
if needed, so old behavior is kept.

Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 3726ee5d67..e11b390466 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -341,7 +341,9 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState 
*s)
  
  migration_add_notifier(>migration_state,

 vdpa_net_migration_state_notifier);
-if (v->shadow_vqs_enabled) {
+
+/* iova_tree may be initialized by vhost_vdpa_net_load_setup */
+if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
  v->shared->iova_tree = 
vhost_iova_tree_new(v->shared->iova_range.first,
 
v->shared->iova_range.last);
  }





[PATCH 31/40] vdpa: batch map and unmap around cvq svq start/stop

2023-12-07 Thread Si-Wei Liu
Coalesce map or unmap operations to exact one DMA
batch to reduce potential impact on performance.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index bc72345..1c1d61f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -715,10 +715,11 @@ out:
v->shared->iova_range.last);
 }
 
+vhost_vdpa_dma_batch_begin_once(v->shared, v->address_space_id);
 r = vhost_vdpa_cvq_map_buf(>vhost_vdpa, s->cvq_cmd_out_buffer,
vhost_vdpa_net_cvq_cmd_page_len(), false);
 if (unlikely(r < 0)) {
-return r;
+goto err;
 }
 
 r = vhost_vdpa_cvq_map_buf(>vhost_vdpa, s->status,
@@ -727,18 +728,23 @@ out:
 vhost_vdpa_cvq_unmap_buf(>vhost_vdpa, s->cvq_cmd_out_buffer);
 }
 
+err:
+vhost_vdpa_dma_batch_end_once(v->shared, v->address_space_id);
 return r;
 }
 
 static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
 {
 VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = >vhost_vdpa;
 
 assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 
 if (s->vhost_vdpa.shadow_vqs_enabled) {
+vhost_vdpa_dma_batch_begin_once(v->shared, v->address_space_id);
 vhost_vdpa_cvq_unmap_buf(>vhost_vdpa, s->cvq_cmd_out_buffer);
 vhost_vdpa_cvq_unmap_buf(>vhost_vdpa, s->status);
+vhost_vdpa_dma_batch_end_once(v->shared, v->address_space_id);
 }
 
 vhost_vdpa_net_client_stop(nc);
-- 
1.8.3.1




[PATCH 28/40] vdpa: support iotlb_batch_asid

2023-12-07 Thread Si-Wei Liu
Then it's possible to specify ASID when calling the DMA
batching API. If the ASID to work on doesn't align with
the ASID for ongoing transaction, the API will fail the
request and return negative, and the transaction will
remain intact as if no failed request ever had occured.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 25 +++--
 include/hw/virtio/vhost-vdpa.h |  1 +
 net/vhost-vdpa.c   |  1 +
 3 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index d3f5721..b7896a8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -189,15 +189,25 @@ static bool vhost_vdpa_map_batch_begin(VhostVDPAShared 
*s, uint32_t asid)
 
 static int vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s, uint32_t asid)
 {
-if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
-s->iotlb_batch_begin_sent) {
+if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
 return 0;
 }
 
-if (vhost_vdpa_map_batch_begin(s, asid)) {
-s->iotlb_batch_begin_sent = true;
+if (s->iotlb_batch_begin_sent && s->iotlb_batch_asid != asid) {
+return -1;
+}
+
+if (s->iotlb_batch_begin_sent) {
+return 0;
 }
 
+if (!vhost_vdpa_map_batch_begin(s, asid)) {
+return 0;
+}
+
+s->iotlb_batch_begin_sent = true;
+s->iotlb_batch_asid = asid;
+
 return 0;
 }
 
@@ -237,10 +247,13 @@ static int vhost_vdpa_dma_batch_end_once(VhostVDPAShared 
*s, uint32_t asid)
 return 0;
 }
 
-if (vhost_vdpa_dma_batch_end(s, asid)) {
-s->iotlb_batch_begin_sent = false;
+if (!vhost_vdpa_dma_batch_end(s, asid)) {
+return 0;
 }
 
+s->iotlb_batch_begin_sent = false;
+s->iotlb_batch_asid = -1;
+
 return 0;
 }
 
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 0fe0f60..219316f 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -61,6 +61,7 @@ typedef struct vhost_vdpa_shared {
 bool map_thread_enabled;
 
 bool iotlb_batch_begin_sent;
+uint32_t iotlb_batch_asid;
 
 /*
  * The memory listener has been registered, so DMA maps have been sent to
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index e9b96ed..bc72345 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -1933,6 +1933,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 s->vhost_vdpa.shared->device_fd = vdpa_device_fd;
 s->vhost_vdpa.shared->iova_range = iova_range;
 s->vhost_vdpa.shared->shadow_data = svq;
+s->vhost_vdpa.shared->iotlb_batch_asid = -1;
 s->vhost_vdpa.shared->refcnt++;
 } else if (!is_datapath) {
 s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
-- 
1.8.3.1




[PATCH 40/40] vdpa: add trace event for vhost_vdpa_net_load_mq

2023-12-07 Thread Si-Wei Liu
For better debuggability and observability.

Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 1 +
 net/vhost-vdpa.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index be087e6..c128cc4 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -30,3 +30,4 @@ vhost_vdpa_net_data_eval_flush(void *s, int qindex, int 
svq_switch, bool svq_flu
 vhost_vdpa_net_cvq_eval_flush(void *s, int qindex, int svq_switch, bool 
svq_flush) "vhost_vdpa: %p qp: %d svq_switch: %d flush_map: %d"
 vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int 
data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d"
 vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) 
"vdpa state: %p class: %u cmd: %u retval: %d"
+vhost_vdpa_net_load_mq(void *s, int ncurqps) "vdpa state: %p current_qpairs: 
%d"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 61da8b4..17b8d01 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -1109,6 +1109,8 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
 return 0;
 }
 
+trace_vhost_vdpa_net_load_mq(s, n->curr_queue_pairs);
+
 mq.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs);
 const struct iovec data = {
 .iov_base = ,
-- 
1.8.3.1




[PATCH 07/40] vdpa: move around vhost_vdpa_set_address_space_id

2023-12-07 Thread Si-Wei Liu
Move it a few lines ahead to make function call easier for those
before it.  No funtional change involved.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1a738b2..dbfa192 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -335,6 +335,24 @@ static void vdpa_net_migration_state_notifier(Notifier 
*notifier, void *data)
 }
 }
 
+static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
+   unsigned vq_group,
+   unsigned asid_num)
+{
+struct vhost_vring_state asid = {
+.index = vq_group,
+.num = asid_num,
+};
+int r;
+
+r = ioctl(v->shared->device_fd, VHOST_VDPA_SET_GROUP_ASID, );
+if (unlikely(r < 0)) {
+error_report("Can't set vq group %u asid %u, errno=%d (%s)",
+ asid.index, asid.num, errno, g_strerror(errno));
+}
+return r;
+}
+
 static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 {
 struct vhost_vdpa *v = >vhost_vdpa;
@@ -490,24 +508,6 @@ static int64_t vhost_vdpa_get_vring_desc_group(int 
device_fd,
 return state.num;
 }
 
-static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
-   unsigned vq_group,
-   unsigned asid_num)
-{
-struct vhost_vring_state asid = {
-.index = vq_group,
-.num = asid_num,
-};
-int r;
-
-r = ioctl(v->shared->device_fd, VHOST_VDPA_SET_GROUP_ASID, );
-if (unlikely(r < 0)) {
-error_report("Can't set vq group %u asid %u, errno=%d (%s)",
- asid.index, asid.num, errno, g_strerror(errno));
-}
-return r;
-}
-
 static void vhost_vdpa_cvq_unmap_buf(struct vhost_vdpa *v, void *addr)
 {
 VhostIOVATree *tree = v->shared->iova_tree;
-- 
1.8.3.1




[PATCH 22/40] vdpa: factor out vhost_vdpa_map_batch_begin

2023-12-07 Thread Si-Wei Liu
Refactoring only. No functional change.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events |  2 +-
 hw/virtio/vhost-vdpa.c | 25 -
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 9725d44..b0239b8 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -32,7 +32,7 @@ vhost_user_create_notifier(int idx, void *n) "idx:%d n:%p"
 # vhost-vdpa.c
 vhost_vdpa_dma_map(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, 
uint64_t iova, uint64_t size, uint64_t uaddr, uint8_t perm, uint8_t type) 
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" 
size: 0x%"PRIx64" uaddr: 0x%"PRIx64" perm: 0x%"PRIx8" type: %"PRIu8
 vhost_vdpa_dma_unmap(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, 
uint64_t iova, uint64_t size, uint8_t type) "vdpa_shared:%p fd: %d msg_type: 
%"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" type: %"PRIu8
-vhost_vdpa_listener_begin_batch(void *v, int fd, uint32_t msg_type, uint8_t 
type)  "vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
+vhost_vdpa_map_batch_begin(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
 vhost_vdpa_listener_commit(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
 vhost_vdpa_listener_region_add_unaligned(void *v, const char *name, uint64_t 
offset_as, uint64_t offset_page) "vdpa_shared: %p region %s 
offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
 vhost_vdpa_listener_region_add(void *vdpa, uint64_t iova, uint64_t llend, void 
*vaddr, bool readonly) "vdpa: %p iova 0x%"PRIx64" llend 0x%"PRIx64" vaddr: %p 
read-only: %d"
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 013bfa2..7a1b7f4 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -161,7 +161,7 @@ int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, 
hwaddr iova,
 return ret;
 }
 
-static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
+static bool vhost_vdpa_map_batch_begin(VhostVDPAShared *s)
 {
 int fd = s->device_fd;
 struct vhost_msg_v2 msg = {
@@ -169,26 +169,33 @@ static void 
vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
 .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
 };
 
-if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
-s->iotlb_batch_begin_sent) {
-return;
-}
-
 if (s->map_thread_enabled && !qemu_thread_is_self(>map_thread)) {
 struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
 
 *new_msg = msg;
 g_async_queue_push(s->map_queue, new_msg);
 
-return;
+return false;
 }
 
-trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type);
+trace_vhost_vdpa_map_batch_begin(s, fd, msg.type, msg.iotlb.type);
 if (write(fd, , sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
  fd, errno, strerror(errno));
 }
-s->iotlb_batch_begin_sent = true;
+return true;
+}
+
+static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
+{
+if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
+s->iotlb_batch_begin_sent) {
+return;
+}
+
+if (vhost_vdpa_map_batch_begin(s)) {
+s->iotlb_batch_begin_sent = true;
+}
 }
 
 static void vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
-- 
1.8.3.1




[PATCH 12/40] vdpa: check map_thread_enabled before join maps thread

2023-12-07 Thread Si-Wei Liu
The next patches will also register memory listener on
demand, hence the need to differentiate the map_thread
case from the rest.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 2b1cc14..4f026db 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1450,7 +1450,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 if (!v->shared->listener_registered) {
 memory_listener_register(>shared->listener, dev->vdev->dma_as);
 v->shared->listener_registered = true;
-} else {
+} else if (v->shared->map_thread_enabled) {
 ok = vhost_vdpa_join_maps_thread(v->shared);
 if (unlikely(!ok)) {
 goto out_stop;
-- 
1.8.3.1




[PATCH 06/40] vhost: make svq work with gpa without iova translation

2023-12-07 Thread Si-Wei Liu
Make vhost_svq_vring_write_descs able to work with GPA directly
without going through iova tree for translation. This will be
needed in the next few patches where the SVQ has dedicated
address space to host its virtqueues. Instead of having to
translate qemu's VA to IOVA via the iova tree, with dedicated
or isolated address space for SVQ descriptors, the IOVA is
exactly same as the guest GPA space where translation would
not be needed any more.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-shadow-virtqueue.c | 35 +++
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index fc5f408..97ccd45 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -136,8 +136,8 @@ static bool vhost_svq_translate_addr(const 
VhostShadowVirtqueue *svq,
  * Return true if success, false otherwise and print error.
  */
 static bool vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
-const struct iovec *iovec, size_t num,
-bool more_descs, bool write)
+const struct iovec *iovec, hwaddr 
*addr,
+size_t num, bool more_descs, bool 
write)
 {
 uint16_t i = svq->free_head, last = svq->free_head;
 unsigned n;
@@ -149,8 +149,15 @@ static bool 
vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
 return true;
 }
 
-ok = vhost_svq_translate_addr(svq, sg, iovec, num);
-if (unlikely(!ok)) {
+if (svq->iova_tree) {
+ok = vhost_svq_translate_addr(svq, sg, iovec, num);
+if (unlikely(!ok)) {
+return false;
+}
+} else if (!addr) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "No translation found for vaddr 0x%p\n",
+  iovec[0].iov_base);
 return false;
 }
 
@@ -161,7 +168,7 @@ static bool 
vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
 } else {
 descs[i].flags = flags;
 }
-descs[i].addr = cpu_to_le64(sg[n]);
+descs[i].addr = cpu_to_le64(svq->iova_tree ? sg[n] : addr[n]);
 descs[i].len = cpu_to_le32(iovec[n].iov_len);
 
 last = i;
@@ -173,9 +180,10 @@ static bool 
vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
 }
 
 static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
-const struct iovec *out_sg, size_t out_num,
-const struct iovec *in_sg, size_t in_num,
-unsigned *head)
+const struct iovec *out_sg, hwaddr *out_addr,
+size_t out_num,
+const struct iovec *in_sg, hwaddr *in_addr,
+size_t in_num, unsigned *head)
 {
 unsigned avail_idx;
 vring_avail_t *avail = svq->vring.avail;
@@ -191,13 +199,14 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
 return false;
 }
 
-ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, in_num > 0,
- false);
+ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_addr, out_num,
+ in_num > 0, false);
 if (unlikely(!ok)) {
 return false;
 }
 
-ok = vhost_svq_vring_write_descs(svq, sgs, in_sg, in_num, false, true);
+ok = vhost_svq_vring_write_descs(svq, sgs, in_sg, in_addr, in_num,
+ false, true);
 if (unlikely(!ok)) {
 return false;
 }
@@ -258,7 +267,9 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct 
iovec *out_sg,
 return -ENOSPC;
 }
 
-ok = vhost_svq_add_split(svq, out_sg, out_num, in_sg, in_num, _head);
+ok = vhost_svq_add_split(svq, out_sg, elem ? elem->out_addr : NULL,
+ out_num, in_sg, elem ? elem->in_addr : NULL,
+ in_num, _head);
 if (unlikely(!ok)) {
 return -EINVAL;
 }
-- 
1.8.3.1




[PATCH 04/40] vdpa: piggyback desc_group index when probing isolated cvq

2023-12-07 Thread Si-Wei Liu
Same as the previous commit, but do it for cvq instead of data vqs.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 0cf3147..cb5705d 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -1601,16 +1601,19 @@ static const VhostShadowVirtqueueOps 
vhost_vdpa_net_svq_ops = {
 };
 
 /**
- * Probe if CVQ is isolated
+ * Probe if CVQ is isolated, and piggyback its descriptor group
+ * index if supported
  *
  * @device_fd The vdpa device fd
  * @features  Features offered by the device.
  * @cvq_index The control vq pair index
+ * @desc_grpidx   The CVQ's descriptor group index to return
  *
- * Returns <0 in case of failure, 0 if false and 1 if true.
+ * Returns <0 in case of failure, 0 if false and 1 if true (isolated).
  */
 static int vhost_vdpa_probe_cvq_isolation(int device_fd, uint64_t features,
-  int cvq_index, Error **errp)
+  int cvq_index, int64_t *desc_grpidx,
+  Error **errp)
 {
 uint64_t backend_features;
 int64_t cvq_group;
@@ -1667,6 +1670,13 @@ static int vhost_vdpa_probe_cvq_isolation(int device_fd, 
uint64_t features,
 goto out;
 }
 
+if (backend_features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) {
+int64_t desc_group = vhost_vdpa_get_vring_desc_group(device_fd,
+ cvq_index, errp);
+if (likely(desc_group >= 0) && desc_group != cvq_group)
+*desc_grpidx = desc_group;
+}
+
 for (int i = 0; i < cvq_index; ++i) {
 int64_t group = vhost_vdpa_get_vring_group(device_fd, i, errp);
 if (unlikely(group < 0)) {
@@ -1685,6 +1695,8 @@ static int vhost_vdpa_probe_cvq_isolation(int device_fd, 
uint64_t features,
 out:
 status = 0;
 ioctl(device_fd, VHOST_VDPA_SET_STATUS, );
+status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
+ioctl(device_fd, VHOST_VDPA_SET_STATUS, );
 return r;
 }
 
@@ -1791,6 +1803,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
Error **errp)
 {
 NetClientState *nc = NULL;
+int64_t desc_group = -1;
 VhostVDPAState *s;
 int ret = 0;
 assert(name);
@@ -1802,7 +1815,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 } else {
 cvq_isolated = vhost_vdpa_probe_cvq_isolation(vdpa_device_fd, features,
   queue_pair_index * 2,
-  errp);
+  _group, errp);
 if (unlikely(cvq_isolated < 0)) {
 return NULL;
 }
-- 
1.8.3.1




[PATCH 25/40] vdpa: add asid to dma_batch_once API

2023-12-07 Thread Si-Wei Liu
So that DMA batching API can operate on other ASID than 0.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events |  4 ++--
 hw/virtio/vhost-vdpa.c | 14 --
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 3411a07..196f32f 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -32,8 +32,8 @@ vhost_user_create_notifier(int idx, void *n) "idx:%d n:%p"
 # vhost-vdpa.c
 vhost_vdpa_dma_map(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, 
uint64_t iova, uint64_t size, uint64_t uaddr, uint8_t perm, uint8_t type) 
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" 
size: 0x%"PRIx64" uaddr: 0x%"PRIx64" perm: 0x%"PRIx8" type: %"PRIu8
 vhost_vdpa_dma_unmap(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, 
uint64_t iova, uint64_t size, uint8_t type) "vdpa_shared:%p fd: %d msg_type: 
%"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" type: %"PRIu8
-vhost_vdpa_map_batch_begin(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
-vhost_vdpa_dma_batch_end(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
+vhost_vdpa_map_batch_begin(void *v, int fd, uint32_t msg_type, uint8_t type, 
uint32_t asid)  "vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8" asid: 
%"PRIu32
+vhost_vdpa_dma_batch_end(void *v, int fd, uint32_t msg_type, uint8_t type, 
uint32_t asid)  "vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8" asid: 
%"PRIu32
 vhost_vdpa_listener_region_add_unaligned(void *v, const char *name, uint64_t 
offset_as, uint64_t offset_page) "vdpa_shared: %p region %s 
offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
 vhost_vdpa_listener_region_add(void *vdpa, uint64_t iova, uint64_t llend, void 
*vaddr, bool readonly) "vdpa: %p iova 0x%"PRIx64" llend 0x%"PRIx64" vaddr: %p 
read-only: %d"
 vhost_vdpa_listener_region_del_unaligned(void *v, const char *name, uint64_t 
offset_as, uint64_t offset_page) "vdpa_shared: %p region %s 
offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 999a97a..2db2832 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -161,11 +161,12 @@ int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t 
asid, hwaddr iova,
 return ret;
 }
 
-static bool vhost_vdpa_map_batch_begin(VhostVDPAShared *s)
+static bool vhost_vdpa_map_batch_begin(VhostVDPAShared *s, uint32_t asid)
 {
 int fd = s->device_fd;
 struct vhost_msg_v2 msg = {
 .type = VHOST_IOTLB_MSG_V2,
+.asid = asid,
 .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
 };
 
@@ -178,7 +179,7 @@ static bool vhost_vdpa_map_batch_begin(VhostVDPAShared *s)
 return false;
 }
 
-trace_vhost_vdpa_map_batch_begin(s, fd, msg.type, msg.iotlb.type);
+trace_vhost_vdpa_map_batch_begin(s, fd, msg.type, msg.iotlb.type, 
msg.asid);
 if (write(fd, , sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
  fd, errno, strerror(errno));
@@ -193,17 +194,18 @@ static void 
vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s)
 return;
 }
 
-if (vhost_vdpa_map_batch_begin(s)) {
+if (vhost_vdpa_map_batch_begin(s, 0)) {
 s->iotlb_batch_begin_sent = true;
 }
 }
 
-static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s)
+static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s, uint32_t asid)
 {
 struct vhost_msg_v2 msg = {};
 int fd = s->device_fd;
 
 msg.type = VHOST_IOTLB_MSG_V2;
+msg.asid = asid;
 msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
 if (s->map_thread_enabled && !qemu_thread_is_self(>map_thread)) {
@@ -215,7 +217,7 @@ static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s)
 return false;
 }
 
-trace_vhost_vdpa_dma_batch_end(s, fd, msg.type, msg.iotlb.type);
+trace_vhost_vdpa_dma_batch_end(s, fd, msg.type, msg.iotlb.type, msg.asid);
 if (write(fd, , sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
  fd, errno, strerror(errno));
@@ -233,7 +235,7 @@ static void vhost_vdpa_dma_batch_end_once(VhostVDPAShared 
*s)
 return;
 }
 
-if (vhost_vdpa_dma_batch_end(s)) {
+if (vhost_vdpa_dma_batch_end(s, 0)) {
 s->iotlb_batch_begin_sent = false;
 }
 }
-- 
1.8.3.1




[PATCH 35/40] vdpa: add vhost_vdpa_set_address_space_id trace

2023-12-07 Thread Si-Wei Liu
For better debuggability and observability.

Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 3 +++
 net/vhost-vdpa.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index 823a071..aab666a 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -23,3 +23,6 @@ colo_compare_tcp_info(const char *pkt, uint32_t seq, uint32_t 
ack, int hdlen, in
 # filter-rewriter.c
 colo_filter_rewriter_pkt_info(const char *func, const char *src, const char 
*dst, uint32_t seq, uint32_t ack, uint32_t flag) "%s: src/dst: %s/%s p: 
seq/ack=%u/%u  flags=0x%x"
 colo_filter_rewriter_conn_offset(uint32_t offset) ": offset=%u"
+
+# vhost-vdpa.c
+vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 41714d1..84876b0 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -30,6 +30,7 @@
 #include "migration/misc.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-vdpa.h"
+#include "trace.h"
 
 /* Todo:need to add the multiqueue support here */
 typedef struct VhostVDPAState {
@@ -365,6 +366,8 @@ static int vhost_vdpa_set_address_space_id(struct 
vhost_vdpa *v,
 };
 int r;
 
+trace_vhost_vdpa_set_address_space_id(v, vq_group, asid_num);
+
 r = ioctl(v->shared->device_fd, VHOST_VDPA_SET_GROUP_ASID, );
 if (unlikely(r < 0)) {
 error_report("Can't set vq group %u asid %u, errno=%d (%s)",
-- 
1.8.3.1




[PATCH 21/40] vdpa: vhost_vdpa_dma_batch_end_once rename

2023-12-07 Thread Si-Wei Liu
No functional changes. Rename only.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 47c764b..013bfa2 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -191,7 +191,7 @@ static void 
vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
 s->iotlb_batch_begin_sent = true;
 }
 
-static void vhost_vdpa_dma_end_batch(VhostVDPAShared *s)
+static void vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
 {
 struct vhost_msg_v2 msg = {};
 int fd = s->device_fd;
@@ -229,7 +229,7 @@ static void vhost_vdpa_listener_commit(MemoryListener 
*listener)
 {
 VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
 
-vhost_vdpa_dma_end_batch(s);
+vhost_vdpa_dma_batch_end_once(s);
 }
 
 static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
@@ -1367,7 +1367,7 @@ static void *vhost_vdpa_load_map(void *opaque)
 vhost_vdpa_iotlb_batch_begin_once(shared);
 break;
 case VHOST_IOTLB_BATCH_END:
-vhost_vdpa_dma_end_batch(shared);
+vhost_vdpa_dma_batch_end_once(shared);
 break;
 default:
 error_report("Invalid IOTLB msg type %d", msg->iotlb.type);
-- 
1.8.3.1




[PATCH 14/40] vdpa: convert iova_tree to ref count based

2023-12-07 Thread Si-Wei Liu
So that it can be freed from vhost_vdpa_cleanup on
the last deref. The next few patches will try to
make iova tree life cycle not depend on memory
listener, and there's possiblity to keep iova tree
around when memory mapping is not changed across
device reset.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index a126e5c..7b8f047 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -238,6 +238,8 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
 }
 if (--s->vhost_vdpa.shared->refcnt == 0) {
 qemu_close(s->vhost_vdpa.shared->device_fd);
+g_clear_pointer(>vhost_vdpa.shared->iova_tree,
+vhost_iova_tree_delete);
 g_free(s->vhost_vdpa.shared);
 }
 s->vhost_vdpa.shared = NULL;
@@ -461,19 +463,12 @@ static int vhost_vdpa_net_data_load(NetClientState *nc)
 static void vhost_vdpa_net_client_stop(NetClientState *nc)
 {
 VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-struct vhost_dev *dev;
 
 assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 
 if (s->vhost_vdpa.index == 0) {
 migration_remove_notifier(>migration_state);
 }
-
-dev = s->vhost_vdpa.dev;
-if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-g_clear_pointer(>vhost_vdpa.shared->iova_tree,
-vhost_iova_tree_delete);
-}
 }
 
 static int vhost_vdpa_net_load_setup(NetClientState *nc, NICState *nic)
-- 
1.8.3.1




[PATCH 11/40] vdpa: factor out vhost_vdpa_last_dev

2023-12-07 Thread Si-Wei Liu
Generalize duplicated condition check for the last vq of vdpa
device to a common function.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 30dff95..2b1cc14 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -593,6 +593,11 @@ static bool vhost_vdpa_first_dev(struct vhost_dev *dev)
 return v->index == 0;
 }
 
+static bool vhost_vdpa_last_dev(struct vhost_dev *dev)
+{
+return dev->vq_index + dev->nvqs == dev->vq_index_end;
+}
+
 static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
uint64_t *features)
 {
@@ -1432,7 +1437,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 goto out_stop;
 }
 
-if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+if (!vhost_vdpa_last_dev(dev)) {
 return 0;
 }
 
@@ -1467,7 +1472,7 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
 {
 struct vhost_vdpa *v = dev->opaque;
 
-if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+if (!vhost_vdpa_last_dev(dev)) {
 return;
 }
 
-- 
1.8.3.1




[PATCH 15/40] vdpa: add svq_switching and flush_map to header

2023-12-07 Thread Si-Wei Liu
Will be used in next patches.

Signed-off-by: Si-Wei Liu 
---
 include/hw/virtio/vhost-vdpa.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 7b8d3bf..0fe0f60 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -72,6 +72,12 @@ typedef struct vhost_vdpa_shared {
 bool shadow_data;
 
 unsigned refcnt;
+
+/* SVQ switching is in progress? 1: turn on SVQ, -1: turn off SVQ */
+int svq_switching;
+
+/* Flush mappings on reset due to shared address space */
+bool flush_map;
 } VhostVDPAShared;
 
 typedef struct vhost_vdpa {
-- 
1.8.3.1




[PATCH 34/40] vdpa: fix network breakage after cancelling migration

2023-12-07 Thread Si-Wei Liu
Fix an issue where cancellation of ongoing migration ends up
with no network connectivity.

When canceling migration, SVQ will be switched back to the
passthrough mode, but the right call fd is not programed to
the device and the svq's own call fd is still used. At the
point of this transitioning period, the shadow_vqs_enabled
hadn't been set back to false yet, causing the installation
of call fd inadvertently bypassed.

Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding capabilities")
Cc: Eugenio Pérez 

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 4010fd9..8ba390d 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1647,7 +1647,12 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev 
*dev,
 
 /* Remember last call fd because we can switch to SVQ anytime. */
 vhost_svq_set_svq_call_fd(svq, file->fd);
-if (v->shadow_vqs_enabled) {
+/*
+ * In the event of SVQ switching to off, shadow_vqs_enabled has
+ * not been set to false yet, but the underlying call fd will
+ * switch back to the guest notifier for passthrough VQs.
+ */
+if (v->shadow_vqs_enabled && v->shared->svq_switching >= 0) {
 return 0;
 }
 
-- 
1.8.3.1




[PATCH 09/40] vdpa: no repeat setting shadow_data

2023-12-07 Thread Si-Wei Liu
Since shadow_data is now shared in the parent data struct, it
just needs to be set only once by the first vq. This change
will make shadow_data independent of svq enabled state, which
can be optionally turned off when SVQ descritors and device
driver areas are all isolated to a separate address space.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index c9bfc6f..2555897 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -387,13 +387,12 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
 if (s->always_svq ||
 migration_is_setup_or_active(migrate_get_current()->state)) {
 v->shadow_vqs_enabled = true;
-v->shared->shadow_data = true;
 } else {
 v->shadow_vqs_enabled = false;
-v->shared->shadow_data = false;
 }
 
 if (v->index == 0) {
+v->shared->shadow_data = v->shadow_vqs_enabled;
 vhost_vdpa_net_data_start_first(s);
 return 0;
 }
-- 
1.8.3.1




[PATCH 13/40] vdpa: ref counting VhostVDPAShared

2023-12-07 Thread Si-Wei Liu
Subsequent patches attempt to release VhostVDPAShared resources,
for example iova tree to free and memory listener to unregister,
in vdpa_dev_cleanup(). Instead of checking against the vq index,
which is not always available in all of the callers, counting
the usage by reference. Then it'll be easy to free resource
upon the last deref.

Signed-off-by: Si-Wei Liu 
---
 include/hw/virtio/vhost-vdpa.h |  2 ++
 net/vhost-vdpa.c   | 14 ++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 63493ff..7b8d3bf 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -70,6 +70,8 @@ typedef struct vhost_vdpa_shared {
 
 /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
 bool shadow_data;
+
+unsigned refcnt;
 } VhostVDPAShared;
 
 typedef struct vhost_vdpa {
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index aebaa53..a126e5c 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -236,11 +236,11 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
 g_free(s->vhost_net);
 s->vhost_net = NULL;
 }
-if (s->vhost_vdpa.index != 0) {
-return;
+if (--s->vhost_vdpa.shared->refcnt == 0) {
+qemu_close(s->vhost_vdpa.shared->device_fd);
+g_free(s->vhost_vdpa.shared);
 }
-qemu_close(s->vhost_vdpa.shared->device_fd);
-g_free(s->vhost_vdpa.shared);
+s->vhost_vdpa.shared = NULL;
 }
 
 /** Dummy SetSteeringEBPF to support RSS for vhost-vdpa backend  */
@@ -1896,6 +1896,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 s->vhost_vdpa.shared->device_fd = vdpa_device_fd;
 s->vhost_vdpa.shared->iova_range = iova_range;
 s->vhost_vdpa.shared->shadow_data = svq;
+s->vhost_vdpa.shared->refcnt++;
 } else if (!is_datapath) {
 s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
  PROT_READ | PROT_WRITE,
@@ -1910,6 +1911,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 }
 if (queue_pair_index != 0) {
 s->vhost_vdpa.shared = shared;
+s->vhost_vdpa.shared->refcnt++;
 }
 
 ret = vhost_vdpa_add(nc, (void *)>vhost_vdpa, queue_pair_index, nvqs);
@@ -1928,6 +1930,10 @@ static NetClientState 
*net_vhost_vdpa_init(NetClientState *peer,
 return nc;
 
 err:
+if (--s->vhost_vdpa.shared->refcnt == 0) {
+g_free(s->vhost_vdpa.shared);
+}
+s->vhost_vdpa.shared = NULL;
 qemu_del_net_client(nc);
 return NULL;
 }
-- 
1.8.3.1




[PATCH 30/40] vdpa: batch map/unmap op per svq pair basis

2023-12-07 Thread Si-Wei Liu
Coalesce multiple map or unmap operations to just one
so that all mapping setup or teardown can occur in a
single DMA batch.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 68dc01b..d98704a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1288,6 +1288,7 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
 return true;
 }
 
+vhost_vdpa_dma_batch_begin_once(v->shared, v->address_space_id);
 for (i = 0; i < v->shadow_vqs->len; ++i) {
 VirtQueue *vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
 VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
@@ -1315,6 +1316,7 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
 goto err_set_addr;
 }
 }
+vhost_vdpa_dma_batch_end_once(v->shared, v->address_space_id);
 
 return true;
 
@@ -1323,6 +1325,7 @@ err_set_addr:
 
 err_map:
 vhost_svq_stop(g_ptr_array_index(v->shadow_vqs, i));
+vhost_vdpa_dma_batch_end_once(v->shared, v->address_space_id);
 
 err:
 error_reportf_err(err, "Cannot setup SVQ %u: ", i);
@@ -1343,6 +1346,7 @@ static void vhost_vdpa_svqs_stop(struct vhost_dev *dev)
 return;
 }
 
+vhost_vdpa_dma_batch_begin_once(v->shared, v->address_space_id);
 for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
 VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
 
@@ -1352,6 +1356,7 @@ static void vhost_vdpa_svqs_stop(struct vhost_dev *dev)
 event_notifier_cleanup(>hdev_kick);
 event_notifier_cleanup(>hdev_call);
 }
+vhost_vdpa_dma_batch_end_once(v->shared, v->address_space_id);
 }
 
 static void vhost_vdpa_suspend(struct vhost_dev *dev)
-- 
1.8.3.1




[PATCH 18/40] vdpa: unregister listener on last dev cleanup

2023-12-07 Thread Si-Wei Liu
So that the free of iova tree struct can be safely deferred to
until the last vq referencing it goes away.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 4f026db..ea2dfc8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -815,7 +815,10 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
 }
 
 vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
-memory_listener_unregister(>shared->listener);
+if (vhost_vdpa_last_dev(dev) && v->shared->listener_registered) {
+memory_listener_unregister(>shared->listener);
+v->shared->listener_registered = false;
+}
 vhost_vdpa_svq_cleanup(dev);
 
 dev->opaque = NULL;
-- 
1.8.3.1




[PATCH 17/40] vdpa: judge if map can be kept across reset

2023-12-07 Thread Si-Wei Liu
The descriptor group for SVQ ASID allows the guest memory mapping
to retain across SVQ switching, same as how isolated CVQ can do
with a different ASID than the guest GPA space. Introduce an
evaluation function to judge whether to flush or keep iotlb maps
based on virtqueue's descriptor group and cvq isolation capability.

Have to hook the evaluation function to NetClient's .poll op as
.vhost_reset_status runs ahead of .stop, and .vhost_dev_start
don't have access to the vhost-vdpa net's information.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 04718b2..e9b96ed 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -504,12 +504,36 @@ static int vhost_vdpa_net_load_cleanup(NetClientState 
*nc, NICState *nic)
  n->parent_obj.status & VIRTIO_CONFIG_S_DRIVER_OK);
 }
 
+static void vhost_vdpa_net_data_eval_flush(NetClientState *nc, bool stop)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = >vhost_vdpa;
+
+if (!stop) {
+return;
+}
+
+if (s->vhost_vdpa.index == 0) {
+if (s->always_svq) {
+v->shared->flush_map = true;
+} else if (!v->shared->svq_switching || v->desc_group >= 0) {
+v->shared->flush_map = false;
+} else {
+v->shared->flush_map = true;
+}
+} else if (!s->always_svq && v->shared->svq_switching &&
+   v->desc_group < 0) {
+v->shared->flush_map = true;
+}
+}
+
 static NetClientInfo net_vhost_vdpa_info = {
 .type = NET_CLIENT_DRIVER_VHOST_VDPA,
 .size = sizeof(VhostVDPAState),
 .receive = vhost_vdpa_receive,
 .start = vhost_vdpa_net_data_start,
 .load = vhost_vdpa_net_data_load,
+.poll = vhost_vdpa_net_data_eval_flush,
 .stop = vhost_vdpa_net_client_stop,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
@@ -1368,12 +1392,28 @@ static int vhost_vdpa_net_cvq_load(NetClientState *nc)
 return 0;
 }
 
+static void vhost_vdpa_net_cvq_eval_flush(NetClientState *nc, bool stop)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = >vhost_vdpa;
+
+if (!stop) {
+return;
+}
+
+if (!v->shared->flush_map && !v->shared->svq_switching &&
+!s->cvq_isolated && v->desc_group < 0) {
+v->shared->flush_map = true;
+}
+}
+
 static NetClientInfo net_vhost_vdpa_cvq_info = {
 .type = NET_CLIENT_DRIVER_VHOST_VDPA,
 .size = sizeof(VhostVDPAState),
 .receive = vhost_vdpa_receive,
 .start = vhost_vdpa_net_cvq_start,
 .load = vhost_vdpa_net_cvq_load,
+.poll = vhost_vdpa_net_cvq_eval_flush,
 .stop = vhost_vdpa_net_cvq_stop,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
-- 
1.8.3.1




[PATCH 38/40] vdpa: add trace events for eval_flush

2023-12-07 Thread Si-Wei Liu
For better debuggability and observability.

Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 2 ++
 net/vhost-vdpa.c | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index aab666a..d650c71 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -26,3 +26,5 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": 
offset=%u"
 
 # vhost-vdpa.c
 vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
+vhost_vdpa_net_data_eval_flush(void *s, int qindex, int svq_switch, bool 
svq_flush) "vhost_vdpa: %p qp: %d svq_switch: %d flush_map: %d"
+vhost_vdpa_net_cvq_eval_flush(void *s, int qindex, int svq_switch, bool 
svq_flush) "vhost_vdpa: %p qp: %d svq_switch: %d flush_map: %d"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 84876b0..a0bd8cd 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -608,6 +608,9 @@ static void vhost_vdpa_net_data_eval_flush(NetClientState 
*nc, bool stop)
v->desc_group < 0) {
 v->shared->flush_map = true;
 }
+trace_vhost_vdpa_net_data_eval_flush(v, s->vhost_vdpa.index,
+v->shared->svq_switching,
+v->shared->flush_map);
 }
 
 static NetClientInfo net_vhost_vdpa_info = {
@@ -1457,6 +1460,10 @@ static void vhost_vdpa_net_cvq_eval_flush(NetClientState 
*nc, bool stop)
 !s->cvq_isolated && v->desc_group < 0) {
 v->shared->flush_map = true;
 }
+
+trace_vhost_vdpa_net_cvq_eval_flush(v, s->vhost_vdpa.index,
+   v->shared->svq_switching,
+   v->shared->flush_map);
 }
 
 static NetClientInfo net_vhost_vdpa_cvq_info = {
-- 
1.8.3.1




[PATCH 33/40] vdpa: batch multiple dma_unmap to a single call for vm stop

2023-12-07 Thread Si-Wei Liu
Should help live migration downtime on source host. Below are the
coalesced dma_unmap time series on 2 queue pair config (no
dedicated descriptor group ASID for SVQ).

109531@1693367276.853503:vhost_vdpa_reset_device dev: 0x55c933926890
109531@1693367276.853513:vhost_vdpa_add_status dev: 0x55c933926890 status: 0x3
109531@1693367276.853520:vhost_vdpa_flush_map dev: 0x55c933926890 doit: 1 
svq_flush: 0 persist: 1
109531@1693367276.853524:vhost_vdpa_set_config_call dev: 0x55c933926890 fd: -1
109531@1693367276.853579:vhost_vdpa_iotlb_begin_batch vdpa:0x7fa2aa895190 fd: 
16 msg_type: 2 type: 5
109531@1693367276.853586:vhost_vdpa_dma_unmap vdpa:0x7fa2aa895190 fd: 16 
msg_type: 2 asid: 0 iova: 0x1000 size: 0x2000 type: 3
109531@1693367276.853600:vhost_vdpa_dma_unmap vdpa:0x7fa2aa895190 fd: 16 
msg_type: 2 asid: 0 iova: 0x3000 size: 0x1000 type: 3
109531@1693367276.853618:vhost_vdpa_dma_unmap vdpa:0x7fa2aa895190 fd: 16 
msg_type: 2 asid: 0 iova: 0x4000 size: 0x2000 type: 3
109531@1693367276.853625:vhost_vdpa_dma_unmap vdpa:0x7fa2aa895190 fd: 16 
msg_type: 2 asid: 0 iova: 0x6000 size: 0x1000 type: 3
109531@1693367276.853630:vhost_vdpa_dma_unmap vdpa:0x7fa2aa84c190 fd: 16 
msg_type: 2 asid: 0 iova: 0x7000 size: 0x2000 type: 3
109531@1693367276.853636:vhost_vdpa_dma_unmap vdpa:0x7fa2aa84c190 fd: 16 
msg_type: 2 asid: 0 iova: 0x9000 size: 0x1000 type: 3
109531@1693367276.853642:vhost_vdpa_dma_unmap vdpa:0x7fa2aa84c190 fd: 16 
msg_type: 2 asid: 0 iova: 0xa000 size: 0x2000 type: 3
109531@1693367276.853648:vhost_vdpa_dma_unmap vdpa:0x7fa2aa84c190 fd: 16 
msg_type: 2 asid: 0 iova: 0xc000 size: 0x1000 type: 3
109531@1693367276.853654:vhost_vdpa_dma_unmap vdpa:0x7fa2aa6b6190 fd: 16 
msg_type: 2 asid: 0 iova: 0xf000 size: 0x1000 type: 3
109531@1693367276.853660:vhost_vdpa_dma_unmap vdpa:0x7fa2aa6b6190 fd: 16 
msg_type: 2 asid: 0 iova: 0x1 size: 0x1000 type: 3
109531@1693367276.853666:vhost_vdpa_dma_unmap vdpa:0x7fa2aa6b6190 fd: 16 
msg_type: 2 asid: 0 iova: 0xd000 size: 0x1000 type: 3
109531@1693367276.853670:vhost_vdpa_dma_unmap vdpa:0x7fa2aa6b6190 fd: 16 
msg_type: 2 asid: 0 iova: 0xe000 size: 0x1000 type: 3
109531@1693367276.853675:vhost_vdpa_iotlb_end_batch vdpa:0x7fa2aa895190 fd: 16 
msg_type: 2 type: 6
109531@1693367277.014697:vhost_vdpa_get_vq_index dev: 0x55c933925de0 idx: 0 vq 
idx: 0
109531@1693367277.014747:vhost_vdpa_get_vq_index dev: 0x55c933925de0 idx: 1 vq 
idx: 1
109531@1693367277.014753:vhost_vdpa_get_vq_index dev: 0x55c9339262e0 idx: 2 vq 
idx: 2
109531@1693367277.014756:vhost_vdpa_get_vq_index dev: 0x55c9339262e0 idx: 3 vq 
idx: 3

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c |   7 +--
 include/hw/virtio/vhost-vdpa.h |   3 ++
 net/vhost-vdpa.c   | 112 +++--
 3 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index d98704a..4010fd9 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1162,8 +1162,8 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa 
*v, hwaddr addr)
 vhost_iova_tree_remove(v->shared->iova_tree, *result);
 }
 
-static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
-   const VhostShadowVirtqueue *svq)
+void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
+const VhostShadowVirtqueue *svq)
 {
 struct vhost_vdpa *v = dev->opaque;
 struct vhost_vring_addr svq_addr;
@@ -1346,17 +1346,14 @@ static void vhost_vdpa_svqs_stop(struct vhost_dev *dev)
 return;
 }
 
-vhost_vdpa_dma_batch_begin_once(v->shared, v->address_space_id);
 for (unsigned i = 0; i < v->shadow_vqs->len; ++i) {
 VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
 
 vhost_svq_stop(svq);
-vhost_vdpa_svq_unmap_rings(dev, svq);
 
 event_notifier_cleanup(>hdev_kick);
 event_notifier_cleanup(>hdev_call);
 }
-vhost_vdpa_dma_batch_end_once(v->shared, v->address_space_id);
 }
 
 static void vhost_vdpa_suspend(struct vhost_dev *dev)
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index aa13679..f426e2c 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -112,6 +112,9 @@ int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s, 
uint32_t asid);
 int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
 int vhost_vdpa_load_cleanup(VhostVDPAShared *s, bool vhost_will_start);
 
+void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
+const VhostShadowVirtqueue *svq);
+
 typedef struct vdpa_iommu {
 VhostVDPAShared *dev_shared;
 IOMMUMemoryRegion *iommu_mr;
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 683619f..41714d1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -29,6 +29,7 @@
 #include "migration/migration.h"
 #include "migration/misc.h&quo

[PATCH 39/40] vdpa: add trace events for vhost_vdpa_net_load_cmd

2023-12-07 Thread Si-Wei Liu
For better debuggability and observability.

Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 2 ++
 net/vhost-vdpa.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index d650c71..be087e6 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -28,3 +28,5 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": 
offset=%u"
 vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
 vhost_vdpa_net_data_eval_flush(void *s, int qindex, int svq_switch, bool 
svq_flush) "vhost_vdpa: %p qp: %d svq_switch: %d flush_map: %d"
 vhost_vdpa_net_cvq_eval_flush(void *s, int qindex, int svq_switch, bool 
svq_flush) "vhost_vdpa: %p qp: %d svq_switch: %d flush_map: %d"
+vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int 
data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d"
+vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) 
"vdpa state: %p class: %u cmd: %u retval: %d"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index a0bd8cd..61da8b4 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -885,6 +885,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s,
 
 assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
 cmd_size = sizeof(ctrl) + data_size;
+trace_vhost_vdpa_net_load_cmd(s, class, cmd, data_num, data_size);
 if (vhost_svq_available_slots(svq) < 2 ||
 iov_size(out_cursor, 1) < cmd_size) {
 /*
@@ -916,6 +917,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s,
 
 r = vhost_vdpa_net_cvq_add(s, , 1, , 1);
 if (unlikely(r < 0)) {
+trace_vhost_vdpa_net_load_cmd_retval(s, class, cmd, r);
 return r;
 }
 
-- 
1.8.3.1




[PATCH 29/40] vdpa: expose API vhost_vdpa_dma_batch_once

2023-12-07 Thread Si-Wei Liu
So that the batching API can be called from other file
externally than the local.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 21 +++--
 include/hw/virtio/vhost-vdpa.h |  3 +++
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index b7896a8..68dc01b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -187,7 +187,7 @@ static bool vhost_vdpa_map_batch_begin(VhostVDPAShared *s, 
uint32_t asid)
 return true;
 }
 
-static int vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s, uint32_t asid)
+int vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s, uint32_t asid)
 {
 if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
 return 0;
@@ -237,7 +237,7 @@ static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s, 
uint32_t asid)
 return true;
 }
 
-static int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s, uint32_t asid)
+int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s, uint32_t asid)
 {
 if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
 return 0;
@@ -436,7 +436,12 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 iova = mem_region.iova;
 }
 
-vhost_vdpa_dma_batch_begin_once(s, VHOST_VDPA_GUEST_PA_ASID);
+ret = vhost_vdpa_dma_batch_begin_once(s, VHOST_VDPA_GUEST_PA_ASID);
+if (unlikely(ret)) {
+error_report("Can't batch mapping on asid 0 (%p)", s);
+goto fail_map;
+}
+
 ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
  int128_get64(llsize), vaddr, section->readonly);
 if (ret) {
@@ -518,7 +523,11 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 iova = result->iova;
 vhost_iova_tree_remove(s->iova_tree, *result);
 }
-vhost_vdpa_dma_batch_begin_once(s, VHOST_VDPA_GUEST_PA_ASID);
+ret = vhost_vdpa_dma_batch_begin_once(s, VHOST_VDPA_GUEST_PA_ASID);
+if (ret) {
+error_report("Can't batch mapping on asid 0 (%p)", s);
+}
+
 /*
  * The unmap ioctl doesn't accept a full 64-bit. need to check it
  */
@@ -1396,10 +1405,10 @@ static void *vhost_vdpa_load_map(void *opaque)
  msg->iotlb.size);
 break;
 case VHOST_IOTLB_BATCH_BEGIN:
-vhost_vdpa_dma_batch_begin_once(shared, msg->asid);
+r = vhost_vdpa_dma_batch_begin_once(shared, msg->asid);
 break;
 case VHOST_IOTLB_BATCH_END:
-vhost_vdpa_dma_batch_end_once(shared, msg->asid);
+r = vhost_vdpa_dma_batch_end_once(shared, msg->asid);
 break;
 default:
 error_report("Invalid IOTLB msg type %d", msg->iotlb.type);
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 219316f..aa13679 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -106,6 +106,9 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, 
hwaddr iova,
hwaddr size, void *vaddr, bool readonly);
 int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
  hwaddr size);
+int vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s, uint32_t asid);
+int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s, uint32_t asid);
+
 int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
 int vhost_vdpa_load_cleanup(VhostVDPAShared *s, bool vhost_will_start);
 
-- 
1.8.3.1




[PATCH 20/40] vdpa: avoid mapping flush across reset

2023-12-07 Thread Si-Wei Liu
Leverage the IOTLB_PERSIST and DESC_ASID features to achieve
a slightly light weight reset path, without resorting to
suspend and resume. Not as best but it offers significant
time saving too, which should somehow play its role in live
migration down time reduction by large.

It benefits two cases:
  - normal virtio reset in the VM, e.g. guest reboot, don't
have to tear down all iotlb mapping and set up again.
  - SVQ switching, in which data vq's descriptor table and
vrings are moved to a different ASID than where its
buffers reside. Along with the use of persistent iotlb,
it would save substantial time from pinning and mapping
unneccessarily when moving descriptors on to or out of
shadow mode.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 31e0a55..47c764b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -633,6 +633,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
  0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
  0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
  0x1ULL << VHOST_BACKEND_F_DESC_ASID |
+ 0x1ULL << VHOST_BACKEND_F_IOTLB_PERSIST |
  0x1ULL << VHOST_BACKEND_F_SUSPEND;
 int ret;
 
@@ -1493,8 +1494,6 @@ static void vhost_vdpa_maybe_flush_map(struct vhost_dev 
*dev)
 
 static void vhost_vdpa_reset_status(struct vhost_dev *dev)
 {
-struct vhost_vdpa *v = dev->opaque;
-
 if (!vhost_vdpa_last_dev(dev)) {
 return;
 }
@@ -1502,9 +1501,7 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
 vhost_vdpa_reset_device(dev);
 vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
VIRTIO_CONFIG_S_DRIVER);
-memory_listener_unregister(>shared->listener);
-v->shared->listener_registered = false;
-
+vhost_vdpa_maybe_flush_map(dev);
 }
 
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
-- 
1.8.3.1




[PATCH 19/40] vdpa: should avoid map flushing with persistent iotlb

2023-12-07 Thread Si-Wei Liu
Today memory listener is unregistered in vhost_vdpa_reset_status
unconditionally, due to which all the maps will be flushed away
from the iotlb. However, map flush is not always needed, and
doing it from performance hot path may have innegligible latency
impact that affects VM reboot time or brown out period during
live migration.

Leverage the IOTLB_PERSIST backend featuae, which ensures durable
iotlb maps and not disappearing even across reset. When it is
supported, we may conditionally keep the maps for cases where the
guest memory mapping doesn't change. Prepare a function so that
the next patch will be able to use it to keep the maps.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events |  1 +
 hw/virtio/vhost-vdpa.c | 20 
 2 files changed, 21 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 77905d1..9725d44 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -66,6 +66,7 @@ vhost_vdpa_set_owner(void *dev) "dev: %p"
 vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t 
avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 
0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
 vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p 
first: 0x%"PRIx64" last: 0x%"PRIx64
 vhost_vdpa_set_config_call(void *dev, int fd)"dev: %p fd: %d"
+vhost_vdpa_maybe_flush_map(void *dev, bool reg, bool flush, bool persist) 
"dev: %p registered: %d flush_map: %d iotlb_persistent: %d"
 
 # virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned 
out_num) "elem %p size %zd in_num %u out_num %u"
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index ea2dfc8..31e0a55 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1471,6 +1471,26 @@ out_stop:
 return ok ? 0 : -1;
 }
 
+static void vhost_vdpa_maybe_flush_map(struct vhost_dev *dev)
+{
+struct vhost_vdpa *v = dev->opaque;
+
+trace_vhost_vdpa_maybe_flush_map(dev, v->shared->listener_registered,
+ v->shared->flush_map,
+ !!(dev->backend_cap &
+ BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)));
+
+if (!v->shared->listener_registered) {
+return;
+}
+
+if (!(dev->backend_cap & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) ||
+v->shared->flush_map) {
+memory_listener_unregister(>shared->listener);
+v->shared->listener_registered = false;
+}
+}
+
 static void vhost_vdpa_reset_status(struct vhost_dev *dev)
 {
 struct vhost_vdpa *v = dev->opaque;
-- 
1.8.3.1




[PATCH 26/40] vdpa: return int for dma_batch_once API

2023-12-07 Thread Si-Wei Liu
Return zero for success for now. Prepare for non-zero return
in the next few patches.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 2db2832..e0137f0 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -187,16 +187,18 @@ static bool vhost_vdpa_map_batch_begin(VhostVDPAShared 
*s, uint32_t asid)
 return true;
 }
 
-static void vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s)
+static int vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s)
 {
 if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
 s->iotlb_batch_begin_sent) {
-return;
+return 0;
 }
 
 if (vhost_vdpa_map_batch_begin(s, 0)) {
 s->iotlb_batch_begin_sent = true;
 }
+
+return 0;
 }
 
 static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s, uint32_t asid)
@@ -225,19 +227,21 @@ static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s, 
uint32_t asid)
 return true;
 }
 
-static void vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
+static int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
 {
 if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
-return;
+return 0;
 }
 
 if (!s->iotlb_batch_begin_sent) {
-return;
+return 0;
 }
 
 if (vhost_vdpa_dma_batch_end(s, 0)) {
 s->iotlb_batch_begin_sent = false;
 }
+
+return 0;
 }
 
 static void vhost_vdpa_listener_commit(MemoryListener *listener)
-- 
1.8.3.1




[PATCH 03/40] vdpa: probe descriptor group index for data vqs

2023-12-07 Thread Si-Wei Liu
Getting it ahead at initialization time instead of start time allows
decision making independent of device status, while reducing failure
possibility in starting device or during migration.

Adding function vhost_vdpa_probe_desc_group() for that end. This
function will be used to probe the descriptor group for data vqs.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 89 
 1 file changed, 89 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 887c329..0cf3147 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -1688,6 +1688,95 @@ out:
 return r;
 }
 
+static int vhost_vdpa_probe_desc_group(int device_fd, uint64_t features,
+   int vq_index, int64_t *desc_grpidx,
+   Error **errp)
+{
+uint64_t backend_features;
+int64_t vq_group, desc_group;
+uint8_t saved_status = 0;
+uint8_t status = 0;
+int r;
+
+ERRP_GUARD();
+
+r = ioctl(device_fd, VHOST_GET_BACKEND_FEATURES, _features);
+if (unlikely(r < 0)) {
+error_setg_errno(errp, errno, "Cannot get vdpa backend_features");
+return r;
+}
+
+if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID))) {
+return 0;
+}
+
+if (!(backend_features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID))) {
+return 0;
+}
+
+r = ioctl(device_fd, VHOST_VDPA_GET_STATUS, _status);
+if (unlikely(r)) {
+error_setg_errno(errp, -r, "Cannot get device status");
+goto out;
+}
+
+r = ioctl(device_fd, VHOST_VDPA_SET_STATUS, );
+if (unlikely(r)) {
+error_setg_errno(errp, -r, "Cannot reset device");
+goto out;
+}
+
+r = ioctl(device_fd, VHOST_SET_FEATURES, );
+if (unlikely(r)) {
+error_setg_errno(errp, errno, "Cannot set features");
+}
+
+status = VIRTIO_CONFIG_S_ACKNOWLEDGE |
+ VIRTIO_CONFIG_S_DRIVER |
+ VIRTIO_CONFIG_S_FEATURES_OK;
+
+r = ioctl(device_fd, VHOST_VDPA_SET_STATUS, );
+if (unlikely(r)) {
+error_setg_errno(errp, -r, "Cannot set device status");
+goto out;
+}
+
+vq_group = vhost_vdpa_get_vring_group(device_fd, vq_index, errp);
+if (unlikely(vq_group < 0)) {
+if (vq_group != -ENOTSUP) {
+r = vq_group;
+goto out;
+}
+
+/*
+ * The kernel report VHOST_BACKEND_F_IOTLB_ASID if the vdpa frontend
+ * support ASID even if the parent driver does not.
+ */
+error_free(*errp);
+*errp = NULL;
+r = 0;
+goto out;
+}
+
+desc_group = vhost_vdpa_get_vring_desc_group(device_fd, vq_index,
+ errp);
+if (unlikely(desc_group < 0)) {
+r = desc_group;
+goto out;
+} else if (desc_group != vq_group) {
+*desc_grpidx = desc_group;
+}
+r = 1;
+
+out:
+status = 0;
+ioctl(device_fd, VHOST_VDPA_SET_STATUS, );
+if (saved_status) {
+ioctl(device_fd, VHOST_VDPA_SET_STATUS, _status);
+}
+return r;
+}
+
 static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
const char *device,
const char *name,
-- 
1.8.3.1




[PATCH 24/40] vdpa: factor out vhost_vdpa_dma_batch_end

2023-12-07 Thread Si-Wei Liu
Refactoring only. No functional change.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events |  2 +-
 hw/virtio/vhost-vdpa.c | 30 ++
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index b0239b8..3411a07 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -33,7 +33,7 @@ vhost_user_create_notifier(int idx, void *n) "idx:%d n:%p"
 vhost_vdpa_dma_map(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, 
uint64_t iova, uint64_t size, uint64_t uaddr, uint8_t perm, uint8_t type) 
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" 
size: 0x%"PRIx64" uaddr: 0x%"PRIx64" perm: 0x%"PRIx8" type: %"PRIu8
 vhost_vdpa_dma_unmap(void *vdpa, int fd, uint32_t msg_type, uint32_t asid, 
uint64_t iova, uint64_t size, uint8_t type) "vdpa_shared:%p fd: %d msg_type: 
%"PRIu32" asid: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" type: %"PRIu8
 vhost_vdpa_map_batch_begin(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
-vhost_vdpa_listener_commit(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
+vhost_vdpa_dma_batch_end(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa_shared:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
 vhost_vdpa_listener_region_add_unaligned(void *v, const char *name, uint64_t 
offset_as, uint64_t offset_page) "vdpa_shared: %p region %s 
offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
 vhost_vdpa_listener_region_add(void *vdpa, uint64_t iova, uint64_t llend, void 
*vaddr, bool readonly) "vdpa: %p iova 0x%"PRIx64" llend 0x%"PRIx64" vaddr: %p 
read-only: %d"
 vhost_vdpa_listener_region_del_unaligned(void *v, const char *name, uint64_t 
offset_as, uint64_t offset_page) "vdpa_shared: %p region %s 
offset_within_address_space %"PRIu64" offset_within_region %"PRIu64
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index a6c6fe5..999a97a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -198,19 +198,11 @@ static void 
vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s)
 }
 }
 
-static void vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
+static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s)
 {
 struct vhost_msg_v2 msg = {};
 int fd = s->device_fd;
 
-if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
-return;
-}
-
-if (!s->iotlb_batch_begin_sent) {
-return;
-}
-
 msg.type = VHOST_IOTLB_MSG_V2;
 msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
@@ -220,16 +212,30 @@ static void vhost_vdpa_dma_batch_end_once(VhostVDPAShared 
*s)
 *new_msg = msg;
 g_async_queue_push(s->map_queue, new_msg);
 
-return;
+return false;
 }
 
-trace_vhost_vdpa_listener_commit(s, fd, msg.type, msg.iotlb.type);
+trace_vhost_vdpa_dma_batch_end(s, fd, msg.type, msg.iotlb.type);
 if (write(fd, , sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
  fd, errno, strerror(errno));
 }
+return true;
+}
+
+static void vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
+{
+if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
+return;
+}
+
+if (!s->iotlb_batch_begin_sent) {
+return;
+}
 
-s->iotlb_batch_begin_sent = false;
+if (vhost_vdpa_dma_batch_end(s)) {
+s->iotlb_batch_begin_sent = false;
+}
 }
 
 static void vhost_vdpa_listener_commit(MemoryListener *listener)
-- 
1.8.3.1




[PATCH 32/40] vdpa: factor out vhost_vdpa_net_get_nc_vdpa

2023-12-07 Thread Si-Wei Liu
Introduce new API. No functional change on existing API.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1c1d61f..683619f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -290,13 +290,18 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 }
 
 
-/** From any vdpa net client, get the netclient of the first queue pair */
-static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+/** From any vdpa net client, get the netclient of the i-th queue pair */
+static VhostVDPAState *vhost_vdpa_net_get_nc_vdpa(VhostVDPAState *s, int i)
 {
 NICState *nic = qemu_get_nic(s->nc.peer);
-NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+NetClientState *nc_i = qemu_get_peer(nic->ncs, i);
+
+return DO_UPCAST(VhostVDPAState, nc, nc_i);
+}
 
-return DO_UPCAST(VhostVDPAState, nc, nc0);
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+return vhost_vdpa_net_get_nc_vdpa(s, 0);
 }
 
 static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
-- 
1.8.3.1




[PATCH 08/40] vdpa: add back vhost_vdpa_net_first_nc_vdpa

2023-12-07 Thread Si-Wei Liu
Previous commits had it removed. Now adding it back because
this function will be needed by next patches.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index dbfa192..c9bfc6f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -287,6 +287,16 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 return size;
 }
 
+
+/** From any vdpa net client, get the netclient of the first queue pair */
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+NICState *nic = qemu_get_nic(s->nc.peer);
+NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
 static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
 {
 struct vhost_vdpa *v = >vhost_vdpa;
@@ -566,7 +576,7 @@ dma_map_err:
 
 static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 {
-VhostVDPAState *s;
+VhostVDPAState *s, *s0;
 struct vhost_vdpa *v;
 int64_t cvq_group;
 int r;
@@ -577,7 +587,8 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 s = DO_UPCAST(VhostVDPAState, nc, nc);
 v = >vhost_vdpa;
 
-v->shadow_vqs_enabled = v->shared->shadow_data;
+s0 = vhost_vdpa_net_first_nc_vdpa(s);
+v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
 s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
 
 if (v->shared->shadow_data) {
-- 
1.8.3.1




[PATCH 27/40] vdpa: add asid to all dma_batch call sites

2023-12-07 Thread Si-Wei Liu
Will allow other callers to specifcy asid when calling the
dma_batch API.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e0137f0..d3f5721 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -187,14 +187,14 @@ static bool vhost_vdpa_map_batch_begin(VhostVDPAShared 
*s, uint32_t asid)
 return true;
 }
 
-static int vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s)
+static int vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s, uint32_t asid)
 {
 if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
 s->iotlb_batch_begin_sent) {
 return 0;
 }
 
-if (vhost_vdpa_map_batch_begin(s, 0)) {
+if (vhost_vdpa_map_batch_begin(s, asid)) {
 s->iotlb_batch_begin_sent = true;
 }
 
@@ -227,7 +227,7 @@ static bool vhost_vdpa_dma_batch_end(VhostVDPAShared *s, 
uint32_t asid)
 return true;
 }
 
-static int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
+static int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s, uint32_t asid)
 {
 if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH))) {
 return 0;
@@ -237,7 +237,7 @@ static int vhost_vdpa_dma_batch_end_once(VhostVDPAShared *s)
 return 0;
 }
 
-if (vhost_vdpa_dma_batch_end(s, 0)) {
+if (vhost_vdpa_dma_batch_end(s, asid)) {
 s->iotlb_batch_begin_sent = false;
 }
 
@@ -248,7 +248,7 @@ static void vhost_vdpa_listener_commit(MemoryListener 
*listener)
 {
 VhostVDPAShared *s = container_of(listener, VhostVDPAShared, listener);
 
-vhost_vdpa_dma_batch_end_once(s);
+vhost_vdpa_dma_batch_end_once(s, VHOST_VDPA_GUEST_PA_ASID);
 }
 
 static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
@@ -423,7 +423,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 iova = mem_region.iova;
 }
 
-vhost_vdpa_dma_batch_begin_once(s);
+vhost_vdpa_dma_batch_begin_once(s, VHOST_VDPA_GUEST_PA_ASID);
 ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
  int128_get64(llsize), vaddr, section->readonly);
 if (ret) {
@@ -505,7 +505,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 iova = result->iova;
 vhost_iova_tree_remove(s->iova_tree, *result);
 }
-vhost_vdpa_dma_batch_begin_once(s);
+vhost_vdpa_dma_batch_begin_once(s, VHOST_VDPA_GUEST_PA_ASID);
 /*
  * The unmap ioctl doesn't accept a full 64-bit. need to check it
  */
@@ -1383,10 +1383,10 @@ static void *vhost_vdpa_load_map(void *opaque)
  msg->iotlb.size);
 break;
 case VHOST_IOTLB_BATCH_BEGIN:
-vhost_vdpa_dma_batch_begin_once(shared);
+vhost_vdpa_dma_batch_begin_once(shared, msg->asid);
 break;
 case VHOST_IOTLB_BATCH_END:
-vhost_vdpa_dma_batch_end_once(shared);
+vhost_vdpa_dma_batch_end_once(shared, msg->asid);
 break;
 default:
 error_report("Invalid IOTLB msg type %d", msg->iotlb.type);
-- 
1.8.3.1




[PATCH 16/40] vdpa: indicate SVQ switching via flag

2023-12-07 Thread Si-Wei Liu
svq_switching indicates the case where SVQ mode change
is on going. Positive (1) means switching from the
normal passthrough mode to SVQ mode, and negative (-1)
meaning switch SVQ back to the passthrough; zero (0)
indicates that there's no SVQ mode switch taking place.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 7b8f047..04718b2 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -320,6 +320,7 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState 
*s, bool enable)
 data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
 cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
   n->max_ncs - n->max_queue_pairs : 0;
+v->shared->svq_switching = enable ? 1 : -1;
 /*
  * TODO: vhost_net_stop does suspend, get_base and reset. We can be smarter
  * in the future and resume the device if read-only operations between
@@ -332,6 +333,7 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState 
*s, bool enable)
 if (unlikely(r < 0)) {
 error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
 }
+v->shared->svq_switching = 0;
 }
 
 static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
-- 
1.8.3.1




[PATCH 37/40] vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode

2023-12-07 Thread Si-Wei Liu
For better debuggability and observability.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events | 2 +-
 hw/virtio/vhost-vdpa.c | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index a8d3321..5085607 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -57,7 +57,7 @@ vhost_vdpa_dev_start(void *dev, bool started) "dev: %p 
started: %d"
 vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned long long size, int 
refcnt, int fd, void *log) "dev: %p base: 0x%"PRIx64" size: %llu refcnt: %d fd: 
%d log: %p"
 vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, 
uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, 
uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 
0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" 
log_guest_addr: 0x%"PRIx64
 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
-vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
+vhost_vdpa_set_dev_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index d66936f..ff4f218 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1043,7 +1043,10 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, 
uint8_t *config,
 static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev,
  struct vhost_vring_state *ring)
 {
-trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
+struct vhost_vdpa *v = dev->opaque;
+
+trace_vhost_vdpa_set_dev_vring_base(dev, ring->index, ring->num,
+v->shadow_vqs_enabled);
 return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
 }
 
-- 
1.8.3.1




[PATCH 36/40] vdpa: add vhost_vdpa_get_vring_base trace for svq mode

2023-12-07 Thread Si-Wei Liu
For better debuggability and observability.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events | 2 +-
 hw/virtio/vhost-vdpa.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 196f32f..a8d3321 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -58,7 +58,7 @@ vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned 
long long size, int r
 vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, 
uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, 
uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 
0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" 
log_guest_addr: 0x%"PRIx64
 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
 vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
-vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
+vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 
0x%"PRIx64
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 8ba390d..d66936f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1607,6 +1607,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
*dev,
 
 if (v->shadow_vqs_enabled) {
 ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
+trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, true);
 return 0;
 }
 
@@ -1619,7 +1620,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
*dev,
 }
 
 ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring);
-trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num);
+trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, false);
 return ret;
 }
 
-- 
1.8.3.1




[PATCH 01/40] linux-headers: add vhost_types.h and vhost.h

2023-12-07 Thread Si-Wei Liu
Signed-off-by: Si-Wei Liu 
---
 include/standard-headers/linux/vhost_types.h | 13 +
 linux-headers/linux/vhost.h  |  9 +
 2 files changed, 22 insertions(+)

diff --git a/include/standard-headers/linux/vhost_types.h 
b/include/standard-headers/linux/vhost_types.h
index 5ad07e1..c39199b 100644
--- a/include/standard-headers/linux/vhost_types.h
+++ b/include/standard-headers/linux/vhost_types.h
@@ -185,5 +185,18 @@ struct vhost_vdpa_iova_range {
  * DRIVER_OK
  */
 #define VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK  0x6
+/* Device can be resumed */
+#define VHOST_BACKEND_F_RESUME  0x5
+/* Device supports the driver enabling virtqueues both before and after
+ * DRIVER_OK
+ */
+#define VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK  0x6
+/* Device may expose the virtqueue's descriptor area, driver area and
+ * device area to a different group for ASID binding than where its
+ * buffers may reside. Requires VHOST_BACKEND_F_IOTLB_ASID.
+ */
+#define VHOST_BACKEND_F_DESC_ASID0x7
+/* IOTLB don't flush memory mapping across device reset */
+#define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
 
 #endif
diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h
index f5c48b6..c61c687 100644
--- a/linux-headers/linux/vhost.h
+++ b/linux-headers/linux/vhost.h
@@ -219,4 +219,13 @@
  */
 #define VHOST_VDPA_RESUME  _IO(VHOST_VIRTIO, 0x7E)
 
+/* Get the dedicated group for the descriptor table of a virtqueue:
+ * read index, write group in num.
+ * The virtqueue index is stored in the index field of vhost_vring_state.
+ * The group id for the descriptor table of this specific virtqueue
+ * is returned via num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F,   
\
+ struct vhost_vring_state)
+
 #endif
-- 
1.8.3.1




[PATCH 10/40] vdpa: assign svq descriptors a separate ASID when possible

2023-12-07 Thread Si-Wei Liu
When backend supports the VHOST_BACKEND_F_DESC_ASID feature
and all the data vqs can support one or more descriptor group
to host SVQ vrings and descriptors, we assign them a different
ASID than where its buffers reside in guest memory address
space. With this dedicated ASID for SVQs, the IOVA for what
vdpa device may care effectively becomes the GPA, thus there's
no need to translate IOVA address. For this reason, shadow_data
can be turned off accordingly. It doesn't mean the SVQ is not
enabled, but just that the translation is not needed from iova
tree perspective.

We can reuse CVQ's address space ID to host SVQ descriptors
because both CVQ and SVQ are emulated in the same QEMU
process, which will share the same VA address space.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c |  5 -
 net/vhost-vdpa.c   | 57 ++
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 24844b5..30dff95 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -627,6 +627,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
 uint64_t qemu_backend_features = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
  0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
  0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
+ 0x1ULL << VHOST_BACKEND_F_DESC_ASID |
  0x1ULL << VHOST_BACKEND_F_SUSPEND;
 int ret;
 
@@ -1249,7 +1250,9 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
 goto err;
 }
 
-vhost_svq_start(svq, dev->vdev, vq, v->shared->iova_tree);
+vhost_svq_start(svq, dev->vdev, vq,
+v->desc_group >= 0 && v->address_space_id ?
+NULL : v->shared->iova_tree);
 ok = vhost_vdpa_svq_map_rings(dev, svq, , );
 if (unlikely(!ok)) {
 goto err_map;
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 2555897..aebaa53 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -366,20 +366,50 @@ static int vhost_vdpa_set_address_space_id(struct 
vhost_vdpa *v,
 static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
 {
 struct vhost_vdpa *v = >vhost_vdpa;
+int r;
 
 migration_add_notifier(>migration_state,
vdpa_net_migration_state_notifier);
 
+if (!v->shadow_vqs_enabled) {
+if (v->desc_group >= 0 &&
+v->address_space_id != VHOST_VDPA_GUEST_PA_ASID) {
+vhost_vdpa_set_address_space_id(v, v->desc_group,
+VHOST_VDPA_GUEST_PA_ASID);
+s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
+}
+return;
+}
+
 /* iova_tree may be initialized by vhost_vdpa_net_load_setup */
-if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
+if (!v->shared->iova_tree) {
 v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
v->shared->iova_range.last);
 }
+
+if (s->always_svq || v->desc_group < 0) {
+return;
+}
+
+r = vhost_vdpa_set_address_space_id(v, v->desc_group,
+VHOST_VDPA_NET_CVQ_ASID);
+if (unlikely(r < 0)) {
+/* The other data vqs should also fall back to using the same ASID */
+s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
+return;
+}
+
+/* No translation needed on data SVQ when descriptor group is used */
+s->vhost_vdpa.address_space_id = VHOST_VDPA_NET_CVQ_ASID;
+s->vhost_vdpa.shared->shadow_data = false;
+return;
 }
 
 static int vhost_vdpa_net_data_start(NetClientState *nc)
 {
 VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
+
 struct vhost_vdpa *v = >vhost_vdpa;
 
 assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
@@ -397,6 +427,18 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
 return 0;
 }
 
+if (v->desc_group >= 0 && v->desc_group != s0->vhost_vdpa.desc_group) {
+unsigned asid;
+asid = v->shadow_vqs_enabled ?
+s0->vhost_vdpa.address_space_id : VHOST_VDPA_GUEST_PA_ASID;
+if (asid != s->vhost_vdpa.address_space_id) {
+vhost_vdpa_set_address_space_id(v, v->desc_group, asid);
+}
+s->vhost_vdpa.address_space_id = asid;
+} else {
+s->vhost_vdpa.address_space_id = s0->vhost_vdpa.address_space_id;
+}
+
 return 0;
 }
 
@@ -603,13 +645,19 @@ static int vhost_vdpa_net_cvq_start(

[PATCH 23/40] vdpa: vhost_vdpa_dma_batch_begin_once rename

2023-12-07 Thread Si-Wei Liu
No functional changes. Rename only.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 7a1b7f4..a6c6fe5 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -186,7 +186,7 @@ static bool vhost_vdpa_map_batch_begin(VhostVDPAShared *s)
 return true;
 }
 
-static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
+static void vhost_vdpa_dma_batch_begin_once(VhostVDPAShared *s)
 {
 if (!(s->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH)) ||
 s->iotlb_batch_begin_sent) {
@@ -411,7 +411,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 iova = mem_region.iova;
 }
 
-vhost_vdpa_iotlb_batch_begin_once(s);
+vhost_vdpa_dma_batch_begin_once(s);
 ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
  int128_get64(llsize), vaddr, section->readonly);
 if (ret) {
@@ -493,7 +493,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 iova = result->iova;
 vhost_iova_tree_remove(s->iova_tree, *result);
 }
-vhost_vdpa_iotlb_batch_begin_once(s);
+vhost_vdpa_dma_batch_begin_once(s);
 /*
  * The unmap ioctl doesn't accept a full 64-bit. need to check it
  */
@@ -1371,7 +1371,7 @@ static void *vhost_vdpa_load_map(void *opaque)
  msg->iotlb.size);
 break;
 case VHOST_IOTLB_BATCH_BEGIN:
-vhost_vdpa_iotlb_batch_begin_once(shared);
+vhost_vdpa_dma_batch_begin_once(shared);
 break;
 case VHOST_IOTLB_BATCH_END:
 vhost_vdpa_dma_batch_end_once(shared);
-- 
1.8.3.1




[PATCH 02/40] vdpa: add vhost_vdpa_get_vring_desc_group

2023-12-07 Thread Si-Wei Liu
Internal API to get the descriptor group index for a specific virtqueue
through the VHOST_VDPA_GET_VRING_DESC_GROUP ioctl.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 90f4128..887c329 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -471,6 +471,25 @@ static int64_t vhost_vdpa_get_vring_group(int device_fd, 
unsigned vq_index,
 return state.num;
 }
 
+static int64_t vhost_vdpa_get_vring_desc_group(int device_fd,
+   unsigned vq_index,
+   Error **errp)
+{
+struct vhost_vring_state state = {
+.index = vq_index,
+};
+int r = ioctl(device_fd, VHOST_VDPA_GET_VRING_DESC_GROUP, );
+
+if (unlikely(r < 0)) {
+r = -errno;
+error_setg_errno(errp, errno, "Cannot get VQ %u descriptor group",
+ vq_index);
+return r;
+}
+
+return state.num;
+}
+
 static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v,
unsigned vq_group,
unsigned asid_num)
-- 
1.8.3.1




[PATCH 05/40] vdpa: populate desc_group from net_vhost_vdpa_init

2023-12-07 Thread Si-Wei Liu
Add the desc_group field to struct vhost_vdpa, and get it
populated when the corresponding vq is initialized at
net_vhost_vdpa_init. If the vq does not have descriptor
group capability, or it doesn't have a dedicated ASID
group to host descriptors other than the data buffers,
desc_group will be set to a negative value -1.

Signed-off-by: Si-Wei Liu 
---
 include/hw/virtio/vhost-vdpa.h |  1 +
 net/vhost-vdpa.c   | 15 +--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 6533ad2..63493ff 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -87,6 +87,7 @@ typedef struct vhost_vdpa {
 Error *migration_blocker;
 VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
 IOMMUNotifier n;
+int64_t desc_group;
 } VhostVDPA;
 
 int vhost_vdpa_get_iova_range(int fd, struct vhost_vdpa_iova_range 
*iova_range);
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index cb5705d..1a738b2 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -1855,11 +1855,22 @@ static NetClientState 
*net_vhost_vdpa_init(NetClientState *peer,
 
 ret = vhost_vdpa_add(nc, (void *)>vhost_vdpa, queue_pair_index, nvqs);
 if (ret) {
-qemu_del_net_client(nc);
-return NULL;
+goto err;
 }
 
+if (is_datapath) {
+ret = vhost_vdpa_probe_desc_group(vdpa_device_fd, features,
+  0, _group, errp);
+if (unlikely(ret < 0)) {
+goto err;
+}
+}
+s->vhost_vdpa.desc_group = desc_group;
 return nc;
+
+err:
+qemu_del_net_client(nc);
+return NULL;
 }
 
 static int vhost_vdpa_get_features(int fd, uint64_t *features, Error **errp)
-- 
1.8.3.1




[PATCH 00/40] vdpa-net: improve migration downtime through descriptor ASID and persistent IOTLB

2023-12-07 Thread Si-Wei Liu
This patch series contain several enhancements to SVQ live migration downtime
for vDPA-net hardware device, specifically on mlx5_vdpa. Currently it is based
off of Eugenio's RFC v2 .load_setup series [1] to utilize the shared facility
and reduce frictions in merging or duplicating code if at all possible.

It's stacked up in particular order as below, as the optimization for one on
the top has to depend on others on the bottom. Here's a breakdown for what
each part does respectively:

Patch #  |  Feature / optimization
-V---
35 - 40  | trace events
34   | migrate_cancel bug fix
21 - 33  | (Un)map batching at stop-n-copy to further optimize LM down time
11 - 20  | persistent IOTLB [3] to improve LM down time
02 - 10  | SVQ descriptor ASID [2] to optimize SVQ switching
01   | dependent linux headers
 V 

Let's first define 2 sources of downtime that this work is concerned with:

* SVQ switching downtime (Downtime #1): downtime at the start of migration.
  Time spent on teardown and setup for SVQ mode switching, and this downtime
  is regarded as the maxium time for an individual vdpa-net device.
  No memory transfer is involved during SVQ switching, hence no .

* LM downtime (Downtime #2): aggregated downtime for all vdpa-net devices on
  resource teardown and setup in the last stop-n-copy phase on source host.

With each part of the optimizations applied bottom up, the effective outcome
in terms of down time (in seconds) performance can be observed in this table:


|Downtime #1|Downtime #2
+---+---
Baseline QEMU   | 20s ~ 30s |20s
|   |
Iterative map   |   |
at destination[1]   |5s |20s
|   |
SVQ descriptor  |   |
ASID [2]|2s | 5s
|   |
|   |
persistent IOTLB|2s | 2s
  [3]   |   |
|   |
(Un)map batching|   |
at stop-n-copy  |  1.7s |   1.5s 
before switchover   |   |

(VM config: 128GB mem, 2 mlx5_vdpa devices, each w/ 4 data vqs)

Please find the details regarding each enhancement on the commit log.

Thanks,
-Siwei


[1] [RFC PATCH v2 00/10] Map memory at destination .load_setup in vDPA-net 
migration
https://lists.nongnu.org/archive/html/qemu-devel/2023-11/msg05711.html
[2] VHOST_BACKEND_F_DESC_ASID
https://lore.kernel.org/virtualization/20231018171456.1624030-2-dtatu...@nvidia.com/
[3] VHOST_BACKEND_F_IOTLB_PERSIST
https://lore.kernel.org/virtualization/1698304480-18463-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (40):
  linux-headers: add vhost_types.h and vhost.h
  vdpa: add vhost_vdpa_get_vring_desc_group
  vdpa: probe descriptor group index for data vqs
  vdpa: piggyback desc_group index when probing isolated cvq
  vdpa: populate desc_group from net_vhost_vdpa_init
  vhost: make svq work with gpa without iova translation
  vdpa: move around vhost_vdpa_set_address_space_id
  vdpa: add back vhost_vdpa_net_first_nc_vdpa
  vdpa: no repeat setting shadow_data
  vdpa: assign svq descriptors a separate ASID when possible
  vdpa: factor out vhost_vdpa_last_dev
  vdpa: check map_thread_enabled before join maps thread
  vdpa: ref counting VhostVDPAShared
  vdpa: convert iova_tree to ref count based
  vdpa: add svq_switching and flush_map to header
  vdpa: indicate SVQ switching via flag
  vdpa: judge if map can be kept across reset
  vdpa: unregister listener on last dev cleanup
  vdpa: should avoid map flushing with persistent iotlb
  vdpa: avoid mapping flush across reset
  vdpa: vhost_vdpa_dma_batch_end_once rename
  vdpa: factor out vhost_vdpa_map_batch_begin
  vdpa: vhost_vdpa_dma_batch_begin_once rename
  vdpa: factor out vhost_vdpa_dma_batch_end
  vdpa: add asid to dma_batch_once API
  vdpa: return int for dma_batch_once API
  vdpa: add asid to all dma_batch call sites
  vdpa: support iotlb_batch_asid
  vdpa: expose API vhost_vdpa_dma_batch_once
  vdpa: batch map/unmap op per svq pair basis
  vdpa: batch map and unmap around cvq svq start/stop
  vdpa: factor out vhost_vdpa_net_get_nc_vdpa
  vdpa: batch multiple dma_unmap to a single call for vm stop
  vdpa: fix network breakage after cancelling migration
  vdpa: add vhost_vdpa_set_address_space_id trace
  vdpa: add vhost_vdpa_get_vring_base trace for svq mode
  vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode
  vdpa: add trace events for eval_flush
  vdpa: add trace events for vhost_vdpa_net_load_cmd
  vdpa: add trace event for vhost_vdpa_net_load_mq

 hw/virtio/trace-events   |   9 +-
 hw/virtio/vhost-shadow-virtqueue.c

Re: [PATCH 9.0 04/13] vdpa: move shadow_data to vhost_vdpa_shared

2023-12-05 Thread Si-Wei Liu
>vhost_vdpa;
  
-s0 = vhost_vdpa_net_first_nc_vdpa(s);

-v->shadow_data = s0->vhost_vdpa.shadow_vqs_enabled;
-v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
+    v->shadow_vqs_enabled = v->shared->shadow_data;

This new code looks fine.

Reviewed-by: Si-Wei Liu 


  s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
  
-if (s->vhost_vdpa.shadow_data) {

+if (v->shared->shadow_data) {
  /* SVQ is already configured for all virtqueues */
  goto out;
  }
@@ -1688,12 +1677,12 @@ static NetClientState 
*net_vhost_vdpa_init(NetClientState *peer,
  s->always_svq = svq;
  s->migration_state.notify = NULL;
  s->vhost_vdpa.shadow_vqs_enabled = svq;
-s->vhost_vdpa.shadow_data = svq;
  if (queue_pair_index == 0) {
  vhost_vdpa_net_valid_svq_features(features,
>vhost_vdpa.migration_blocker);
  s->vhost_vdpa.shared = g_new0(VhostVDPAShared, 1);
  s->vhost_vdpa.shared->iova_range = iova_range;
+s->vhost_vdpa.shared->shadow_data = svq;
  } else if (!is_datapath) {
  s->cvq_cmd_out_buffer = mmap(NULL, vhost_vdpa_net_cvq_cmd_page_len(),
   PROT_READ | PROT_WRITE,





Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration

2023-12-05 Thread Si-Wei Liu




On 12/5/2023 6:23 AM, Eugenio Perez Martin wrote:

On Fri, Nov 3, 2023 at 9:19 PM Si-Wei Liu  wrote:



On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:

On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu  wrote:


On 10/19/2023 7:34 AM, Eugenio Pérez wrote:

Current memory operations like pinning may take a lot of time at the

destination.  Currently they are done after the source of the migration is

stopped, and before the workload is resumed at the destination.  This is a

period where neigher traffic can flow, nor the VM workload can continue

(downtime).



We can do better as we know the memory layout of the guest RAM at the

destination from the moment the migration starts.  Moving that operation allows

QEMU to communicate the kernel the maps while the workload is still running in

the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,

but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still

saving all the pinning time.

I get what you want to say, though not sure how pinning is relevant to
on-chip IOMMU and .set_map here, essentially pinning is required for all
parent vdpa drivers that perform DMA hence don't want VM pages to move
around.

Basically highlighting that the work done under .set_map is not only
pinning, but it is a significant fraction on it. It can be reworded or
deleted for sure.



Note that further devices setup at the end of the migration may alter the guest

memory layout. But same as the previous point, many operations are still done

incrementally, like memory pinning, so we're saving time anyway.



The first bunch of patches just reorganizes the code, so memory related

operation parameters are shared between all vhost_vdpa devices.  This is

because the destination does not know what vhost_vdpa struct will have the

registered listener member, so it is easier to place them in a shared struct

rather to keep them in vhost_vdpa struct.  Future version may squash or omit

these patches.

It looks this VhostVDPAShared facility (patch 1-13) is also what I need
in my SVQ descriptor group series [*], for which I've built similar
construct there. If possible please try to merge this in ASAP. I'll
rework my series on top of that.

[*]
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5


I can send it individually, for sure.

MST, Jason, can this first part be merged? It doesn't add a lot by
itself but it helps pave the way for future changes.

If it cannot, it doesn't matter. I can pick it from here and get my
series posted with your patches 1-13 applied upfront. This should work,
I think?



Only tested with vdpa_sim. I'm sending this before full benchmark, as some work

like [1] can be based on it, and Si-Wei agreed on benchmark this series with

his experience.

Haven't done the full benchmark compared to pre-map at destination yet,
though an observation is that the destination QEMU seems very easy to
get stuck for very long time while in mid of pinning pages. During this
period, any client doing read-only QMP query or executing HMP info
command got frozen indefinitely (subject to how large size the memory is
being pinned). Is it possible to unblock those QMP request or HMP
command from being executed (at least the read-only ones) while in
migration? Yield from the load_setup corourtine and spawn another thread?


Ok, I wasn't aware of that.

I think we cannot yield in a coroutine and wait for an ioctl.

I was wondering if we need a separate coroutine out of the general
migration path to support this special code without overloading
load_setup or its callers. For instance, unblock the source from sending
guest rams while allow destination pin pages in parallel should be
possible.


Hi Si-Wei,

I'm working on this, I think I'll be able to send a new version soon.
Just a question, when the mapping is done in vhost_vdpa_dev_start as
the current upstream master does, are you able to interact with QMP?

Hi Eugenio,

Yes, the latest version works pretty well! Did not get to all of the QMP 
commands, but at least I can do read-only QMP without a problem. That is 
able to address our typical usages. Thanks for the prompt fix!


I've rebased my series on top the .load_setup series instead of the top 
13 patches for 9.0, as there are some other dependent patches from this 
series to avoid duplicate work. Am debugging some problems I ran into 
after the code merge. Once they are sorted out I'll post my patch series 
soon!


Thanks,
-Siwei





Thanks!


Regardless, a separate thread is needed to carry out all the heavy
lifting, i.e. ioctl(2) or write(2) syscalls to map pages.



One
option that came to my mind is to effectively use another thread, and
use a POSIX barrier (or equivalent on glib / QEMU) before finishing
the migration.

Yes, a separate thread is needed anyway.


   I'm not sure if there are more points where we can
check the barrier and tell the migration to continue or stop though.

I think

Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration

2023-11-03 Thread Si-Wei Liu




On 11/2/2023 3:12 AM, Si-Wei Liu wrote:



On 10/19/2023 7:34 AM, Eugenio Pérez wrote:

Current memory operations like pinning may take a lot of time at the

destination.  Currently they are done after the source of the 
migration is


stopped, and before the workload is resumed at the destination. This 
is a


period where neigher traffic can flow, nor the VM workload can continue

(downtime).



We can do better as we know the memory layout of the guest RAM at the

destination from the moment the migration starts.  Moving that 
operation allows


QEMU to communicate the kernel the maps while the workload is still 
running in


the source, so Linux can start mapping them.  Ideally, all IOMMU is 
configured,


but if the vDPA parent driver uses on-chip IOMMU and .set_map we're 
still


saving all the pinning time.
I get what you want to say, though not sure how pinning is relevant to 
on-chip IOMMU and .set_map here, essentially pinning is required for 
all parent vdpa drivers that perform DMA hence don't want VM pages to 
move around.




Note that further devices setup at the end of the migration may alter 
the guest


memory layout. But same as the previous point, many operations are 
still done


incrementally, like memory pinning, so we're saving time anyway.



The first bunch of patches just reorganizes the code, so memory related

operation parameters are shared between all vhost_vdpa devices. This is

because the destination does not know what vhost_vdpa struct will 
have the


registered listener member, so it is easier to place them in a shared 
struct


rather to keep them in vhost_vdpa struct.  Future version may squash 
or omit


these patches.
It looks this VhostVDPAShared facility (patch 1-13) is also what I 
need in my SVQ descriptor group series [*], for which I've built 
similar construct there. If possible please try to merge this in ASAP. 
I'll rework my series on top of that.


[*] 
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5






Only tested with vdpa_sim. I'm sending this before full benchmark, as 
some work


like [1] can be based on it, and Si-Wei agreed on benchmark this 
series with


his experience.

Haven't done the full benchmark compared to pre-map at destination yet,

Hi Eugenio,

I just notice one thing that affects the performance benchmark for this 
series in terms of migration total_time (to be fair, it's mlx5_vdpa 
specific). It looks like iotlb map batching is not acked (via 
vhost_vdpa_set_backend_cap) at the point of vhost-vdpa_load_setup, 
effectively causing quite extensive time spent on hundreds of dma_map 
calls from listener_register().  While the equivalent code had been 
implemented in my destination pre-map patch [1]. Although I can 
benchmark the current patchset by remove batching from my code, I guess 
that's not the goal of this benchmark, right?


If would be the best to have map batching in place, so benchmark for 
both options could match. What do you think?


Thanks,
-Siwei

[1]
https://github.com/siwliu-kernel/qemu/commit/0ce225b0c7e618163ea09da3846c93c4de2f85ed#diff-45489c6f25dc36fd84e1cd28cbf3b8ff03301e2d24dadb6d1c334c9e8f14c00cR639

though an observation is that the destination QEMU seems very easy to 
get stuck for very long time while in mid of pinning pages. During 
this period, any client doing read-only QMP query or executing HMP 
info command got frozen indefinitely (subject to how large size the 
memory is being pinned). Is it possible to unblock those QMP request 
or HMP command from being executed (at least the read-only ones) while 
in migration? Yield from the load_setup corourtine and spawn another 
thread?


Having said, not sure if .load_setup is a good fit for what we want to 
do. Searching all current users of .load_setup, either the job can be 
done instantly or the task is time bound without trapping into kernel 
for too long. Maybe pinning is too special use case here...


-Siwei




Future directions on top of this series may include:

* Iterative migration of virtio-net devices, as it may reduce 
downtime per [1].


   vhost-vdpa net can apply the configuration through CVQ in the 
destination


   while the source is still migrating.

* Move more things ahead of migration time, like DRIVER_OK.

* Check that the devices of the destination are valid, and cancel the 
migration


   in case it is not.



[1] 
https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/




Eugenio Pérez (18):

   vdpa: add VhostVDPAShared

   vdpa: move iova tree to the shared struct

   vdpa: move iova_range to vhost_vdpa_shared

   vdpa: move shadow_data to vhost_vdpa_shared

   vdpa: use vdpa shared for tracing

   vdpa: move file descriptor to vhost_vdpa_shared

   vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared

   vdpa: move backend_cap to vhost_vdpa_shared

   vdpa: remove msg type of vhost_vdpa

   vdpa: move iommu_list to vhost_vdpa_shared

   vdpa: use

Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration

2023-11-03 Thread Si-Wei Liu




On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:

On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu  wrote:



On 10/19/2023 7:34 AM, Eugenio Pérez wrote:

Current memory operations like pinning may take a lot of time at the

destination.  Currently they are done after the source of the migration is

stopped, and before the workload is resumed at the destination.  This is a

period where neigher traffic can flow, nor the VM workload can continue

(downtime).



We can do better as we know the memory layout of the guest RAM at the

destination from the moment the migration starts.  Moving that operation allows

QEMU to communicate the kernel the maps while the workload is still running in

the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,

but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still

saving all the pinning time.

I get what you want to say, though not sure how pinning is relevant to
on-chip IOMMU and .set_map here, essentially pinning is required for all
parent vdpa drivers that perform DMA hence don't want VM pages to move
around.

Basically highlighting that the work done under .set_map is not only
pinning, but it is a significant fraction on it. It can be reworded or
deleted for sure.




Note that further devices setup at the end of the migration may alter the guest

memory layout. But same as the previous point, many operations are still done

incrementally, like memory pinning, so we're saving time anyway.



The first bunch of patches just reorganizes the code, so memory related

operation parameters are shared between all vhost_vdpa devices.  This is

because the destination does not know what vhost_vdpa struct will have the

registered listener member, so it is easier to place them in a shared struct

rather to keep them in vhost_vdpa struct.  Future version may squash or omit

these patches.

It looks this VhostVDPAShared facility (patch 1-13) is also what I need
in my SVQ descriptor group series [*], for which I've built similar
construct there. If possible please try to merge this in ASAP. I'll
rework my series on top of that.

[*]
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5


I can send it individually, for sure.

MST, Jason, can this first part be merged? It doesn't add a lot by
itself but it helps pave the way for future changes.
If it cannot, it doesn't matter. I can pick it from here and get my 
series posted with your patches 1-13 applied upfront. This should work, 
I think?





Only tested with vdpa_sim. I'm sending this before full benchmark, as some work

like [1] can be based on it, and Si-Wei agreed on benchmark this series with

his experience.

Haven't done the full benchmark compared to pre-map at destination yet,
though an observation is that the destination QEMU seems very easy to
get stuck for very long time while in mid of pinning pages. During this
period, any client doing read-only QMP query or executing HMP info
command got frozen indefinitely (subject to how large size the memory is
being pinned). Is it possible to unblock those QMP request or HMP
command from being executed (at least the read-only ones) while in
migration? Yield from the load_setup corourtine and spawn another thread?


Ok, I wasn't aware of that.

I think we cannot yield in a coroutine and wait for an ioctl.
I was wondering if we need a separate coroutine out of the general 
migration path to support this special code without overloading 
load_setup or its callers. For instance, unblock the source from sending 
guest rams while allow destination pin pages in parallel should be 
possible.


Regardless, a separate thread is needed to carry out all the heavy 
lifting, i.e. ioctl(2) or write(2) syscalls to map pages.




One
option that came to my mind is to effectively use another thread, and
use a POSIX barrier (or equivalent on glib / QEMU) before finishing
the migration.

Yes, a separate thread is needed anyway.


  I'm not sure if there are more points where we can
check the barrier and tell the migration to continue or stop though.
I think there is, for e.g. what if the dma_map fails. There must be a 
check point for that.




Another option is to effectively start doing these ioctls in an
asynchronous way, io_uring cmds like, but I'd like to achieve this
first.
Yes, io_uring or any async API could be another option. Though this 
needs new uAPI through additional kernel facility to support. Anyway, 
it's up to you to decide. :)


Regards,
-Siwei

Having said, not sure if .load_setup is a good fit for what we want to
do. Searching all current users of .load_setup, either the job can be
done instantly or the task is time bound without trapping into kernel
for too long. Maybe pinning is too special use case here...

-Siwei



Future directions on top of this series may include:

* Iterative migration of virtio-net devices, as it may reduce downtime per [1].

vhost-vdpa net can apply the configuration

Re: [RFC PATCH 00/18] Map memory at destination .load_setup in vDPA-net migration

2023-11-02 Thread Si-Wei Liu




On 10/19/2023 7:34 AM, Eugenio Pérez wrote:

Current memory operations like pinning may take a lot of time at the

destination.  Currently they are done after the source of the migration is

stopped, and before the workload is resumed at the destination.  This is a

period where neigher traffic can flow, nor the VM workload can continue

(downtime).



We can do better as we know the memory layout of the guest RAM at the

destination from the moment the migration starts.  Moving that operation allows

QEMU to communicate the kernel the maps while the workload is still running in

the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,

but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still

saving all the pinning time.
I get what you want to say, though not sure how pinning is relevant to 
on-chip IOMMU and .set_map here, essentially pinning is required for all 
parent vdpa drivers that perform DMA hence don't want VM pages to move 
around.




Note that further devices setup at the end of the migration may alter the guest

memory layout. But same as the previous point, many operations are still done

incrementally, like memory pinning, so we're saving time anyway.



The first bunch of patches just reorganizes the code, so memory related

operation parameters are shared between all vhost_vdpa devices.  This is

because the destination does not know what vhost_vdpa struct will have the

registered listener member, so it is easier to place them in a shared struct

rather to keep them in vhost_vdpa struct.  Future version may squash or omit

these patches.
It looks this VhostVDPAShared facility (patch 1-13) is also what I need 
in my SVQ descriptor group series [*], for which I've built similar 
construct there. If possible please try to merge this in ASAP. I'll 
rework my series on top of that.


[*] 
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5






Only tested with vdpa_sim. I'm sending this before full benchmark, as some work

like [1] can be based on it, and Si-Wei agreed on benchmark this series with

his experience.
Haven't done the full benchmark compared to pre-map at destination yet, 
though an observation is that the destination QEMU seems very easy to 
get stuck for very long time while in mid of pinning pages. During this 
period, any client doing read-only QMP query or executing HMP info 
command got frozen indefinitely (subject to how large size the memory is 
being pinned). Is it possible to unblock those QMP request or HMP 
command from being executed (at least the read-only ones) while in 
migration? Yield from the load_setup corourtine and spawn another thread?


Having said, not sure if .load_setup is a good fit for what we want to 
do. Searching all current users of .load_setup, either the job can be 
done instantly or the task is time bound without trapping into kernel 
for too long. Maybe pinning is too special use case here...


-Siwei




Future directions on top of this series may include:

* Iterative migration of virtio-net devices, as it may reduce downtime per [1].

   vhost-vdpa net can apply the configuration through CVQ in the destination

   while the source is still migrating.

* Move more things ahead of migration time, like DRIVER_OK.

* Check that the devices of the destination are valid, and cancel the migration

   in case it is not.



[1] 
https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/



Eugenio Pérez (18):

   vdpa: add VhostVDPAShared

   vdpa: move iova tree to the shared struct

   vdpa: move iova_range to vhost_vdpa_shared

   vdpa: move shadow_data to vhost_vdpa_shared

   vdpa: use vdpa shared for tracing

   vdpa: move file descriptor to vhost_vdpa_shared

   vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared

   vdpa: move backend_cap to vhost_vdpa_shared

   vdpa: remove msg type of vhost_vdpa

   vdpa: move iommu_list to vhost_vdpa_shared

   vdpa: use VhostVDPAShared in vdpa_dma_map and unmap

   vdpa: use dev_shared in vdpa_iommu

   vdpa: move memory listener to vhost_vdpa_shared

   vdpa: do not set virtio status bits if unneeded

   vdpa: add vhost_vdpa_load_setup

   vdpa: add vhost_vdpa_net_load_setup NetClient callback

   vdpa: use shadow_data instead of first device v->shadow_vqs_enabled

   virtio_net: register incremental migration handlers



  include/hw/virtio/vhost-vdpa.h |  43 +---

  include/net/net.h  |   4 +

  hw/net/virtio-net.c|  23 +

  hw/virtio/vdpa-dev.c   |   7 +-

  hw/virtio/vhost-vdpa.c | 183 ++---

  net/vhost-vdpa.c   | 127 ---

  hw/virtio/trace-events |  14 +--

  7 files changed, 239 insertions(+), 162 deletions(-)








Re: [RFC PATCH 02/18] vdpa: move iova tree to the shared struct

2023-11-02 Thread Si-Wei Liu




On 10/19/2023 7:34 AM, Eugenio Pérez wrote:

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the iova tree to VhostVDPAShared so all vhost_vdpa can use it,
rather than always in the first or last vhost_vdpa.

Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h |  4 +--
  hw/virtio/vhost-vdpa.c | 19 ++--
  net/vhost-vdpa.c   | 54 +++---
  3 files changed, 35 insertions(+), 42 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index eb1a56d75a..ac036055d3 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -32,6 +32,8 @@ typedef struct VhostVDPAHostNotifier {
  
  /* Info shared by all vhost_vdpa device models */

  typedef struct vhost_vdpa_shared {
+/* IOVA mapping used by the Shadow Virtqueue */
+VhostIOVATree *iova_tree;
  } VhostVDPAShared;
  
  typedef struct vhost_vdpa {

@@ -48,8 +50,6 @@ typedef struct vhost_vdpa {
  bool shadow_data;
  /* Device suspended successfully */
  bool suspended;
-/* IOVA mapping used by the Shadow Virtqueue */
-VhostIOVATree *iova_tree;
  VhostVDPAShared *shared;
  GPtrArray *shadow_vqs;
  const VhostShadowVirtqueueOps *shadow_vq_ops;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 819b2d811a..9cee38cb6d 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -358,7 +358,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
  mem_region.size = int128_get64(llsize) - 1,
  mem_region.perm = IOMMU_ACCESS_FLAG(true, section->readonly),
  
-r = vhost_iova_tree_map_alloc(v->iova_tree, _region);

+r = vhost_iova_tree_map_alloc(v->shared->iova_tree, _region);
  if (unlikely(r != IOVA_OK)) {
  error_report("Can't allocate a mapping (%d)", r);
  goto fail;
@@ -379,7 +379,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
  
  fail_map:

  if (v->shadow_data) {
-vhost_iova_tree_remove(v->iova_tree, mem_region);
+vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
  }
  
  fail:

@@ -441,13 +441,13 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
  .size = int128_get64(llsize) - 1,
  };
  
-result = vhost_iova_tree_find_iova(v->iova_tree, _region);

+result = vhost_iova_tree_find_iova(v->shared->iova_tree, _region);
  if (!result) {
  /* The memory listener map wasn't mapped */
  return;
  }
  iova = result->iova;
-vhost_iova_tree_remove(v->iova_tree, *result);
+vhost_iova_tree_remove(v->shared->iova_tree, *result);
  }
  vhost_vdpa_iotlb_batch_begin_once(v);
  /*
@@ -1059,7 +1059,8 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa 
*v, hwaddr addr)
  const DMAMap needle = {
  .translated_addr = addr,
  };
-const DMAMap *result = vhost_iova_tree_find_iova(v->iova_tree, );
+const DMAMap *result = vhost_iova_tree_find_iova(v->shared->iova_tree,
+ );
  hwaddr size;
  int r;
  
@@ -1075,7 +1076,7 @@ static void vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v, hwaddr addr)

  return;
  }
  
-vhost_iova_tree_remove(v->iova_tree, *result);

+vhost_iova_tree_remove(v->shared->iova_tree, *result);
  }
  
  static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,

@@ -1103,7 +1104,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, 
DMAMap *needle,
  {
  int r;
  
-r = vhost_iova_tree_map_alloc(v->iova_tree, needle);

+r = vhost_iova_tree_map_alloc(v->shared->iova_tree, needle);
  if (unlikely(r != IOVA_OK)) {
  error_setg(errp, "Cannot allocate iova (%d)", r);
  return false;
@@ -1115,7 +1116,7 @@ static bool vhost_vdpa_svq_map_ring(struct vhost_vdpa *v, 
DMAMap *needle,
 needle->perm == IOMMU_RO);
  if (unlikely(r != 0)) {
  error_setg_errno(errp, -r, "Cannot map region to device");
-vhost_iova_tree_remove(v->iova_tree, *needle);
+vhost_iova_tree_remove(v->shared->iova_tree, *needle);
  }
  
  return r == 0;

@@ -1216,7 +1217,7 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
  goto err;
  }
  
-vhost_svq_start(svq, dev->vdev, vq, v->iova_tree);

+vhost_svq_start(svq, dev->vdev, vq, v->shared->iova_tree);
  ok = vhost_vdpa_svq_map_rings(dev, svq, , );
 

Re: [RFC PATCH 15/18] vdpa: add vhost_vdpa_load_setup

2023-11-02 Thread Si-Wei Liu




On 10/19/2023 7:34 AM, Eugenio Pérez wrote:

Callers can use this function to setup the incoming migration.

Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h |  7 +++
  hw/virtio/vhost-vdpa.c | 17 -
  2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 8f54e5edd4..edc08b7a02 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -45,6 +45,12 @@ typedef struct vhost_vdpa_shared {
  
  bool iotlb_batch_begin_sent;
  
+/*

+ * The memory listener has been registered, so DMA maps have been sent to
+ * the device.
+ */
+bool listener_registered;
+
  /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA 
*/
  bool shadow_data;
  } VhostVDPAShared;
@@ -73,6 +79,7 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, 
hwaddr iova,
 hwaddr size, void *vaddr, bool readonly);
  int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
   hwaddr size);
+int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
  
  typedef struct vdpa_iommu {

  VhostVDPAShared *dev_shared;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index cc252fc2d8..bfbe4673af 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1325,7 +1325,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
   "IOMMU and try again");
  return -1;
  }
-memory_listener_register(>shared->listener, dev->vdev->dma_as);
+if (!v->shared->listener_registered) {
+memory_listener_register(>shared->listener, dev->vdev->dma_as);
+}
Set listener_registered to true after registration; in addition, it 
looks like the memory_listener_unregister in vhost_vdpa_reset_status 
doesn't clear the listener_registered flag after unregistration. This 
code path can be called during SVQ switching, if not doing so mapping 
can't be added back after a couple rounds of SVQ switching or live 
migration.


-Siwei

  
  return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);

  }
@@ -1528,3 +1530,16 @@ const VhostOps vdpa_ops = {
  .vhost_set_config_call = vhost_vdpa_set_config_call,
  .vhost_reset_status = vhost_vdpa_reset_status,
  };
+
+int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as)
+{
+uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
+int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, );
+if (unlikely(r < 0)) {
+return r;
+}
+
+memory_listener_register(>listener, dma_as);
+shared->listener_registered = true;
+return 0;
+}





Re: [RFC PATCH 04/18] vdpa: move shadow_data to vhost_vdpa_shared

2023-11-02 Thread Si-Wei Liu




On 10/19/2023 7:34 AM, Eugenio Pérez wrote:

Next patches will register the vhost_vdpa memory listener while the VM
is migrating at the destination, so we can map the memory to the device
before stopping the VM at the source.  The main goal is to reduce the
downtime.

However, the destination QEMU is unaware of which vhost_vdpa device will
register its memory_listener.  If the source guest has CVQ enabled, it
will be the CVQ device.  Otherwise, it  will be the first one.

Move the shadow_data member to VhostVDPAShared so all vhost_vdpa can use
it, rather than always in the first or last vhost_vdpa.

Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h |  5 +++--
  hw/virtio/vhost-vdpa.c |  6 +++---
  net/vhost-vdpa.c   | 23 ++-
  3 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 8d52a7e498..01e0f25e27 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -36,6 +36,9 @@ typedef struct vhost_vdpa_shared {
  
  /* IOVA mapping used by the Shadow Virtqueue */

  VhostIOVATree *iova_tree;
+
+/* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
+bool shadow_data;
  } VhostVDPAShared;
  
  typedef struct vhost_vdpa {

@@ -47,8 +50,6 @@ typedef struct vhost_vdpa {
  MemoryListener listener;
  uint64_t acked_features;
  bool shadow_vqs_enabled;
-/* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
-bool shadow_data;
  /* Device suspended successfully */
  bool suspended;
  VhostVDPAShared *shared;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 2bceadd118..ec028e4c56 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -353,7 +353,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
   vaddr, section->readonly);
  
  llsize = int128_sub(llend, int128_make64(iova));

-if (v->shadow_data) {
+if (v->shared->shadow_data) {
  int r;
  
  mem_region.translated_addr = (hwaddr)(uintptr_t)vaddr,

@@ -380,7 +380,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
  return;
  
  fail_map:

-if (v->shadow_data) {
+if (v->shared->shadow_data) {
  vhost_iova_tree_remove(v->shared->iova_tree, mem_region);
  }
  
@@ -435,7 +435,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener *listener,
  
  llsize = int128_sub(llend, int128_make64(iova));
  
-if (v->shadow_data) {

+if (v->shared->shadow_data) {
  const DMAMap *result;
  const void *vaddr = memory_region_get_ram_ptr(section->mr) +
  section->offset_within_region +
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 9648b0ef7e..01202350ea 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -282,15 +282,6 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
  return size;
  }
  
-/** From any vdpa net client, get the netclient of the first queue pair */

-static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
-{
-NICState *nic = qemu_get_nic(s->nc.peer);
-NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
-
-return DO_UPCAST(VhostVDPAState, nc, nc0);
-}
-
  static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
  {
  struct vhost_vdpa *v = >vhost_vdpa;
@@ -360,10 +351,10 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
  if (s->always_svq ||
  migration_is_setup_or_active(migrate_get_current()->state)) {
  v->shadow_vqs_enabled = true;
-v->shadow_data = true;
+v->shared->shadow_data = true;
  } else {
  v->shadow_vqs_enabled = false;
-v->shadow_data = false;
+v->shared->shadow_data = false;
  }
  
  if (v->index == 0) {

@@ -513,7 +504,7 @@ dma_map_err:
  
  static int vhost_vdpa_net_cvq_start(NetClientState *nc)

  {
-VhostVDPAState *s, *s0;
+VhostVDPAState *s;
  struct vhost_vdpa *v;
  int64_t cvq_group;
  int r;
@@ -524,12 +515,10 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
  s = DO_UPCAST(VhostVDPAState, nc, nc);
  v = >vhost_vdpa;
  
-s0 = vhost_vdpa_net_first_nc_vdpa(s);

-v->shadow_data = s0->vhost_vdpa.shadow_vqs_enabled;
-v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
+v->shadow_vqs_enabled = s->always_svq;
This doesn't seem equivalent to the previous code. If always_svq is not 
set and migration is active, will it cause CVQ not shadowed at all? The 
"goto out;" line below would effectively return from this function, 
resulting in cvq's shadow_vqs_enabled left behind as false.




  s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
  
-if (s->vhost_vdpa.shadow_data) {

+if (v->shared->shadow_data) {
  /* SVQ is already 

Re: [PATCH] vhost: Perform memory section dirty scans once per iteration

2023-10-17 Thread Si-Wei Liu




On 10/6/2023 2:48 AM, Michael S. Tsirkin wrote:

On Fri, Oct 06, 2023 at 09:58:30AM +0100, Joao Martins wrote:

On 03/10/2023 15:01, Michael S. Tsirkin wrote:

On Wed, Sep 27, 2023 at 12:14:28PM +0100, Joao Martins wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So
essentially we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger.

The real problem, however, is exactly that: a device per vhost worker/qp,
when there should be a device representing a netdev (for N vhost workers).
Given this problem exists for any Qemu these days, figured a simpler
solution is better to increase stable tree's coverage; thus don't
change the device model of sw vhost to fix this "over log scan" issue.

Signed-off-by: Joao Martins 
---
I am not fully sure the heuristic captures the myriad of different vhost
devices -- I think so. IIUC, the log is always shared, it's just whether
it's qemu head memory or via /dev/shm when other processes want to
access it.

Thanks for working on this.

I don't think this works like this because different types of different
vhost devices have different regions - see e.g. vhost_region_add_section
I am also not sure all devices are running at the same time - e.g.
some could be disconnected, and vhost_sync_dirty_bitmap takes this
into account.


Good point. But this all means logic in selecting the 'logger' to take into
considering whether vhost_dev::log_enabled or vhost_dev::started right?

With respect to regions it seems like this can only change depending on whether
one of the vhost devices, backend_type is VHOST_BACKEND_TYPE_USER *and* whether
the backend sets vhost_backend_can_merge?

With respect to 'could be disconnected' during migration not devices can be
added or removed during migration, so might not be something that occurs during
migration.
I placed this in log_sync exactly to just cover migration, unless
there's some other way that disconnects the vhost and changes these variables
during migration.

The *frontend* can't be added or removed (ATM - this is just because we lack
good ways to describe devices that can be migrated, so all we
came up with is passing same command line on both sides,
and this breaks if you add/remove things in the process).
We really shouldn't bake this assumption into code if we can
help it though.

But I digress.

The *backend* can disconnect at any time as this is not guest visible.


But the idea is I think a good one - I just feel more refactoring is
needed.

Can you expand on what refactoring you were thinking for this fix?

Better separate the idea of logging from device. then we can
have a single logger that collects data from devices to decide
what needs to be logged.
Discussion. I think the troublemaker here is the vhost-user clients that 
attempt to round down to (huge) page boundary and then has to merge 
adjacent sections, leading to differing views between vhost devices. 
While I agree it is a great idea to separate logging from device, it 
isn't clear to me how that can help the case where there could be a mix 
of both vhost-user and vhost-kernel clients in the same qemu process, in 
which case it would need at least 2 separate vhost loggers for the 
specific vhost type? Or you would think there's value to unify the two 
distinct subsystems with one single vhost logger facility? Noted the 
vhost logging interface (vhost kernel or vhost userspace) doesn't 
support the notion of separate logging of memory buffer sections against 
those for VQs, all QEMU can rely on is various sections in the memory 
table and basically a single dirty bitmap for both guest buffers and VQs 
are indistinctively shared by all vhost devices. How does it help to 
just refactor QEMU part of code using today's vhost backend interface, I 
am not sure.


Regardless, IMHO for fixing stable p.o.v it might be less risky and 
valuable to just limit the fix to vhost-kernel case (to be more precise, 
non-vhost-user type and without 

Re: [PATCH v3 0/5] Enable vdpa net migration with features depending on CVQ

2023-09-15 Thread Si-Wei Liu
Does this series need to work with the recently merged 
ENABLE_AFTER_DRIVER_OK series from kernel?


-Siwei

On 8/22/2023 1:53 AM, Eugenio Pérez wrote:

At this moment the migration of net features that depends on CVQ is not

possible, as there is no reliable way to restore the device state like mac

address, number of enabled queues, etc to the destination.  This is mainly

caused because the device must only read CVQ, and process all the commands

before resuming the dataplane.



This series lift that requirement, sending the VHOST_VDPA_SET_VRING_ENABLE

ioctl for dataplane vqs only after the device has processed all commands.

---

v3:

* Fix subject typo and expand message of patch ("vdpa: move

   vhost_vdpa_set_vring_ready to the caller").



v2:

* Factor out VRING_ENABLE ioctls from vhost_vdpa_dev_start to the caller,

   instead of providing a callback to know if it must be called or not.

* at https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg05447.html



RFC:

* Enable vqs early in case CVQ cannot be shadowed.

* at https://lists.gnu.org/archive/html/qemu-devel/2023-07/msg01325.html



Eugenio Pérez (5):

   vdpa: use first queue SVQ state for CVQ default

   vdpa: export vhost_vdpa_set_vring_ready

   vdpa: rename vhost_vdpa_net_load to vhost_vdpa_net_cvq_load

   vdpa: move vhost_vdpa_set_vring_ready to the caller

   vdpa: remove net cvq migration blocker



  include/hw/virtio/vhost-vdpa.h |  1 +

  hw/virtio/vdpa-dev.c   |  3 ++

  hw/virtio/vhost-vdpa.c | 22 +-

  net/vhost-vdpa.c   | 75 +++---

  hw/virtio/trace-events |  2 +-

  5 files changed, 57 insertions(+), 46 deletions(-)








  1   2   3   >