On 7/6/21 11:16 AM, Burakov, Anatoly wrote:
> On 06-Jul-21 9:31 AM, Ding, Xuan wrote:
>> Hi Maxime,
>>
>>> -----Original Message-----
>>> From: Maxime Coquelin <maxime.coque...@redhat.com>
>>> Sent: Monday, July 5, 2021 8:46 PM
>>> To: Burakov, Anatoly <anatoly.bura...@intel.com>; Ding, Xuan
>>> <xuan.d...@intel.com>; Xia, Chenbo <chenbo....@intel.com>; Thomas
>>> Monjalon <tho...@monjalon.net>; David Marchand
>>> <david.march...@redhat.com>
>>> Cc: dev@dpdk.org; Hu, Jiayu <jiayu...@intel.com>; Pai G, Sunil
>>> <sunil.pa...@intel.com>; Richardson, Bruce <bruce.richard...@intel.com>;
>>> Van Haaren, Harry <harry.van.haa...@intel.com>; Liu, Yong
>>> <yong....@intel.com>; Ma, WenwuX <wenwux...@intel.com>
>>> Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async
>>> vhost
>>>
>>>
>>>
>>> On 7/5/21 2:16 PM, Burakov, Anatoly wrote:
>>>> On 05-Jul-21 9:40 AM, Xuan Ding wrote:
>>>>> The use of IOMMU has many advantages, such as isolation and address
>>>>> translation. This patch extends the capbility of DMA engine to use
>>>>> IOMMU if the DMA device is bound to vfio.
>>>>>
>>>>> When set memory table, the guest memory will be mapped
>>>>> into the default container of DPDK.
>>>>>
>>>>> Signed-off-by: Xuan Ding <xuan.d...@intel.com>
>>>>> ---
>>>>> doc/guides/prog_guide/vhost_lib.rst | 9 ++++++
>>>>> lib/vhost/rte_vhost.h | 1 +
>>>>> lib/vhost/socket.c | 9 ++++++
>>>>> lib/vhost/vhost.h | 1 +
>>>>> lib/vhost/vhost_user.c | 46
>>>>> ++++++++++++++++++++++++++++-
>>>>> 5 files changed, 65 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/doc/guides/prog_guide/vhost_lib.rst
>>>>> b/doc/guides/prog_guide/vhost_lib.rst
>>>>> index 05c42c9b11..c3beda23d9 100644
>>>>> --- a/doc/guides/prog_guide/vhost_lib.rst
>>>>> +++ b/doc/guides/prog_guide/vhost_lib.rst
>>>>> @@ -118,6 +118,15 @@ The following is an overview of some key Vhost
>>>>> API functions:
>>>>> It is disabled by default.
>>>>> + - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
>>>>> +
>>>>> + In asynchronous data path, vhost liarary is not aware of which
>>>>> driver
>>>>> + (igb_uio/vfio) the DMA device is bound to. Application should
>>>>> pass
>>>>> + this flag to tell vhost library whether IOMMU should be
>>>>> programmed
>>>>> + for guest memory.
>>>>> +
>>>>> + It is disabled by default.
>>>>> +
>>>>> - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
>>>>> Since v16.04, the vhost library forwards checksum and gso
>>>>> requests for
>>>>> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
>>>>> index 8d875e9322..a766ea7b6b 100644
>>>>> --- a/lib/vhost/rte_vhost.h
>>>>> +++ b/lib/vhost/rte_vhost.h
>>>>> @@ -37,6 +37,7 @@ extern "C" {
>>>>> #define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6)
>>>>> #define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7)
>>>>> #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS (1ULL << 8)
>>>>> +#define RTE_VHOST_USER_ASYNC_USE_VFIO (1ULL << 9)
>>>>> /* Features. */
>>>>> #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
>>>>> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
>>>>> index 5d0d728d52..77c722c86b 100644
>>>>> --- a/lib/vhost/socket.c
>>>>> +++ b/lib/vhost/socket.c
>>>>> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>>>>> bool extbuf;
>>>>> bool linearbuf;
>>>>> bool async_copy;
>>>>> + bool async_use_vfio;
>>>>> bool net_compliant_ol_flags;
>>>>> /*
>>>>> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct
>>>>> vhost_user_socket *vsocket)
>>>>> dev->async_copy = 1;
>>>>> }
>>>>> + if (vsocket->async_use_vfio) {
>>>>> + dev = get_device(vid);
>>>>> +
>>>>> + if (dev)
>>>>> + dev->async_use_vfio = 1;
>>>>> + }
>>>>> +
>>>>> VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
>>>>> if (vsocket->notify_ops->new_connection) {
>>>>> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path,
>>>>> uint64_t flags)
>>>>> vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
>>>>> vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
>>>>> vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
>>>>> + vsocket->async_use_vfio = flags &
>>> RTE_VHOST_USER_ASYNC_USE_VFIO;
>>>>> vsocket->net_compliant_ol_flags = flags &
>>>>> RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
>>>>> if (vsocket->async_copy &&
>>>>> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
>>>>> index 8078ddff79..fb775ce4ed 100644
>>>>> --- a/lib/vhost/vhost.h
>>>>> +++ b/lib/vhost/vhost.h
>>>>> @@ -370,6 +370,7 @@ struct virtio_net {
>>>>> int16_t broadcast_rarp;
>>>>> uint32_t nr_vring;
>>>>> int async_copy;
>>>>> + int async_use_vfio;
>>>>> int extbuf;
>>>>> int linearbuf;
>>>>> struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS *
>>>>> 2];
>>>>> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
>>>>> index 8f0eba6412..f3703f2e72 100644
>>>>> --- a/lib/vhost/vhost_user.c
>>>>> +++ b/lib/vhost/vhost_user.c
>>>>> @@ -45,6 +45,7 @@
>>>>> #include <rte_common.h>
>>>>> #include <rte_malloc.h>
>>>>> #include <rte_log.h>
>>>>> +#include <rte_vfio.h>
>>>>> #include "iotlb.h"
>>>>> #include "vhost.h"
>>>>> @@ -141,6 +142,36 @@ get_blk_size(int fd)
>>>>> return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
>>>>> }
>>>>> +static int
>>>>> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
>>>>> +{
>>>>> + int ret = 0;
>>>>> + uint64_t host_iova;
>>>>> + host_iova = rte_mem_virt2iova((void
>>>>> *)(uintptr_t)region->host_user_addr);
>>>>> + if (do_map) {
>>>>> + /* Add mapped region into the default container of DPDK. */
>>>>> + ret =
>>> rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>>>> + region->host_user_addr,
>>>>> + host_iova,
>>>>> + region->size);
>>>>> + if (ret) {
>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
>>>>> + return ret;
>>>>> + }
>>>>> + } else {
>>>>> + /* Remove mapped region from the default container of
>>>>> DPDK. */
>>>>> + ret =
>>>>> rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
>>>>> + region->host_user_addr,
>>>>> + host_iova,
>>>>> + region->size);
>>>>> + if (ret) {
>>>>> + VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
>>>>> + return ret;
>>>>> + }
>>>>> + }
>>>>> + return ret;
>>>>> +}
>>>>
>>>> We've been discussing this off list with Xuan, and unfortunately
>>>> this is
>>>> a blocker for now.
>>>>
>>>> Currently, the x86 IOMMU does not support partial unmap - the segments
>>>> have to be unmapped exactly the same addr/len as they were mapped. We
>>>> also concatenate adjacent mappings to prevent filling up the DMA
>>>> mapping
>>>> entry table with superfluous entries.
>>>>
>>>> This means that, when two unrelated mappings are contiguous in memory
>>>> (e.g. if you map regions 1 and 2 independently, but they happen to be
>>>> sitting right next to each other in virtual memory), we cannot later
>>>> unmap one of them because, even though these are two separate
>>> mappings
>>>> as far as kernel VFIO infrastructure is concerned, the mapping gets
>>>> compacted and looks like one single mapping to VFIO, so DPDK API will
>>>> not let us unmap region 1 without also unmapping region 2.
>>>>
>>>> The proper fix for this problem would be to always map memory
>>>> page-by-page regardless of where it comes from (we already do that for
>>>> internal memory, but not for external). However, the reason this works
>>>> for internal memory is because when mapping internal memory segments,
>>>> *we know the page size*. For external memory segments, there is no such
>>>> guarantee, so we cannot deduce page size for a given memory segment,
>>> and
>>>> thus can't map things page-by-page.
>>>>
>>>> So, the proper fix for it would be to add page size to the VFIO DMA
>>>> API.
>>>> Unfortunately, it probably has to wait until 21.11 because it is an API
>>>> change.
>>>>
>>>> The slightly hacky fix for this would be to forego user mem map
>>>> concatenation and trust that user is not going to do anything stupid,
>>>> and will not spam the VFIO DMA API without reason. I would rather
>>>> not go
>>>> down this road, but this could be an option in this case.
>>>>
>>>> Thoughts?
>>>>
>>>
>>> Thanks Anatoly for the detailed description of the issue.
>>> It may be possible to either create a versioned symbol for this API
>>> change, or maybe even to have a temporary internal API.
>>>
>>> But I think this series in its current form is not acceptable, so
>>> waiting for v21.11 would be the best option (we may want to send the
>>> deprecation notice in this release though).
>>>
>>> In this series, I don't like the user application has to pass a flag to
>>> state whether the DMA engine uses VFIO or not. AFAICT, this new revision
>>> does not implement what was discussed in the previous one, i.e.
>>> supporting both IOVA_AS_VA and IOVA_AS_PA.
>>
>> Thanks for your comments. Here I hope to explain some questions:
>> 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
>> A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version, the
>> virtual address is replaced with iova address of mapped region, and
>> the iova
>> address is selected to program the IOMMU instead of virtual address only.
Good!
>>
>> 2. Why a flag is chosen to be passed by application?
>> A: Yes, as we discussed before, the rte_eal_iova_mode() API can be
>> used to
>> get the IOVA mode, so as to determine whether IOMMU should be programmed.
>> However, in the implementation process, I found a problem. That is how to
>> distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we should
>> always program the IOMMU. While in IGB_UIO cases, it depends on IOMMU
>> capability of platform.
>
> How does one program IOMMU with igb_uio? I was under impression that
> igb_uio (and uio_pci_generic for that matter) does not provide such
> facilities.
+1
>>
>> So a flag is selected, but this requires the application to do extra
>> things.
>> I find another solution, is to use
>> #ifdef VFIO_PRESENT
>> If(rte_vfio_is_enabled("vfio"))
>> program_iommu;
>> #endif
>>
>> Because all the devices are managed by DPDK, we can follow DPDK to do the
>> decision. Does this make sense for you, or any some suggestions?
>
> IMO the #ifdef is not needed. The API will always work, it's just that
> if VFIO is not compiled, it'll just compile down to noops.
Agree the #ifdef is not necessary.
To be clear, rte_vfio_is_enabled() check is going to be done in the
Vhost library, making this transparent to the application?
>>
>> 3. The partial unmap issue
>> A: Thanks Anatoly for the detailed explanation. This problem was found in
>> reconnection cases. After our off list discussion, the solution requires
>> rte_vfio_container_dma_map/unmap API change. Here I want to consult
>> if there are some hope for versioned symbol or a temporary internal API
>> be used in this release.
>
> I don't think we can add a versioned symbol in this release unless
> there's an exception to rc1 feature freeze. I also don't like the idea
> of a temporary internal API because vhost is not in EAL, it's a library
> - meaning, the "internal" API has to in fact be external API, added to
> the .map file etc., otherwise it won't work with shared library builds.
>
> That said, i'm not an expert on versioning, so maybe there are other
> ways i'm not aware of, or i have some misconceptions about how it works :)
Ok, it maybe indeed be better to wait for v21.11, it is too late for
this release.
Thanks,
Maxime
>>
>> Thanks for your time!
>>
>> Regards,
>> Xuan
>>
>>>
>>> Regards,
>>> Maxime
>>
>
>