On Thursday 06 July 2017 04:29 PM, Maxime Coquelin wrote: > > On 07/06/2017 11:49 AM, Jerin Jacob wrote: >> -----Original Message----- >>> Date: Thu, 6 Jul 2017 09:58:41 +0200 >>> From: Maxime Coquelin <maxime.coque...@redhat.com> >>> To: Jerin Jacob <jerin.ja...@caviumnetworks.com> >>> CC: Santosh Shukla <santosh.shu...@caviumnetworks.com>, >>> tho...@monjalon.net, bruce.richard...@intel.com, dev@dpdk.org, >>> hemant.agra...@nxp.com, shreyansh.j...@nxp.com, gaetan.ri...@6wind.com >>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode >>> before mapping >>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 >>> Thunderbird/52.1.0 >>> >>> >>> >>> On 07/05/2017 05:43 PM, Jerin Jacob wrote: >>>> -----Original Message----- >>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200 >>>>> From: Maxime Coquelin <maxime.coque...@redhat.com> >>>>> To: Santosh Shukla <santosh.shu...@caviumnetworks.com>, >>>>> tho...@monjalon.net, bruce.richard...@intel.com, dev@dpdk.org >>>>> CC: jerin.ja...@caviumnetworks.com, hemant.agra...@nxp.com, >>>>> shreyansh.j...@nxp.com, gaetan.ri...@6wind.com >>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode >>>>> before mapping >>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 >>>>> Thunderbird/52.1.0 >>>>> >>>>> >>>>> >>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote: >>>>>> Check iova mode and accordingly map iova to pa or va. >>>>>> >>>>>> Signed-off-by: Santosh Shukla<santosh.shu...@caviumnetworks.com> >>>>>> Signed-off-by: Jerin Jacob<jerin.ja...@caviumnetworks.com> >>>>>> --- >>>>>> lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++-- >>>>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>>> b/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>>> index 04914406f..348b7a7f4 100644 >>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd) >>>>>> dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); >>>>>> dma_map.vaddr = ms[i].addr_64; >>>>>> dma_map.size = ms[i].len; >>>>>> - dma_map.iova = ms[i].phys_addr; >>>>>> + if (rte_eal_iova_mode() == RTE_IOVA_VA) >>>>>> + dma_map.iova = dma_map.vaddr; >>>>>> + else >>>>>> + dma_map.iova = ms[i].phys_addr; >>>>>> dma_map.flags = VFIO_DMA_MAP_FLAG_READ | >>>>>> VFIO_DMA_MAP_FLAG_WRITE; >>>>> >>>>> IIUC, it is changing default behavior for VFIO devices. >>>>> >>>>> I see a possible problem, but I'm not sure the case is valid. >>>>> >>>>> Imagine you have two devices in the iommu group, and the two devices are >>>>> used in separate processes. Each process could try two different >>>>> physical addresses at the same virtual address, and so the second map >>>>> would fail. >>>> >>>> IMO, Doesn't look like a problem. Here is the data flow >>>> >>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only >>>> on primary process >>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359 >>>> >>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure >>>> that, the Secondary process has the _same_ virtual address as primary or >>>> exit from on attach. >>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452 >>>> >>>> 3) Since secondary process adds the mapped the virtual address in step (2). >>>> in the page table in OS. On SMMU entry miss(When device >>>> request from I/O transaction), OS will load the mapping and update the SMMU >>>> "context" with page tables from MMU. >>> >>> Ok thanks for the detailed info, but what about the case where the same >>> iommu group is used by two primary processes? >> >> Does that case exist with DPDK? We always need to blacklist same BDF in >> the secondary process to make things work with existing DPDK setup. Which >> make sense as well. Only primary process configures the HW blocks. > > I meant the case when two BDF are in the same IOMMU group (if ACS is not > supported at some point in the hierarchy). And I meant two primary > processes running, like for example two containers running each a DPDK > application. > > Maybe this is not a valid use-case (it is not secure, as it would break > isolation between the two containers), but it seems that it is something > DPDK allows today, if I'm not mistaken. > I'm not sure how two primary process could run, as because latter primary process would try accessing /var/run/.rte_config and would fail at this [1] point.
It's not valid use-case for dpdk (imo). [1] http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal.c#n204 >>> >>> I don't know how frequent it is, but if ACS is not supported by either the >>> endpoint or the the root port, then you would have to share the same IOMMU >>> group for all the ports of your card. Right? >> >> ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF >> comes as a IOMMU group. >> >> If it in bypass mode anyway you use in vfio-noiommu mode as >> there is no protection anyway. > Yes. > >>> >>>> Let me add the background for why this feature is required in DPDK to >>>> enable NPU style co-processors. >>>> >>>> The traditional NICs the Rx path code look like this: >>>> 1) On control path, Fill the mempool with buffers >>>> 2) on rx_burst(), alloc the mbuf from mempool >>>> 3) SW has the mbuf in hand(which is a virtual address) and program the >>>> HW with mbuf->buf_physaddr) >>>> 4) Return the last pushed mbuf(will be updated by HW by now) >>>> >>>> >>>> On NPU style co-processors, situation is different as the buffer recycling >>>> has been done in HW unlike SW model. Here is the data flow: >>>> 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA >>>> address, which is PA in existing model) >>>> 2) on rx_burst, HW gives you IOVA address(as address as step 1) >>>> 3) As application expects VA to operate on it, rx_burst() needs to >>>> convert to VA from PAA. Which is very costly. >>>> Instead with this IOVA as VA scheme, We can avoid the cost of converting >>>> with help of IOMMU/SMMU. >>>> >>>> This patch set auto detects the mode based available of type devices in >>>> bus and provides an option to override mode based on eal argument, so we >>>> don't foresee any issue with this approach and welcome any alternative >>>> approaches. >>> >>> I don't question the need of the feature for these kind of >>> co-processors, using VA as IOVA in your case seems very valid. >>> >>> What concerns me is that we change the default behavior for all other >>> devices. Having an option to override is fine to me, but the default >>> mode should remain the same IMHO. >> >> Doesn't seems to be a technical point. But I agree with your concern. >> we will address it. >> I think, we have two ways to address it. >> >> option 1: >> - In existing patch, >> a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA) >> http://dpdk.org/dev/patchwork/patch/25192 >> b) only when with eal argument sets to RTE_IOVA_VA and then bus probed >> value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA >> http://dpdk.org/dev/patchwork/patch/25193/ >> check the code after rte_bus_scan() >> >> option 2: >> On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/ >> we can check the rte_pci_device.id.vendor_id == CAVIUM to select the >> mode so other type of devices safe. >> >> I think, option 2 makes sense, as it gives foolproof auto detection scheme >> and >> without effecting any other devices that not interested in this scheme >> >> Does that address your concern about the patchset? > > Yes it does, or maybe create a new flag in struct rte_pci_driver's > drv_flags to provide a hint it prefers to use VA as IOVA? > > It, of course, would just be a hint, and should be set only when other > conditions are met. > >>> Wouldn't it be possible to default to VA as IOVA only when an HW mempool >>> is in use? >> >> It will be too late as in the normal scheme of things, application >> creates the pool. > > OK, makes sense. > > Thanks, > Maxime > >>> >>>> Similar problem exists in another part of the code in DPDK, >>>> http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231 >>>> Its a conditional compilation based approach with duplicating the vfio >>>> code and we are trying to fix the problem in a generic way so that >>>> everyone can get benefited out of it. >>>> >>>> Comments are welcome. >>> >>> Thanks, >>> Maxime >>> >>>> /Jerin >>>> >>>>> >>>>> By using physical addresses, you are safe against this problem. >>>>> >>>>> Any thoughts? >>>>> >>>>> Cheers, >>>>> Maxime