Re: [RFC 2/2] drm/msm/gem: Make use of the system cache
On Wed, Nov 17, 2021 at 12:16 AM Georgi Djakov wrote: > > Instead of writing to WC cmdstream buffers that go all the way to the main > memory, let's use the system cache to improve the performance. > > Signed-off-by: Georgi Djakov > --- > drivers/gpu/drm/msm/msm_gem.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c > index 104fdfc14027..921a1c24721e 100644 > --- a/drivers/gpu/drm/msm/msm_gem.c > +++ b/drivers/gpu/drm/msm/msm_gem.c > @@ -214,7 +214,7 @@ void msm_gem_put_pages(struct drm_gem_object *obj) > static pgprot_t msm_gem_pgprot(struct msm_gem_object *msm_obj, pgprot_t prot) > { > if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED)) > - return pgprot_writecombine(prot); > + return pgprot_syscached(prot); > return prot; > } Based on the definition in patch 1, doesn't this mean that 32-bit kernels degrade from writecombined to uncached, making them a lot slower? My feeling about this series is that there should be a clearer definition of what exactly happens on systems with and without system cache. Arnd ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: dma-direct fixes and cleanups v3
On Tue, Nov 16, 2021 at 11:31:49AM +, Robin Murphy wrote: > On 2021-11-11 06:50, Christoph Hellwig wrote: >> Hi all, >> >> Linus complained about the complex flow in dma_direct_alloc, so this >> tries to simplify it a bit, and while I was at it I also made sure that >> unencrypted pages never leak back into the page allocator. > > Before I forget, I've had a quick skim of the remaining patches and nothing > more stands out. Let me know if you'd like me to find time to check > everything over in detail again for a proper review, but otherwise I reckon > we may as well get this baking in -next sooner rather than later. I'd rather wait for a proper review, I don't think we are in a rush with any of these patches. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/11] iommu: Add device dma ownership set/release interfaces
Hi Jason, On 11/16/21 9:46 PM, Jason Gunthorpe wrote: On Tue, Nov 16, 2021 at 09:57:30AM +0800, Lu Baolu wrote: Hi Christoph, On 11/15/21 9:14 PM, Christoph Hellwig wrote: On Mon, Nov 15, 2021 at 10:05:42AM +0800, Lu Baolu wrote: +enum iommu_dma_owner { + DMA_OWNER_NONE, + DMA_OWNER_KERNEL, + DMA_OWNER_USER, +}; + + enum iommu_dma_owner dma_owner; + refcount_t owner_cnt; + struct file *owner_user_file; I'd just overload the ownership into owner_user_file, NULL -> no owner (struct file *)1UL) -> kernel real pointer -> user Which could simplify a lot of the code dealing with the owner. Yeah! Sounds reasonable. I will make this in the next version. It would be good to figure out how to make iommu_attach_device() enforce no other driver binding as a kernel user without a file *, as Robin pointed to, before optimizing this. This fixes an existing bug where iommu_attach_device() only checks the group size and is vunerable to a hot plug increasing the group size after it returns. That check should be replaced by this series's logic instead. As my my understanding, the essence of this problem is that only the user owner of the iommu_group could attach an UNMANAGED domain to it. If I understand it right, how about introducing a new interface to allocate a user managed domain and storing the user file pointer in it. --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -94,6 +94,7 @@ struct iommu_domain { void *handler_token; struct iommu_domain_geometry geometry; struct iommu_dma_cookie *iova_cookie; + struct file *owner_user_file; }; --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1902,6 +1902,18 @@ struct iommu_domain *iommu_domain_alloc(struct bus_type *bus) } EXPORT_SYMBOL_GPL(iommu_domain_alloc); +struct iommu_domain *iommu_domain_alloc_user(struct bus_type *bus, +struct file *filep) +{ + struct iommu_domain *domain; + + domain = __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED); + if (domain) + domain->owner_user_file = filep; + + return domain; +} When attaching a domain to an user-owned iommu_group, both group and domain should have matched user fd. Does above help here? Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.
On 16 Nov 2021, at 3:58, David Hildenbrand wrote: > On 15.11.21 20:37, Zi Yan wrote: >> From: Zi Yan >> >> Hi David, > > Hi, > > thanks for looking into this. > >> >> You suggested to make alloc_contig_range() deal with pageblock_order instead >> of >> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This >> patchset is my attempt to achieve that. Please take a look and let me know if >> I am doing it correctly or not. >> >> From what my understanding, cma required alignment of >> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced, >> __free_one_page() does not prevent merging two different pageblocks, when >> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation >> does prevent that. It should be OK to just align cma to pageblock_order. >> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use >> pageblock_order as alignment too. > > I wonder if that's sufficient. Especially the outer_start logic in > alloc_contig_range() might be problematic. There are some ugly corner > cases with free pages/allocations spanning multiple pageblocks and we > only isolated a single pageblock. Thank you a lot for writing the list of these corner cases. They are very helpful! > > > Regarding CMA, we have to keep the following cases working: > > a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER >- 1 page: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA, > but not both. We have to make sure alloc_contig_range() keeps working > correctly. This should be the case even with your change, as we won't > merging pages accross differing migratetypes. Yes. > > b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume both are MIGRATE_CMA. Assume we want to either allocate from > pageblock 0 or pageblock 1. Especially, assume we want to allocate from > pageblock 1. While we would isolate pageblock 1, we wouldn't isolate > pageblock 0. > > What happens if we either have a free page spanning the MAX_ORDER - 1 > range already OR if we have to migrate a MAX_ORDER - 1 page, resulting > in a free MAX_ORDER - 1 page of which only the second pageblock is > isolated? We would end up essentially freeing a page that has mixed > pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I > might be wrong but I have the feeling that this would be problematic. > This could happen when start_isolate_page_range() stumbles upon a compound page with order >= pageblock_order or a free page with order >= pageblock_order, but should not. start_isolate_page_range() should check the actual page size, either compound page size or free page size, and set the migratetype across pageblocks if the page is bigger than pageblock size. More precisely set_migratetype_isolate() should do that. > c) Concurrent allocations: > [ MAX_ ORDER - 1 ] > [ pageblock 0 | pageblock 1] > > Assume b) but we have two concurrent CMA allocations to pageblock 0 and > pageblock 1, which would now be possible as start_isolate_page_range() > isolate would succeed on both. Two isolations will be serialized by the zone lock taken by set_migratetype_isolate(), so the concurrent allocation would not be a problem. If it is a MAX_ORDER-1 free page, the first comer should split it and only isolate one of the pageblock then second one can isolate the other pageblock. If it is a MAX_ORDER-1 compound page, the first comer should isolate both pageblocks, then the second one would fail. WDYT? In sum, it seems to me that the issue is page isolation code only sees pageblock without check the actual page. When there are multiple pageblocks belonging to one page, the problem appears. This should be fixed. > > > Regarding virtio-mem, we care about the following cases: > > a) Allocating parts from completely movable MAX_ ORDER - 1 page: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume pageblock 0 and pageblock 1 are either free or contain only > movable pages. Assume we allocated pageblock 0. We have to make sure we > can allocate pageblock 1. The other way around, assume we allocated > pageblock 1, we have to make sure we can allocate pageblock 0. > > Free pages spanning both pageblocks might be problematic. Can you elaborate a bit? If either of pageblock 0 and 1 is used by virtio-mem, why do we care the other? If pageblock 0 and 1 belong to the same page (either free or compound), they should have the same migratetype. If we want to just allocate one of them, we can split the free page or migrate the compound page then split the remaining free page. > > b) Allocate parts of partially movable MAX_ ORDER - 1 page: >[ MAX_ ORDER - 1 ] >[ pageblock 0 | pageblock 1] > > Assume pageblock 0 contains unmovable data but
[RFC 1/2] arm64: Add support for system cache memory type
From: "Isaac J. Manjarres" Non-coherent devices on systems that support a system or last level cache may want to request that allocations be cached in the system cache. For memory that is allocated by the kernel, and used for DMA with devices, the memory attributes used for CPU access should match the memory attributes that will be used for device access. The memory attributes that need to be programmed into the MAIR for system cache usage are: 0xf4 - Normal memory, outer write back read/write allocate, inner non-cacheable. There is currently no support for this memory attribute for CPU mappings, so add it. Signed-off-by: Isaac J. Manjarres Signed-off-by: Georgi Djakov --- arch/arm64/include/asm/memory.h | 1 + arch/arm64/include/asm/pgtable.h | 9 + arch/arm64/include/asm/sysreg.h | 1 + arch/arm64/mm/proc.S | 3 ++- include/linux/dma-map-ops.h | 8 5 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index 0af70d9abede..22553aab67a4 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -134,6 +134,7 @@ #define MT_NORMAL_NC 2 #define MT_DEVICE_nGnRnE 3 #define MT_DEVICE_nGnRE4 +#define MT_NORMAL_iNC_oWB 5 /* * Memory types for Stage-2 translation diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index c4ba047a82d2..681c294c364e 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -524,6 +524,15 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) __pgprot_modify(prot, PTE_ATTRINDX_MASK, \ PTE_ATTRINDX(MT_NORMAL_NC) | PTE_PXN | PTE_UXN) +/* + * Mark the prot value as outer cacheable and inner non-cacheable. Non-coherent + * devices on a system with support for a system or last level cache use these + * attributes to cache allocations in the system cache. + */ +#define pgprot_syscached(prot) \ + __pgprot_modify(prot, PTE_ATTRINDX_MASK, \ + PTE_ATTRINDX(MT_NORMAL_iNC_oWB) | PTE_PXN | PTE_UXN) + #define __HAVE_PHYS_MEM_ACCESS_PROT struct file; extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index 16b3f1a1d468..7c50b1840532 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -715,6 +715,7 @@ #define MAIR_ATTR_NORMAL_TAGGEDUL(0xf0) #define MAIR_ATTR_NORMAL UL(0xff) #define MAIR_ATTR_MASK UL(0xff) +#define MAIR_ATTR_NORMAL_iNC_oWB UL(0xf4) /* Position the attr at the correct index */ #define MAIR_ATTRIDX(attr, idx)((attr) << ((idx) * 8)) diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S index d35c90d2e47a..8a75973e5148 100644 --- a/arch/arm64/mm/proc.S +++ b/arch/arm64/mm/proc.S @@ -64,7 +64,8 @@ MAIR_ATTRIDX(MAIR_ATTR_DEVICE_nGnRE, MT_DEVICE_nGnRE) |\ MAIR_ATTRIDX(MAIR_ATTR_NORMAL_NC, MT_NORMAL_NC) | \ MAIR_ATTRIDX(MAIR_ATTR_NORMAL, MT_NORMAL) |\ -MAIR_ATTRIDX(MAIR_ATTR_NORMAL, MT_NORMAL_TAGGED)) +MAIR_ATTRIDX(MAIR_ATTR_NORMAL, MT_NORMAL_TAGGED) | \ +MAIR_ATTRIDX(MAIR_ATTR_NORMAL_iNC_oWB, MT_NORMAL_iNC_oWB)) #ifdef CONFIG_CPU_PM /** diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d5b06b3a4a6..1f7d75201577 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -277,6 +277,14 @@ void arch_dma_free(struct device *dev, size_t size, void *cpu_addr, #define pgprot_dmacoherent(prot) pgprot_noncached(prot) #endif +/* + * If there is no system cache pgprot, then fallback to dmacoherent + * pgprot, as the expectation is that the device is not coherent. + */ +#ifndef pgprot_syscached +#define pgprot_syscached(prot) pgprot_dmacoherent(prot) +#endif + pgprot_t dma_pgprot(struct device *dev, pgprot_t prot, unsigned long attrs); #else static inline pgprot_t dma_pgprot(struct device *dev, pgprot_t prot, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC 2/2] drm/msm/gem: Make use of the system cache
Instead of writing to WC cmdstream buffers that go all the way to the main memory, let's use the system cache to improve the performance. Signed-off-by: Georgi Djakov --- drivers/gpu/drm/msm/msm_gem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c index 104fdfc14027..921a1c24721e 100644 --- a/drivers/gpu/drm/msm/msm_gem.c +++ b/drivers/gpu/drm/msm/msm_gem.c @@ -214,7 +214,7 @@ void msm_gem_put_pages(struct drm_gem_object *obj) static pgprot_t msm_gem_pgprot(struct msm_gem_object *msm_obj, pgprot_t prot) { if (msm_obj->flags & (MSM_BO_WC|MSM_BO_UNCACHED)) - return pgprot_writecombine(prot); + return pgprot_syscached(prot); return prot; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 04/11] PCI: portdrv: Suppress kernel DMA ownership auto-claiming
On Tue, Nov 16, 2021 at 02:22:01PM -0600, Bjorn Helgaas wrote: > On Tue, Nov 16, 2021 at 03:24:29PM +0800, Lu Baolu wrote: > > On 2021/11/16 4:44, Bjorn Helgaas wrote: > > > On Mon, Nov 15, 2021 at 10:05:45AM +0800, Lu Baolu wrote: > > > > IOMMU grouping on PCI necessitates that if we lack isolation on a bridge > > > > then all of the downstream devices will be part of the same IOMMU group > > > > as the bridge. > > > > > > I think this means something like: "If a PCIe Switch Downstream Port > > > lacks , all downstream devices > > > will be part of the same IOMMU group as the switch," right? > > > > For this patch, yes. > > > > > If so, can you fill in the details to make it specific and concrete? > > > > The existing vfio implementation allows a kernel driver to bind with a > > PCI bridge while its downstream devices are assigned to the user space > > though there lacks ACS-like isolation in bridge. > > > > drivers/vfio/vfio.c: > > 540 static bool vfio_dev_driver_allowed(struct device *dev, > > 541 struct device_driver *drv) > > 542 { > > 543 if (dev_is_pci(dev)) { > > 544 struct pci_dev *pdev = to_pci_dev(dev); > > 545 > > 546 if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL) > > 547 return true; > > 548 } > > > > We are moving the group viability check to IOMMU core, and trying to > > make it compatible with the current vfio policy. We saw three types of > > bridges: > > > > #1) PCIe/PCI-to-PCI bridges > > These bridges are configured in the PCI framework, there's no > > dedicated driver for such devices. > > > > #2) Generic PCIe switch downstream port > > The port driver doesn't map and access any MMIO in the PCI BAR. > > The iommu group is viable to user even this driver is bound. > > > > #3) Hot Plug Controller > > The controller driver maps and access the device MMIO. The iommu > > group is not viable to user with this driver bound to its device. > > I *guess* the question here is whether the bridge can or will do DMA? > I think that's orthogonal to the question of whether it implements > BARs, so I'm not sure why the MMIO BARs are part of this discussion. > I assume it's theoretically possible for a driver to use registers in > config space to program a device to do DMA, even if the device has no > BARs. There are two questions Lu is trying to get at: 1) Does the bridge driver use DMA? Calling pci_set_master() or a dma_map_* API is a sure indicate the driver is doing DMA Kernel DMA doesn't work if the PCI device is attached to non-default iommu domain. 2) If the bridge driver uses MMIO, is it tolerant to hostile userspace also touching the same MMIO registers via P2P DMA attacks? Conservatively if the driver maps a MMIO region at all we can say it fails this test. Unless someone want to do the audit work, identifying MMIO usage alone is sufficient to disqualify a driver. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 04/11] PCI: portdrv: Suppress kernel DMA ownership auto-claiming
On Tue, Nov 16, 2021 at 03:24:29PM +0800, Lu Baolu wrote: > On 2021/11/16 4:44, Bjorn Helgaas wrote: > > On Mon, Nov 15, 2021 at 10:05:45AM +0800, Lu Baolu wrote: > > > IOMMU grouping on PCI necessitates that if we lack isolation on a bridge > > > then all of the downstream devices will be part of the same IOMMU group > > > as the bridge. > > > > I think this means something like: "If a PCIe Switch Downstream Port > > lacks , all downstream devices > > will be part of the same IOMMU group as the switch," right? > > For this patch, yes. > > > If so, can you fill in the details to make it specific and concrete? > > The existing vfio implementation allows a kernel driver to bind with a > PCI bridge while its downstream devices are assigned to the user space > though there lacks ACS-like isolation in bridge. > > drivers/vfio/vfio.c: > 540 static bool vfio_dev_driver_allowed(struct device *dev, > 541 struct device_driver *drv) > 542 { > 543 if (dev_is_pci(dev)) { > 544 struct pci_dev *pdev = to_pci_dev(dev); > 545 > 546 if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL) > 547 return true; > 548 } > > We are moving the group viability check to IOMMU core, and trying to > make it compatible with the current vfio policy. We saw three types of > bridges: > > #1) PCIe/PCI-to-PCI bridges > These bridges are configured in the PCI framework, there's no > dedicated driver for such devices. > > #2) Generic PCIe switch downstream port > The port driver doesn't map and access any MMIO in the PCI BAR. > The iommu group is viable to user even this driver is bound. > > #3) Hot Plug Controller > The controller driver maps and access the device MMIO. The iommu > group is not viable to user with this driver bound to its device. I *guess* the question here is whether the bridge can or will do DMA? I think that's orthogonal to the question of whether it implements BARs, so I'm not sure why the MMIO BARs are part of this discussion. I assume it's theoretically possible for a driver to use registers in config space to program a device to do DMA, even if the device has no BARs. Bjorn ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/2] perf/smmuv3: Support devicetree
On Tue, Nov 16, 2021 at 05:00:14PM +, Robin Murphy wrote: > On 2021-11-16 15:42, Jean-Philippe Brucker wrote: > > On Tue, Nov 16, 2021 at 12:02:47PM +, Robin Murphy wrote: > > > On 2021-11-16 11:35, Jean-Philippe Brucker wrote: > > > > Add devicetree binding for the SMMUv3 PMU, called Performance Monitoring > > > > Counter Group (PMCG) in the spec. Each SMMUv3 implementation can have > > > > multiple independent PMCGs, for example one for the Translation Control > > > > Unit (TCU) and one per Translation Buffer Unit (TBU). > > > > > > > > I previously sent the binding as reply to Jay Chen's thread implementing > > > > device tree support [1]. This posting addresses the comments from that > > > > thread. > > > > > > Ha, I'd also resurrected this and was planning to post it at some point > > > this > > > week[0] - you should have said :) > > > > Ah sorry about that, I just resent because there was some demand for it at > > Linaro > > Heh, no worries - it's not like you were even CC'ed on the thread where I > only mentioned I *might* do it. > > Can I get away with being cheeky and just saying that my review comments are > the diff between my branch and yours, I wonder... Sure, that works for me, I'll send a v2 this week or so Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/2] perf/smmuv3: Support devicetree
On 2021-11-16 15:42, Jean-Philippe Brucker wrote: On Tue, Nov 16, 2021 at 12:02:47PM +, Robin Murphy wrote: On 2021-11-16 11:35, Jean-Philippe Brucker wrote: Add devicetree binding for the SMMUv3 PMU, called Performance Monitoring Counter Group (PMCG) in the spec. Each SMMUv3 implementation can have multiple independent PMCGs, for example one for the Translation Control Unit (TCU) and one per Translation Buffer Unit (TBU). I previously sent the binding as reply to Jay Chen's thread implementing device tree support [1]. This posting addresses the comments from that thread. Ha, I'd also resurrected this and was planning to post it at some point this week[0] - you should have said :) Ah sorry about that, I just resent because there was some demand for it at Linaro Heh, no worries - it's not like you were even CC'ed on the thread where I only mentioned I *might* do it. Can I get away with being cheeky and just saying that my review comments are the diff between my branch and yours, I wonder... Patch 1 adds two compatible strings. "arm,smmu-v3-pmcg" is common to all PMCGs. "hisilicon,smmu-v3-pmcg-hip08" allows to support the same quirk as IORT for that implementation (see patch 2). We'll probably want to also introduce compatible strings for each implementation that has additional perf events. For example the MMU-600 implementation has different events for TCU and TBU PMCGs [2], but both components have the same device IDs. So the driver could differentiate them if they had two distinct compatible strings such as "arm,mmu-600-pmcg-tbu" and "arm,mmu-600-pmcg-tcu". Actually it only needs a general MMU-600 compatible, since once you know it's an Arm Ltd. implementation, you can assume the pattern for the IMP_DEF ID registers to figure out the rest. It might be an error in the MMU-600 spec specifically, both TBU and TCU PMU registers have a 0x83 PIDR0, where I think the TBU should be 0x84 (the revC model uses that value). It's possible that the implementation actually has 0x84 instead. Yup, it's a mistake in the TRM. I just checked a real MMU-600 and the PMU PIDRs match the main TCU/TBU PIDRs as expected. At least the MMU-700 docs haven't repeated the same error. Cheers, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] dt-bindings: Add Arm SMMUv3 PMCG binding
On Tue, Nov 16, 2021 at 08:02:53AM -0600, Rob Herring wrote: > My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check' > on your patch (DT_CHECKER_FLAGS is new in v5.13): > > yamllint warnings/errors: > ./Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml:24:9: > [warning] wrong indentation: expected 10 but found 8 (indentation) > ./Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml:25:11: > [warning] wrong indentation: expected 12 but found 10 (indentation) > > dtschema/dtc warnings/errors: > /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: > example-0: pmu@2b42:reg:0: [0, 725745664, 0, 4096] is too long > From schema: > /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml > /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: > example-0: pmu@2b42:reg:1: [0, 725811200, 0, 4096] is too long > From schema: > /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml > /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: > example-0: pmu@2b44:reg:0: [0, 725876736, 0, 4096] is too long > From schema: > /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml > /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: > example-0: pmu@2b44:reg:1: [0, 725942272, 0, 4096] is too long > From schema: > /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml > > doc reference errors (make refcheckdocs): > > See https://patchwork.ozlabs.org/patch/1555758 > > This check can fail if there are any dependencies. The base for a patch > series is generally the most recent rc1. > > If you already ran 'make dt_binding_check' and didn't see the above > error(s), then make sure 'yamllint' is installed and dt-schema is up to > date: > > pip3 install dtschema --upgrade > > Please check and re-submit. Right I'll fix those, I had only run dtbs_check Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] perf/smmuv3: Add devicetree support
On Tue, Nov 16, 2021 at 12:06:36PM +, John Garry wrote: > On 16/11/2021 11:35, Jean-Philippe Brucker wrote: > > Add device-tree support to the SMMUv3 PMCG. One small cosmetic change > > while factoring the option mask printout: don't display it when zero, it > > only contains one erratum at the moment. > > > > Signed-off-by: Jay Chen > > Signed-off-by: Jean-Philippe Brucker > > --- > > drivers/perf/arm_smmuv3_pmu.c | 25 +++-- > > 1 file changed, 23 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c > > index 226348822ab3..958325ac103a 100644 > > --- a/drivers/perf/arm_smmuv3_pmu.c > > +++ b/drivers/perf/arm_smmuv3_pmu.c > > @@ -47,6 +47,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -750,8 +751,15 @@ static void smmu_pmu_get_acpi_options(struct smmu_pmu > > *smmu_pmu) > > smmu_pmu->options |= SMMU_PMCG_EVCNTR_RDONLY; > > break; > > } > > +} > > + > > +static void smmu_pmu_get_of_options(struct smmu_pmu *smmu_pmu) > > +{ > > + struct device_node *node = smmu_pmu->dev->of_node; > > - dev_notice(smmu_pmu->dev, "option mask 0x%x\n", smmu_pmu->options); > > + if (of_device_is_compatible(node, "hisilicon,smmu-v3-pmcg-hip08")) > > I don't think that this is necessary. We don't support DT for hip08, nor > have any plans to. Incidentally, was this binding missing in your series? Ok I'll drop this (and the compatible value from patch 1) Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/2] perf/smmuv3: Support devicetree
On Tue, Nov 16, 2021 at 12:02:47PM +, Robin Murphy wrote: > On 2021-11-16 11:35, Jean-Philippe Brucker wrote: > > Add devicetree binding for the SMMUv3 PMU, called Performance Monitoring > > Counter Group (PMCG) in the spec. Each SMMUv3 implementation can have > > multiple independent PMCGs, for example one for the Translation Control > > Unit (TCU) and one per Translation Buffer Unit (TBU). > > > > I previously sent the binding as reply to Jay Chen's thread implementing > > device tree support [1]. This posting addresses the comments from that > > thread. > > Ha, I'd also resurrected this and was planning to post it at some point this > week[0] - you should have said :) Ah sorry about that, I just resent because there was some demand for it at Linaro > > Patch 1 adds two compatible strings. "arm,smmu-v3-pmcg" is common to all > > PMCGs. "hisilicon,smmu-v3-pmcg-hip08" allows to support the same quirk > > as IORT for that implementation (see patch 2). We'll probably want to > > also introduce compatible strings for each implementation that has > > additional perf events. For example the MMU-600 implementation has > > different events for TCU and TBU PMCGs [2], but both components have the > > same device IDs. So the driver could differentiate them if they had two > > distinct compatible strings such as "arm,mmu-600-pmcg-tbu" and > > "arm,mmu-600-pmcg-tcu". > > Actually it only needs a general MMU-600 compatible, since once you know > it's an Arm Ltd. implementation, you can assume the pattern for the IMP_DEF > ID registers to figure out the rest. It might be an error in the MMU-600 spec specifically, both TBU and TCU PMU registers have a 0x83 PIDR0, where I think the TBU should be 0x84 (the revC model uses that value). It's possible that the implementation actually has 0x84 instead. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 5/5] scsi: storvsc: Add Isolation VM support for storvsc driver
From: Tianyu Lan In Isolation VM, all shared memory with host needs to mark visible to host via hvcall. vmbus_establish_gpadl() has already done it for storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_ mpb_desc() still needs to be handled. Use DMA API(scsi_dma_map/unmap) to map these memory during sending/receiving packet and return swiotlb bounce buffer dma address. In Isolation VM, swiotlb bounce buffer is marked to be visible to host and the swiotlb force mode is enabled. Set device's dma min align mask to HV_HYP_PAGE_SIZE - 1 in order to keep the original data offset in the bounce buffer. Signed-off-by: Tianyu Lan --- drivers/scsi/storvsc_drv.c | 37 + include/linux/hyperv.h | 1 + 2 files changed, 22 insertions(+), 16 deletions(-) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 20595c0ba0ae..ae293600d799 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -21,6 +21,8 @@ #include #include #include +#include + #include #include #include @@ -1336,6 +1338,7 @@ static void storvsc_on_channel_callback(void *context) continue; } request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd); + scsi_dma_unmap(scmnd); } storvsc_on_receive(stor_device, packet, request); @@ -1749,7 +1752,6 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) struct hv_host_device *host_dev = shost_priv(host); struct hv_device *dev = host_dev->dev; struct storvsc_cmd_request *cmd_request = scsi_cmd_priv(scmnd); - int i; struct scatterlist *sgl; unsigned int sg_count; struct vmscsi_request *vm_srb; @@ -1831,10 +1833,11 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) payload_sz = sizeof(cmd_request->mpb); if (sg_count) { - unsigned int hvpgoff, hvpfns_to_add; unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset); unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length); - u64 hvpfn; + struct scatterlist *sg; + unsigned long hvpfn, hvpfns_to_add; + int j, i = 0; if (hvpg_count > MAX_PAGE_BUFFER_COUNT) { @@ -1848,21 +1851,22 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) payload->range.len = length; payload->range.offset = offset_in_hvpg; + sg_count = scsi_dma_map(scmnd); + if (sg_count < 0) + return SCSI_MLQUEUE_DEVICE_BUSY; - for (i = 0; sgl != NULL; sgl = sg_next(sgl)) { + for_each_sg(sgl, sg, sg_count, j) { /* -* Init values for the current sgl entry. hvpgoff -* and hvpfns_to_add are in units of Hyper-V size -* pages. Handling the PAGE_SIZE != HV_HYP_PAGE_SIZE -* case also handles values of sgl->offset that are -* larger than PAGE_SIZE. Such offsets are handled -* even on other than the first sgl entry, provided -* they are a multiple of PAGE_SIZE. +* Init values for the current sgl entry. hvpfns_to_add +* is in units of Hyper-V size pages. Handling the +* PAGE_SIZE != HV_HYP_PAGE_SIZE case also handles +* values of sgl->offset that are larger than PAGE_SIZE. +* Such offsets are handled even on other than the first +* sgl entry, provided they are a multiple of PAGE_SIZE. */ - hvpgoff = HVPFN_DOWN(sgl->offset); - hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff; - hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) - - hvpgoff; + hvpfn = HVPFN_DOWN(sg_dma_address(sg)); + hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) + +sg_dma_len(sg)) - hvpfn; /* * Fill the next portion of the PFN array with @@ -1872,7 +1876,7 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd) * the PFN array is filled. */ while (hvpfns_to_add--) - payload->range.pfn_array[i++] = hvpfn++; + payload->range.pfn_array[i++] = hvpfn++; } } @@ -2016,6
[PATCH 4/5] net: netvsc: Add Isolation VM support for netvsc driver
From: Tianyu Lan In Isolation VM, all shared memory with host needs to mark visible to host via hvcall. vmbus_establish_gpadl() has already done it for netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_ pagebuffer() stills need to be handled. Use DMA API to map/umap these memory during sending/receiving packet and Hyper-V swiotlb bounce buffer dma address will be returned. The swiotlb bounce buffer has been masked to be visible to host during boot up. Allocate rx/tx ring buffer via dma_alloc_noncontiguous() in Isolation VM. After calling vmbus_establish_gpadl() which marks these pages visible to host, map these pages unencrypted addes space via dma_vmap_noncontiguous(). Signed-off-by: Tianyu Lan --- drivers/net/hyperv/hyperv_net.h | 5 + drivers/net/hyperv/netvsc.c | 192 +++--- drivers/net/hyperv/rndis_filter.c | 2 + include/linux/hyperv.h| 6 + 4 files changed, 190 insertions(+), 15 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 315278a7cf88..31c77a00d01e 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -164,6 +164,7 @@ struct hv_netvsc_packet { u32 total_bytes; u32 send_buf_index; u32 total_data_buflen; + struct hv_dma_range *dma_range; }; #define NETVSC_HASH_KEYLEN 40 @@ -1074,6 +1075,7 @@ struct netvsc_device { /* Receive buffer allocated by us but manages by NetVSP */ void *recv_buf; + struct sg_table *recv_sgt; u32 recv_buf_size; /* allocated bytes */ struct vmbus_gpadl recv_buf_gpadl_handle; u32 recv_section_cnt; @@ -1082,6 +1084,7 @@ struct netvsc_device { /* Send buffer allocated by us */ void *send_buf; + struct sg_table *send_sgt; u32 send_buf_size; struct vmbus_gpadl send_buf_gpadl_handle; u32 send_section_cnt; @@ -1731,4 +1734,6 @@ struct rndis_message { #define RETRY_US_HI1 #define RETRY_MAX 2000/* >10 sec */ +void netvsc_dma_unmap(struct hv_device *hv_dev, + struct hv_netvsc_packet *packet); #endif /* _HYPERV_NET_H */ diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 396bc1c204e6..9cdc71930830 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -146,15 +147,39 @@ static struct netvsc_device *alloc_net_device(void) return net_device; } +static struct hv_device *netvsc_channel_to_device(struct vmbus_channel *channel) +{ + struct vmbus_channel *primary = channel->primary_channel; + + return primary ? primary->device_obj : channel->device_obj; +} + static void free_netvsc_device(struct rcu_head *head) { struct netvsc_device *nvdev = container_of(head, struct netvsc_device, rcu); + struct hv_device *dev = + netvsc_channel_to_device(nvdev->chan_table[0].channel); int i; kfree(nvdev->extension); - vfree(nvdev->recv_buf); - vfree(nvdev->send_buf); + + if (nvdev->recv_sgt) { + dma_vunmap_noncontiguous(>device, nvdev->recv_buf); + dma_free_noncontiguous(>device, nvdev->recv_buf_size, + nvdev->recv_sgt, DMA_FROM_DEVICE); + } else { + vfree(nvdev->recv_buf); + } + + if (nvdev->send_sgt) { + dma_vunmap_noncontiguous(>device, nvdev->send_buf); + dma_free_noncontiguous(>device, nvdev->send_buf_size, + nvdev->send_sgt, DMA_TO_DEVICE); + } else { + vfree(nvdev->send_buf); + } + kfree(nvdev->send_section_map); for (i = 0; i < VRSS_CHANNEL_MAX; i++) { @@ -348,7 +373,21 @@ static int netvsc_init_buf(struct hv_device *device, buf_size = min_t(unsigned int, buf_size, NETVSC_RECEIVE_BUFFER_SIZE_LEGACY); - net_device->recv_buf = vzalloc(buf_size); + if (hv_isolation_type_snp()) { + net_device->recv_sgt = + dma_alloc_noncontiguous(>device, buf_size, + DMA_FROM_DEVICE, GFP_KERNEL, 0); + if (!net_device->recv_sgt) { + pr_err("Fail to allocate recv buffer buf_size %d.\n.", buf_size); + ret = -ENOMEM; + goto cleanup; + } + + net_device->recv_buf = (void *)net_device->recv_sgt->sgl->dma_address; + } else { + net_device->recv_buf = vzalloc(buf_size); + } + if (!net_device->recv_buf) { netdev_err(ndev, "unable to allocate receive buffer of size %u\n", @@ -357,8 +396,6 @@ static int netvsc_init_buf(struct hv_device *device,
[PATCH 2/5] dma-mapping: Add vmap/vunmap_noncontiguous() callback in dma ops
From: Tianyu Lan Hyper-V netvsc driver needs to allocate noncontiguous DMA memory and remap it into unencrypted address space before sharing with host. Add vmap/vunmap_noncontiguous() callback and handle the remap in the Hyper-V dma ops callback. Signed-off-by: Tianyu Lan --- include/linux/dma-map-ops.h | 3 +++ kernel/dma/mapping.c| 18 ++ 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d5b06b3a4a6..f7b9958ca20a 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -27,6 +27,9 @@ struct dma_map_ops { unsigned long attrs); void (*free_noncontiguous)(struct device *dev, size_t size, struct sg_table *sgt, enum dma_data_direction dir); + void *(*vmap_noncontiguous)(struct device *dev, size_t size, + struct sg_table *sgt); + void (*vunmap_noncontiguous)(struct device *dev, void *addr); int (*mmap)(struct device *, struct vm_area_struct *, void *, dma_addr_t, size_t, unsigned long attrs); diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 9478eccd1c8e..7fd751d866cc 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -674,8 +674,14 @@ void *dma_vmap_noncontiguous(struct device *dev, size_t size, const struct dma_map_ops *ops = get_dma_ops(dev); unsigned long count = PAGE_ALIGN(size) >> PAGE_SHIFT; - if (ops && ops->alloc_noncontiguous) - return vmap(sgt_handle(sgt)->pages, count, VM_MAP, PAGE_KERNEL); + if (ops) { + if (ops->vmap_noncontiguous) + return ops->vmap_noncontiguous(dev, size, sgt); + else if (ops->alloc_noncontiguous) + return vmap(sgt_handle(sgt)->pages, count, VM_MAP, + PAGE_KERNEL); + } + return page_address(sg_page(sgt->sgl)); } EXPORT_SYMBOL_GPL(dma_vmap_noncontiguous); @@ -684,8 +690,12 @@ void dma_vunmap_noncontiguous(struct device *dev, void *vaddr) { const struct dma_map_ops *ops = get_dma_ops(dev); - if (ops && ops->alloc_noncontiguous) - vunmap(vaddr); + if (ops) { + if (ops->vunmap_noncontiguous) + ops->vunmap_noncontiguous(dev, vaddr); + else if (ops->alloc_noncontiguous) + vunmap(vaddr); + } } EXPORT_SYMBOL_GPL(dma_vunmap_noncontiguous); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/5] x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
From: Tianyu Lan In Isolation VM with AMD SEV, bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Expose swiotlb_unencrypted_base for platforms to set unencrypted memory base offset and platform calls swiotlb_update_mem_attributes() to remap swiotlb mem to unencrypted address space. memremap() can not be called in the early stage and so put remapping code into swiotlb_update_mem_attributes(). Store remap address and use it to copy data from/to swiotlb bounce buffer. Signed-off-by: Tianyu Lan --- include/linux/swiotlb.h | 6 kernel/dma/swiotlb.c| 75 - 2 files changed, 73 insertions(+), 8 deletions(-) diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 569272871375..09a140d617fa 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force; * @end: The end address of the swiotlb memory pool. Used to do a quick * range check to see if the memory was in fact allocated by this * API. + * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb + * memory pool may be remapped in the memory encrypted case and store + * virtual address for bounce buffer operation. * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and * @end. For default swiotlb, this is command line adjustable via * setup_io_tlb_npages. @@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force; struct io_tlb_mem { phys_addr_t start; phys_addr_t end; + void *vaddr; unsigned long nslabs; unsigned long used; unsigned int index; @@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev) } #endif /* CONFIG_DMA_RESTRICTED_POOL */ +extern phys_addr_t swiotlb_unencrypted_base; + #endif /* __LINUX_SWIOTLB_H */ diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 8e840fbbed7c..4735c5e0f44d 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -50,6 +50,7 @@ #include #include +#include #include #include #include @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force; struct io_tlb_mem io_tlb_default_mem; +phys_addr_t swiotlb_unencrypted_base; + /* * Max segment that we can provide which (if pages are contingous) will * not be bounced (unless SWIOTLB_FORCE is set). @@ -155,6 +158,31 @@ static inline unsigned long nr_slots(u64 val) return DIV_ROUND_UP(val, IO_TLB_SIZE); } +/* + * Remap swioltb memory in the unencrypted physical address space + * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP + * Isolation VMs). + */ +void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes) +{ + void *vaddr; + + if (swiotlb_unencrypted_base) { + phys_addr_t paddr = mem->start + swiotlb_unencrypted_base; + + vaddr = memremap(paddr, bytes, MEMREMAP_WB); + if (!vaddr) { + pr_err("Failed to map the unencrypted memory %llx size %lx.\n", + paddr, bytes); + return NULL; + } + + return vaddr; + } + + return phys_to_virt(mem->start); +} + /* * Early SWIOTLB allocation may be too early to allow an architecture to * perform the desired operations. This function allows the architecture to @@ -172,10 +200,17 @@ void __init swiotlb_update_mem_attributes(void) vaddr = phys_to_virt(mem->start); bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT); set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT); - memset(vaddr, 0, bytes); + + mem->vaddr = swiotlb_mem_remap(mem, bytes); + if (!mem->vaddr) { + pr_err("Fail to remap swiotlb mem.\n"); + return; + } + + memset(mem->vaddr, 0, bytes); } -static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start, +static int swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start, unsigned long nslabs, bool late_alloc) { void *vaddr = phys_to_virt(start); @@ -196,13 +231,28 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start, mem->slots[i].orig_addr = INVALID_PHYS_ADDR; mem->slots[i].alloc_size = 0; } + + /* +* With swiotlb_unencrypted_base setting, swiotlb bounce buffer will +* be remapped in the swiotlb_update_mem_attributes() and return
[PATCH 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
From: Tianyu Lan hyperv Isolation VM requires bounce buffer support to copy data from/to encrypted memory and so enable swiotlb force mode to use swiotlb bounce buffer for DMA transaction. In Isolation VM with AMD SEV, the bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Hyper-V initalizes swiotlb bounce buffer and default swiotlb needs to be disabled. pci_swiotlb_detect_override() and pci_swiotlb_detect_4gb() enable the default one. To override the setting, hyperv_swiotlb_detect() needs to run before these detect functions which depends on the pci_xen_swiotlb_ init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb _detect() to keep the order. Swiotlb bounce buffer code calls set_memory_decrypted() to mark bounce buffer visible to host and map it in extra address space via memremap. Populate the shared_gpa_boundary (vTOM) via swiotlb_unencrypted_base variable. The map function memremap() can't work in the early place hyperv_iommu_swiotlb_init() and so call swiotlb_update_mem_attributes() in the hyperv_iommu_swiotlb_later_init(). Add Hyper-V dma ops and provide alloc/free and vmap/vunmap noncontiguous callback to handle request of allocating and mapping noncontiguous dma memory in vmbus device driver. Netvsc driver will use this. Set dma_ops_ bypass flag for hv device to use dma direct functions during mapping/unmapping dma page. Signed-off-by: Tianyu Lan --- arch/x86/mm/mem_encrypt.c | 4 +- arch/x86/xen/pci-swiotlb-xen.c | 3 +- drivers/hv/Kconfig | 1 + drivers/hv/vmbus_drv.c | 6 ++ drivers/iommu/hyperv-iommu.c | 164 + include/linux/hyperv.h | 10 ++ 6 files changed, 186 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c index 35487305d8af..65bc385ae07a 100644 --- a/arch/x86/mm/mem_encrypt.c +++ b/arch/x86/mm/mem_encrypt.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "mm_internal.h" @@ -203,7 +204,8 @@ void __init sev_setup_arch(void) phys_addr_t total_mem = memblock_phys_mem_size(); unsigned long size; - if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) + if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) + && !hv_is_isolation_supported()) return; /* diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c index 46df59aeaa06..30fd0600b008 100644 --- a/arch/x86/xen/pci-swiotlb-xen.c +++ b/arch/x86/xen/pci-swiotlb-xen.c @@ -4,6 +4,7 @@ #include #include +#include #include #include @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void) EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); IOMMU_INIT_FINISH(pci_xen_swiotlb_detect, - NULL, + hyperv_swiotlb_detect, pci_xen_swiotlb_init, NULL); diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig index dd12af20e467..d43b4cd88f57 100644 --- a/drivers/hv/Kconfig +++ b/drivers/hv/Kconfig @@ -9,6 +9,7 @@ config HYPERV select PARAVIRT select X86_HV_CALLBACK_VECTOR if X86 select VMAP_PFN + select DMA_OPS_BYPASS help Select this option to run Linux as a Hyper-V client operating system. diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 392c1ac4f819..32dc193e31cd 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include "hyperv_vmbus.h" @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t *type, return child_device_obj; } +static u64 vmbus_dma_mask = DMA_BIT_MASK(64); /* * vmbus_device_register - Register the child device */ @@ -2118,6 +2120,10 @@ int vmbus_device_register(struct hv_device *child_device_obj) } hv_debug_add_dev_dir(child_device_obj); + child_device_obj->device.dma_ops_bypass = true; + child_device_obj->device.dma_ops = _iommu_dma_ops; + child_device_obj->device.dma_mask = _dma_mask; + child_device_obj->device.dma_parms = _device_obj->dma_parms; return 0; err_kset_unregister: diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c index e285a220c913..ebcb628e7e8f 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv-iommu.c @@ -13,14 +13,21 @@ #include #include #include +#include +#include #include #include #include #include +#include +#include #include #include #include +#include +#include +#include #include
[PATCH 0/5] x86/Hyper-V: Add Hyper-V Isolation VM support(Second part)
From: Tianyu Lan Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset is to add support for these Isolation VM support in Linux. The memory of these vms are encrypted and host can't access guest memory directly. Hyper-V provides new host visibility hvcall and the guest needs to call new hvcall to mark memory visible to host before sharing memory with host. For security, all network/storage stack memory should not be shared with host and so there is bounce buffer requests. Vmbus channel ring buffer already plays bounce buffer role because all data from/to host needs to copy from/to between the ring buffer and IO stack memory. So mark vmbus channel ring buffer visible. For SNP isolation VM, guest needs to access the shared memory via extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_ ISOLATION_CONFIG. The access physical address of the shared memory should be bounce buffer memory GPA plus with shared_gpa_boundary reported by CPUID. This patchset is to enable swiotlb bounce buffer for netvsc/storvsc in Isolation VM. Add Hyper-V dma ops and provide dma_alloc/free_ noncontiguous and vmap/vunmap_noncontiguous callback. Allocate rx/tx ring via dma_alloc_noncontiguous() and map them into extra address space via dma_vmap_noncontiguous(). Tianyu Lan (5): x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM dma-mapping: Add vmap/vunmap_noncontiguous() callback in dma ops hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM net: netvsc: Add Isolation VM support for netvsc driver scsi: storvsc: Add Isolation VM support for storvsc driver arch/x86/mm/mem_encrypt.c | 4 +- arch/x86/xen/pci-swiotlb-xen.c| 3 +- drivers/hv/Kconfig| 1 + drivers/hv/vmbus_drv.c| 6 + drivers/iommu/hyperv-iommu.c | 164 + drivers/net/hyperv/hyperv_net.h | 5 + drivers/net/hyperv/netvsc.c | 192 +++--- drivers/net/hyperv/rndis_filter.c | 2 + drivers/scsi/storvsc_drv.c| 37 +++--- include/linux/dma-map-ops.h | 3 + include/linux/hyperv.h| 17 +++ include/linux/swiotlb.h | 6 + kernel/dma/mapping.c | 18 ++- kernel/dma/swiotlb.c | 75 ++-- 14 files changed, 488 insertions(+), 45 deletions(-) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/5] iommu: Some IOVA code reorganisation
On 2021-11-16 14:21, John Garry wrote: On 04/10/2021 12:44, Will Deacon wrote: On Fri, Sep 24, 2021 at 06:01:52PM +0800, John Garry wrote: The IOVA domain structure is a bit overloaded, holding: - IOVA tree management - FQ control - IOVA rcache memories Indeed only a couple of IOVA users use the rcache, and only dma-iommu.c uses the FQ feature. This series separates out that structure. In addition, it moves the FQ code into dma-iommu.c . This is not strictly necessary, but it does make it easier for the FQ domain lookup the rcache domain. The rcache code stays where it is, as it may be reworked in future, so there is not much point in relocating and then discarding. This topic was initially discussed and suggested (I think) by Robin here: https://lore.kernel.org/linux-iommu/1d06eda1-9961-d023-f5e7-fe87e768f...@arm.com/ It would be useful to have Robin's Ack on patches 2-4. The implementation looks straightforward to me, but the thread above isn't very clear about what is being suggested. Hi Robin, Just wondering if you had made any progress on your FQ code rework or your own re-org? Hey John - as it happens I started hacking on that in earnest about half an hour ago, aiming to get something out later this week. Cheers, Robin. I wasn't planning on progressing https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/ until this is done first (and that is still a big issue), even though not strictly necessary. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/5] iommu: Some IOVA code reorganisation
On 04/10/2021 12:44, Will Deacon wrote: On Fri, Sep 24, 2021 at 06:01:52PM +0800, John Garry wrote: The IOVA domain structure is a bit overloaded, holding: - IOVA tree management - FQ control - IOVA rcache memories Indeed only a couple of IOVA users use the rcache, and only dma-iommu.c uses the FQ feature. This series separates out that structure. In addition, it moves the FQ code into dma-iommu.c . This is not strictly necessary, but it does make it easier for the FQ domain lookup the rcache domain. The rcache code stays where it is, as it may be reworked in future, so there is not much point in relocating and then discarding. This topic was initially discussed and suggested (I think) by Robin here: https://lore.kernel.org/linux-iommu/1d06eda1-9961-d023-f5e7-fe87e768f...@arm.com/ It would be useful to have Robin's Ack on patches 2-4. The implementation looks straightforward to me, but the thread above isn't very clear about what is being suggested. Hi Robin, Just wondering if you had made any progress on your FQ code rework or your own re-org? I wasn't planning on progressing https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/ until this is done first (and that is still a big issue), even though not strictly necessary. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] dt-bindings: Add Arm SMMUv3 PMCG binding
On Tue, 16 Nov 2021 11:35:36 +, Jean-Philippe Brucker wrote: > Add binding for the Arm SMMUv3 PMU. Each node represents a PMCG, and is > placed as a sibling node of the SMMU. Although the PMCGs registers may > be within the SMMU MMIO region, they are separate devices, and there can > be multiple PMCG devices for each SMMU (for example one for the TCU and > one for each TBU). > > Signed-off-by: Jean-Philippe Brucker > --- > .../bindings/iommu/arm,smmu-v3-pmcg.yaml | 67 +++ > 1 file changed, 67 insertions(+) > create mode 100644 > Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml > My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check' on your patch (DT_CHECKER_FLAGS is new in v5.13): yamllint warnings/errors: ./Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml:24:9: [warning] wrong indentation: expected 10 but found 8 (indentation) ./Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml:25:11: [warning] wrong indentation: expected 12 but found 10 (indentation) dtschema/dtc warnings/errors: /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: example-0: pmu@2b42:reg:0: [0, 725745664, 0, 4096] is too long From schema: /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: example-0: pmu@2b42:reg:1: [0, 725811200, 0, 4096] is too long From schema: /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: example-0: pmu@2b44:reg:0: [0, 725876736, 0, 4096] is too long From schema: /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml /builds/robherring/linux-dt-review/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.example.dt.yaml: example-0: pmu@2b44:reg:1: [0, 725942272, 0, 4096] is too long From schema: /usr/local/lib/python3.8/dist-packages/dtschema/schemas/reg.yaml doc reference errors (make refcheckdocs): See https://patchwork.ozlabs.org/patch/1555758 This check can fail if there are any dependencies. The base for a patch series is generally the most recent rc1. If you already ran 'make dt_binding_check' and didn't see the above error(s), then make sure 'yamllint' is installed and dt-schema is up to date: pip3 install dtschema --upgrade Please check and re-submit. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/11] iommu: Add device dma ownership set/release interfaces
On Tue, Nov 16, 2021 at 09:57:30AM +0800, Lu Baolu wrote: > Hi Christoph, > > On 11/15/21 9:14 PM, Christoph Hellwig wrote: > > On Mon, Nov 15, 2021 at 10:05:42AM +0800, Lu Baolu wrote: > > > +enum iommu_dma_owner { > > > + DMA_OWNER_NONE, > > > + DMA_OWNER_KERNEL, > > > + DMA_OWNER_USER, > > > +}; > > > + > > > > > + enum iommu_dma_owner dma_owner; > > > + refcount_t owner_cnt; > > > + struct file *owner_user_file; > > > > I'd just overload the ownership into owner_user_file, > > > > NULL -> no owner > > (struct file *)1UL) -> kernel > > real pointer -> user > > > > Which could simplify a lot of the code dealing with the owner. > > > > Yeah! Sounds reasonable. I will make this in the next version. It would be good to figure out how to make iommu_attach_device() enforce no other driver binding as a kernel user without a file *, as Robin pointed to, before optimizing this. This fixes an existing bug where iommu_attach_device() only checks the group size and is vunerable to a hot plug increasing the group size after it returns. That check should be replaced by this series's logic instead. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] perf/smmuv3: Add devicetree support
On 16/11/2021 11:35, Jean-Philippe Brucker wrote: Add device-tree support to the SMMUv3 PMCG. One small cosmetic change while factoring the option mask printout: don't display it when zero, it only contains one erratum at the moment. Signed-off-by: Jay Chen Signed-off-by: Jean-Philippe Brucker --- drivers/perf/arm_smmuv3_pmu.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c index 226348822ab3..958325ac103a 100644 --- a/drivers/perf/arm_smmuv3_pmu.c +++ b/drivers/perf/arm_smmuv3_pmu.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include #include @@ -750,8 +751,15 @@ static void smmu_pmu_get_acpi_options(struct smmu_pmu *smmu_pmu) smmu_pmu->options |= SMMU_PMCG_EVCNTR_RDONLY; break; } +} + +static void smmu_pmu_get_of_options(struct smmu_pmu *smmu_pmu) +{ + struct device_node *node = smmu_pmu->dev->of_node; - dev_notice(smmu_pmu->dev, "option mask 0x%x\n", smmu_pmu->options); + if (of_device_is_compatible(node, "hisilicon,smmu-v3-pmcg-hip08")) I don't think that this is necessary. We don't support DT for hip08, nor have any plans to. Incidentally, was this binding missing in your series? Thanks, John + /* HiSilicon Erratum 162001800 */ + smmu_pmu->options |= SMMU_PMCG_EVCNTR_RDONLY; } static int smmu_pmu_probe(struct platform_device *pdev) @@ -834,7 +842,13 @@ static int smmu_pmu_probe(struct platform_device *pdev) return -EINVAL; } - smmu_pmu_get_acpi_options(smmu_pmu); + if (dev->of_node) + smmu_pmu_get_of_options(smmu_pmu); + else + smmu_pmu_get_acpi_options(smmu_pmu); + + if (smmu_pmu->options) + dev_notice(dev, "option mask 0x%x\n", smmu_pmu->options); /* Pick one CPU to be the preferred one to use */ smmu_pmu->on_cpu = raw_smp_processor_id(); @@ -884,10 +898,17 @@ static void smmu_pmu_shutdown(struct platform_device *pdev) smmu_pmu_disable(_pmu->pmu); } +static const struct of_device_id arm_smmu_pmu_match[] = { + { .compatible = "arm,smmu-v3-pmcg" }, + {}, +}; +MODULE_DEVICE_TABLE(of, arm_smmu_pmu_match); + static struct platform_driver smmu_pmu_driver = { .driver = { .name = "arm-smmu-v3-pmcg", .suppress_bind_attrs = true, + .of_match_table = of_match_ptr(arm_smmu_pmu_match), }, .probe = smmu_pmu_probe, .remove = smmu_pmu_remove, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/2] perf/smmuv3: Support devicetree
On 2021-11-16 11:35, Jean-Philippe Brucker wrote: Add devicetree binding for the SMMUv3 PMU, called Performance Monitoring Counter Group (PMCG) in the spec. Each SMMUv3 implementation can have multiple independent PMCGs, for example one for the Translation Control Unit (TCU) and one per Translation Buffer Unit (TBU). I previously sent the binding as reply to Jay Chen's thread implementing device tree support [1]. This posting addresses the comments from that thread. Ha, I'd also resurrected this and was planning to post it at some point this week[0] - you should have said :) Patch 1 adds two compatible strings. "arm,smmu-v3-pmcg" is common to all PMCGs. "hisilicon,smmu-v3-pmcg-hip08" allows to support the same quirk as IORT for that implementation (see patch 2). We'll probably want to also introduce compatible strings for each implementation that has additional perf events. For example the MMU-600 implementation has different events for TCU and TBU PMCGs [2], but both components have the same device IDs. So the driver could differentiate them if they had two distinct compatible strings such as "arm,mmu-600-pmcg-tbu" and "arm,mmu-600-pmcg-tcu". Actually it only needs a general MMU-600 compatible, since once you know it's an Arm Ltd. implementation, you can assume the pattern for the IMP_DEF ID registers to figure out the rest. Robin. [0] https://gitlab.arm.com/linux-arm/linux-rm/-/commits/smmu-pmcg The series doesn't deal with this because for testing I use a software model which only implements architected events. I do not include DTS change for that platform because enabling PMCGs requires an additional model option. See my branch smmu/pmu-dt [3] for details. [1] https://lore.kernel.org/all/20200707150114.GC159413@myrica/ [2] https://developer.arm.com/documentation/100310/0202/Functional-description/Operation/Performance-Monitoring-Unit [3] https://jpbrucker.net/git/linux/log/?h=smmu/pmu-dt Jean-Philippe Brucker (2): dt-bindings: Add Arm SMMUv3 PMCG binding perf/smmuv3: Add devicetree support .../bindings/iommu/arm,smmu-v3-pmcg.yaml | 67 +++ drivers/perf/arm_smmuv3_pmu.c | 25 ++- 2 files changed, 90 insertions(+), 2 deletions(-) create mode 100644 Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/2] dt-bindings: Add Arm SMMUv3 PMCG binding
Add binding for the Arm SMMUv3 PMU. Each node represents a PMCG, and is placed as a sibling node of the SMMU. Although the PMCGs registers may be within the SMMU MMIO region, they are separate devices, and there can be multiple PMCG devices for each SMMU (for example one for the TCU and one for each TBU). Signed-off-by: Jean-Philippe Brucker --- .../bindings/iommu/arm,smmu-v3-pmcg.yaml | 67 +++ 1 file changed, 67 insertions(+) create mode 100644 Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml b/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml new file mode 100644 index ..a893e071fdb4 --- /dev/null +++ b/Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml @@ -0,0 +1,67 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iommu/arm,smmu-v3-pmcg.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Arm SMMUv3 Performance Monitor Counter Group + +maintainers: + - Will Deacon + - Robin Murphy + +description: |+ + An SMMUv3 may have several Performance Monitor Counter Group (PMCG). + They are standalone performance monitoring units that support both + architected and IMPLEMENTATION DEFINED event counters. + +properties: + $nodename: +pattern: "^pmu@[0-9a-f]*" + compatible: +oneOf: + - items: +- enum: + - hisilicon,smmu-v3-pmcg-hip08 +- const: arm,smmu-v3-pmcg + - const: arm,smmu-v3-pmcg + + reg: +description: | + Base addresses of the PMCG registers. Either a single address for Page 0 + or an additional address for Page 1, where some registers can be + relocated with SMMU_PMCG_CFGR.RELOC_CTRS. +minItems: 1 +maxItems: 2 + + interrupts: +maxItems: 1 + + msi-parent: true + +required: + - compatible + - reg + +additionalProperties: false + +examples: + - |+ +#include +#include + +pmu@2b42 { +compatible = "arm,smmu-v3-pmcg"; +reg = <0 0x2b42 0 0x1000>, + <0 0x2b43 0 0x1000>; +interrupts = ; +msi-parent = < 0xff>; +}; + +pmu@2b44 { +compatible = "arm,smmu-v3-pmcg"; +reg = <0 0x2b44 0 0x1000>, + <0 0x2b45 0 0x1000>; +interrupts = ; +msi-parent = < 0xff>; +}; -- 2.33.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 0/2] perf/smmuv3: Support devicetree
Add devicetree binding for the SMMUv3 PMU, called Performance Monitoring Counter Group (PMCG) in the spec. Each SMMUv3 implementation can have multiple independent PMCGs, for example one for the Translation Control Unit (TCU) and one per Translation Buffer Unit (TBU). I previously sent the binding as reply to Jay Chen's thread implementing device tree support [1]. This posting addresses the comments from that thread. Patch 1 adds two compatible strings. "arm,smmu-v3-pmcg" is common to all PMCGs. "hisilicon,smmu-v3-pmcg-hip08" allows to support the same quirk as IORT for that implementation (see patch 2). We'll probably want to also introduce compatible strings for each implementation that has additional perf events. For example the MMU-600 implementation has different events for TCU and TBU PMCGs [2], but both components have the same device IDs. So the driver could differentiate them if they had two distinct compatible strings such as "arm,mmu-600-pmcg-tbu" and "arm,mmu-600-pmcg-tcu". The series doesn't deal with this because for testing I use a software model which only implements architected events. I do not include DTS change for that platform because enabling PMCGs requires an additional model option. See my branch smmu/pmu-dt [3] for details. [1] https://lore.kernel.org/all/20200707150114.GC159413@myrica/ [2] https://developer.arm.com/documentation/100310/0202/Functional-description/Operation/Performance-Monitoring-Unit [3] https://jpbrucker.net/git/linux/log/?h=smmu/pmu-dt Jean-Philippe Brucker (2): dt-bindings: Add Arm SMMUv3 PMCG binding perf/smmuv3: Add devicetree support .../bindings/iommu/arm,smmu-v3-pmcg.yaml | 67 +++ drivers/perf/arm_smmuv3_pmu.c | 25 ++- 2 files changed, 90 insertions(+), 2 deletions(-) create mode 100644 Documentation/devicetree/bindings/iommu/arm,smmu-v3-pmcg.yaml -- 2.33.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] perf/smmuv3: Add devicetree support
Add device-tree support to the SMMUv3 PMCG. One small cosmetic change while factoring the option mask printout: don't display it when zero, it only contains one erratum at the moment. Signed-off-by: Jay Chen Signed-off-by: Jean-Philippe Brucker --- drivers/perf/arm_smmuv3_pmu.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c index 226348822ab3..958325ac103a 100644 --- a/drivers/perf/arm_smmuv3_pmu.c +++ b/drivers/perf/arm_smmuv3_pmu.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include #include @@ -750,8 +751,15 @@ static void smmu_pmu_get_acpi_options(struct smmu_pmu *smmu_pmu) smmu_pmu->options |= SMMU_PMCG_EVCNTR_RDONLY; break; } +} + +static void smmu_pmu_get_of_options(struct smmu_pmu *smmu_pmu) +{ + struct device_node *node = smmu_pmu->dev->of_node; - dev_notice(smmu_pmu->dev, "option mask 0x%x\n", smmu_pmu->options); + if (of_device_is_compatible(node, "hisilicon,smmu-v3-pmcg-hip08")) + /* HiSilicon Erratum 162001800 */ + smmu_pmu->options |= SMMU_PMCG_EVCNTR_RDONLY; } static int smmu_pmu_probe(struct platform_device *pdev) @@ -834,7 +842,13 @@ static int smmu_pmu_probe(struct platform_device *pdev) return -EINVAL; } - smmu_pmu_get_acpi_options(smmu_pmu); + if (dev->of_node) + smmu_pmu_get_of_options(smmu_pmu); + else + smmu_pmu_get_acpi_options(smmu_pmu); + + if (smmu_pmu->options) + dev_notice(dev, "option mask 0x%x\n", smmu_pmu->options); /* Pick one CPU to be the preferred one to use */ smmu_pmu->on_cpu = raw_smp_processor_id(); @@ -884,10 +898,17 @@ static void smmu_pmu_shutdown(struct platform_device *pdev) smmu_pmu_disable(_pmu->pmu); } +static const struct of_device_id arm_smmu_pmu_match[] = { + { .compatible = "arm,smmu-v3-pmcg" }, + {}, +}; +MODULE_DEVICE_TABLE(of, arm_smmu_pmu_match); + static struct platform_driver smmu_pmu_driver = { .driver = { .name = "arm-smmu-v3-pmcg", .suppress_bind_attrs = true, + .of_match_table = of_match_ptr(arm_smmu_pmu_match), }, .probe = smmu_pmu_probe, .remove = smmu_pmu_remove, -- 2.33.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 2/6] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 2021/11/16 18:56, Robin Murphy wrote: > On 2021-11-16 09:06, Yicong Yang via iommu wrote: > [...] >> +/* >> + * Get RMR address if provided by the firmware. >> + * Return 0 if the IOMMU doesn't present or the policy of the >> + * IOMMU domain is passthrough or we get a usable RMR region. >> + * Otherwise a negative value is returned. >> + */ >> +static int hisi_ptt_get_rmr(struct hisi_ptt *hisi_ptt) >> +{ >> + struct pci_dev *pdev = hisi_ptt->pdev; >> + struct iommu_domain *iommu_domain; >> + struct iommu_resv_region *region; >> + LIST_HEAD(list); >> + >> + /* >> + * Use direct DMA if IOMMU does not present or the policy of the >> + * IOMMU domain is passthrough. >> + */ >> + iommu_domain = iommu_get_domain_for_dev(>dev); >> + if (!iommu_domain || iommu_domain->type == IOMMU_DOMAIN_IDENTITY) >> + return 0; >> + >> + iommu_get_resv_regions(>dev, ); >> + list_for_each_entry(region, , list) >> + if (region->type == IOMMU_RESV_DIRECT && >> + region->length >= HISI_PTT_TRACE_BUFFER_SIZE) { >> + hisi_ptt->trace_ctrl.has_rmr = true; >> + hisi_ptt->trace_ctrl.rmr_addr = region->start; >> + hisi_ptt->trace_ctrl.rmr_length = region->length; >> + break; >> + } >> + >> + iommu_put_resv_regions(>dev, ); >> + return hisi_ptt->trace_ctrl.has_rmr ? 0 : -ENOMEM; >> +} > > No. > > The whole point of RMRs is for devices that are already configured to access > the given address range in a manner beyond the kernel's control. If you can > do this, it proves that you should not have an RMR in the first place. > > The notion of a kernel driver explicitly configuring its device to DMA into > any random RMR that looks big enough is so egregiously wrong that I'm almost > lost for words... > our bios will reserve such a region and reported it through iort. the device will write to the region and in the driver we need to access the region to get the traced data. the region is reserved exclusively and will not be accessed by kernel or other devices. is it ok to let bios configure the address to the device and from CPU side we just read it? Thanks, Yicong ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: dma-direct fixes and cleanups v3
On 2021-11-11 06:50, Christoph Hellwig wrote: Hi all, Linus complained about the complex flow in dma_direct_alloc, so this tries to simplify it a bit, and while I was at it I also made sure that unencrypted pages never leak back into the page allocator. Before I forget, I've had a quick skim of the remaining patches and nothing more stands out. Let me know if you'd like me to find time to check everything over in detail again for a proper review, but otherwise I reckon we may as well get this baking in -next sooner rather than later. Cheers, Robin. Changes since v2: - don't call dma_set_decrypted on remapped memory - move the leak printk into dma_set_encrypted - add another local variable to clean up dma_direct_alloc - return NULL when the is no way to make the memory coherent Changes since v1: - fix a missing return - add a new patch to fix a pre-existing missing unmap - various additional cleanups Diffstat: direct.c | 234 +-- 1 file changed, 138 insertions(+), 96 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 2/6] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 2021-11-16 09:06, Yicong Yang via iommu wrote: [...] +/* + * Get RMR address if provided by the firmware. + * Return 0 if the IOMMU doesn't present or the policy of the + * IOMMU domain is passthrough or we get a usable RMR region. + * Otherwise a negative value is returned. + */ +static int hisi_ptt_get_rmr(struct hisi_ptt *hisi_ptt) +{ + struct pci_dev *pdev = hisi_ptt->pdev; + struct iommu_domain *iommu_domain; + struct iommu_resv_region *region; + LIST_HEAD(list); + + /* +* Use direct DMA if IOMMU does not present or the policy of the +* IOMMU domain is passthrough. +*/ + iommu_domain = iommu_get_domain_for_dev(>dev); + if (!iommu_domain || iommu_domain->type == IOMMU_DOMAIN_IDENTITY) + return 0; + + iommu_get_resv_regions(>dev, ); + list_for_each_entry(region, , list) + if (region->type == IOMMU_RESV_DIRECT && + region->length >= HISI_PTT_TRACE_BUFFER_SIZE) { + hisi_ptt->trace_ctrl.has_rmr = true; + hisi_ptt->trace_ctrl.rmr_addr = region->start; + hisi_ptt->trace_ctrl.rmr_length = region->length; + break; + } + + iommu_put_resv_regions(>dev, ); + return hisi_ptt->trace_ctrl.has_rmr ? 0 : -ENOMEM; +} No. The whole point of RMRs is for devices that are already configured to access the given address range in a manner beyond the kernel's control. If you can do this, it proves that you should not have an RMR in the first place. The notion of a kernel driver explicitly configuring its device to DMA into any random RMR that looks big enough is so egregiously wrong that I'm almost lost for words... Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 06/11] iommu: Expose group variants of dma ownership interfaces
Hi Christoph, On 2021/11/15 21:27, Christoph Hellwig wrote: On Mon, Nov 15, 2021 at 10:05:47AM +0800, Lu Baolu wrote: The vfio needs to set DMA_OWNER_USER for the entire group when attaching The vfio subsystem? driver? "vfio subsystem" it to a vfio container. So expose group variants of setting/releasing dma ownership for this purpose. This also exposes the helper iommu_group_dma_owner_unclaimed() for vfio report to userspace if the group is viable to user assignment, for .. for vfio to report .. ? Yes. void iommu_device_release_dma_owner(struct device *dev, enum iommu_dma_owner owner); +int iommu_group_set_dma_owner(struct iommu_group *group, enum iommu_dma_owner owner, + struct file *user_file); +void iommu_group_release_dma_owner(struct iommu_group *group, enum iommu_dma_owner owner); Pleae avoid all these overly long lines. Sure. Thanks! +static inline int iommu_group_set_dma_owner(struct iommu_group *group, + enum iommu_dma_owner owner, + struct file *user_file) +{ + return -EINVAL; +} + +static inline void iommu_group_release_dma_owner(struct iommu_group *group, +enum iommu_dma_owner owner) +{ +} + +static inline bool iommu_group_dma_owner_unclaimed(struct iommu_group *group) +{ + return false; +} Why do we need these stubs? All potential callers should already require CONFIG_IOMMU_API? Same for the helpers added in patch 1, btw. You are right. This helper is only for vfio which requires IOMMU_API. I will remove this. The helpers in patch 1 seem not the same. The driver core (or bus dma_configure() callback as suggested) will also call them. + mutex_lock(>mutex); + ret = __iommu_group_set_dma_owner(group, owner, user_file); + mutex_unlock(>mutex); + mutex_lock(>mutex); + __iommu_group_release_dma_owner(group, owner); + mutex_unlock(>mutex); Unless I'm missing something (just skipping over the patches), the existing callers also take the lock just around these calls, so we don't really need the __-prefixed lowlevel helpers. Move mutex_lock/unlock will make the helper implementation easier. :-) It seems to be common code style in iommu core. For example, __iommu_attach_group(), __iommu_group_for_each_dev(), etc. + mutex_lock(>mutex); + owner = group->dma_owner; + mutex_unlock(>mutex); No need for a lock to read a single scalar. Adding the lock will make kcasn happy. Jason G also told me that [citing from his review comment] " It is always incorrect to read concurrent data without an annotation of some kind. For instance it can cause mis-execution of logic where the compiler is unaware that a value it loads is allowed to change - ie no READ_ONCE/WRITE_ONCE semantic. " + + return owner == DMA_OWNER_NONE; +} +EXPORT_SYMBOL_GPL(iommu_group_dma_owner_unclaimed); Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 4/6] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver
From: Qi Liu 'perf record' and 'perf report --dump-raw-trace' supported in this patch. Example usage: Output will contain raw PTT data and its textual representation, such as: 0 0 0x5810 [0x30]: PERF_RECORD_AUXTRACE size: 0x40 offset: 0 ref: 0xa5d50c725 idx: 0 tid: -1 cpu: 0 . . ... HISI PTT data: size 4194304 bytes . : 00 00 00 00 Prefix . 0004: 08 20 00 60 Header DW0 . 0008: ff 02 00 01 Header DW1 . 000c: 20 08 00 00 Header DW2 . 0010: 10 e7 44 ab Header DW3 . 0014: 2a a8 1e 01 Time . 0020: 00 00 00 00 Prefix . 0024: 01 00 00 60 Header DW0 . 0028: 0f 1e 00 01 Header DW1 . 002c: 04 00 00 00 Header DW2 . 0030: 40 00 81 02 Header DW3 . 0034: ee 02 00 00 Time Signed-off-by: Qi Liu Signed-off-by: Yicong Yang --- tools/perf/arch/arm/util/auxtrace.c | 56 - tools/perf/arch/arm/util/pmu.c| 3 + tools/perf/arch/arm64/util/Build | 2 +- tools/perf/arch/arm64/util/hisi_ptt.c | 195 +++ tools/perf/util/Build | 2 + tools/perf/util/auxtrace.c| 4 + tools/perf/util/auxtrace.h| 1 + tools/perf/util/hisi-ptt-decoder/Build| 1 + .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.c | 170 + .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.h | 28 +++ tools/perf/util/hisi_ptt.c| 228 ++ tools/perf/util/hisi_ptt.h| 28 +++ 12 files changed, 714 insertions(+), 4 deletions(-) create mode 100644 tools/perf/arch/arm64/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi-ptt-decoder/Build create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.c create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.h create mode 100644 tools/perf/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi_ptt.h diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c index 5fc6a2a3dbc5..dc9d2172464e 100644 --- a/tools/perf/arch/arm/util/auxtrace.c +++ b/tools/perf/arch/arm/util/auxtrace.c @@ -4,6 +4,7 @@ * Author: Mathieu Poirier */ +#include #include #include #include @@ -14,6 +15,7 @@ #include "../../../util/pmu.h" #include "cs-etm.h" #include "arm-spe.h" +#include "hisi_ptt.h" static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err) { @@ -50,6 +52,39 @@ static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err) return arm_spe_pmus; } +static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err) +{ + struct perf_pmu **hisi_ptt_pmus = NULL; + int nr_cpus = sysconf(_SC_NPROCESSORS_CONF); + DIR *dir = NULL; + struct dirent *dent; + + hisi_ptt_pmus = zalloc(sizeof(struct perf_pmu *) * nr_cpus); + if (!hisi_ptt_pmus) { + pr_err("hisi_ptt alloc failed\n"); + *err = -ENOMEM; + return NULL; + } + + dir = opendir("/sys/devices"); + dent = readdir(dir); + while (dent) { + if (strstr(dent->d_name, HISI_PTT_PMU_NAME)) { + hisi_ptt_pmus[*nr_ptts] = perf_pmu__find(dent->d_name); + if (hisi_ptt_pmus[*nr_ptts]) { + pr_debug2("%s %d: arm_spe_pmu %d type %d name %s\n", + __func__, __LINE__, *nr_ptts, + hisi_ptt_pmus[*nr_ptts]->type, + hisi_ptt_pmus[*nr_ptts]->name); + (*nr_ptts)++; + } + } + dent = readdir(dir); + } + + return hisi_ptt_pmus; +} + struct auxtrace_record *auxtrace_record__init(struct evlist *evlist, int *err) { @@ -57,8 +92,12 @@ struct auxtrace_record struct evsel *evsel; bool found_etm = false; struct perf_pmu *found_spe = NULL; + struct perf_pmu *found_ptt = NULL; struct perf_pmu **arm_spe_pmus = NULL; + struct perf_pmu **hisi_ptt_pmus = NULL; + int nr_spes = 0; + int nr_ptts = 0; int i = 0; if (!evlist) @@ -66,13 +105,14 @@ struct auxtrace_record cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME); arm_spe_pmus = find_all_arm_spe_pmus(_spes, err); + hisi_ptt_pmus = find_all_hisi_ptt_pmus(_ptts, err); evlist__for_each_entry(evlist, evsel) { if (cs_etm_pmu && evsel->core.attr.type ==
[PATCH v2 2/6] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic(tune), and trace the TLP headers(trace). Add the driver for the device to enable the trace function. The driver will create PMU device for each PTT device, and users can start trace through perf command. Signed-off-by: Yicong Yang --- drivers/Makefile |1 + drivers/hwtracing/Kconfig |2 + drivers/hwtracing/hisilicon/Kconfig|8 + drivers/hwtracing/hisilicon/Makefile |2 + drivers/hwtracing/hisilicon/hisi_ptt.c | 1146 5 files changed, 1159 insertions(+) create mode 100644 drivers/hwtracing/hisilicon/Kconfig create mode 100644 drivers/hwtracing/hisilicon/Makefile create mode 100644 drivers/hwtracing/hisilicon/hisi_ptt.c diff --git a/drivers/Makefile b/drivers/Makefile index be5d40ae1488..bb0dc9b55ea2 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -176,6 +176,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ obj-$(CONFIG_CORESIGHT)+= hwtracing/coresight/ obj-y += hwtracing/intel_th/ obj-$(CONFIG_STM) += hwtracing/stm/ +obj-$(CONFIG_HISI_PTT) += hwtracing/hisilicon/ obj-$(CONFIG_ANDROID) += android/ obj-$(CONFIG_NVMEM)+= nvmem/ obj-$(CONFIG_FPGA) += fpga/ diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig index 13085835a636..e3796b17541a 100644 --- a/drivers/hwtracing/Kconfig +++ b/drivers/hwtracing/Kconfig @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" source "drivers/hwtracing/intel_th/Kconfig" +source "drivers/hwtracing/hisilicon/Kconfig" + endmenu diff --git a/drivers/hwtracing/hisilicon/Kconfig b/drivers/hwtracing/hisilicon/Kconfig new file mode 100644 index ..9c3ab80d99fe --- /dev/null +++ b/drivers/hwtracing/hisilicon/Kconfig @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on PCI && HAS_DMA && HAS_IOMEM + help + HiSilicon PCIe Tune and Trace Device exist as a PCIe RCiEP + device, provides support for PCIe traffic tuning and + tracing TLP headers to the memory. diff --git a/drivers/hwtracing/hisilicon/Makefile b/drivers/hwtracing/hisilicon/Makefile new file mode 100644 index ..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/hisilicon/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/hisilicon/hisi_ptt.c b/drivers/hwtracing/hisilicon/hisi_ptt.c new file mode 100644 index ..e11e9b6cc2a8 --- /dev/null +++ b/drivers/hwtracing/hisilicon/hisi_ptt.c @@ -0,0 +1,1146 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2021 HiSilicon Limited. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define HISI_PTT_TRACE_ADDR_SIZE 0x0800 +#define HISI_PTT_TRACE_ADDR_BASE_LO_0 0x0810 +#define HISI_PTT_TRACE_ADDR_BASE_HI_0 0x0814 +#define HISI_PTT_TRACE_ADDR_STRIDE 0x8 +#define HISI_PTT_TRACE_CTRL0x0850 +#define HISI_PTT_TRACE_CTRL_EN BIT(0) +#define HISI_PTT_TRACE_CTRL_RST BIT(1) +#define HISI_PTT_TRACE_CTRL_RXTX_SEL GENMASK(3, 2) +#define HISI_PTT_TRACE_CTRL_TYPE_SEL GENMASK(7, 4) +#define HISI_PTT_TRACE_CTRL_DATA_FORMAT BIT(14) +#define HISI_PTT_TRACE_CTRL_FILTER_MODE BIT(15) +#define HISI_PTT_TRACE_CTRL_TARGET_SEL GENMASK(31, 16) +#define HISI_PTT_TRACE_INT_STAT0x0890 +#define HISI_PTT_TRACE_INT_STAT_MASK GENMASK(3, 0) +#define HISI_PTT_TRACE_INT_MASK0x0894 +#define HISI_PTT_TRACE_WR_STS 0x08a0 +#define HISI_PTT_TRACE_WR_STS_WRITE GENMASK(27, 0) +#define HISI_PTT_TRACE_WR_STS_BUFFER GENMASK(29, 28) +#define HISI_PTT_TRACE_STS 0x08b0 +#define HISI_PTT_TRACE_IDLE BIT(0) +#define HISI_PTT_DEVICE_RANGE 0x0fe0 +#define HISI_PTT_LOCATION 0x0fe8 +#define HISI_PTT_CORE_ID GENMASK(15, 0) +#define HISI_PTT_SICL_ID GENMASK(31, 16) + +#define HISI_PTT_TRACE_DMA_IRQ 0 +#define HISI_PTT_TRACE_BUFLETS_CNT 4 +#define HISI_PTT_TRACE_BUFLET_SIZE SZ_4M +#define HISI_PTT_TRACE_BUFFER_SIZE (HISI_PTT_TRACE_BUFLET_SIZE * \ +HISI_PTT_TRACE_BUFLETS_CNT) +#define HISI_PTT_FILTER_UPDATE_FIFO_SIZE 16 + +/* Delay time for filter updating work */ +#define HISI_PTT_WORK_DELAY_MS 100UL +/* Wait time for DMA hardware to reset */ +#define HISI_PTT_RESET_WAIT_MS 1000UL +/* Poll timeout and
[PATCH v2 5/6] docs: Add HiSilicon PTT device driver documentation
Document the introduction and usage of HiSilicon PTT device driver. Signed-off-by: Yicong Yang --- Documentation/trace/hisi-ptt.rst | 305 +++ 1 file changed, 305 insertions(+) create mode 100644 Documentation/trace/hisi-ptt.rst diff --git a/Documentation/trace/hisi-ptt.rst b/Documentation/trace/hisi-ptt.rst new file mode 100644 index ..a7ef6438e4aa --- /dev/null +++ b/Documentation/trace/hisi-ptt.rst @@ -0,0 +1,305 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +HiSilicon PCIe Tune and Trace device +== + +Introduction + + +HiSilicon PCIe tune and trace device (PTT) is a PCIe Root Complex +integrated Endpoint (RCiEP) device, providing the capability +to dynamically monitor and tune the PCIe link's events (tune), +and trace the TLP headers (trace). The two functions are independent, +but is recommended to use them together to analyze and enhance the +PCIe link's performance. + +On Kunpeng 930 SoC, the PCIe Root Complex is composed of several +PCIe cores. Each PCIe core includes several Root Ports and a PTT +RCiEP, like below. The PTT device is capable of tuning and +tracing the link of the PCIe core. +:: + +--Core 0---+ + | | [ PTT ] | + | | [Root Port]---[Endpoint] + | | [Root Port]---[Endpoint] + | | [Root Port]---[Endpoint] +Root Complex |--Core 1---+ + | | [ PTT ] | + | | [Root Port]---[ Switch ]---[Endpoint] + | | [Root Port]---[Endpoint] `-[Endpoint] + | | [Root Port]---[Endpoint] + +---+ + +The PTT device driver registers PMU device for each PTT device. +The name of each PTT device is composed of 'hisi_ptt' prefix with +the id of the SICL and the Core where it locates. The Kunpeng 930 +SoC encapsulates multiple CPU dies (SCCL, Super CPU Cluster) and +IO dies (SICL, Super I/O Cluster), where there's one PCIe Root +Complex for each SICL. +:: +/sys/devices/hisi_ptt_ + +Tune + + +PTT tune is designed for monitoring and adjusting PCIe link parameters (events). +Currently we support events in 4 classes. The scope of the events +covers the PCIe core to which the PTT device belongs. + +Each event is presented as a file under $(PTT PMU dir)/tune, and +mostly a simple open/read/write/close cycle will be used to tune +the event. +:: +$ cd /sys/devices/hisi_ptt_/tune +$ ls +qos_tx_cplqos_tx_npqos_tx_p +tx_path_rx_req_alloc_buf_level +tx_path_tx_req_alloc_buf_level +$ cat qos_tx_dp +1 +$ echo 2 > qos_tx_dp +$ cat qos_tx_dp +2 + +Current value (numerical value) of the event can be simply read +from the file, and the desired value written to the file to tune. + +1. Tx path QoS control + + +The following files are provided to tune the QoS of the tx path of +the PCIe core. + +- qos_tx_cpl: weight of Tx completion TLPs +- qos_tx_np: weight of Tx non-posted TLPs +- qos_tx_p: weight of Tx posted TLPs + +The weight influences the proportion of certain packets on the PCIe link. +For example, for the storage scenario, increase the proportion +of the completion packets on the link to enhance the performance as +more completions are consumed. + +The available tune data of these events is [0, 1, 2]. +Writing a negative value will return an error, and out of range +values will be converted to 2. Note that the event value just +indicates a probable level, but is not precise. + +2. Tx path buffer control +- + +Following files are provided to tune the buffer of tx path of the PCIe core. + +- tx_path_rx_req_alloc_buf_level: watermark of Rx requested +- tx_path_tx_req_alloc_buf_level: watermark of Tx requested + +These events influence the watermark of the buffer allocated for each +type. Rx means the inbound while Tx means outbound. The packets will +be stored in the buffer first and then posted either when the watermark +reached or when timed out. For a busy direction, you should increase +the related buffer watermark to avoid frequently posting and thus +enhance the performance. In most cases just keep the default value. + +The available tune data of above events is [0, 1, 2]. +Writing a negative value will return an error, and out of range +values will be converted to 2. Note that the event value just +indicates a probable level, but is not precise. + +Trace += + +PTT trace is designed for dumping the TLP headers to the memory, which +can be used to analyze the transactions and usage condition of the PCIe +Link. You can choose to filter the traced headers by either requester ID, +or those downstream of a set of Root Ports on the same core of the PTT +device. It's also supported to trace the headers of certain type and of +certain direction. + +You can
[PATCH v2 3/6] hwtracing: Add tune function support for HiSilicon PCIe Tune and Trace device
Add tune function for the HiSilicon Tune and Trace device. The interface of tune is exposed through sysfs attributes of PTT PMU device. Signed-off-by: Yicong Yang --- drivers/hwtracing/hisilicon/hisi_ptt.c | 167 + 1 file changed, 167 insertions(+) diff --git a/drivers/hwtracing/hisilicon/hisi_ptt.c b/drivers/hwtracing/hisilicon/hisi_ptt.c index e11e9b6cc2a8..a1e1fb766376 100644 --- a/drivers/hwtracing/hisilicon/hisi_ptt.c +++ b/drivers/hwtracing/hisilicon/hisi_ptt.c @@ -24,6 +24,11 @@ #include #include +#define HISI_PTT_TUNING_CTRL 0x +#define HISI_PTT_TUNING_CTRL_CODEGENMASK(15, 0) +#define HISI_PTT_TUNING_CTRL_SUB GENMASK(23, 16) +#define HISI_PTT_TUNING_DATA 0x0004 +#define HISI_PTT_TUNING_DATA_VAL_MASKGENMASK(15, 0) #define HISI_PTT_TRACE_ADDR_SIZE 0x0800 #define HISI_PTT_TRACE_ADDR_BASE_LO_0 0x0810 #define HISI_PTT_TRACE_ADDR_BASE_HI_0 0x0814 @@ -39,6 +44,8 @@ #define HISI_PTT_TRACE_INT_STAT0x0890 #define HISI_PTT_TRACE_INT_STAT_MASK GENMASK(3, 0) #define HISI_PTT_TRACE_INT_MASK0x0894 +#define HISI_PTT_TUNING_INT_STAT 0x0898 +#define HISI_PTT_TUNING_INT_STAT_MASKBIT(0) #define HISI_PTT_TRACE_WR_STS 0x08a0 #define HISI_PTT_TRACE_WR_STS_WRITE GENMASK(27, 0) #define HISI_PTT_TRACE_WR_STS_BUFFER GENMASK(29, 28) @@ -71,6 +78,12 @@ enum hisi_ptt_trace_status { HISI_PTT_TRACE_STATUS_ON, }; +struct hisi_ptt_tune_desc { + struct hisi_ptt *hisi_ptt; + const char *name; + u32 event_code; +}; + struct hisi_ptt_dma_buflet { struct list_head list; unsigned int size; @@ -177,6 +190,159 @@ static inline struct hisi_ptt *to_hisi_ptt(struct pmu *pmu) return container_of(pmu, struct hisi_ptt, hisi_ptt_pmu); } +static int hisi_ptt_wait_tuning_finish(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return readl_poll_timeout(hisi_ptt->iobase + HISI_PTT_TUNING_INT_STAT, + val, !(val & HISI_PTT_TUNING_INT_STAT_MASK), + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TIMEOUT_US); +} + +static int hisi_ptt_tune_data_get(struct hisi_ptt *hisi_ptt, + u32 event, u16 *data) +{ + u32 reg; + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB); + reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB, + event); + writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + + /* Write all 1 to indicates it's the read process */ + writel(~0UL, hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + + if (hisi_ptt_wait_tuning_finish(hisi_ptt)) + return -ETIMEDOUT; + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + reg &= HISI_PTT_TUNING_DATA_VAL_MASK; + *data = (u16)reg; + + return 0; +} + +static int hisi_ptt_tune_data_set(struct hisi_ptt *hisi_ptt, + u32 event, u16 data) +{ + u32 reg; + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB); + reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB, + event); + writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + + reg = data; + writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + + if (hisi_ptt_wait_tuning_finish(hisi_ptt)) + return -ETIMEDOUT; + + return 0; +} + +static ssize_t hisi_ptt_tune_attr_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct dev_ext_attribute *ext_attr; + struct hisi_ptt_tune_desc *desc; + int ret; + u16 val; + + ext_attr = container_of(attr, struct dev_ext_attribute, attr); + desc = ext_attr->var; + + if (!mutex_trylock(_ptt->mutex)) + return -EBUSY; + + ret = hisi_ptt_tune_data_get(hisi_ptt, desc->event_code, ); + + mutex_unlock(_ptt->mutex); + return ret ? ret : sysfs_emit(buf, "%u\n", val); +} + +static ssize_t hisi_ptt_tune_attr_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct dev_ext_attribute *ext_attr; + struct hisi_ptt_tune_desc *desc; + int ret; + u16 val; + + ext_attr = container_of(attr, struct dev_ext_attribute, attr); + desc = ext_attr->var; + + if (kstrtou16(buf, 10, )) + return -EINVAL; + + if
[PATCH v2 6/6] MAINTAINERS: Add maintainer for HiSilicon PTT driver
Add maintainer for driver and documentation of HiSilicon PTT device. Signed-off-by: Yicong Yang --- MAINTAINERS | 7 +++ 1 file changed, 7 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 7a2345ce8521..823d495ca0d5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8601,6 +8601,13 @@ W: http://www.hisilicon.com F: Documentation/admin-guide/perf/hisi-pmu.rst F: drivers/perf/hisilicon +HISILICON PTT DRIVER +M: Yicong Yang +L: linux-ker...@vger.kernel.org +S: Maintained +F: Documentation/trace/hisi-ptt.rst +F: drivers/hwtracing/hisilicon/ + HISILICON QM AND ZIP Controller DRIVER M: Zhou Wang L: linux-cry...@vger.kernel.org -- 2.33.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 0/6] Add support for HiSilicon PCIe Tune and Trace device
HiSilicon PCIe tune and trace device (PTT) is a PCIe Root Complex integrated Endpoint (RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic (tune), and trace the TLP headers (trace). PTT tune is designed for monitoring and adjusting PCIe link parameters. We provide several parameters of the PCIe link. Through the driver, user can adjust the value of certain parameter to affect the PCIe link for the purpose of enhancing the performance in certian situation. PTT trace is designed for dumping the TLP headers to the memory, which can be used to analyze the transactions and usage condition of the PCIe Link. Users can choose filters to trace headers, by either requester ID, or those downstream of a set of Root Ports on the same core of the PTT device. It's also supported to trace the headers of certain type and of certain direction. The driver registers a PMU device for each PTT device. The trace can be used through `perf record` and the traced headers can be decoded by `perf report`. The perf command support for the device is also added in this patchset. The tune can be used through the sysfs attributes of related PMU device. See the documentation for the detailed usage. Change since v1: - switch the user interface of trace to perf from debugfs - switch the user interface of tune to sysfs from debugfs - add perf tool support to start trace and decode the trace data - address the comments of documentation from Bjorn - add RMR[1] support of the device as trace works in RMR mode or direct DMA mode. RMR support is achieved by common APIs rather than the APIs implemented in [1]. Link: https://lore.kernel.org/lkml/1618654631-42454-1-git-send-email-yangyic...@hisilicon.com/ [1] https://lore.kernel.org/linux-acpi/20210805080724.480-1-shameerali.kolothum.th...@huawei.com/ Qi Liu (1): perf tool: Add support for HiSilicon PCIe Tune and Trace device driver Yicong Yang (5): iommu: Export iommu_{get,put}_resv_regions() hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device hwtracing: Add tune function support for HiSilicon PCIe Tune and Trace device docs: Add HiSilicon PTT device driver documentation MAINTAINERS: Add maintainer for HiSilicon PTT driver Documentation/trace/hisi-ptt.rst | 305 MAINTAINERS |7 + drivers/Makefile |1 + drivers/hwtracing/Kconfig |2 + drivers/hwtracing/hisilicon/Kconfig |8 + drivers/hwtracing/hisilicon/Makefile |2 + drivers/hwtracing/hisilicon/hisi_ptt.c| 1313 + drivers/iommu/iommu.c |2 + include/linux/iommu.h |4 +- tools/perf/arch/arm/util/auxtrace.c | 56 +- tools/perf/arch/arm/util/pmu.c|3 + tools/perf/arch/arm64/util/Build |2 +- tools/perf/arch/arm64/util/hisi_ptt.c | 195 +++ tools/perf/util/Build |2 + tools/perf/util/auxtrace.c|4 + tools/perf/util/auxtrace.h|1 + tools/perf/util/hisi-ptt-decoder/Build|1 + .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.c | 170 +++ .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.h | 28 + tools/perf/util/hisi_ptt.c| 228 +++ tools/perf/util/hisi_ptt.h| 28 + 21 files changed, 2356 insertions(+), 6 deletions(-) create mode 100644 Documentation/trace/hisi-ptt.rst create mode 100644 drivers/hwtracing/hisilicon/Kconfig create mode 100644 drivers/hwtracing/hisilicon/Makefile create mode 100644 drivers/hwtracing/hisilicon/hisi_ptt.c create mode 100644 tools/perf/arch/arm64/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi-ptt-decoder/Build create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.c create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.h create mode 100644 tools/perf/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi_ptt.h -- 2.33.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/6] iommu: Export iommu_{get,put}_resv_regions()
Export iommu_{get,put}_resv_regions() to the modules so that the driver can retrieve and use the reserved regions of the device. Signed-off-by: Yicong Yang --- drivers/iommu/iommu.c | 2 ++ include/linux/iommu.h | 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index dd7863e453a5..e96711eee965 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2792,6 +2792,7 @@ void iommu_get_resv_regions(struct device *dev, struct list_head *list) if (ops && ops->get_resv_regions) ops->get_resv_regions(dev, list); } +EXPORT_SYMBOL_GPL(iommu_get_resv_regions); void iommu_put_resv_regions(struct device *dev, struct list_head *list) { @@ -2800,6 +2801,7 @@ void iommu_put_resv_regions(struct device *dev, struct list_head *list) if (ops && ops->put_resv_regions) ops->put_resv_regions(dev, list); } +EXPORT_SYMBOL_GPL(iommu_put_resv_regions); /** * generic_iommu_put_resv_regions - Reserved region driver helper diff --git a/include/linux/iommu.h b/include/linux/iommu.h index d2f3435e7d17..1b7b0f370e28 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -450,8 +450,8 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t io extern void iommu_set_fault_handler(struct iommu_domain *domain, iommu_fault_handler_t handler, void *token); -extern void iommu_get_resv_regions(struct device *dev, struct list_head *list); -extern void iommu_put_resv_regions(struct device *dev, struct list_head *list); +void iommu_get_resv_regions(struct device *dev, struct list_head *list); +void iommu_put_resv_regions(struct device *dev, struct list_head *list); extern void generic_iommu_put_resv_regions(struct device *dev, struct list_head *list); extern void iommu_set_default_passthrough(bool cmd_line); -- 2.33.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.
On 15.11.21 20:37, Zi Yan wrote: > From: Zi Yan > > Hi David, Hi, thanks for looking into this. > > You suggested to make alloc_contig_range() deal with pageblock_order instead > of > MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This > patchset is my attempt to achieve that. Please take a look and let me know if > I am doing it correctly or not. > > From what my understanding, cma required alignment of > max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced, > __free_one_page() does not prevent merging two different pageblocks, when > MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation > does prevent that. It should be OK to just align cma to pageblock_order. > alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use > pageblock_order as alignment too. I wonder if that's sufficient. Especially the outer_start logic in alloc_contig_range() might be problematic. There are some ugly corner cases with free pages/allocations spanning multiple pageblocks and we only isolated a single pageblock. Regarding CMA, we have to keep the following cases working: a) Different pageblock types (MIGRATE_CMA and !MIGRATE_CMA) in MAX_ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume either pageblock 0 is MIGRATE_CMA or pageblock 1 is MIGRATE_CMA, but not both. We have to make sure alloc_contig_range() keeps working correctly. This should be the case even with your change, as we won't merging pages accross differing migratetypes. b) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume both are MIGRATE_CMA. Assume we want to either allocate from pageblock 0 or pageblock 1. Especially, assume we want to allocate from pageblock 1. While we would isolate pageblock 1, we wouldn't isolate pageblock 0. What happens if we either have a free page spanning the MAX_ORDER - 1 range already OR if we have to migrate a MAX_ORDER - 1 page, resulting in a free MAX_ORDER - 1 page of which only the second pageblock is isolated? We would end up essentially freeing a page that has mixed pageblocks, essentially placing it in !MIGRATE_ISOLATE free lists ... I might be wrong but I have the feeling that this would be problematic. c) Concurrent allocations: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume b) but we have two concurrent CMA allocations to pageblock 0 and pageblock 1, which would now be possible as start_isolate_page_range() isolate would succeed on both. Regarding virtio-mem, we care about the following cases: a) Allocating parts from completely movable MAX_ ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume pageblock 0 and pageblock 1 are either free or contain only movable pages. Assume we allocated pageblock 0. We have to make sure we can allocate pageblock 1. The other way around, assume we allocated pageblock 1, we have to make sure we can allocate pageblock 0. Free pages spanning both pageblocks might be problematic. b) Allocate parts of partially movable MAX_ ORDER - 1 page: [ MAX_ ORDER - 1 ] [ pageblock 0 | pageblock 1] Assume pageblock 0 contains unmovable data but pageblock 1 not: we have to make sure we can allocate pageblock 1. Similarly, assume pageblock 1 contains unmovable data but pageblock 0 no: we have to make sure we can allocate pageblock 1. has_unmovable_pages() might allow for that. But, we want to fail early in case we want to allocate a single pageblock but it contains unmovable data. This could be either directly or indirectly. If we have an unmovable (compound) MAX_ ORDER - 1 and we'd try isolating pageblock 1, has_unmovable_pages() would always return "false" because we'd simply be skiping over any tail pages, and not detect the un-movability. c) Migrating/freeing a MAX_ ORDER - 1 page while partially isolated: Same concern as for CMA b) So the biggest concern I have is dealing with migrating/freeing > pageblock_order pages while only having isolated a single pageblock. > > In terms of virtio_mem, if I understand correctly, it relies on > alloc_contig_range() to obtain contiguous free pages and offlines them to > reduce > guest memory size. As the result of alloc_contig_range() alignment change, > virtio_mem should be able to just align PFNs to pageblock_order. For virtio-mem it will most probably be desirable to first try allocating the MAX_ORDER -1 range if possible and then fallback to pageblock_order. But that's an additional change on top in virtio-mem code. My take to teach alloc_contig_range() to properly handle would be the following: a) Convert MIGRATE_ISOLATE into a separate pageblock flag We would want to convert MIGRATE_ISOLATE into a separate pageblock flags, such that when we isolate a page block we preserve the original migratetype. While start_isolate_page_range()