Re: [PATCH V3 5/5] hv_netvsc: Add Isolation VM support for netvsc driver
On 12/4/2021 2:59 AM, Michael Kelley (LINUX) wrote: + +/* + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM. + */ +void *hv_map_memory(void *addr, unsigned long size) +{ + unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE, This should be just PAGE_SIZE, as this code is unrelated to communication with Hyper-V. Yes, agree. Will update. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
On 12/4/2021 3:17 AM, Michael Kelley (LINUX) wrote: +static void __init hyperv_iommu_swiotlb_init(void) +{ + unsigned long hyperv_io_tlb_size; + void *hyperv_io_tlb_start; + + /* +* Allocate Hyper-V swiotlb bounce buffer at early place +* to reserve large contiguous memory. +*/ + hyperv_io_tlb_size = swiotlb_size_or_default(); + hyperv_io_tlb_start = memblock_alloc(hyperv_io_tlb_size, PAGE_SIZE); + + if (!hyperv_io_tlb_start) + pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n"); In the error case, won't swiotlb_init_with_tlb() end up panic'ing when it tries to zero out the memory? The only real choice here is to return immediately after printing the message, and not call swiotlb_init_with_tlb(). Yes, agree. Will update. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
On 12/4/2021 4:06 AM, Tom Lendacky wrote: Hi Tom: Thanks for your test. Could you help to test the following patch and check whether it can fix the issue. The patch is mangled. Is the only difference where set_memory_decrypted() is called? I de-mangled the patch. No more stack traces with SME active. Thanks, Tom Hi Tom: Thanks a lot for your rework and test. I will update in the next version. Thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/4] dt-bindings: memory: mediatek: Correct the minItems of clk for larbs
On Fri, 03 Dec 2021 14:40:24 +0800, Yong Wu wrote: > If a platform's larb support gals, there will be some larbs have a one > more "gals" clock while the others still only need "apb"/"smi" clocks. > then the minItems is 2 and the maxItems is 3. > > Fixes: 27bb0e42855a ("dt-bindings: memory: mediatek: Convert SMI to DT > schema") > Signed-off-by: Yong Wu > --- > .../bindings/memory-controllers/mediatek,smi-larb.yaml | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > Running 'make dtbs_check' with the schema in this patch gives the following warnings. Consider if they are expected or the schema is incorrect. These may not be new warnings. Note that it is not yet a requirement to have 0 warnings for dtbs_check. This will change in the future. Full log is available here: https://patchwork.ozlabs.org/patch/1563127 larb@14016000: 'mediatek,larb-id' is a required property arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml larb@14017000: clock-names: ['apb', 'smi'] is too short arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-burnet.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel14.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku1.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku6.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-sku16.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku0.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku1.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku16.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku272.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku288.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku32.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku0.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku176.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-pumpkin.dt.yaml larb@15001000: 'mediatek,larb-id' is a required property arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml larb@1601: clock-names: ['apb', 'smi'] is too short arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-burnet.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel14.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku1.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku6.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-sku16.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku0.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku1.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku16.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku272.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku288.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku32.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku0.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku176.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-pumpkin.dt.yaml larb@1601: 'mediatek,larb-id' is a required property arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml larb@1701: clock-names: ['apb', 'smi'] is too short arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-burnet.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel14.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku1.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku6.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-sku16.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku0.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku1.dt.yaml arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml arch/arm64/boot/dts/
Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
On 12/3/21 1:11 PM, Tom Lendacky wrote: On 12/3/21 5:20 AM, Tianyu Lan wrote: On 12/2/2021 10:42 PM, Tom Lendacky wrote: On 12/1/21 10:02 AM, Tianyu Lan wrote: From: Tianyu Lan In Isolation VM with AMD SEV, bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Expose swiotlb_unencrypted_base for platforms to set unencrypted memory base offset and platform calls swiotlb_update_mem_attributes() to remap swiotlb mem to unencrypted address space. memremap() can not be called in the early stage and so put remapping code into swiotlb_update_mem_attributes(). Store remap address and use it to copy data from/to swiotlb bounce buffer. Signed-off-by: Tianyu Lan This patch results in the following stack trace during a bare-metal boot on my EPYC system with SME active (e.g. mem_encrypt=on): [ 0.123932] BUG: Bad page state in process swapper pfn:108001 [ 0.123942] page:(ptrval) refcount:0 mapcount:-128 mapping: index:0x0 pfn:0x108001 [ 0.123946] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f) [ 0.123952] raw: 0017c000 88904f2d5e80 88904f2d5e80 [ 0.123954] raw: ff7f [ 0.123955] page dumped because: nonzero mapcount [ 0.123957] Modules linked in: [ 0.123961] CPU: 0 PID: 0 Comm: swapper Not tainted 5.16.0-rc3-sos-custom #2 [ 0.123964] Hardware name: AMD Corporation [ 0.123967] Call Trace: [ 0.123971] [ 0.123975] dump_stack_lvl+0x48/0x5e [ 0.123985] bad_page.cold+0x65/0x96 [ 0.123990] __free_pages_ok+0x3a8/0x410 [ 0.123996] memblock_free_all+0x171/0x1dc [ 0.124005] mem_init+0x1f/0x14b [ 0.124011] start_kernel+0x3b5/0x6a1 [ 0.124016] secondary_startup_64_no_verify+0xb0/0xbb [ 0.124022] I see ~40 of these traces, each for different pfns. Thanks, Tom Hi Tom: Thanks for your test. Could you help to test the following patch and check whether it can fix the issue. The patch is mangled. Is the only difference where set_memory_decrypted() is called? I de-mangled the patch. No more stack traces with SME active. Thanks, Tom Thanks, Tom diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 569272871375..f6c3638255d5 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force; * @end: The end address of the swiotlb memory pool. Used to do a quick * range check to see if the memory was in fact allocated by this * API. + * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool + * may be remapped in the memory encrypted case and store virtual + * address for bounce buffer operation. * @nslabs: The number of IO TLB blocks (in groups of 64) between @start and * @end. For default swiotlb, this is command line adjustable via * setup_io_tlb_npages. @@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force; struct io_tlb_mem { phys_addr_t start; phys_addr_t end; + void *vaddr; unsigned long nslabs; unsigned long used; unsigned int index; @@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev) } #endif /* CONFIG_DMA_RESTRICTED_POOL */ +extern phys_addr_t swiotlb_unencrypted_base; + #endif /* __LINUX_SWIOTLB_H */ diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 8e840fbbed7c..34e6ade4f73c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -50,6 +50,7 @@ #include #include +#include #include #include #include @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force; struct io_tlb_mem io_tlb_default_mem; +phys_addr_t swiotlb_unencrypted_base; + /* * Max segment that we can provide which (if pages are contingous) will * not be bounced (unless SWIOTLB_FORCE is set). @@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val) return DIV_ROUND_UP(val, IO_TLB_SIZE); } +/* + * Remap swioltb memory in the unencrypted physical address space + * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP + * Isolation VMs). + */ +void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes) +{ + void *vaddr = NULL; + + if (swiotlb_unencrypted_base) { + phys_addr_t paddr = mem->start + swiotlb_unencrypted_base; + + vaddr = memremap(paddr, bytes, MEMREMAP_WB); + if (!vaddr) + pr_err("Faile
RE: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
From: Tianyu Lan Sent: Wednesday, December 1, 2021 8:03 AM > > hyperv Isolation VM requires bounce buffer support to copy > data from/to encrypted memory and so enable swiotlb force > mode to use swiotlb bounce buffer for DMA transaction. > > In Isolation VM with AMD SEV, the bounce buffer needs to be > accessed via extra address space which is above shared_gpa_boundary > (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. > The access physical address will be original physical address + > shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP > spec is called virtual top of memory(vTOM). Memory addresses below > vTOM are automatically treated as private while memory above > vTOM is treated as shared. > > Hyper-V initalizes swiotlb bounce buffer and default swiotlb > needs to be disabled. pci_swiotlb_detect_override() and > pci_swiotlb_detect_4gb() enable the default one. To override > the setting, hyperv_swiotlb_detect() needs to run before > these detect functions which depends on the pci_xen_swiotlb_ > init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb > _detect() to keep the order. > > Swiotlb bounce buffer code calls set_memory_decrypted() > to mark bounce buffer visible to host and map it in extra > address space via memremap. Populate the shared_gpa_boundary > (vTOM) via swiotlb_unencrypted_base variable. > > The map function memremap() can't work in the early place > hyperv_iommu_swiotlb_init() and so call swiotlb_update_mem_attributes() > in the hyperv_iommu_swiotlb_later_init(). > > Signed-off-by: Tianyu Lan > --- > arch/x86/xen/pci-swiotlb-xen.c | 3 +- > drivers/hv/vmbus_drv.c | 3 ++ > drivers/iommu/hyperv-iommu.c | 56 ++ > include/linux/hyperv.h | 8 + > 4 files changed, 69 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c > index 46df59aeaa06..30fd0600b008 100644 > --- a/arch/x86/xen/pci-swiotlb-xen.c > +++ b/arch/x86/xen/pci-swiotlb-xen.c > @@ -4,6 +4,7 @@ > > #include > #include > +#include > #include > > #include > @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void) > EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); > > IOMMU_INIT_FINISH(pci_xen_swiotlb_detect, > - NULL, > + hyperv_swiotlb_detect, > pci_xen_swiotlb_init, > NULL); > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index 392c1ac4f819..0a64ccfafb8b 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -33,6 +33,7 @@ > #include > #include > #include > +#include > #include > #include "hyperv_vmbus.h" > > @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t > *type, > return child_device_obj; > } > > +static u64 vmbus_dma_mask = DMA_BIT_MASK(64); > /* > * vmbus_device_register - Register the child device > */ > @@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device > *child_device_obj) > } > hv_debug_add_dev_dir(child_device_obj); > > + child_device_obj->device.dma_mask = &vmbus_dma_mask; > return 0; > > err_kset_unregister: > diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c > index e285a220c913..dd729d49a1eb 100644 > --- a/drivers/iommu/hyperv-iommu.c > +++ b/drivers/iommu/hyperv-iommu.c > @@ -13,14 +13,20 @@ > #include > #include > #include > +#include > +#include > > #include > #include > #include > #include > +#include > +#include > #include > #include > #include > +#include > +#include > > #include "irq_remapping.h" > > @@ -337,4 +343,54 @@ static const struct irq_domain_ops > hyperv_root_ir_domain_ops = { > .free = hyperv_root_irq_remapping_free, > }; > > +static void __init hyperv_iommu_swiotlb_init(void) > +{ > + unsigned long hyperv_io_tlb_size; > + void *hyperv_io_tlb_start; > + > + /* > + * Allocate Hyper-V swiotlb bounce buffer at early place > + * to reserve large contiguous memory. > + */ > + hyperv_io_tlb_size = swiotlb_size_or_default(); > + hyperv_io_tlb_start = memblock_alloc(hyperv_io_tlb_size, PAGE_SIZE); > + > + if (!hyperv_io_tlb_start) > + pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n"); In the error case, won't swiotlb_init_with_tlb() end up panic'ing when it tries to zero out the memory? The only real choice here is to return immediately after printing the message, and not call swiotlb_init_with_tlb(). > + > + swiotlb_init_with_tbl(hyperv_io_tlb_start, > + hyperv_io_tlb_size >> IO_TLB_SHIFT, true); > +} > + > +int __init hyperv_swiotlb_detect(void) > +{ > + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV)) > + return 0; > + > + if (!hv_is_isolation_supported()) > + return 0; > + > + /* > + * Enable swiotlb force mode in Isolation VM to > + * use swiotlb bounce
Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
On 12/3/21 5:20 AM, Tianyu Lan wrote: On 12/2/2021 10:42 PM, Tom Lendacky wrote: On 12/1/21 10:02 AM, Tianyu Lan wrote: From: Tianyu Lan In Isolation VM with AMD SEV, bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Expose swiotlb_unencrypted_base for platforms to set unencrypted memory base offset and platform calls swiotlb_update_mem_attributes() to remap swiotlb mem to unencrypted address space. memremap() can not be called in the early stage and so put remapping code into swiotlb_update_mem_attributes(). Store remap address and use it to copy data from/to swiotlb bounce buffer. Signed-off-by: Tianyu Lan This patch results in the following stack trace during a bare-metal boot on my EPYC system with SME active (e.g. mem_encrypt=on): [ 0.123932] BUG: Bad page state in process swapper pfn:108001 [ 0.123942] page:(ptrval) refcount:0 mapcount:-128 mapping: index:0x0 pfn:0x108001 [ 0.123946] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f) [ 0.123952] raw: 0017c000 88904f2d5e80 88904f2d5e80 [ 0.123954] raw: ff7f [ 0.123955] page dumped because: nonzero mapcount [ 0.123957] Modules linked in: [ 0.123961] CPU: 0 PID: 0 Comm: swapper Not tainted 5.16.0-rc3-sos-custom #2 [ 0.123964] Hardware name: AMD Corporation [ 0.123967] Call Trace: [ 0.123971] [ 0.123975] dump_stack_lvl+0x48/0x5e [ 0.123985] bad_page.cold+0x65/0x96 [ 0.123990] __free_pages_ok+0x3a8/0x410 [ 0.123996] memblock_free_all+0x171/0x1dc [ 0.124005] mem_init+0x1f/0x14b [ 0.124011] start_kernel+0x3b5/0x6a1 [ 0.124016] secondary_startup_64_no_verify+0xb0/0xbb [ 0.124022] I see ~40 of these traces, each for different pfns. Thanks, Tom Hi Tom: Thanks for your test. Could you help to test the following patch and check whether it can fix the issue. The patch is mangled. Is the only difference where set_memory_decrypted() is called? Thanks, Tom diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 569272871375..f6c3638255d5 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force; * @end: The end address of the swiotlb memory pool. Used to do a quick * range check to see if the memory was in fact allocated by this * API. + * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool + * may be remapped in the memory encrypted case and store virtual + * address for bounce buffer operation. * @nslabs: The number of IO TLB blocks (in groups of 64) between @start and * @end. For default swiotlb, this is command line adjustable via * setup_io_tlb_npages. @@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force; struct io_tlb_mem { phys_addr_t start; phys_addr_t end; + void *vaddr; unsigned long nslabs; unsigned long used; unsigned int index; @@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev) } #endif /* CONFIG_DMA_RESTRICTED_POOL */ +extern phys_addr_t swiotlb_unencrypted_base; + #endif /* __LINUX_SWIOTLB_H */ diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 8e840fbbed7c..34e6ade4f73c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -50,6 +50,7 @@ #include #include +#include #include #include #include @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force; struct io_tlb_mem io_tlb_default_mem; +phys_addr_t swiotlb_unencrypted_base; + /* * Max segment that we can provide which (if pages are contingous) will * not be bounced (unless SWIOTLB_FORCE is set). @@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val) return DIV_ROUND_UP(val, IO_TLB_SIZE); } +/* + * Remap swioltb memory in the unencrypted physical address space + * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP + * Isolation VMs). + */ +void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes) +{ + void *vaddr = NULL; + + if (swiotlb_unencrypted_base) { + phys_addr_t paddr = mem->start + swiotlb_unencrypted_base; + + vaddr = memremap(paddr, bytes, MEMREMAP_WB); + if (!vaddr) + pr_err("Failed to map the unencrypted memory %llx size %lx.\n", + paddr, bytes); + } + + r
RE: [PATCH V3 5/5] hv_netvsc: Add Isolation VM support for netvsc driver
From: Tianyu Lan Sent: Wednesday, December 1, 2021 8:03 AM > > In Isolation VM, all shared memory with host needs to mark visible > to host via hvcall. vmbus_establish_gpadl() has already done it for > netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_ > pagebuffer() stills need to be handled. Use DMA API to map/umap > these memory during sending/receiving packet and Hyper-V swiotlb > bounce buffer dma adress will be returned. The swiotlb bounce buffer > has been masked to be visible to host during boot up. > > rx/tx ring buffer is allocated via vzalloc() and they need to be > mapped into unencrypted address space(above vTOM) before sharing > with host and accessing. Add hv_map/unmap_memory() to map/umap rx > /tx ring buffer. > > Signed-off-by: Tianyu Lan > --- > Change since v2: >* Add hv_map/unmap_memory() to map/umap rx/tx ring buffer. > --- > arch/x86/hyperv/ivm.c | 28 ++ > drivers/hv/hv_common.c| 11 +++ > drivers/net/hyperv/hyperv_net.h | 5 ++ > drivers/net/hyperv/netvsc.c | 136 +- > drivers/net/hyperv/netvsc_drv.c | 1 + > drivers/net/hyperv/rndis_filter.c | 2 + > include/asm-generic/mshyperv.h| 2 + > include/linux/hyperv.h| 5 ++ > 8 files changed, 187 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c > index 69c7a57f3307..9f78d8f67ea3 100644 > --- a/arch/x86/hyperv/ivm.c > +++ b/arch/x86/hyperv/ivm.c > @@ -287,3 +287,31 @@ int hv_set_mem_host_visibility(unsigned long kbuffer, > int pagecount, bool visibl > kfree(pfn_array); > return ret; > } > + > +/* > + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM. > + */ > +void *hv_map_memory(void *addr, unsigned long size) > +{ > + unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE, This should be just PAGE_SIZE, as this code is unrelated to communication with Hyper-V. > + sizeof(unsigned long), GFP_KERNEL); > + void *vaddr; > + int i; > + > + if (!pfns) > + return NULL; > + > + for (i = 0; i < size / PAGE_SIZE; i++) > + pfns[i] = virt_to_hvpfn(addr + i * PAGE_SIZE) + Same here: Use virt_to_pfn(). > + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT); > + > + vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO); > + kfree(pfns); > + > + return vaddr; > +} > + > +void hv_unmap_memory(void *addr) > +{ > + vunmap(addr); > +} > diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c > index 7be173a99f27..3c5cb1f70319 100644 > --- a/drivers/hv/hv_common.c > +++ b/drivers/hv/hv_common.c > @@ -295,3 +295,14 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, > void *output, u32 input_s > return HV_STATUS_INVALID_PARAMETER; > } > EXPORT_SYMBOL_GPL(hv_ghcb_hypercall); > + > +void __weak *hv_map_memory(void *addr, unsigned long size) > +{ > + return NULL; > +} > +EXPORT_SYMBOL_GPL(hv_map_memory); > + > +void __weak hv_unmap_memory(void *addr) > +{ > +} > +EXPORT_SYMBOL_GPL(hv_unmap_memory); > diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h > index 315278a7cf88..cf69da0e296c 100644 > --- a/drivers/net/hyperv/hyperv_net.h > +++ b/drivers/net/hyperv/hyperv_net.h > @@ -164,6 +164,7 @@ struct hv_netvsc_packet { > u32 total_bytes; > u32 send_buf_index; > u32 total_data_buflen; > + struct hv_dma_range *dma_range; > }; > > #define NETVSC_HASH_KEYLEN 40 > @@ -1074,6 +1075,7 @@ struct netvsc_device { > > /* Receive buffer allocated by us but manages by NetVSP */ > void *recv_buf; > + void *recv_original_buf; > u32 recv_buf_size; /* allocated bytes */ > struct vmbus_gpadl recv_buf_gpadl_handle; > u32 recv_section_cnt; > @@ -1082,6 +1084,7 @@ struct netvsc_device { > > /* Send buffer allocated by us */ > void *send_buf; > + void *send_original_buf; > u32 send_buf_size; > struct vmbus_gpadl send_buf_gpadl_handle; > u32 send_section_cnt; > @@ -1731,4 +1734,6 @@ struct rndis_message { > #define RETRY_US_HI 1 > #define RETRY_MAX2000/* >10 sec */ > > +void netvsc_dma_unmap(struct hv_device *hv_dev, > + struct hv_netvsc_packet *packet); > #endif /* _HYPERV_NET_H */ > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c > index 396bc1c204e6..b7ade735a806 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head) > int i; > > kfree(nvdev->extension); > - vfree(nvdev->recv_buf); > - vfree(nvdev->send_buf); > + > + if (nvdev->recv_original_buf) { > + hv_unmap_memory(nvdev->recv_buf); > + vfree(nvdev->recv_original_buf); > + } else { > + vfree(nvdev->recv_buf); > + } > + > + if (nvdev->send_
Re: [PATCH v4 04/14] Documentation/x86: Secure Launch kernel documentation
On 12/3/21 11:03, Robin Murphy wrote: > On 2021-12-03 15:47, Ross Philipson wrote: >> On 12/2/21 12:26, Robin Murphy wrote: >>> On 2021-08-27 14:28, Ross Philipson wrote: >>> [...] +IOMMU Configuration +--- + +When doing a Secure Launch, the IOMMU should always be enabled and the drivers +loaded. However, IOMMU passthrough mode should never be used. This leaves the +MLE completely exposed to DMA after the PMR's [2]_ are disabled. First, IOMMU +passthrough should be disabled by default in the build configuration:: + + "Device Drivers" --> + "IOMMU Hardware Support" --> + "IOMMU passthrough by default [ ]" + +This unset the Kconfig value CONFIG_IOMMU_DEFAULT_PASSTHROUGH. >>> >>> Note that the config structure has now changed, and if set, passthrough >>> is deselected by choosing a different default domain type. >> >> Thanks for the heads up. We will have to modify this for how things >> exist today. >> >>> +In addition, passthrough must be disabled on the kernel command line when doing +a Secure Launch as follows:: + + iommu=nopt iommu.passthrough=0 >>> >>> This part is a bit silly - those options are literally aliases for the >>> exact same thing, and furthermore if the config is already set as >>> required then the sole effect either of them will have is to cause "(set >>> by kernel command line)" to be printed. There is no value in explicitly >>> overriding the default value with the default value - if anyone can >>> append an additional "iommu.passthrough=1" (or "iommu=pt") to the end of >>> the command line they'll still win. >> >> I feel like when we worked on this, it was still important to set those >> values. This could have been in an older kernel version. We will go back >> and verify what you are saying here and adjust the documentation >> accordingly. >> >> As to anyone just adding values to the command line, that is why the >> command line is part of the DRTM measurements. > > Yeah, I had a vague memory that that was the case - basically if you can > trust the command line as much as the config then it's definitely > redundant to pass an option for this (see iommu_subsys_init() - it's now > all plumbed through iommu_def_domain_type), and if you can't then > passing them is futile anyway. Thanks you for your feedback. Ross > > Cheers, > Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote: > Jason, > > On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote: > > On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote: > >> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a > >> >> dedicated xarray or by partitioning the xarray space. Both have their > >> >> pro and cons. > >> > > >> > This decision seems to drive the question of how many 'struct devices' > >> > do we need, and where do we get them.. > >> > >> Not really. There is nothing what enforces to make the MSI irqdomain > >> storage strictly hang off struct device. There has to be a connection to > >> a struct device in some way obviously to make IOMMU happy. > > > > Yes, I thought this too, OK we agree > > One thing which just occured to me and I missed so far is the > association of the irqdomain. Oh? I though this was part of what you were thinking when talkinga about using the mdev struct device. > but I really need to look at all of the MSI infrastructure code whether > it makes assumptions about this: > >msi_desc->dev->irqdomain > > being the correct one. Which would obviously just be correct when we'd > have a per queue struct device :) Right I'd half expect the msi_desc to gain a pointer to the 'lfunc' Then msi_desc->dev becomes msi_desc->origin_physical_device and is only used by IOMMU code to figure out MSI remapping/etc. Maybe that allows solve your aliasing problem (Is this the RID aliasing we see with iommu_groups too?) > > Lets imagine a simple probe function for a simple timer device. It has > > no cdev, vfio, queues, etc. However, the device only supports IMS. No > > MSI, MSI-X, nothing else. > > > > What does the probe function look like to get to request_irq()? > > The timer device would be represented by what? A struct device? Lets say it is a pci_device and probe is a pci_driver probe function > If so then something needs to set device->irqdomain and then you can > just do: > > msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...); Okay, but that doesn't work if it is a PCI device and dev->irqdomain is already the PCI MSI and INTx irqdomain? > To do that, I need to understand the bigger picture and the intended > usage because otherwise we end up with an utter mess. Sure > >> The idea is to have a common representation for these type of things > >> which allows: > >> > >> 1) Have common code for exposing queues to VFIO, cdev, sysfs... > >> > >> You still need myqueue specific code, but the common stuff which is > >> in struct logical_function can be generic and device independent. > > > > I will quote you: "Seriously, you cannot make something uniform which > > is by definition non-uniform." :) > > > > We will find there is no common stuff here, we did this exercise > > already when we designed struct auxiliary_device, and came up > > empty. > > Really? Yes.. It was a long drawn out affair, and auxiliary_device is just a simple container > >> 2) Having the MSI storage per logical function (queue) allows to have > >> a queue relative 0 based MSI index space. > > > > Can you explain a bit what you see 0 meaning in this idx numbering? > > > > I also don't understand what 'queue relative' means? > > > >> The actual index in the physical table (think IMS) would be held in > >> the msi descriptor itself. > > > > This means that a device like IDXD would store the phsical IMS table > > entry # in the msi descriptor? What is idx then? > > > > For a device like IDXD with a simple linear table, how does the driver > > request a specific entry in the IMS table? eg maybe IMS table entry #0 > > is special like we often see in MSI? > > If there is a hardwired entry then this hardwired entry belongs to the > controlblock (status, error or whatever), but certainly not to a > dynamically allocatable queue as that would defeat the whole purpose. > > That hardwired entry could surely exist in a IDXD type setup where the > message storage is in an defined array on the device. Agree - so how does it get accessed? > But with the use case you described where you store the message at some > place in per queue system memory, the MSI index is not relevant at all, > because there is no table. So it's completely meaningless for the device > and just useful for finding the related MSI descriptor again. Yes What I don't see is how exactly we make this situation work. The big picture sequence would be: - Device allocates a queue and gets a queue handle - Device finds a free msi_desc and associates that msi_desc with the queue handle - Device specific irq_chip uses the msi_desc to get the queue handle and programs the addr/data In this regard the msi_desc is just an artifact of the API, the real work is in the msi_desc. > Or has each queue and controlblock and whatever access to a shared large > array where the messages are stored and the indices are handed out to > the queues and con
Re: [PATCH v4 04/14] Documentation/x86: Secure Launch kernel documentation
On 2021-12-03 15:47, Ross Philipson wrote: On 12/2/21 12:26, Robin Murphy wrote: On 2021-08-27 14:28, Ross Philipson wrote: [...] +IOMMU Configuration +--- + +When doing a Secure Launch, the IOMMU should always be enabled and the drivers +loaded. However, IOMMU passthrough mode should never be used. This leaves the +MLE completely exposed to DMA after the PMR's [2]_ are disabled. First, IOMMU +passthrough should be disabled by default in the build configuration:: + + "Device Drivers" --> + "IOMMU Hardware Support" --> + "IOMMU passthrough by default [ ]" + +This unset the Kconfig value CONFIG_IOMMU_DEFAULT_PASSTHROUGH. Note that the config structure has now changed, and if set, passthrough is deselected by choosing a different default domain type. Thanks for the heads up. We will have to modify this for how things exist today. +In addition, passthrough must be disabled on the kernel command line when doing +a Secure Launch as follows:: + + iommu=nopt iommu.passthrough=0 This part is a bit silly - those options are literally aliases for the exact same thing, and furthermore if the config is already set as required then the sole effect either of them will have is to cause "(set by kernel command line)" to be printed. There is no value in explicitly overriding the default value with the default value - if anyone can append an additional "iommu.passthrough=1" (or "iommu=pt") to the end of the command line they'll still win. I feel like when we worked on this, it was still important to set those values. This could have been in an older kernel version. We will go back and verify what you are saying here and adjust the documentation accordingly. As to anyone just adding values to the command line, that is why the command line is part of the DRTM measurements. Yeah, I had a vague memory that that was the case - basically if you can trust the command line as much as the config then it's definitely redundant to pass an option for this (see iommu_subsys_init() - it's now all plumbed through iommu_def_domain_type), and if you can't then passing them is futile anyway. Cheers, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 04/14] Documentation/x86: Secure Launch kernel documentation
On 12/2/21 12:26, Robin Murphy wrote: > On 2021-08-27 14:28, Ross Philipson wrote: > [...] >> +IOMMU Configuration >> +--- >> + >> +When doing a Secure Launch, the IOMMU should always be enabled and >> the drivers >> +loaded. However, IOMMU passthrough mode should never be used. This >> leaves the >> +MLE completely exposed to DMA after the PMR's [2]_ are disabled. >> First, IOMMU >> +passthrough should be disabled by default in the build configuration:: >> + >> + "Device Drivers" --> >> + "IOMMU Hardware Support" --> >> + "IOMMU passthrough by default [ ]" >> + >> +This unset the Kconfig value CONFIG_IOMMU_DEFAULT_PASSTHROUGH. > > Note that the config structure has now changed, and if set, passthrough > is deselected by choosing a different default domain type. Thanks for the heads up. We will have to modify this for how things exist today. > >> +In addition, passthrough must be disabled on the kernel command line >> when doing >> +a Secure Launch as follows:: >> + >> + iommu=nopt iommu.passthrough=0 > > This part is a bit silly - those options are literally aliases for the > exact same thing, and furthermore if the config is already set as > required then the sole effect either of them will have is to cause "(set > by kernel command line)" to be printed. There is no value in explicitly > overriding the default value with the default value - if anyone can > append an additional "iommu.passthrough=1" (or "iommu=pt") to the end of > the command line they'll still win. I feel like when we worked on this, it was still important to set those values. This could have been in an older kernel version. We will go back and verify what you are saying here and adjust the documentation accordingly. As to anyone just adding values to the command line, that is why the command line is part of the DRTM measurements. Thank you, Ross > > Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
Jason, On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote: > On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote: >> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a >> >> dedicated xarray or by partitioning the xarray space. Both have their >> >> pro and cons. >> > >> > This decision seems to drive the question of how many 'struct devices' >> > do we need, and where do we get them.. >> >> Not really. There is nothing what enforces to make the MSI irqdomain >> storage strictly hang off struct device. There has to be a connection to >> a struct device in some way obviously to make IOMMU happy. > > Yes, I thought this too, OK we agree One thing which just occured to me and I missed so far is the association of the irqdomain. Right now we know for each device which irqdomain it belongs to. That's part of device discovery (PCI, device tree or whatever) which set's up device->irqdomain That obviously cannot work for that device specific IMS domain. Why? Because the PCIe device provides MSI[x] which obviously requires that device->irqdomain is pointing to the PCI/MSI irqdomain. Otherwise the PCI/MSI allocation mechanism wont work. Sure each driver can have mystruct->ims_irqdomain and then do the allocation via msi_irq_domain_alloc(mystruct->ims_irqdomain) but I really need to look at all of the MSI infrastructure code whether it makes assumptions about this: msi_desc->dev->irqdomain being the correct one. Which would obviously just be correct when we'd have a per queue struct device :) >> Just badly named because the table itself is where the resulting message >> is stored, which is composed with the help of the relevant MSI >> descriptor. See above. > > I picked the name because it looks like it will primarily contain an > xarray and the API you suggested to query it is idx based. To me that > is a table. A table of msi_desc storage to be specific. > > It seems we have some agreement here as your lfunc also included the > same xarray and uses an idx as part of the API, right? It's a per lfunc xarray, yes. >> We really should not try to make up an artifical table representation >> for something which does not necessarily have a table at all, i.e. the >> devices you talk about which store the message in queue specific system >> memory. Pretending that this is a table is just silly. > > Now I'm a bit confused, what is the idx in the lfunc? > > I think this is language again, I would call idx an artificial table > representation? Ok. I prefer the term indexed storage, but yeah. >> I really meant a container like this: >> >> struct logical_function { >> /* Pointer to the physical device */ >> struct device*phys_device; >> /* MSI descriptor storage */ >> struct msi_data msi; > > Up to here is basically what I thought the "message table" would > contain. Possibly a pointer to the iommu_domain too? > >> /* The queue number */ >> unsigned int queue_nr; >> /* Add more information which is common to these things */ > > Not sure why do we need this? > > Lets imagine a simple probe function for a simple timer device. It has > no cdev, vfio, queues, etc. However, the device only supports IMS. No > MSI, MSI-X, nothing else. > > What does the probe function look like to get to request_irq()? The timer device would be represented by what? A struct device? If so then something needs to set device->irqdomain and then you can just do: msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...); > Why does this simple driver create something called a 'logical > function' to access its only interrupt? It does not have to when it's struct device based. > I think you have the right idea here, just forget about VFIO, the IDXD > use case, all of it. Provide a way to use IMS cleanly and concurrently > with MSI. > > Do that and everything else should have sane solutions too. To do that, I need to understand the bigger picture and the intended usage because otherwise we end up with an utter mess. >> The idea is to have a common representation for these type of things >> which allows: >> >> 1) Have common code for exposing queues to VFIO, cdev, sysfs... >> >> You still need myqueue specific code, but the common stuff which is >> in struct logical_function can be generic and device independent. > > I will quote you: "Seriously, you cannot make something uniform which > is by definition non-uniform." :) > > We will find there is no common stuff here, we did this exercise > already when we designed struct auxiliary_device, and came up > empty. Really? >> 2) Having the MSI storage per logical function (queue) allows to have >> a queue relative 0 based MSI index space. > > Can you explain a bit what you see 0 meaning in this idx numbering? > > I also don't understand what 'queue relative' means? > >> The actual index in the physical table (think IMS) woul
Re: [RFC v16 0/9] SMMUv3 Nested Stage Setup (IOMMU part)
Hi Eric, This series brings the IOMMU part of HW nested paging support in the SMMUv3. The SMMUv3 driver is adapted to support 2 nested stages. The IOMMU API is extended to convey the guest stage 1 configuration and the hook is implemented in the SMMUv3 driver. This allows the guest to own the stage 1 tables and context descriptors (so-called PASID table) while the host owns the stage 2 tables and main configuration structures (STE). This work mainly is provided for test purpose as the upper layer integration is under rework and bound to be based on /dev/iommu instead of VFIO tunneling. In this version we also get rid of the MSI BINDING ioctl, assuming the guest enforces flat mapping of host IOVAs used to bind physical MSI doorbells. In the current QEMU integration this is achieved by exposing RMRs to the guest, using Shameer's series [1]. This approach is RFC as the IORT spec is not really meant to do that (single mapping flag limitation). Best Regards Eric This series (Host) can be found at: https://github.com/eauger/linux/tree/v5.15-rc7-nested-v16 This includes a rebased VFIO integration (although not meant to be upstreamed) Guest kernel branch can be found at: https://github.com/eauger/linux/tree/shameer_rmrr_v7 featuring [1] QEMU integration (still based on VFIO and exposing RMRs) can be found at: https://github.com/eauger/qemu/tree/v6.1.0-rmr-v2-nested_smmuv3_v10 (use iommu=nested-smmuv3 ARM virt option) Guest dependency: [1] [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node History: v15 -> v16: - guest RIL must support RIL - additional checks in the cache invalidation hook - removal of the MSI BINDING ioctl (tentative replacement by RMRs) Eric Auger (9): iommu: Introduce attach/detach_pasid_table API iommu: Introduce iommu_get_nesting iommu/smmuv3: Allow s1 and s2 configs to coexist iommu/smmuv3: Get prepared for nested stage support iommu/smmuv3: Implement attach/detach_pasid_table iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs iommu/smmuv3: Implement cache_invalidate iommu/smmuv3: report additional recoverable faults iommu/smmuv3: Disallow nested mode in presence of HW MSI regions Hi Eric, I validated the reworked test patches in v16 from the given branches with Kernel v5.15 & Qemu v6.2. Verified them with NVMe PCI device assigned to Guest VM. Sorry, forgot to update earlier. Tested-by: Sumit Gupta Thanks, Sumit Gupta ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/3] Allow restricted-dma-pool to customize IO_TLB_SEGSIZE
On 2021-11-25 07:35, Tomasz Figa wrote: Hi Robin, On Tue, Nov 23, 2021 at 8:59 PM Robin Murphy wrote: On 2021-11-23 11:21, Hsin-Yi Wang wrote: Default IO_TLB_SEGSIZE (128) slabs may be not enough for some use cases. This series adds support to customize io_tlb_segsize for each restricted-dma-pool. Example use case: mtk-isp drivers[1] are controlled by mtk-scp[2] and allocate memory through mtk-scp. In order to use the noncontiguous DMA API[3], we need to use the swiotlb pool. mtk-scp needs to allocate memory with 2560 slabs. mtk-isp drivers also needs to allocate memory with 200+ slabs. Both are larger than the default IO_TLB_SEGSIZE (128) slabs. Are drivers really doing streaming DMA mappings that large? If so, that seems like it might be worth trying to address in its own right for the sake of efficiency - allocating ~5MB of memory twice and copying it back and forth doesn't sound like the ideal thing to do. If it's really about coherent DMA buffer allocation, I thought the plan was that devices which expect to use a significant amount and/or size of coherent buffers would continue to use a shared-dma-pool for that? It's still what the binding implies. My understanding was that swiotlb_alloc() is mostly just a fallback for the sake of drivers which mostly do streaming DMA but may allocate a handful of pages worth of coherent buffers here and there. Certainly looking at the mtk_scp driver, that seems like it shouldn't be going anywhere near SWIOTLB at all. First, thanks a lot for taking a look at this patch series. The drivers would do streaming DMA within a reserved region that is the only memory accessible to them for security reasons. This seems to exactly match the definition of the restricted pool as merged recently. Huh? Of the drivers indicated, the SCP driver is doing nothing but coherent allocations, and I'm not entirely sure what those ISP driver patches are supposed to be doing but I suspect it's probably just buffer allocation too. I don't see any actual streaming DMA anywhere :/ The new dma_alloc_noncontiguous() API would allow allocating suitable memory directly from the pool, which would eliminate the need to copy. Can you clarify what's being copied, and where? I'm not all that familiar with the media APIs, but I thought it was all based around preallocated DMA buffers (the whole dedicated "videobuf" thing)? The few instances of actual streaming DMA I can see in drivers/media/ look to be mostly PCI drivers mapping private descriptors, whereas the MTK ISP appears to be entirely register-based. However, for a restricted pool, this would exercise the SWIOTLB allocator, which currently suffers from the limitation as described by Hsin-Yi. Since the allocator in general is quite general purpose and already used for coherent allocations as per the current restricted pool implementation, I think it indeed makes sense to lift the limitation, rather than trying to come up with yet another thing. No, just fix the dma_alloc_noncontiguous() fallback case to split the allocation into dma_max_mapping_size() chunks. *That* makes sense. Thanks, Robin. Best regards, Tomasz Robin. [1] (not in upstream) https://patchwork.kernel.org/project/linux-media/cover/20190611035344.29814-1-jungo@mediatek.com/ [2] https://elixir.bootlin.com/linux/latest/source/drivers/remoteproc/mtk_scp.c [3] https://patchwork.kernel.org/project/linux-media/cover/20210909112430.61243-1-senozhat...@chromium.org/ Hsin-Yi Wang (3): dma: swiotlb: Allow restricted-dma-pool to customize IO_TLB_SEGSIZE dt-bindings: Add io-tlb-segsize property for restricted-dma-pool arm64: dts: mt8183: use restricted swiotlb for scp mem .../reserved-memory/shared-dma-pool.yaml | 8 + .../arm64/boot/dts/mediatek/mt8183-kukui.dtsi | 4 +-- include/linux/swiotlb.h | 1 + kernel/dma/swiotlb.c | 34 ++- 4 files changed, 37 insertions(+), 10 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v16 0/9] SMMUv3 Nested Stage Setup (IOMMU part)
Hi, Eric On 2021/10/27 下午6:44, Eric Auger wrote: This series brings the IOMMU part of HW nested paging support in the SMMUv3. The SMMUv3 driver is adapted to support 2 nested stages. The IOMMU API is extended to convey the guest stage 1 configuration and the hook is implemented in the SMMUv3 driver. This allows the guest to own the stage 1 tables and context descriptors (so-called PASID table) while the host owns the stage 2 tables and main configuration structures (STE). This work mainly is provided for test purpose as the upper layer integration is under rework and bound to be based on /dev/iommu instead of VFIO tunneling. In this version we also get rid of the MSI BINDING ioctl, assuming the guest enforces flat mapping of host IOVAs used to bind physical MSI doorbells. In the current QEMU integration this is achieved by exposing RMRs to the guest, using Shameer's series [1]. This approach is RFC as the IORT spec is not really meant to do that (single mapping flag limitation). Best Regards Eric This series (Host) can be found at: https://github.com/eauger/linux/tree/v5.15-rc7-nested-v16 This includes a rebased VFIO integration (although not meant to be upstreamed) Guest kernel branch can be found at: https://github.com/eauger/linux/tree/shameer_rmrr_v7 featuring [1] QEMU integration (still based on VFIO and exposing RMRs) can be found at: https://github.com/eauger/qemu/tree/v6.1.0-rmr-v2-nested_smmuv3_v10 (use iommu=nested-smmuv3 ARM virt option) Guest dependency: [1] [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node Thanks a lot for upgrading these patches. I have basically verified these patches on HiSilicon Kunpeng920. And integrated them to these branches. https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.16 https://github.com/Linaro/qemu/tree/v6.1.0-rmr-v2-nested_smmuv3_v10 Though they are provided for test purpose, Tested-by: Zhangfei Gao Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
On 12/2/2021 10:43 PM, Wei Liu wrote: On Wed, Dec 01, 2021 at 11:02:54AM -0500, Tianyu Lan wrote: [...] diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c index 46df59aeaa06..30fd0600b008 100644 --- a/arch/x86/xen/pci-swiotlb-xen.c +++ b/arch/x86/xen/pci-swiotlb-xen.c @@ -4,6 +4,7 @@ #include #include +#include #include #include @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void) EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late); IOMMU_INIT_FINISH(pci_xen_swiotlb_detect, - NULL, + hyperv_swiotlb_detect, It is not immediately obvious why this is needed just by reading the code. Please consider copying some of the text in the commit message to a comment here. Thanks for suggestion. Will update. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 2/5] x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()
On 12/2/2021 10:39 PM, Wei Liu wrote: +static bool hyperv_cc_platform_has(enum cc_attr attr) +{ +#ifdef CONFIG_HYPERV + if (attr == CC_ATTR_GUEST_MEM_ENCRYPT) + return true; + else + return false; This can be simplified as return attr == CC_ATTR_GUEST_MEM_ENCRYPT; Wei. Hi Wei: Thanks for your review. Will update. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
On 12/2/2021 10:42 PM, Tom Lendacky wrote: On 12/1/21 10:02 AM, Tianyu Lan wrote: From: Tianyu Lan In Isolation VM with AMD SEV, bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Expose swiotlb_unencrypted_base for platforms to set unencrypted memory base offset and platform calls swiotlb_update_mem_attributes() to remap swiotlb mem to unencrypted address space. memremap() can not be called in the early stage and so put remapping code into swiotlb_update_mem_attributes(). Store remap address and use it to copy data from/to swiotlb bounce buffer. Signed-off-by: Tianyu Lan This patch results in the following stack trace during a bare-metal boot on my EPYC system with SME active (e.g. mem_encrypt=on): [ 0.123932] BUG: Bad page state in process swapper pfn:108001 [ 0.123942] page:(ptrval) refcount:0 mapcount:-128 mapping: index:0x0 pfn:0x108001 [ 0.123946] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f) [ 0.123952] raw: 0017c000 88904f2d5e80 88904f2d5e80 [ 0.123954] raw: ff7f [ 0.123955] page dumped because: nonzero mapcount [ 0.123957] Modules linked in: [ 0.123961] CPU: 0 PID: 0 Comm: swapper Not tainted 5.16.0-rc3-sos-custom #2 [ 0.123964] Hardware name: AMD Corporation [ 0.123967] Call Trace: [ 0.123971] [ 0.123975] dump_stack_lvl+0x48/0x5e [ 0.123985] bad_page.cold+0x65/0x96 [ 0.123990] __free_pages_ok+0x3a8/0x410 [ 0.123996] memblock_free_all+0x171/0x1dc [ 0.124005] mem_init+0x1f/0x14b [ 0.124011] start_kernel+0x3b5/0x6a1 [ 0.124016] secondary_startup_64_no_verify+0xb0/0xbb [ 0.124022] I see ~40 of these traces, each for different pfns. Thanks, Tom Hi Tom: Thanks for your test. Could you help to test the following patch and check whether it can fix the issue. diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 569272871375..f6c3638255d5 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force; * @end: The end address of the swiotlb memory pool. Used to do a quick * range check to see if the memory was in fact allocated by this * API. + * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool + * may be remapped in the memory encrypted case and store virtual + * address for bounce buffer operation. * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and * @end. For default swiotlb, this is command line adjustable via * setup_io_tlb_npages. @@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force; struct io_tlb_mem { phys_addr_t start; phys_addr_t end; + void *vaddr; unsigned long nslabs; unsigned long used; unsigned int index; @@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device *dev) } #endif /* CONFIG_DMA_RESTRICTED_POOL */ +extern phys_addr_t swiotlb_unencrypted_base; + #endif /* __LINUX_SWIOTLB_H */ diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 8e840fbbed7c..34e6ade4f73c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -50,6 +50,7 @@ #include #include +#include #include #include #include @@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force; struct io_tlb_mem io_tlb_default_mem; +phys_addr_t swiotlb_unencrypted_base; + /* * Max segment that we can provide which (if pages are contingous) will * not be bounced (unless SWIOTLB_FORCE is set). @@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val) return DIV_ROUND_UP(val, IO_TLB_SIZE); } +/* + * Remap swioltb memory in the unencrypted physical address space + * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP + * Isolation VMs). + */ +void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes) +{ + void *vaddr = NULL; + + if (swiotlb_unencrypted_base) { + phys_addr_t paddr = mem->start + swiotlb_unencrypted_base; + + vaddr = memremap(paddr, bytes, MEMREMAP_WB); + if (!vaddr) + pr_err("Failed to map the unencrypted memory %llx size %lx.\n", + paddr, bytes); + } + + return vaddr; +} + /* * Early SWIOTLB allocation may be too early to allow an architecture to * perform the desired operations. This function allows the archite