Re: [PATCH V3 5/5] hv_netvsc: Add Isolation VM support for netvsc driver

2021-12-03 Thread Tianyu Lan

On 12/4/2021 2:59 AM, Michael Kelley (LINUX) wrote:

+
+/*
+ * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
+ */
+void *hv_map_memory(void *addr, unsigned long size)
+{
+   unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,

This should be just PAGE_SIZE, as this code is unrelated to communication
with Hyper-V.



Yes, agree. Will update.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-12-03 Thread Tianyu Lan




On 12/4/2021 3:17 AM, Michael Kelley (LINUX) wrote:

+static void __init hyperv_iommu_swiotlb_init(void)
+{
+   unsigned long hyperv_io_tlb_size;
+   void *hyperv_io_tlb_start;
+
+   /*
+* Allocate Hyper-V swiotlb bounce buffer at early place
+* to reserve large contiguous memory.
+*/
+   hyperv_io_tlb_size = swiotlb_size_or_default();
+   hyperv_io_tlb_start = memblock_alloc(hyperv_io_tlb_size, PAGE_SIZE);
+
+   if (!hyperv_io_tlb_start)
+   pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");

In the error case, won't swiotlb_init_with_tlb() end up panic'ing when
it tries to zero out the memory?   The only real choice here is to
return immediately after printing the message, and not call
swiotlb_init_with_tlb().



Yes, agree. Will update.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-12-03 Thread Tianyu Lan

On 12/4/2021 4:06 AM, Tom Lendacky wrote:

Hi Tom:
   Thanks for your test. Could you help to test the following 
patch and check whether it can fix the issue.


The patch is mangled. Is the only difference where 
set_memory_decrypted() is called?


I de-mangled the patch. No more stack traces with SME active.

Thanks,
Tom


Hi Tom:
Thanks a lot for your rework and test. I will update in the next 
version.

Thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/4] dt-bindings: memory: mediatek: Correct the minItems of clk for larbs

2021-12-03 Thread Rob Herring
On Fri, 03 Dec 2021 14:40:24 +0800, Yong Wu wrote:
> If a platform's larb support gals, there will be some larbs have a one
> more "gals" clock while the others still only need "apb"/"smi" clocks.
> then the minItems is 2 and the maxItems is 3.
> 
> Fixes: 27bb0e42855a ("dt-bindings: memory: mediatek: Convert SMI to DT 
> schema")
> Signed-off-by: Yong Wu 
> ---
>  .../bindings/memory-controllers/mediatek,smi-larb.yaml  | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Running 'make dtbs_check' with the schema in this patch gives the
following warnings. Consider if they are expected or the schema is
incorrect. These may not be new warnings.

Note that it is not yet a requirement to have 0 warnings for dtbs_check.
This will change in the future.

Full log is available here: https://patchwork.ozlabs.org/patch/1563127


larb@14016000: 'mediatek,larb-id' is a required property
arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml

larb@14017000: clock-names: ['apb', 'smi'] is too short
arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-burnet.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel14.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku1.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku6.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-sku16.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku0.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku1.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku16.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku272.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku288.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku32.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku0.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku176.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-pumpkin.dt.yaml

larb@15001000: 'mediatek,larb-id' is a required property
arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml

larb@1601: clock-names: ['apb', 'smi'] is too short
arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-burnet.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel14.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku1.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku6.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-sku16.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku0.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku1.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku16.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku272.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku288.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kodama-sku32.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku0.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-krane-sku176.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-pumpkin.dt.yaml

larb@1601: 'mediatek,larb-id' is a required property
arch/arm64/boot/dts/mediatek/mt8167-pumpkin.dt.yaml

larb@1701: clock-names: ['apb', 'smi'] is too short
arch/arm64/boot/dts/mediatek/mt8183-evb.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-burnet.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-damu.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel14.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku1.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-fennel-sku6.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-juniper-sku16.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kappa.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-kenzo.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku0.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-jacuzzi-willow-sku1.dt.yaml
arch/arm64/boot/dts/mediatek/mt8183-kukui-kakadu.dt.yaml
arch/arm64/boot/dts/

Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-12-03 Thread Tom Lendacky via iommu

On 12/3/21 1:11 PM, Tom Lendacky wrote:

On 12/3/21 5:20 AM, Tianyu Lan wrote:

On 12/2/2021 10:42 PM, Tom Lendacky wrote:

On 12/1/21 10:02 AM, Tianyu Lan wrote:

From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary (E.G 39 bit
address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
physical address will be original physical address + shared_gpa_boundary.
The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
memory(vTOM). Memory addresses below vTOM are automatically treated as
private while memory above vTOM is treated as shared.

Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and platform calls swiotlb_update_mem_attributes()
to remap swiotlb mem to unencrypted address space. memremap() can
not be called in the early stage and so put remapping code into
swiotlb_update_mem_attributes(). Store remap address and use it to copy
data from/to swiotlb bounce buffer.

Signed-off-by: Tianyu Lan 


This patch results in the following stack trace during a bare-metal boot
on my EPYC system with SME active (e.g. mem_encrypt=on):

[    0.123932] BUG: Bad page state in process swapper  pfn:108001
[    0.123942] page:(ptrval) refcount:0 mapcount:-128 
mapping: index:0x0 pfn:0x108001

[    0.123946] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f)
[    0.123952] raw: 0017c000 88904f2d5e80 88904f2d5e80 

[    0.123954] raw:   ff7f 


[    0.123955] page dumped because: nonzero mapcount
[    0.123957] Modules linked in:
[    0.123961] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.16.0-rc3-sos-custom #2

[    0.123964] Hardware name: AMD Corporation
[    0.123967] Call Trace:
[    0.123971]  
[    0.123975]  dump_stack_lvl+0x48/0x5e
[    0.123985]  bad_page.cold+0x65/0x96
[    0.123990]  __free_pages_ok+0x3a8/0x410
[    0.123996]  memblock_free_all+0x171/0x1dc
[    0.124005]  mem_init+0x1f/0x14b
[    0.124011]  start_kernel+0x3b5/0x6a1
[    0.124016]  secondary_startup_64_no_verify+0xb0/0xbb
[    0.124022]  

I see ~40 of these traces, each for different pfns.

Thanks,
Tom


Hi Tom:
   Thanks for your test. Could you help to test the following patch 
and check whether it can fix the issue.


The patch is mangled. Is the only difference where set_memory_decrypted() 
is called?


I de-mangled the patch. No more stack traces with SME active.

Thanks,
Tom



Thanks,
Tom




diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 569272871375..f6c3638255d5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force;
   * @end:   The end address of the swiotlb memory pool. Used to do 
a quick
   * range check to see if the memory was in fact allocated 
by this

   * API.
+ * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory 
pool
+ * may be remapped in the memory encrypted case and store 
virtual

+ * address for bounce buffer operation.
   * @nslabs:    The number of IO TLB blocks (in groups of 64) between 
@start and
   * @end. For default swiotlb, this is command line 
adjustable via

   * setup_io_tlb_npages.
@@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force;
  struct io_tlb_mem {
 phys_addr_t start;
 phys_addr_t end;
+   void *vaddr;
 unsigned long nslabs;
 unsigned long used;
 unsigned int index;
@@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct 
device *dev)

  }
  #endif /* CONFIG_DMA_RESTRICTED_POOL */

+extern phys_addr_t swiotlb_unencrypted_base;
+
  #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8e840fbbed7c..34e6ade4f73c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -50,6 +50,7 @@
  #include 
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;

  struct io_tlb_mem io_tlb_default_mem;

+phys_addr_t swiotlb_unencrypted_base;
+
  /*
   * Max segment that we can provide which (if pages are contingous) will
   * not be bounced (unless SWIOTLB_FORCE is set).
@@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val)
 return DIV_ROUND_UP(val, IO_TLB_SIZE);
  }

+/*
+ * Remap swioltb memory in the unencrypted physical address space
+ * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
+ * Isolation VMs).
+ */
+void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
+{
+   void *vaddr = NULL;
+
+   if (swiotlb_unencrypted_base) {
+   phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
+
+   vaddr = memremap(paddr, bytes, MEMREMAP_WB);
+   if (!vaddr)
+   pr_err("Faile

RE: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-12-03 Thread Michael Kelley (LINUX) via iommu
From: Tianyu Lan  Sent: Wednesday, December 1, 2021 8:03 AM
> 
> hyperv Isolation VM requires bounce buffer support to copy
> data from/to encrypted memory and so enable swiotlb force
> mode to use swiotlb bounce buffer for DMA transaction.
> 
> In Isolation VM with AMD SEV, the bounce buffer needs to be
> accessed via extra address space which is above shared_gpa_boundary
> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
> The access physical address will be original physical address +
> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
> spec is called virtual top of memory(vTOM). Memory addresses below
> vTOM are automatically treated as private while memory above
> vTOM is treated as shared.
> 
> Hyper-V initalizes swiotlb bounce buffer and default swiotlb
> needs to be disabled. pci_swiotlb_detect_override() and
> pci_swiotlb_detect_4gb() enable the default one. To override
> the setting, hyperv_swiotlb_detect() needs to run before
> these detect functions which depends on the pci_xen_swiotlb_
> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
> _detect() to keep the order.
> 
> Swiotlb bounce buffer code calls set_memory_decrypted()
> to mark bounce buffer visible to host and map it in extra
> address space via memremap. Populate the shared_gpa_boundary
> (vTOM) via swiotlb_unencrypted_base variable.
> 
> The map function memremap() can't work in the early place
> hyperv_iommu_swiotlb_init() and so call swiotlb_update_mem_attributes()
> in the hyperv_iommu_swiotlb_later_init().
> 
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/xen/pci-swiotlb-xen.c |  3 +-
>  drivers/hv/vmbus_drv.c |  3 ++
>  drivers/iommu/hyperv-iommu.c   | 56 ++
>  include/linux/hyperv.h |  8 +
>  4 files changed, 69 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
> index 46df59aeaa06..30fd0600b008 100644
> --- a/arch/x86/xen/pci-swiotlb-xen.c
> +++ b/arch/x86/xen/pci-swiotlb-xen.c
> @@ -4,6 +4,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include 
> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
> 
>  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
> -   NULL,
> +   hyperv_swiotlb_detect,
> pci_xen_swiotlb_init,
> NULL);
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 392c1ac4f819..0a64ccfafb8b 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "hyperv_vmbus.h"
> 
> @@ -2078,6 +2079,7 @@ struct hv_device *vmbus_device_create(const guid_t 
> *type,
>   return child_device_obj;
>  }
> 
> +static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
>  /*
>   * vmbus_device_register - Register the child device
>   */
> @@ -2118,6 +2120,7 @@ int vmbus_device_register(struct hv_device 
> *child_device_obj)
>   }
>   hv_debug_add_dev_dir(child_device_obj);
> 
> + child_device_obj->device.dma_mask = &vmbus_dma_mask;
>   return 0;
> 
>  err_kset_unregister:
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index e285a220c913..dd729d49a1eb 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -13,14 +13,20 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "irq_remapping.h"
> 
> @@ -337,4 +343,54 @@ static const struct irq_domain_ops 
> hyperv_root_ir_domain_ops = {
>   .free = hyperv_root_irq_remapping_free,
>  };
> 
> +static void __init hyperv_iommu_swiotlb_init(void)
> +{
> + unsigned long hyperv_io_tlb_size;
> + void *hyperv_io_tlb_start;
> +
> + /*
> +  * Allocate Hyper-V swiotlb bounce buffer at early place
> +  * to reserve large contiguous memory.
> +  */
> + hyperv_io_tlb_size = swiotlb_size_or_default();
> + hyperv_io_tlb_start = memblock_alloc(hyperv_io_tlb_size, PAGE_SIZE);
> +
> + if (!hyperv_io_tlb_start)
> + pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");

In the error case, won't swiotlb_init_with_tlb() end up panic'ing when
it tries to zero out the memory?   The only real choice here is to
return immediately after printing the message, and not call
swiotlb_init_with_tlb().

> +
> + swiotlb_init_with_tbl(hyperv_io_tlb_start,
> +   hyperv_io_tlb_size >> IO_TLB_SHIFT, true);
> +}
> +
> +int __init hyperv_swiotlb_detect(void)
> +{
> + if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))
> + return 0;
> +
> + if (!hv_is_isolation_supported())
> + return 0;
> +
> + /*
> +  * Enable swiotlb force mode in Isolation VM to
> +  * use swiotlb bounce

Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-12-03 Thread Tom Lendacky via iommu

On 12/3/21 5:20 AM, Tianyu Lan wrote:

On 12/2/2021 10:42 PM, Tom Lendacky wrote:

On 12/1/21 10:02 AM, Tianyu Lan wrote:

From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary (E.G 39 bit
address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
physical address will be original physical address + shared_gpa_boundary.
The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
memory(vTOM). Memory addresses below vTOM are automatically treated as
private while memory above vTOM is treated as shared.

Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and platform calls swiotlb_update_mem_attributes()
to remap swiotlb mem to unencrypted address space. memremap() can
not be called in the early stage and so put remapping code into
swiotlb_update_mem_attributes(). Store remap address and use it to copy
data from/to swiotlb bounce buffer.

Signed-off-by: Tianyu Lan 


This patch results in the following stack trace during a bare-metal boot
on my EPYC system with SME active (e.g. mem_encrypt=on):

[    0.123932] BUG: Bad page state in process swapper  pfn:108001
[    0.123942] page:(ptrval) refcount:0 mapcount:-128 
mapping: index:0x0 pfn:0x108001

[    0.123946] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f)
[    0.123952] raw: 0017c000 88904f2d5e80 88904f2d5e80 

[    0.123954] raw:   ff7f 


[    0.123955] page dumped because: nonzero mapcount
[    0.123957] Modules linked in:
[    0.123961] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.16.0-rc3-sos-custom #2

[    0.123964] Hardware name: AMD Corporation
[    0.123967] Call Trace:
[    0.123971]  
[    0.123975]  dump_stack_lvl+0x48/0x5e
[    0.123985]  bad_page.cold+0x65/0x96
[    0.123990]  __free_pages_ok+0x3a8/0x410
[    0.123996]  memblock_free_all+0x171/0x1dc
[    0.124005]  mem_init+0x1f/0x14b
[    0.124011]  start_kernel+0x3b5/0x6a1
[    0.124016]  secondary_startup_64_no_verify+0xb0/0xbb
[    0.124022]  

I see ~40 of these traces, each for different pfns.

Thanks,
Tom


Hi Tom:
   Thanks for your test. Could you help to test the following patch 
and check whether it can fix the issue.


The patch is mangled. Is the only difference where set_memory_decrypted() 
is called?


Thanks,
Tom




diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 569272871375..f6c3638255d5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force;
   * @end:   The end address of the swiotlb memory pool. Used to do a 
quick
   * range check to see if the memory was in fact allocated by 
this

   * API.
+ * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory pool
+ * may be remapped in the memory encrypted case and store 
virtual

+ * address for bounce buffer operation.
   * @nslabs:    The number of IO TLB blocks (in groups of 64) between 
@start and
   * @end. For default swiotlb, this is command line 
adjustable via

   * setup_io_tlb_npages.
@@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force;
  struct io_tlb_mem {
     phys_addr_t start;
     phys_addr_t end;
+   void *vaddr;
     unsigned long nslabs;
     unsigned long used;
     unsigned int index;
@@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct device 
*dev)

  }
  #endif /* CONFIG_DMA_RESTRICTED_POOL */

+extern phys_addr_t swiotlb_unencrypted_base;
+
  #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8e840fbbed7c..34e6ade4f73c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -50,6 +50,7 @@
  #include 
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;

  struct io_tlb_mem io_tlb_default_mem;

+phys_addr_t swiotlb_unencrypted_base;
+
  /*
   * Max segment that we can provide which (if pages are contingous) will
   * not be bounced (unless SWIOTLB_FORCE is set).
@@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val)
     return DIV_ROUND_UP(val, IO_TLB_SIZE);
  }

+/*
+ * Remap swioltb memory in the unencrypted physical address space
+ * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
+ * Isolation VMs).
+ */
+void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
+{
+   void *vaddr = NULL;
+
+   if (swiotlb_unencrypted_base) {
+   phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
+
+   vaddr = memremap(paddr, bytes, MEMREMAP_WB);
+   if (!vaddr)
+   pr_err("Failed to map the unencrypted memory %llx 
size %lx.\n",

+  paddr, bytes);
+   }
+
+   r

RE: [PATCH V3 5/5] hv_netvsc: Add Isolation VM support for netvsc driver

2021-12-03 Thread Michael Kelley (LINUX) via iommu
From: Tianyu Lan  Sent: Wednesday, December 1, 2021 8:03 AM
> 
> In Isolation VM, all shared memory with host needs to mark visible
> to host via hvcall. vmbus_establish_gpadl() has already done it for
> netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
> pagebuffer() stills need to be handled. Use DMA API to map/umap
> these memory during sending/receiving packet and Hyper-V swiotlb
> bounce buffer dma adress will be returned. The swiotlb bounce buffer
> has been masked to be visible to host during boot up.
> 
> rx/tx ring buffer is allocated via vzalloc() and they need to be
> mapped into unencrypted address space(above vTOM) before sharing
> with host and accessing. Add hv_map/unmap_memory() to map/umap rx
> /tx ring buffer.
> 
> Signed-off-by: Tianyu Lan 
> ---
> Change since v2:
>* Add hv_map/unmap_memory() to map/umap rx/tx ring buffer.
> ---
>  arch/x86/hyperv/ivm.c |  28 ++
>  drivers/hv/hv_common.c|  11 +++
>  drivers/net/hyperv/hyperv_net.h   |   5 ++
>  drivers/net/hyperv/netvsc.c   | 136 +-
>  drivers/net/hyperv/netvsc_drv.c   |   1 +
>  drivers/net/hyperv/rndis_filter.c |   2 +
>  include/asm-generic/mshyperv.h|   2 +
>  include/linux/hyperv.h|   5 ++
>  8 files changed, 187 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 69c7a57f3307..9f78d8f67ea3 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -287,3 +287,31 @@ int hv_set_mem_host_visibility(unsigned long kbuffer, 
> int pagecount, bool visibl
>   kfree(pfn_array);
>   return ret;
>  }
> +
> +/*
> + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
> + */
> +void *hv_map_memory(void *addr, unsigned long size)
> +{
> + unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,

This should be just PAGE_SIZE, as this code is unrelated to communication
with Hyper-V.

> +   sizeof(unsigned long), GFP_KERNEL);
> + void *vaddr;
> + int i;
> +
> + if (!pfns)
> + return NULL;
> +
> + for (i = 0; i < size / PAGE_SIZE; i++)
> + pfns[i] = virt_to_hvpfn(addr + i * PAGE_SIZE) +

Same here:  Use virt_to_pfn().

> + (ms_hyperv.shared_gpa_boundary >> PAGE_SHIFT);
> +
> + vaddr = vmap_pfn(pfns, size / PAGE_SIZE, PAGE_KERNEL_IO);
> + kfree(pfns);
> +
> + return vaddr;
> +}
> +
> +void hv_unmap_memory(void *addr)
> +{
> + vunmap(addr);
> +}
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 7be173a99f27..3c5cb1f70319 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -295,3 +295,14 @@ u64 __weak hv_ghcb_hypercall(u64 control, void *input, 
> void *output, u32 input_s
>   return HV_STATUS_INVALID_PARAMETER;
>  }
>  EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
> +
> +void __weak *hv_map_memory(void *addr, unsigned long size)
> +{
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(hv_map_memory);
> +
> +void __weak hv_unmap_memory(void *addr)
> +{
> +}
> +EXPORT_SYMBOL_GPL(hv_unmap_memory);
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 315278a7cf88..cf69da0e296c 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -164,6 +164,7 @@ struct hv_netvsc_packet {
>   u32 total_bytes;
>   u32 send_buf_index;
>   u32 total_data_buflen;
> + struct hv_dma_range *dma_range;
>  };
> 
>  #define NETVSC_HASH_KEYLEN 40
> @@ -1074,6 +1075,7 @@ struct netvsc_device {
> 
>   /* Receive buffer allocated by us but manages by NetVSP */
>   void *recv_buf;
> + void *recv_original_buf;
>   u32 recv_buf_size; /* allocated bytes */
>   struct vmbus_gpadl recv_buf_gpadl_handle;
>   u32 recv_section_cnt;
> @@ -1082,6 +1084,7 @@ struct netvsc_device {
> 
>   /* Send buffer allocated by us */
>   void *send_buf;
> + void *send_original_buf;
>   u32 send_buf_size;
>   struct vmbus_gpadl send_buf_gpadl_handle;
>   u32 send_section_cnt;
> @@ -1731,4 +1734,6 @@ struct rndis_message {
>  #define RETRY_US_HI  1
>  #define RETRY_MAX2000/* >10 sec */
> 
> +void netvsc_dma_unmap(struct hv_device *hv_dev,
> +   struct hv_netvsc_packet *packet);
>  #endif /* _HYPERV_NET_H */
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 396bc1c204e6..b7ade735a806 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
>   int i;
> 
>   kfree(nvdev->extension);
> - vfree(nvdev->recv_buf);
> - vfree(nvdev->send_buf);
> +
> + if (nvdev->recv_original_buf) {
> + hv_unmap_memory(nvdev->recv_buf);
> + vfree(nvdev->recv_original_buf);
> + } else {
> + vfree(nvdev->recv_buf);
> + }
> +
> + if (nvdev->send_

Re: [PATCH v4 04/14] Documentation/x86: Secure Launch kernel documentation

2021-12-03 Thread Ross Philipson
On 12/3/21 11:03, Robin Murphy wrote:
> On 2021-12-03 15:47, Ross Philipson wrote:
>> On 12/2/21 12:26, Robin Murphy wrote:
>>> On 2021-08-27 14:28, Ross Philipson wrote:
>>> [...]
 +IOMMU Configuration
 +---
 +
 +When doing a Secure Launch, the IOMMU should always be enabled and
 the drivers
 +loaded. However, IOMMU passthrough mode should never be used. This
 leaves the
 +MLE completely exposed to DMA after the PMR's [2]_ are disabled.
 First, IOMMU
 +passthrough should be disabled by default in the build configuration::
 +
 +  "Device Drivers" -->
 +  "IOMMU Hardware Support" -->
 +  "IOMMU passthrough by default [ ]"
 +
 +This unset the Kconfig value CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
>>>
>>> Note that the config structure has now changed, and if set, passthrough
>>> is deselected by choosing a different default domain type.
>>
>> Thanks for the heads up. We will have to modify this for how things
>> exist today.
>>
>>>
 +In addition, passthrough must be disabled on the kernel command line
 when doing
 +a Secure Launch as follows::
 +
 +  iommu=nopt iommu.passthrough=0
>>>
>>> This part is a bit silly - those options are literally aliases for the
>>> exact same thing, and furthermore if the config is already set as
>>> required then the sole effect either of them will have is to cause "(set
>>> by kernel command line)" to be printed. There is no value in explicitly
>>> overriding the default value with the default value - if anyone can
>>> append an additional "iommu.passthrough=1" (or "iommu=pt") to the end of
>>> the command line they'll still win.
>>
>> I feel like when we worked on this, it was still important to set those
>> values. This could have been in an older kernel version. We will go back
>> and verify what you are saying here and adjust the documentation
>> accordingly.
>>
>> As to anyone just adding values to the command line, that is why the
>> command line is part of the DRTM measurements.
> 
> Yeah, I had a vague memory that that was the case - basically if you can
> trust the command line as much as the config then it's definitely
> redundant to pass an option for this (see iommu_subsys_init() - it's now
> all plumbed through iommu_def_domain_type), and if you can't then
> passing them is futile anyway.

Thanks you for your feedback.

Ross

> 
> Cheers,
> Robin.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-03 Thread Jason Gunthorpe via iommu
On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> Jason,
> 
> On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
> >> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> >> pro and cons.
> >> >
> >> > This decision seems to drive the question of how many 'struct devices'
> >> > do we need, and where do we get them..
> >> 
> >> Not really. There is nothing what enforces to make the MSI irqdomain
> >> storage strictly hang off struct device. There has to be a connection to
> >> a struct device in some way obviously to make IOMMU happy.
> >
> > Yes, I thought this too, OK we agree
> 
> One thing which just occured to me and I missed so far is the
> association of the irqdomain.

Oh? I though this was part of what you were thinking when talkinga
about using the mdev struct device.

> but I really need to look at all of the MSI infrastructure code whether
> it makes assumptions about this:
> 
>msi_desc->dev->irqdomain
> 
> being the correct one. Which would obviously just be correct when we'd
> have a per queue struct device :)

Right

I'd half expect the msi_desc to gain a pointer to the 'lfunc'

Then msi_desc->dev becomes msi_desc->origin_physical_device and is
only used by IOMMU code to figure out MSI remapping/etc. Maybe that
allows solve your aliasing problem (Is this the RID aliasing we see
with iommu_groups too?)

> > Lets imagine a simple probe function for a simple timer device. It has
> > no cdev, vfio, queues, etc. However, the device only supports IMS. No
> > MSI, MSI-X, nothing else.
> >
> > What does the probe function look like to get to request_irq()?
> 
> The timer device would be represented by what? A struct device?

Lets say it is a pci_device and probe is a pci_driver probe function

> If so then something needs to set device->irqdomain and then you can
> just do:
> 
>  msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

Okay, but that doesn't work if it is a PCI device and dev->irqdomain
is already the PCI MSI and INTx irqdomain?

> To do that, I need to understand the bigger picture and the intended
> usage because otherwise we end up with an utter mess.

Sure
 
> >> The idea is to have a common representation for these type of things
> >> which allows:
> >> 
> >>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
> >> 
> >> You still need myqueue specific code, but the common stuff which is
> >> in struct logical_function can be generic and device independent.
> >
> > I will quote you: "Seriously, you cannot make something uniform which
> > is by definition non-uniform." :)
> >
> > We will find there is no common stuff here, we did this exercise
> > already when we designed struct auxiliary_device, and came up
> > empty.
> 
> Really?

Yes.. It was a long drawn out affair, and auxiliary_device is just a
simple container

> >>  2) Having the MSI storage per logical function (queue) allows to have
> >> a queue relative 0 based MSI index space.
> >
> > Can you explain a bit what you see 0 meaning in this idx numbering?
> >
> > I also don't understand what 'queue relative' means?
> >
> >> The actual index in the physical table (think IMS) would be held in
> >> the msi descriptor itself.
> >
> > This means that a device like IDXD would store the phsical IMS table
> > entry # in the msi descriptor? What is idx then?
> >
> > For a device like IDXD with a simple linear table, how does the driver
> > request a specific entry in the IMS table? eg maybe IMS table entry #0
> > is special like we often see in MSI?
> 
> If there is a hardwired entry then this hardwired entry belongs to the
> controlblock (status, error or whatever), but certainly not to a
> dynamically allocatable queue as that would defeat the whole purpose.
> 
> That hardwired entry could surely exist in a IDXD type setup where the
> message storage is in an defined array on the device.

Agree - so how does it get accessed?

> But with the use case you described where you store the message at some
> place in per queue system memory, the MSI index is not relevant at all,
> because there is no table. So it's completely meaningless for the device
> and just useful for finding the related MSI descriptor again.

Yes

What I don't see is how exactly we make this situation work. The big
picture sequence would be:

 - Device allocates a queue and gets a queue handle
 - Device finds a free msi_desc and associates that msi_desc with the
   queue handle
 - Device specific irq_chip uses the msi_desc to get the queue handle
   and programs the addr/data

In this regard the msi_desc is just an artifact of the API, the real
work is in the msi_desc.

> Or has each queue and controlblock and whatever access to a shared large
> array where the messages are stored and the indices are handed out to
> the queues and con

Re: [PATCH v4 04/14] Documentation/x86: Secure Launch kernel documentation

2021-12-03 Thread Robin Murphy

On 2021-12-03 15:47, Ross Philipson wrote:

On 12/2/21 12:26, Robin Murphy wrote:

On 2021-08-27 14:28, Ross Philipson wrote:
[...]

+IOMMU Configuration
+---
+
+When doing a Secure Launch, the IOMMU should always be enabled and
the drivers
+loaded. However, IOMMU passthrough mode should never be used. This
leaves the
+MLE completely exposed to DMA after the PMR's [2]_ are disabled.
First, IOMMU
+passthrough should be disabled by default in the build configuration::
+
+  "Device Drivers" -->
+  "IOMMU Hardware Support" -->
+  "IOMMU passthrough by default [ ]"
+
+This unset the Kconfig value CONFIG_IOMMU_DEFAULT_PASSTHROUGH.


Note that the config structure has now changed, and if set, passthrough
is deselected by choosing a different default domain type.


Thanks for the heads up. We will have to modify this for how things
exist today.




+In addition, passthrough must be disabled on the kernel command line
when doing
+a Secure Launch as follows::
+
+  iommu=nopt iommu.passthrough=0


This part is a bit silly - those options are literally aliases for the
exact same thing, and furthermore if the config is already set as
required then the sole effect either of them will have is to cause "(set
by kernel command line)" to be printed. There is no value in explicitly
overriding the default value with the default value - if anyone can
append an additional "iommu.passthrough=1" (or "iommu=pt") to the end of
the command line they'll still win.


I feel like when we worked on this, it was still important to set those
values. This could have been in an older kernel version. We will go back
and verify what you are saying here and adjust the documentation
accordingly.

As to anyone just adding values to the command line, that is why the
command line is part of the DRTM measurements.


Yeah, I had a vague memory that that was the case - basically if you can 
trust the command line as much as the config then it's definitely 
redundant to pass an option for this (see iommu_subsys_init() - it's now 
all plumbed through iommu_def_domain_type), and if you can't then 
passing them is futile anyway.


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 04/14] Documentation/x86: Secure Launch kernel documentation

2021-12-03 Thread Ross Philipson
On 12/2/21 12:26, Robin Murphy wrote:
> On 2021-08-27 14:28, Ross Philipson wrote:
> [...]
>> +IOMMU Configuration
>> +---
>> +
>> +When doing a Secure Launch, the IOMMU should always be enabled and
>> the drivers
>> +loaded. However, IOMMU passthrough mode should never be used. This
>> leaves the
>> +MLE completely exposed to DMA after the PMR's [2]_ are disabled.
>> First, IOMMU
>> +passthrough should be disabled by default in the build configuration::
>> +
>> +  "Device Drivers" -->
>> +  "IOMMU Hardware Support" -->
>> +  "IOMMU passthrough by default [ ]"
>> +
>> +This unset the Kconfig value CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
> 
> Note that the config structure has now changed, and if set, passthrough
> is deselected by choosing a different default domain type.

Thanks for the heads up. We will have to modify this for how things
exist today.

> 
>> +In addition, passthrough must be disabled on the kernel command line
>> when doing
>> +a Secure Launch as follows::
>> +
>> +  iommu=nopt iommu.passthrough=0
> 
> This part is a bit silly - those options are literally aliases for the
> exact same thing, and furthermore if the config is already set as
> required then the sole effect either of them will have is to cause "(set
> by kernel command line)" to be printed. There is no value in explicitly
> overriding the default value with the default value - if anyone can
> append an additional "iommu.passthrough=1" (or "iommu=pt") to the end of
> the command line they'll still win.

I feel like when we worked on this, it was still important to set those
values. This could have been in an older kernel version. We will go back
and verify what you are saying here and adjust the documentation
accordingly.

As to anyone just adding values to the command line, that is why the
command line is part of the DRTM measurements.

Thank you,
Ross

> 
> Robin.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-03 Thread Thomas Gleixner
Jason,

On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
>> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> >> dedicated xarray or by partitioning the xarray space. Both have their
>> >> pro and cons.
>> >
>> > This decision seems to drive the question of how many 'struct devices'
>> > do we need, and where do we get them..
>> 
>> Not really. There is nothing what enforces to make the MSI irqdomain
>> storage strictly hang off struct device. There has to be a connection to
>> a struct device in some way obviously to make IOMMU happy.
>
> Yes, I thought this too, OK we agree

One thing which just occured to me and I missed so far is the
association of the irqdomain.

Right now we know for each device which irqdomain it belongs to. That's
part of device discovery (PCI, device tree or whatever) which set's up

 device->irqdomain

That obviously cannot work for that device specific IMS domain. Why?

Because the PCIe device provides MSI[x] which obviously requires that
device->irqdomain is pointing to the PCI/MSI irqdomain. Otherwise the
PCI/MSI allocation mechanism wont work.

Sure each driver can have

 mystruct->ims_irqdomain

and then do the allocation via

msi_irq_domain_alloc(mystruct->ims_irqdomain)

but I really need to look at all of the MSI infrastructure code whether
it makes assumptions about this:

   msi_desc->dev->irqdomain

being the correct one. Which would obviously just be correct when we'd
have a per queue struct device :)

>> Just badly named because the table itself is where the resulting message
>> is stored, which is composed with the help of the relevant MSI
>> descriptor. See above.
>
> I picked the name because it looks like it will primarily contain an
> xarray and the API you suggested to query it is idx based. To me that
> is a table. A table of msi_desc storage to be specific.
>
> It seems we have some agreement here as your lfunc also included the
> same xarray and uses an idx as part of the API, right?

It's a per lfunc xarray, yes.

>> We really should not try to make up an artifical table representation
>> for something which does not necessarily have a table at all, i.e. the
>> devices you talk about which store the message in queue specific system
>> memory. Pretending that this is a table is just silly.
>
> Now I'm a bit confused, what is the idx in the lfunc? 
>
> I think this is language again, I would call idx an artificial table
> representation?

Ok. I prefer the term indexed storage, but yeah.

>> I really meant a container like this:
>> 
>> struct logical_function {
>> /* Pointer to the physical device */
>> struct device*phys_device;
>> /* MSI descriptor storage */
>>  struct msi_data msi;
>
> Up to here is basically what I thought the "message table" would
> contain. Possibly a pointer to the iommu_domain too?
>
>> /* The queue number */
>> unsigned int queue_nr;
>> /* Add more information which is common to these things */
>
> Not sure why do we need this?
>
> Lets imagine a simple probe function for a simple timer device. It has
> no cdev, vfio, queues, etc. However, the device only supports IMS. No
> MSI, MSI-X, nothing else.
>
> What does the probe function look like to get to request_irq()?

The timer device would be represented by what? A struct device?

If so then something needs to set device->irqdomain and then you can
just do:

 msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

> Why does this simple driver create something called a 'logical
> function' to access its only interrupt?

It does not have to when it's struct device based.

> I think you have the right idea here, just forget about VFIO, the IDXD
> use case, all of it. Provide a way to use IMS cleanly and concurrently
> with MSI.
>
> Do that and everything else should have sane solutions too.

To do that, I need to understand the bigger picture and the intended
usage because otherwise we end up with an utter mess.

>> The idea is to have a common representation for these type of things
>> which allows:
>> 
>>  1) Have common code for exposing queues to VFIO, cdev, sysfs...
>> 
>> You still need myqueue specific code, but the common stuff which is
>> in struct logical_function can be generic and device independent.
>
> I will quote you: "Seriously, you cannot make something uniform which
> is by definition non-uniform." :)
>
> We will find there is no common stuff here, we did this exercise
> already when we designed struct auxiliary_device, and came up
> empty.

Really?

>>  2) Having the MSI storage per logical function (queue) allows to have
>> a queue relative 0 based MSI index space.
>
> Can you explain a bit what you see 0 meaning in this idx numbering?
>
> I also don't understand what 'queue relative' means?
>
>> The actual index in the physical table (think IMS) woul

Re: [RFC v16 0/9] SMMUv3 Nested Stage Setup (IOMMU part)

2021-12-03 Thread Sumit Gupta via iommu

Hi Eric,


This series brings the IOMMU part of HW nested paging support
in the SMMUv3.

The SMMUv3 driver is adapted to support 2 nested stages.

The IOMMU API is extended to convey the guest stage 1
configuration and the hook is implemented in the SMMUv3 driver.

This allows the guest to own the stage 1 tables and context
descriptors (so-called PASID table) while the host owns the
stage 2 tables and main configuration structures (STE).

This work mainly is provided for test purpose as the upper
layer integration is under rework and bound to be based on
/dev/iommu instead of VFIO tunneling. In this version we also get
rid of the MSI BINDING ioctl, assuming the guest enforces
flat mapping of host IOVAs used to bind physical MSI doorbells.
In the current QEMU integration this is achieved by exposing
RMRs to the guest, using Shameer's series [1]. This approach
is RFC as the IORT spec is not really meant to do that
(single mapping flag limitation).

Best Regards

Eric

This series (Host) can be found at:
https://github.com/eauger/linux/tree/v5.15-rc7-nested-v16
This includes a rebased VFIO integration (although not meant
to be upstreamed)

Guest kernel branch can be found at:
https://github.com/eauger/linux/tree/shameer_rmrr_v7
featuring [1]

QEMU integration (still based on VFIO and exposing RMRs)
can be found at:
https://github.com/eauger/qemu/tree/v6.1.0-rmr-v2-nested_smmuv3_v10
(use iommu=nested-smmuv3 ARM virt option)

Guest dependency:
[1] [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node

History:

v15 -> v16:
- guest RIL must support RIL
- additional checks in the cache invalidation hook
- removal of the MSI BINDING ioctl (tentative replacement
   by RMRs)


Eric Auger (9):
   iommu: Introduce attach/detach_pasid_table API
   iommu: Introduce iommu_get_nesting
   iommu/smmuv3: Allow s1 and s2 configs to coexist
   iommu/smmuv3: Get prepared for nested stage support
   iommu/smmuv3: Implement attach/detach_pasid_table
   iommu/smmuv3: Allow stage 1 invalidation with unmanaged ASIDs
   iommu/smmuv3: Implement cache_invalidate
   iommu/smmuv3: report additional recoverable faults
   iommu/smmuv3: Disallow nested mode in presence of HW MSI regions

Hi Eric,

I validated the reworked test patches in v16 from the given
branches with Kernel v5.15 & Qemu v6.2. Verified them with
NVMe PCI device assigned to Guest VM.
Sorry, forgot to update earlier.

Tested-by: Sumit Gupta 

Thanks,
Sumit Gupta

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/3] Allow restricted-dma-pool to customize IO_TLB_SEGSIZE

2021-12-03 Thread Robin Murphy

On 2021-11-25 07:35, Tomasz Figa wrote:

Hi Robin,

On Tue, Nov 23, 2021 at 8:59 PM Robin Murphy  wrote:


On 2021-11-23 11:21, Hsin-Yi Wang wrote:

Default IO_TLB_SEGSIZE (128) slabs may be not enough for some use cases.
This series adds support to customize io_tlb_segsize for each
restricted-dma-pool.

Example use case:

mtk-isp drivers[1] are controlled by mtk-scp[2] and allocate memory through
mtk-scp. In order to use the noncontiguous DMA API[3], we need to use
the swiotlb pool. mtk-scp needs to allocate memory with 2560 slabs.
mtk-isp drivers also needs to allocate memory with 200+ slabs. Both are
larger than the default IO_TLB_SEGSIZE (128) slabs.


Are drivers really doing streaming DMA mappings that large? If so, that
seems like it might be worth trying to address in its own right for the
sake of efficiency - allocating ~5MB of memory twice and copying it back
and forth doesn't sound like the ideal thing to do.

If it's really about coherent DMA buffer allocation, I thought the plan
was that devices which expect to use a significant amount and/or size of
coherent buffers would continue to use a shared-dma-pool for that? It's
still what the binding implies. My understanding was that
swiotlb_alloc() is mostly just a fallback for the sake of drivers which
mostly do streaming DMA but may allocate a handful of pages worth of
coherent buffers here and there. Certainly looking at the mtk_scp
driver, that seems like it shouldn't be going anywhere near SWIOTLB at all.


First, thanks a lot for taking a look at this patch series.

The drivers would do streaming DMA within a reserved region that is
the only memory accessible to them for security reasons. This seems to
exactly match the definition of the restricted pool as merged
recently.


Huh? Of the drivers indicated, the SCP driver is doing nothing but 
coherent allocations, and I'm not entirely sure what those ISP driver 
patches are supposed to be doing but I suspect it's probably just buffer 
allocation too. I don't see any actual streaming DMA anywhere :/



The new dma_alloc_noncontiguous() API would allow allocating suitable
memory directly from the pool, which would eliminate the need to copy.


Can you clarify what's being copied, and where? I'm not all that 
familiar with the media APIs, but I thought it was all based around 
preallocated DMA buffers (the whole dedicated "videobuf" thing)? The few 
instances of actual streaming DMA I can see in drivers/media/ look to be 
mostly PCI drivers mapping private descriptors, whereas the MTK ISP 
appears to be entirely register-based.



However, for a restricted pool, this would exercise the SWIOTLB
allocator, which currently suffers from the limitation as described by
Hsin-Yi. Since the allocator in general is quite general purpose and
already used for coherent allocations as per the current restricted
pool implementation, I think it indeed makes sense to lift the
limitation, rather than trying to come up with yet another thing.


No, just fix the dma_alloc_noncontiguous() fallback case to split the 
allocation into dma_max_mapping_size() chunks. *That* makes sense.


Thanks,
Robin.



Best regards,
Tomasz



Robin.


[1] (not in upstream) 
https://patchwork.kernel.org/project/linux-media/cover/20190611035344.29814-1-jungo@mediatek.com/
[2] https://elixir.bootlin.com/linux/latest/source/drivers/remoteproc/mtk_scp.c
[3] 
https://patchwork.kernel.org/project/linux-media/cover/20210909112430.61243-1-senozhat...@chromium.org/

Hsin-Yi Wang (3):
dma: swiotlb: Allow restricted-dma-pool to customize IO_TLB_SEGSIZE
dt-bindings: Add io-tlb-segsize property for restricted-dma-pool
arm64: dts: mt8183: use restricted swiotlb for scp mem

   .../reserved-memory/shared-dma-pool.yaml  |  8 +
   .../arm64/boot/dts/mediatek/mt8183-kukui.dtsi |  4 +--
   include/linux/swiotlb.h   |  1 +
   kernel/dma/swiotlb.c  | 34 ++-
   4 files changed, 37 insertions(+), 10 deletions(-)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v16 0/9] SMMUv3 Nested Stage Setup (IOMMU part)

2021-12-03 Thread Zhangfei Gao


Hi, Eric

On 2021/10/27 下午6:44, Eric Auger wrote:

This series brings the IOMMU part of HW nested paging support
in the SMMUv3.

The SMMUv3 driver is adapted to support 2 nested stages.

The IOMMU API is extended to convey the guest stage 1
configuration and the hook is implemented in the SMMUv3 driver.

This allows the guest to own the stage 1 tables and context
descriptors (so-called PASID table) while the host owns the
stage 2 tables and main configuration structures (STE).

This work mainly is provided for test purpose as the upper
layer integration is under rework and bound to be based on
/dev/iommu instead of VFIO tunneling. In this version we also get
rid of the MSI BINDING ioctl, assuming the guest enforces
flat mapping of host IOVAs used to bind physical MSI doorbells.
In the current QEMU integration this is achieved by exposing
RMRs to the guest, using Shameer's series [1]. This approach
is RFC as the IORT spec is not really meant to do that
(single mapping flag limitation).

Best Regards

Eric

This series (Host) can be found at:
https://github.com/eauger/linux/tree/v5.15-rc7-nested-v16
This includes a rebased VFIO integration (although not meant
to be upstreamed)

Guest kernel branch can be found at:
https://github.com/eauger/linux/tree/shameer_rmrr_v7
featuring [1]

QEMU integration (still based on VFIO and exposing RMRs)
can be found at:
https://github.com/eauger/qemu/tree/v6.1.0-rmr-v2-nested_smmuv3_v10
(use iommu=nested-smmuv3 ARM virt option)

Guest dependency:
[1] [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node


Thanks a lot for upgrading these patches.

I have basically verified these patches on HiSilicon Kunpeng920.
And integrated them to these branches.
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.16
https://github.com/Linaro/qemu/tree/v6.1.0-rmr-v2-nested_smmuv3_v10

Though they are provided for test purpose,

Tested-by: Zhangfei Gao 

Thanks
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V3 3/5] hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-12-03 Thread Tianyu Lan

On 12/2/2021 10:43 PM, Wei Liu wrote:

On Wed, Dec 01, 2021 at 11:02:54AM -0500, Tianyu Lan wrote:
[...]

diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 46df59aeaa06..30fd0600b008 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -4,6 +4,7 @@
  
  #include 

  #include 
+#include 
  #include 
  
  #include 

@@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
  
  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,

- NULL,
+ hyperv_swiotlb_detect,


It is not immediately obvious why this is needed just by reading the
code. Please consider copying some of the text in the commit message to
a comment here.



Thanks for suggestion. Will update.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 2/5] x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()

2021-12-03 Thread Tianyu Lan

On 12/2/2021 10:39 PM, Wei Liu wrote:

+static bool hyperv_cc_platform_has(enum cc_attr attr)
+{
+#ifdef CONFIG_HYPERV
+   if (attr == CC_ATTR_GUEST_MEM_ENCRYPT)
+   return true;
+   else
+   return false;

This can be simplified as

return attr == CC_ATTR_GUEST_MEM_ENCRYPT;


Wei.


Hi Wei: 
Thanks for your review. Will update.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V3 1/5] Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM

2021-12-03 Thread Tianyu Lan



On 12/2/2021 10:42 PM, Tom Lendacky wrote:

On 12/1/21 10:02 AM, Tianyu Lan wrote:

From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary (E.G 39 bit
address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access
physical address will be original physical address + shared_gpa_boundary.
The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of
memory(vTOM). Memory addresses below vTOM are automatically treated as
private while memory above vTOM is treated as shared.

Expose swiotlb_unencrypted_base for platforms to set unencrypted
memory base offset and platform calls swiotlb_update_mem_attributes()
to remap swiotlb mem to unencrypted address space. memremap() can
not be called in the early stage and so put remapping code into
swiotlb_update_mem_attributes(). Store remap address and use it to copy
data from/to swiotlb bounce buffer.

Signed-off-by: Tianyu Lan 


This patch results in the following stack trace during a bare-metal boot
on my EPYC system with SME active (e.g. mem_encrypt=on):

[    0.123932] BUG: Bad page state in process swapper  pfn:108001
[    0.123942] page:(ptrval) refcount:0 mapcount:-128 
mapping: index:0x0 pfn:0x108001

[    0.123946] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f)
[    0.123952] raw: 0017c000 88904f2d5e80 88904f2d5e80 

[    0.123954] raw:   ff7f 


[    0.123955] page dumped because: nonzero mapcount
[    0.123957] Modules linked in:
[    0.123961] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.16.0-rc3-sos-custom #2

[    0.123964] Hardware name: AMD Corporation
[    0.123967] Call Trace:
[    0.123971]  
[    0.123975]  dump_stack_lvl+0x48/0x5e
[    0.123985]  bad_page.cold+0x65/0x96
[    0.123990]  __free_pages_ok+0x3a8/0x410
[    0.123996]  memblock_free_all+0x171/0x1dc
[    0.124005]  mem_init+0x1f/0x14b
[    0.124011]  start_kernel+0x3b5/0x6a1
[    0.124016]  secondary_startup_64_no_verify+0xb0/0xbb
[    0.124022]  

I see ~40 of these traces, each for different pfns.

Thanks,
Tom


Hi Tom:
  Thanks for your test. Could you help to test the following patch 
and check whether it can fix the issue.



diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 569272871375..f6c3638255d5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,6 +73,9 @@ extern enum swiotlb_force swiotlb_force;
  * @end:   The end address of the swiotlb memory pool. Used to do 
a quick
  * range check to see if the memory was in fact allocated 
by this

  * API.
+ * @vaddr: The vaddr of the swiotlb memory pool. The swiotlb memory 
pool
+ * may be remapped in the memory encrypted case and store 
virtual

+ * address for bounce buffer operation.
  * @nslabs:The number of IO TLB blocks (in groups of 64) between 
@start and
  * @end. For default swiotlb, this is command line 
adjustable via

  * setup_io_tlb_npages.
@@ -92,6 +95,7 @@ extern enum swiotlb_force swiotlb_force;
 struct io_tlb_mem {
phys_addr_t start;
phys_addr_t end;
+   void *vaddr;
unsigned long nslabs;
unsigned long used;
unsigned int index;
@@ -186,4 +190,6 @@ static inline bool is_swiotlb_for_alloc(struct 
device *dev)

 }
 #endif /* CONFIG_DMA_RESTRICTED_POOL */

+extern phys_addr_t swiotlb_unencrypted_base;
+
 #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8e840fbbed7c..34e6ade4f73c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -50,6 +50,7 @@
 #include 
 #include 

+#include 
 #include 
 #include 
 #include 
@@ -72,6 +73,8 @@ enum swiotlb_force swiotlb_force;

 struct io_tlb_mem io_tlb_default_mem;

+phys_addr_t swiotlb_unencrypted_base;
+
 /*
  * Max segment that we can provide which (if pages are contingous) will
  * not be bounced (unless SWIOTLB_FORCE is set).
@@ -155,6 +158,27 @@ static inline unsigned long nr_slots(u64 val)
return DIV_ROUND_UP(val, IO_TLB_SIZE);
 }

+/*
+ * Remap swioltb memory in the unencrypted physical address space
+ * when swiotlb_unencrypted_base is set. (e.g. for Hyper-V AMD SEV-SNP
+ * Isolation VMs).
+ */
+void *swiotlb_mem_remap(struct io_tlb_mem *mem, unsigned long bytes)
+{
+   void *vaddr = NULL;
+
+   if (swiotlb_unencrypted_base) {
+   phys_addr_t paddr = mem->start + swiotlb_unencrypted_base;
+
+   vaddr = memremap(paddr, bytes, MEMREMAP_WB);
+   if (!vaddr)
+   pr_err("Failed to map the unencrypted memory 
%llx size %lx.\n",

+  paddr, bytes);
+   }
+
+   return vaddr;
+}
+
 /*
  * Early SWIOTLB allocation may be too early to allow an architecture to
  * perform the desired operations.  This function allows the 
archite