Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Jason Wang


在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:

在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:

在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:

On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:

On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  wrote:

On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:

在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Because:

1) Kernel can not wait forever in get_config(), this is the major difference
with vhost-user.

virtio_cread() can sleep:

 #define virtio_cread(vdev, structname, member, ptr) \
 do {\
 typeof(((structname*)0)->member) virtio_cread_v;\
 \
 might_sleep();  \
 ^^

Which code path cannot sleep?

Well, it can sleep but it can't sleep forever. For VDUSE, a
buggy/malicious userspace may refuse to respond to the get_config.

It looks to me the ideal case, with the current virtio spec, for VDUSE is to

1) maintain the device and its state in the kernel, userspace may sync
with the kernel device via ioctls
2) offload the datapath (virtqueue) to the userspace

This seems more robust and safe than simply relaying everything to
userspace and waiting for its response.

And we know for sure this model can work, an example is TUN/TAP:
netdevice is abstracted in the kernel and datapath is done via
sendmsg()/recvmsg().

Maintaining the config in the kernel follows this model and it can
simplify the device generation implementation.

For config space write, it requires more thought but fortunately it's
not commonly used. So VDUSE can choose to filter out the
device/features that depends on the config write.

This is the problem. There are other messages like SET_FEATURES where I
guess we'll face the same challenge.

Probably not, userspace device can tell the kernel about the device_features
and mandated_features during 

RE: [PATCH] swiotlb: add overflow checks to swiotlb_bounce

2021-07-07 Thread 이범용
> This is a follow-up on 5f89468e2f06 ("swiotlb: manipulate orig_addr when
> tlb_addr has offset") which fixed unaligned dma mappings, making sure the
> following overflows are caught:
> 
> - offset of the start of the slot within the device bigger than requested
> address' offset, in other words if the base address given in
> swiotlb_tbl_map_single to create the mapping (orig_addr) was after the
> requested address for the sync (tlb_offset) in the same block:
> 
>  |--| block
>   <> mapped part of the block
>   ^
>   orig_addr
>^
>invalid tlb_addr for sync
> 
> - if the resulting offset was bigger than the allocation size this one
> could happen if the mapping was not until the end. e.g.
> 
>  |--| block
>   <-> mapped part of the block
>   ^   ^
>   orig_addr   invalid tlb_addr
> 
> Both should never happen so print a warning and bail out without trying to
> adjust the sizes/offsets: the first one could try to sync from orig_addr
> to whatever is left of the requested size, but the later really has
> nothing to sync there...
> 
> Signed-off-by: Dominique Martinet 
> Cc: Konrad Rzeszutek Wilk 
> Cc: Bumyong Lee 

Reviewed-by: Bumyong Lee  Cc: Chanho Park 
> Cc: Christoph Hellwig 
> ---
> 
> Hi Konrad,
> 
> here's the follow up for the swiotlb/caamjr regression I had promissed.
> It doesn't really change anything, and I confirmed I don't hit either of
> the warnings on our board, but it's probably best to have as either could
> really happen.
> 
> 
>  kernel/dma/swiotlb.c | 20 +---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index
> e50df8d8f87e..23f8d0b168c5 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -354,13 +354,27 @@ static void swiotlb_bounce(struct device *dev,
> phys_addr_t tlb_addr, size_t size
>   size_t alloc_size = mem->slots[index].alloc_size;
>   unsigned long pfn = PFN_DOWN(orig_addr);
>   unsigned char *vaddr = phys_to_virt(tlb_addr);
> - unsigned int tlb_offset;
> + unsigned int tlb_offset, orig_addr_offset;
> 
>   if (orig_addr == INVALID_PHYS_ADDR)
>   return;
> 
> - tlb_offset = (tlb_addr & (IO_TLB_SIZE - 1)) -
> -  swiotlb_align_offset(dev, orig_addr);
> + tlb_offset = tlb_addr & (IO_TLB_SIZE - 1);
> + orig_addr_offset = swiotlb_align_offset(dev, orig_addr);
> + if (tlb_offset < orig_addr_offset) {
> + dev_WARN_ONCE(dev, 1,
> + "Access before mapping start detected. orig offset
%u,
> requested offset %u.\n",
> + orig_addr_offset, tlb_offset);
> + return;
> + }
> +
> + tlb_offset -= orig_addr_offset;
> + if (tlb_offset > alloc_size) {
> + dev_WARN_ONCE(dev, 1,
> + "Buffer overflow detected. Allocation size: %zu.
> Mapping size: %zu+%u.\n",
> + alloc_size, size, tlb_offset);
> + return;
> + }
> 
>   orig_addr += tlb_offset;
>   alloc_size -= tlb_offset;
> --
> 2.30.2


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/3] iommu: Enable non-strict DMA on QCom SD/MMC

2021-07-07 Thread Doug Anderson
Hi,

On Fri, Jun 25, 2021 at 7:42 AM Doug Anderson  wrote:
>
> Hi,
>
> On Fri, Jun 25, 2021 at 6:19 AM Joerg Roedel  wrote:
> >
> > Hi Douglas,
> >
> > On Thu, Jun 24, 2021 at 10:17:56AM -0700, Douglas Anderson wrote:
> > > The goal of this patch series is to get better SD/MMC performance on
> > > Qualcomm eMMC controllers and in generally nudge us forward on the
> > > path of allowing some devices to be in strict mode and others to be in
> > > non-strict mode.
> >
> > So if I understand it right, this patch-set wants a per-device decision
> > about setting dma-mode to strict vs. non-strict.
> >
> > I think we should get to the reason why strict mode is used by default
> > first. Is the default on ARM platforms to use iommu-strict mode by
> > default and if so, why?
> >
> > The x86 IOMMUs use non-strict mode by default (yes, it is a security
> > trade-off).
>
> It is certainly a good question. I will say that, as per usual, I'm
> fumbling around trying to solve problems in subsystems I'm not an
> expert at, so if something I'm saying sounds like nonsense it probably
> is. Please correct me.
>
> I guess I'd start out by thinking about what devices I think need to
> be in "strict" mode. Most of my thoughts on this are in the 3rd patch
> in the series. I think devices where it's important to be in strict
> mode fall into "Case 1" from that patch description, copied here:
>
> Case 1: IOMMUs prevent malicious code running on the peripheral (maybe
> a malicious peripheral or maybe someone exploited a benign peripheral)
> from turning into an exploit of the Linux kernel. This is particularly
> important if the peripheral has loadable / updatable firmware or if
> the peripheral has some type of general purpose processor and is
> processing untrusted inputs. It's also important if the device is
> something that can be easily plugged into the host and the device has
> direct DMA access itself, like a PCIe device.
>
>
> Using sc7180 as an example (searching for iommus in sc7180.dtsi), I'd
> expect these peripherals to be in strict mode:
>
> * WiFi / LTE - I'm almost certain we want this in "strict" mode. Both
> have loadable / updatable firmware and both do complex processing on
> untrusted inputs. Both have a history of being compromised over the
> air just by being near an attacker. Note that on sc7180 these are
> _not_ connected over PCI so we can't leverage any PCI mechanism for
> deciding strict / non-strict.
>
> * Video decode / encode - pretty sure we want this in strict. It's got
> loadable / updatable firmware and processing complex / untrusted
> inputs.
>
> * LPASS (low power audio subsystem) - I don't know a ton and I think
> we don't use this much on our designs, but I believe it meets the
> definitions for needing "strict".
>
> * The QUPs (handles UART, SPI, and i2c) - I'm not as sure here. These
> are much "smarter" than you'd expect. They have loadable / updatable
> firmware and certainly have a sort of general purpose processor in
> them. They also might be processing untrusted inputs, but presumably
> in a pretty simple way. At the moment we don't use a ton of DMA here
> anyway and these are pretty low speed, so I would tend to leave them
> as strict just to be on the safe side.
>
>
> I'd expect these to be non-strict:
>
> * SD/MMC - as described in this patch series.
>
> * USB - As far as I know firmware isn't updatable and has no history
> of being compromised.
>
>
> Special:
>
> * GPU - This already has a bunch of special cases, so we probably
> don't need to discuss here.
>
>
> As far as I can tell everything in sc7180.dtsi that has an "iommus"
> property is classified above. So, unless I'm wrong and it's totally
> fine to run LTE / WiFi / Video / LPASS in non-strict mode then:
>
> * We still need some way to pick strict vs. non-strict.
>
> * Since I've only identified two peripherals that I think should be
> non-strict, having "strict" the default seems like fewer special
> cases. It's also safer.
>
>
> In terms of thinking about x86 / AMD where the default is non-strict,
> I don't have any historical knowledge there. I guess the use of PCI
> for connecting WiFi is more common (so you can use the PCI special
> cases) and I'd sorta hope that WiFi is running in strict mode. For
> video encode / decode, perhaps x86 / AMD are just accepting the risk
> here because there was no kernel infrastructure for doing better? I'd
> also expect that x86/AMD don't have something quite as crazy as the
> QUPs for UART/I2C/SPI, but even if they do I wouldn't be terribly
> upset if they were in non-strict mode.
>
> ...so I guess maybe the super short answer to everything above is that
> I believe that at least WiFi ought to be in "strict" mode and it's not
> on PCI so we need to come up with some type of per-device solution.

I guess this thread has been silent for a bit of time now. Given that
my previous version generated a whole bunch of traffic, I guess I'm
assuming this:

a) Nothing is inherently broken with 

Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling

2021-07-07 Thread Rob Clark
On Tue, Jul 6, 2021 at 10:12 PM John Stultz  wrote:
>
> On Sun, Jul 4, 2021 at 11:16 AM Rob Clark  wrote:
> >
> > I suspect you are getting a dpu fault, and need:
> >
> > https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=h...@mail.gmail.com/
> >
> > I suppose Bjorn was expecting me to send that patch
>
> If it's helpful, I applied that and it got the db845c booting mainline
> again for me (along with some reverts for a separate ext4 shrinker
> crash).
> Tested-by: John Stultz 
>

Thanks, I'll send a patch shortly

BR,
-R
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [Resend RFC PATCH V4 13/13] x86/HV: Not set memory decrypted/encrypted during kexec alloc/free page in IVM

2021-07-07 Thread Dave Hansen
On 7/7/21 8:46 AM, Tianyu Lan wrote:
> @@ -598,7 +599,7 @@ void arch_kexec_unprotect_crashkres(void)
>   */
>  int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, gfp_t gfp)
>  {
> - if (sev_active())
> + if (sev_active() || hv_is_isolation_supported())
>   return 0;
>  
>   /*
> @@ -611,7 +612,7 @@ int arch_kexec_post_alloc_pages(void *vaddr, unsigned int 
> pages, gfp_t gfp)
>  
>  void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages)
>  {
> - if (sev_active())
> + if (sev_active() || hv_is_isolation_supported())
>   return;

You might want to take a look through the "protected guest" patches.  I
think this series is touching a few of the same locations that TDX and
recent SEV work touch.

https://lore.kernel.org/lkml/20210618225755.662725-5-sathyanarayanan.kuppusw...@linux.intel.com/

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Stefan Hajnoczi
On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  
> > > > > wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says 
> > > > > > > > > > > > "Device
> > > > > > > > > > > > configuration space is generally used for 
> > > > > > > > > > > > rarely-changing or
> > > > > > > > > > > > initialization-time parameters". I assume the 
> > > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > > > > > > requirements.
> > > > > > > > > > > Here the language (especially the word "generally") is 
> > > > > > > > > > > weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access that doesn't work with the 
> > > > > > > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that have side-effects. For example, 
> > > > > > > > > > > imagine a field
> > > > > > > > > > > containing an error code if the device encounters a 
> > > > > > > > > > > problem unrelated to
> > > > > > > > > > > a specific virtqueue request. Reading from this field 
> > > > > > > > > > > resets the error
> > > > > > > > > > > code to 0, saving the driver an extra configuration space 
> > > > > > > > > > > write access
> > > > > > > > > > > and possibly race conditions. It isn't possible to 
> > > > > > > > > > > implement those
> > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner 
> > > > > > > > > > > case, but it
> > > > > > > > > > > makes me think that the interface does not allow full 
> > > > > > > > > > > VIRTIO semantics.
> > > > > > > > > Note that though you're correct, my understanding is that 
> > > > > > > > > config space is
> > > > > > > > > not suitable for this kind of error propagating. And it would 
> > > > > > > > > be very hard
> > > > > > > > > to implement such kind of semantic in some transports.  
> > > > > > > > > Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next 
> > > > > > > > > > version. And to
> > > > > > > > > > handle the message failure, I'm going to add a return value 
> > > > > > > > > > to
> > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the 
> > > > > > > > > > error can
> > > > > > > > > > be propagated to the virtio device driver. Then the 
> > > > > > > > > > virtio-blk device
> > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > 
> > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return 
> > > > > > > > value?
> > > > > > > > 
> > > > > > > > The VIRTIO spec provides no way for the device to report errors 
> > > > > > > > from
> > > > > > > > config space accesses.
> > > > > > > > 
> > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > virtio_config_read*() and silently discards 
> > > > > > > > virtio_config_write*()
> > > > > > > > accesses.
> > > > > > > > 
> > > > > > > > VDUSE can take the same approach with
> > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > 
> > > > > > > > > I'd like to stick to the current assumption thich get_config 
> > > > > > > > > won't fail.
> > > > > > > > > That is to say,
> > > > > > > > > 
> > > > > > > > > 1) maintain a config in the kernel, make sure the config 
> > > > > > > > > space read can
> > > > > > > > > always succeed
> > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the 
> > > > > > > > > config space.
> > > > > > > > > 3) we can synchronize with the vduse userspace during 
> > > > > > > > > set_config
> > > > > > > > > 
> > > > > > > > > Does this work?
> > > > > > > > I noticed that caching is also allowed by the vhost-user 
> > > > > > > > protocol
> > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device 
> > > > > > > > doesn't
> > > > > > > > know whether or not caching is in effect. The interface you 
> > > > > > > > outlined
> > > > > > > > above requires caching.
> > > > > > > > 
> > > > > > > > Is there a reason why the host kernel vDPA code needs to cache 
> > > > > > > 

[Resend RFC PATCH V4 13/13] x86/HV: Not set memory decrypted/encrypted during kexec alloc/free page in IVM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V Isolation VM reuses set_memory_decrypted/encrypted function
and not needs to decrypted/encrypted in arch_kexec_post_alloc(pre_free)
_pages just likes AMD SEV VM. So skip them.

Signed-off-by: Tianyu Lan 
---
 arch/x86/kernel/machine_kexec_64.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index c078b0d3ab0e..0cadc64b6873 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_ACPI
 /*
@@ -598,7 +599,7 @@ void arch_kexec_unprotect_crashkres(void)
  */
 int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, gfp_t gfp)
 {
-   if (sev_active())
+   if (sev_active() || hv_is_isolation_supported())
return 0;
 
/*
@@ -611,7 +612,7 @@ int arch_kexec_post_alloc_pages(void *vaddr, unsigned int 
pages, gfp_t gfp)
 
 void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages)
 {
-   if (sev_active())
+   if (sev_active() || hv_is_isolation_supported())
return;
 
/*
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Resend RFC PATCH V4 12/13] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
mpb_desc() still need to handle. Use DMA API to map/umap these
memory during sending/receiving packet and Hyper-V DMA ops callback
will use swiotlb function to allocate bounce buffer and copy data
from/to bounce buffer.

Signed-off-by: Tianyu Lan 
---
 drivers/scsi/storvsc_drv.c | 68 +++---
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 403753929320..cc9cb32f6621 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -427,6 +429,8 @@ struct storvsc_cmd_request {
u32 payload_sz;
 
struct vstor_packet vstor_packet;
+   u32 hvpg_count;
+   struct hv_dma_range *dma_range;
 };
 
 
@@ -509,6 +513,14 @@ struct storvsc_scan_work {
u8 tgt_id;
 };
 
+#define storvsc_dma_map(dev, page, offset, size, dir) \
+   dma_map_page(dev, page, offset, size, dir)
+
+#define storvsc_dma_unmap(dev, dma_range, dir) \
+   dma_unmap_page(dev, dma_range.dma,  \
+  dma_range.mapping_size,  \
+  dir ? DMA_FROM_DEVICE : DMA_TO_DEVICE)
+
 static void storvsc_device_scan(struct work_struct *work)
 {
struct storvsc_scan_work *wrk;
@@ -1267,6 +1279,7 @@ static void storvsc_on_channel_callback(void *context)
struct hv_device *device;
struct storvsc_device *stor_device;
struct Scsi_Host *shost;
+   int i;
 
if (channel->primary_channel != NULL)
device = channel->primary_channel->device_obj;
@@ -1321,6 +1334,15 @@ static void storvsc_on_channel_callback(void *context)
request = (struct storvsc_cmd_request 
*)scsi_cmd_priv(scmnd);
}
 
+   if (request->dma_range) {
+   for (i = 0; i < request->hvpg_count; i++)
+   storvsc_dma_unmap(>device,
+   request->dma_range[i],
+   
request->vstor_packet.vm_srb.data_in == READ_TYPE);
+
+   kfree(request->dma_range);
+   }
+
storvsc_on_receive(stor_device, packet, request);
continue;
}
@@ -1817,7 +1839,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
unsigned int hvpgoff, hvpfns_to_add;
unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
+   dma_addr_t dma;
u64 hvpfn;
+   u32 size;
 
if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
 
@@ -1831,6 +1855,13 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
payload->range.len = length;
payload->range.offset = offset_in_hvpg;
 
+   cmd_request->dma_range = kcalloc(hvpg_count,
+sizeof(*cmd_request->dma_range),
+GFP_ATOMIC);
+   if (!cmd_request->dma_range) {
+   ret = -ENOMEM;
+   goto free_payload;
+   }
 
for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
/*
@@ -1854,9 +1885,29 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
 * last sgl should be reached at the same time that
 * the PFN array is filled.
 */
-   while (hvpfns_to_add--)
-   payload->range.pfn_array[i++] = hvpfn++;
+   while (hvpfns_to_add--) {
+   size = min(HV_HYP_PAGE_SIZE - offset_in_hvpg,
+  (unsigned long)length);
+   dma = storvsc_dma_map(>device, 
pfn_to_page(hvpfn++),
+ offset_in_hvpg, size,
+ scmnd->sc_data_direction);
+   if (dma_mapping_error(>device, dma)) {
+   ret = -ENOMEM;
+   goto free_dma_range;
+   }
+
+   if (offset_in_hvpg) {
+   payload->range.offset = dma & 
~HV_HYP_PAGE_MASK;
+   

[Resend RFC PATCH V4 11/13] HV/Netvsc: Add Isolation VM support for netvsc driver

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
pagebuffer() still need to handle. Use DMA API to map/umap these
memory during sending/receiving packet and Hyper-V DMA ops callback
will use swiotlb function to allocate bounce buffer and copy data
from/to bounce buffer.

Signed-off-by: Tianyu Lan 
---
 drivers/net/hyperv/hyperv_net.h   |   6 ++
 drivers/net/hyperv/netvsc.c   | 144 +-
 drivers/net/hyperv/rndis_filter.c |   2 +
 include/linux/hyperv.h|   5 ++
 4 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index b11aa68b44ec..c2fbb9d4df2c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -164,6 +164,7 @@ struct hv_netvsc_packet {
u32 total_bytes;
u32 send_buf_index;
u32 total_data_buflen;
+   struct hv_dma_range *dma_range;
 };
 
 #define NETVSC_HASH_KEYLEN 40
@@ -1074,6 +1075,7 @@ struct netvsc_device {
 
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
+   void *recv_original_buf;
u32 recv_buf_size; /* allocated bytes */
u32 recv_buf_gpadl_handle;
u32 recv_section_cnt;
@@ -1082,6 +1084,8 @@ struct netvsc_device {
 
/* Send buffer allocated by us */
void *send_buf;
+   void *send_original_buf;
+   u32 send_buf_size;
u32 send_buf_gpadl_handle;
u32 send_section_cnt;
u32 send_section_size;
@@ -1729,4 +1733,6 @@ struct rndis_message {
 #define RETRY_US_HI1
 #define RETRY_MAX  2000/* >10 sec */
 
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+ struct hv_netvsc_packet *packet);
 #endif /* _HYPERV_NET_H */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 7bd935412853..fc312e5db4d5 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
int i;
 
kfree(nvdev->extension);
-   vfree(nvdev->recv_buf);
-   vfree(nvdev->send_buf);
+
+   if (nvdev->recv_original_buf) {
+   vunmap(nvdev->recv_buf);
+   vfree(nvdev->recv_original_buf);
+   } else {
+   vfree(nvdev->recv_buf);
+   }
+
+   if (nvdev->send_original_buf) {
+   vunmap(nvdev->send_buf);
+   vfree(nvdev->send_original_buf);
+   } else {
+   vfree(nvdev->send_buf);
+   }
+
kfree(nvdev->send_section_map);
 
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
@@ -330,6 +343,27 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device 
*net_device, u32 q_idx)
return nvchan->mrc.slots ? 0 : -ENOMEM;
 }
 
+static void *netvsc_remap_buf(void *buf, unsigned long size)
+{
+   unsigned long *pfns;
+   void *vaddr;
+   int i;
+
+   pfns = kcalloc(size / HV_HYP_PAGE_SIZE, sizeof(unsigned long),
+  GFP_KERNEL);
+   if (!pfns)
+   return NULL;
+
+   for (i = 0; i < size / HV_HYP_PAGE_SIZE; i++)
+   pfns[i] = virt_to_hvpfn(buf + i * HV_HYP_PAGE_SIZE)
+   + (ms_hyperv.shared_gpa_boundary >> HV_HYP_PAGE_SHIFT);
+
+   vaddr = vmap_pfn(pfns, size / HV_HYP_PAGE_SIZE, PAGE_KERNEL_IO);
+   kfree(pfns);
+
+   return vaddr;
+}
+
 static int netvsc_init_buf(struct hv_device *device,
   struct netvsc_device *net_device,
   const struct netvsc_device_info *device_info)
@@ -340,6 +374,7 @@ static int netvsc_init_buf(struct hv_device *device,
unsigned int buf_size;
size_t map_words;
int i, ret = 0;
+   void *vaddr;
 
/* Get receive buffer area. */
buf_size = device_info->recv_sections * device_info->recv_section_size;
@@ -375,6 +410,15 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
 
+   if (hv_isolation_type_snp()) {
+   vaddr = netvsc_remap_buf(net_device->recv_buf, buf_size);
+   if (!vaddr)
+   goto cleanup;
+
+   net_device->recv_original_buf = net_device->recv_buf;
+   net_device->recv_buf = vaddr;
+   }
+
/* Notify the NetVsp of the gpadl handle */
init_packet = _device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -477,6 +521,15 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
 
+   if (hv_isolation_type_snp()) {
+   vaddr = netvsc_remap_buf(net_device->send_buf, buf_size);
+   if (!vaddr)
+   goto cleanup;
+
+   net_device->send_original_buf = 

[Resend RFC PATCH V4 10/13] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V Isolation VM requires bounce buffer support to copy
data from/to encrypted memory and so enable swiotlb force
mode to use swiotlb bounce buffer for DMA transaction.

In Isolation VM with AMD SEV, the bounce buffer needs to be
accessed via extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Swiotlb bounce buffer code calls set_memory_decrypted_map()
to mark bounce buffer visible to host and map it in extra
address space.

Hyper-V initalizes swiotlb bounce buffer and default swiotlb
needs to be disabled. pci_swiotlb_detect_override() and
pci_swiotlb_detect_4gb() enable the default one. To override
the setting, hyperv_swiotlb_detect() needs to run before
these detect functions which depends on the pci_xen_swiotlb_
init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
_detect() to keep the order.

The map function vmap_pfn() can't work in the early place
hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce
buffer in the hyperv_iommu_swiotlb_later_init().

Signed-off-by: Tianyu Lan 
---
 arch/x86/xen/pci-swiotlb-xen.c |  3 +-
 drivers/hv/vmbus_drv.c |  3 ++
 drivers/iommu/hyperv-iommu.c   | 62 ++
 include/linux/hyperv.h |  1 +
 4 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 54f9aa7e8457..43bd031aa332 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
 EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
 
 IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
- NULL,
+ hyperv_swiotlb_detect,
  pci_xen_swiotlb_init,
  NULL);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 92cb3f7d21d9..5e3bb76d4dee 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -2080,6 +2081,7 @@ struct hv_device *vmbus_device_create(const guid_t *type,
return child_device_obj;
 }
 
+static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
 /*
  * vmbus_device_register - Register the child device
  */
@@ -2120,6 +2122,7 @@ int vmbus_device_register(struct hv_device 
*child_device_obj)
}
hv_debug_add_dev_dir(child_device_obj);
 
+   child_device_obj->device.dma_mask = _dma_mask;
return 0;
 
 err_kset_unregister:
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e285a220c913..d7ea8e05b991 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -13,14 +13,22 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include "irq_remapping.h"
 
@@ -36,6 +44,8 @@
 static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
 static struct irq_domain *ioapic_ir_domain;
 
+static unsigned long hyperv_io_tlb_start, hyperv_io_tlb_size;
+
 static int hyperv_ir_set_affinity(struct irq_data *data,
const struct cpumask *mask, bool force)
 {
@@ -337,4 +347,56 @@ static const struct irq_domain_ops 
hyperv_root_ir_domain_ops = {
.free = hyperv_root_irq_remapping_free,
 };
 
+void __init hyperv_iommu_swiotlb_init(void)
+{
+   unsigned long bytes;
+   void *vstart;
+
+   /*
+* Allocate Hyper-V swiotlb bounce buffer at early place
+* to reserve large contiguous memory.
+*/
+   hyperv_io_tlb_size = 200 * 1024 * 1024;
+   hyperv_io_tlb_start = memblock_alloc_low(PAGE_ALIGN(hyperv_io_tlb_size),
+HV_HYP_PAGE_SIZE);
+
+   if (!hyperv_io_tlb_start) {
+   pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");
+   return;
+   }
+}
+
+int __init hyperv_swiotlb_detect(void)
+{
+   if (hypervisor_is_type(X86_HYPER_MS_HYPERV)
+   && hv_is_isolation_supported()) {
+   /*
+* Enable swiotlb force mode in Isolation VM to
+* use swiotlb bounce buffer for dma transaction.
+*/
+   swiotlb_force = SWIOTLB_FORCE;
+   return 1;
+   }
+
+   return 0;
+}
+
+void __init hyperv_iommu_swiotlb_later_init(void)
+{
+   void *hyperv_io_tlb_remap;
+   int ret;
+
+   /*
+* Swiotlb bounce buffer needs to be mapped in extra address
+* space. Map function 

[Resend RFC PATCH V4 09/13] x86/Swiotlb/HV: Add Swiotlb bounce buffer remap function for HV IVM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Introduce set_memory_decrypted_map() function to decrypt memory and
remap memory with platform callback. Use set_memory_decrypted_
map() in the swiotlb code, store remap address returned by the new
API and use the remap address to copy data from/to swiotlb bounce buffer.

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/ivm.c | 30 ++
 arch/x86/include/asm/mshyperv.h   |  2 ++
 arch/x86/include/asm/set_memory.h |  2 ++
 arch/x86/mm/pat/set_memory.c  | 28 
 include/linux/swiotlb.h   |  4 
 kernel/dma/swiotlb.c  | 11 ---
 6 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 8a6f4e9e3d6c..ea33935e0c17 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -267,3 +267,33 @@ int hv_set_mem_enc(unsigned long addr, int numpages, bool 
enc)
enc ? VMBUS_PAGE_NOT_VISIBLE
: VMBUS_PAGE_VISIBLE_READ_WRITE);
 }
+
+/*
+ * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
+ */
+unsigned long hv_map_memory(unsigned long addr, unsigned long size)
+{
+   unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,
+ sizeof(unsigned long),
+  GFP_KERNEL);
+   unsigned long vaddr;
+   int i;
+
+   if (!pfns)
+   return (unsigned long)NULL;
+
+   for (i = 0; i < size / HV_HYP_PAGE_SIZE; i++)
+   pfns[i] = virt_to_hvpfn((void *)addr + i * HV_HYP_PAGE_SIZE) +
+   (ms_hyperv.shared_gpa_boundary >> HV_HYP_PAGE_SHIFT);
+
+   vaddr = (unsigned long)vmap_pfn(pfns, size / HV_HYP_PAGE_SIZE,
+   PAGE_KERNEL_IO);
+   kfree(pfns);
+
+   return vaddr;
+}
+
+void hv_unmap_memory(unsigned long addr)
+{
+   vunmap((void *)addr);
+}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index fe03e3e833ac..ba3cb9e32fdb 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -253,6 +253,8 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, int 
vcpu, int vector,
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
 int hv_mark_gpa_visibility(u16 count, const u64 pfn[], u32 visibility);
 int hv_set_mem_enc(unsigned long addr, int numpages, bool enc);
+unsigned long hv_map_memory(unsigned long addr, unsigned long size);
+void hv_unmap_memory(unsigned long addr);
 void hv_sint_wrmsrl_ghcb(u64 msr, u64 value);
 void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);
 void hv_signal_eom_ghcb(void);
diff --git a/arch/x86/include/asm/set_memory.h 
b/arch/x86/include/asm/set_memory.h
index 43fa081a1adb..7a2117931830 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,6 +49,8 @@ int set_memory_decrypted(unsigned long addr, int numpages);
 int set_memory_np_noalias(unsigned long addr, int numpages);
 int set_memory_nonglobal(unsigned long addr, int numpages);
 int set_memory_global(unsigned long addr, int numpages);
+unsigned long set_memory_decrypted_map(unsigned long addr, unsigned long size);
+int set_memory_encrypted_unmap(unsigned long addr, unsigned long size);
 
 int set_pages_array_uc(struct page **pages, int addrinarray);
 int set_pages_array_wc(struct page **pages, int addrinarray);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 6cc83c57383d..5d4d3963f4a2 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2039,6 +2039,34 @@ int set_memory_decrypted(unsigned long addr, int 
numpages)
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
+static unsigned long __map_memory(unsigned long addr, unsigned long size)
+{
+   if (hv_is_isolation_supported())
+   return hv_map_memory(addr, size);
+
+   return addr;
+}
+
+static void __unmap_memory(unsigned long addr)
+{
+   if (hv_is_isolation_supported())
+   hv_unmap_memory(addr);
+}
+
+unsigned long set_memory_decrypted_map(unsigned long addr, unsigned long size)
+{
+   if (__set_memory_enc_dec(addr, size / PAGE_SIZE, false))
+   return (unsigned long)NULL;
+
+   return __map_memory(addr, size);
+}
+
+int set_memory_encrypted_unmap(unsigned long addr, unsigned long size)
+{
+   __unmap_memory(addr);
+   return __set_memory_enc_dec(addr, size / PAGE_SIZE, true);
+}
+
 int 

[Resend RFC PATCH V4 08/13] HV/Vmbus: Initialize VMbus ring buffer for Isolation VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

VMbus ring buffer are shared with host and it's need to
be accessed via extra address space of Isolation VM with
SNP support. This patch is to map the ring buffer
address in extra address space via ioremap(). HV host
visibility hvcall smears data in the ring buffer and
so reset the ring buffer memory to zero after calling
visibility hvcall.

Signed-off-by: Tianyu Lan 
---
 drivers/hv/Kconfig|  1 +
 drivers/hv/channel.c  | 10 +
 drivers/hv/hyperv_vmbus.h |  2 +
 drivers/hv/ring_buffer.c  | 84 ++-
 4 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 66c794d92391..a8386998be40 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -7,6 +7,7 @@ config HYPERV
depends on X86 && ACPI && X86_LOCAL_APIC && HYPERVISOR_GUEST
select PARAVIRT
select X86_HV_CALLBACK_VECTOR
+   select VMAP_PFN
help
  Select this option to run Linux as a Hyper-V client operating
  system.
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 01048bb07082..7350da9dbe97 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -707,6 +707,16 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
if (err)
goto error_clean_ring;
 
+   err = hv_ringbuffer_post_init(>outbound,
+ page, send_pages);
+   if (err)
+   goto error_free_gpadl;
+
+   err = hv_ringbuffer_post_init(>inbound,
+ [send_pages], recv_pages);
+   if (err)
+   goto error_free_gpadl;
+
/* Create and init the channel open message */
open_info = kzalloc(sizeof(*open_info) +
   sizeof(struct vmbus_channel_open_channel),
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 40bc0eff6665..15cd23a561f3 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -172,6 +172,8 @@ extern int hv_synic_cleanup(unsigned int cpu);
 /* Interface */
 
 void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
+int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
+   struct page *pages, u32 page_cnt);
 
 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
   struct page *pages, u32 pagecnt, u32 max_pkt_size);
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 2aee356840a2..d4f93fca1108 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "hyperv_vmbus.h"
 
@@ -179,43 +181,89 @@ void hv_ringbuffer_pre_init(struct vmbus_channel *channel)
mutex_init(>outbound.ring_buffer_mutex);
 }
 
-/* Initialize the ring buffer. */
-int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
-  struct page *pages, u32 page_cnt, u32 max_pkt_size)
+int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
+  struct page *pages, u32 page_cnt)
 {
+   u64 physic_addr = page_to_pfn(pages) << PAGE_SHIFT;
+   unsigned long *pfns_wraparound;
+   void *vaddr;
int i;
-   struct page **pages_wraparound;
 
-   BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
+   if (!hv_isolation_type_snp())
+   return 0;
+
+   physic_addr += ms_hyperv.shared_gpa_boundary;
 
/*
 * First page holds struct hv_ring_buffer, do wraparound mapping for
 * the rest.
 */
-   pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),
+   pfns_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(unsigned long),
   GFP_KERNEL);
-   if (!pages_wraparound)
+   if (!pfns_wraparound)
return -ENOMEM;
 
-   pages_wraparound[0] = pages;
+   pfns_wraparound[0] = physic_addr >> PAGE_SHIFT;
for (i = 0; i < 2 * (page_cnt - 1); i++)
-   pages_wraparound[i + 1] = [i % (page_cnt - 1) + 1];
-
-   ring_info->ring_buffer = (struct hv_ring_buffer *)
-   vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL);
-
-   kfree(pages_wraparound);
+   pfns_wraparound[i + 1] = (physic_addr >> PAGE_SHIFT) +
+   i % (page_cnt - 1) + 1;
 
-
-   if (!ring_info->ring_buffer)
+   vaddr = vmap_pfn(pfns_wraparound, page_cnt * 2 - 1, PAGE_KERNEL_IO);
+   kfree(pfns_wraparound);
+   if (!vaddr)
return -ENOMEM;
 
-   ring_info->ring_buffer->read_index =
-   ring_info->ring_buffer->write_index = 0;
+   /* Clean memory after setting host visibility. */
+   memset((void *)vaddr, 0x00, page_cnt * PAGE_SIZE);
+
+   ring_info->ring_buffer = (struct hv_ring_buffer *)vaddr;
+   ring_info->ring_buffer->read_index = 0;
+   

[Resend RFC PATCH V4 07/13] HV/Vmbus: Add SNP support for VMbus channel initiate message

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

The monitor pages in the CHANNELMSG_INITIATE_CONTACT msg are shared
with host in Isolation VM and so it's necessary to use hvcall to set
them visible to host. In Isolation VM with AMD SEV SNP, the access
address should be in the extra space which is above shared gpa
boundary. So remap these pages into the extra address(pa +
shared_gpa_boundary).

Signed-off-by: Tianyu Lan 
---
 drivers/hv/connection.c   | 65 +++
 drivers/hv/hyperv_vmbus.h |  1 +
 2 files changed, 66 insertions(+)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 186fd4c8acd4..a32bde143e4c 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "hyperv_vmbus.h"
@@ -104,6 +105,12 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
*msginfo, u32 version)
 
msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
+
+   if (hv_is_isolation_supported()) {
+   msg->monitor_page1 += ms_hyperv.shared_gpa_boundary;
+   msg->monitor_page2 += ms_hyperv.shared_gpa_boundary;
+   }
+
msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
 
/*
@@ -148,6 +155,31 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
*msginfo, u32 version)
return -ECONNREFUSED;
}
 
+   if (hv_is_isolation_supported()) {
+   vmbus_connection.monitor_pages_va[0]
+   = vmbus_connection.monitor_pages[0];
+   vmbus_connection.monitor_pages[0]
+   = memremap(msg->monitor_page1, HV_HYP_PAGE_SIZE,
+  MEMREMAP_WB);
+   if (!vmbus_connection.monitor_pages[0])
+   return -ENOMEM;
+
+   vmbus_connection.monitor_pages_va[1]
+   = vmbus_connection.monitor_pages[1];
+   vmbus_connection.monitor_pages[1]
+   = memremap(msg->monitor_page2, HV_HYP_PAGE_SIZE,
+  MEMREMAP_WB);
+   if (!vmbus_connection.monitor_pages[1]) {
+   memunmap(vmbus_connection.monitor_pages[0]);
+   return -ENOMEM;
+   }
+
+   memset(vmbus_connection.monitor_pages[0], 0x00,
+  HV_HYP_PAGE_SIZE);
+   memset(vmbus_connection.monitor_pages[1], 0x00,
+  HV_HYP_PAGE_SIZE);
+   }
+
return ret;
 }
 
@@ -159,6 +191,7 @@ int vmbus_connect(void)
struct vmbus_channel_msginfo *msginfo = NULL;
int i, ret = 0;
__u32 version;
+   u64 pfn[2];
 
/* Initialize the vmbus connection */
vmbus_connection.conn_state = CONNECTING;
@@ -216,6 +249,16 @@ int vmbus_connect(void)
goto cleanup;
}
 
+   if (hv_is_isolation_supported()) {
+   pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
+   pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
+   if (hv_mark_gpa_visibility(2, pfn,
+   VMBUS_PAGE_VISIBLE_READ_WRITE)) {
+   ret = -EFAULT;
+   goto cleanup;
+   }
+   }
+
msginfo = kzalloc(sizeof(*msginfo) +
  sizeof(struct vmbus_channel_initiate_contact),
  GFP_KERNEL);
@@ -282,6 +325,8 @@ int vmbus_connect(void)
 
 void vmbus_disconnect(void)
 {
+   u64 pfn[2];
+
/*
 * First send the unload request to the host.
 */
@@ -301,6 +346,26 @@ void vmbus_disconnect(void)
vmbus_connection.int_page = NULL;
}
 
+   if (hv_is_isolation_supported()) {
+   if (vmbus_connection.monitor_pages_va[0]) {
+   memunmap(vmbus_connection.monitor_pages[0]);
+   vmbus_connection.monitor_pages[0]
+   = vmbus_connection.monitor_pages_va[0];
+   vmbus_connection.monitor_pages_va[0] = NULL;
+   }
+
+   if (vmbus_connection.monitor_pages_va[1]) {
+   memunmap(vmbus_connection.monitor_pages[1]);
+   vmbus_connection.monitor_pages[1]
+   = vmbus_connection.monitor_pages_va[1];
+   vmbus_connection.monitor_pages_va[1] = NULL;
+   }
+
+   pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
+   pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
+   hv_mark_gpa_visibility(2, pfn, VMBUS_PAGE_NOT_VISIBLE);
+   }
+
hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[0]);
hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[1]);
   

[Resend RFC PATCH V4 06/13] HV: Add ghcb hvcall support for SNP VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides ghcb hvcall to handle VMBus
HVCALL_SIGNAL_EVENT and HVCALL_POST_MESSAGE
msg in SNP Isolation VM. Add such support.

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/ivm.c   | 42 +
 arch/x86/include/asm/mshyperv.h |  1 +
 drivers/hv/connection.c |  6 -
 drivers/hv/hv.c |  8 ++-
 include/asm-generic/mshyperv.h  | 29 +++
 5 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index c7b54631ca0d..8a6f4e9e3d6c 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -15,6 +15,48 @@
 #include 
 #include 
 
+u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
+{
+   union hv_ghcb *hv_ghcb;
+   void **ghcb_base;
+   unsigned long flags;
+
+   if (!ms_hyperv.ghcb_base)
+   return -EFAULT;
+
+   WARN_ON(in_nmi());
+
+   local_irq_save(flags);
+   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   hv_ghcb = (union hv_ghcb *)*ghcb_base;
+   if (!hv_ghcb) {
+   local_irq_restore(flags);
+   return -EFAULT;
+   }
+
+   memset(hv_ghcb, 0x00, HV_HYP_PAGE_SIZE);
+   hv_ghcb->ghcb.protocol_version = 1;
+   hv_ghcb->ghcb.ghcb_usage = 1;
+
+   hv_ghcb->hypercall.outputgpa = (u64)output;
+   hv_ghcb->hypercall.hypercallinput.asuint64 = 0;
+   hv_ghcb->hypercall.hypercallinput.callcode = control;
+
+   if (input_size)
+   memcpy(hv_ghcb->hypercall.hypercalldata, input, input_size);
+
+   VMGEXIT();
+
+   hv_ghcb->ghcb.ghcb_usage = 0x;
+   memset(hv_ghcb->ghcb.save.valid_bitmap, 0,
+  sizeof(hv_ghcb->ghcb.save.valid_bitmap));
+
+   local_irq_restore(flags);
+
+   return hv_ghcb->hypercall.hypercalloutput.callstatus;
+}
+EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
+
 void hv_ghcb_msr_write(u64 msr, u64 value)
 {
union hv_ghcb *hv_ghcb;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f9cc3753040a..fe03e3e833ac 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -258,6 +258,7 @@ void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);
 void hv_signal_eom_ghcb(void);
 void hv_ghcb_msr_write(u64 msr, u64 value);
 void hv_ghcb_msr_read(u64 msr, u64 *value);
+u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size);
 
 #define hv_get_synint_state_ghcb(int_num, val) \
hv_sint_rdmsrl_ghcb(HV_X64_MSR_SINT0 + int_num, val)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 311cd005b3be..186fd4c8acd4 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -445,6 +445,10 @@ void vmbus_set_event(struct vmbus_channel *channel)
 
++channel->sig_events;
 
-   hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
+   if (hv_isolation_type_snp())
+   hv_ghcb_hypercall(HVCALL_SIGNAL_EVENT, >sig_event,
+   NULL, sizeof(u64));
+   else
+   hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
 }
 EXPORT_SYMBOL_GPL(vmbus_set_event);
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 59f7173c4d9f..e5c9fc467893 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -98,7 +98,13 @@ int hv_post_message(union hv_connection_id connection_id,
aligned_msg->payload_size = payload_size;
memcpy((void *)aligned_msg->payload, payload, payload_size);
 
-   status = hv_do_hypercall(HVCALL_POST_MESSAGE, aligned_msg, NULL);
+   if (hv_isolation_type_snp())
+   status = hv_ghcb_hypercall(HVCALL_POST_MESSAGE,
+   (void *)aligned_msg, NULL,
+   sizeof(struct hv_input_post_message));
+   else
+   status = hv_do_hypercall(HVCALL_POST_MESSAGE,
+   aligned_msg, NULL);
 
/* Preemption must remain disabled until after the hypercall
 * so some other thread can't get scheduled onto this cpu and
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index e6d6886faed1..8f6f283fb5b5 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -30,6 +30,35 @@
 
 union hv_ghcb {
struct ghcb ghcb;
+   struct {
+   u64 hypercalldata[509];
+   u64 outputgpa;
+   union {
+   union {
+   struct {
+   u32 callcode: 16;
+   u32 isfast  : 1;
+   u32 reserved1   : 14;
+   u32 isnested: 1;
+   u32 countofelements : 12;
+   u32 reserved2   : 4;
+  

[Resend RFC PATCH V4 05/13] HV: Add Write/Read MSR registers via ghcb page

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides GHCB protocol to write Synthetic Interrupt
Controller MSR registers in Isolation VM with AMD SEV SNP
and these registers are emulated by hypervisor directly.
Hyper-V requires to write SINTx MSR registers twice. First
writes MSR via GHCB page to communicate with hypervisor
and then writes wrmsr instruction to talk with paravisor
which runs in VMPL0. Guest OS ID MSR also needs to be set
via GHCB.

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/hv_init.c   |  25 +--
 arch/x86/hyperv/ivm.c   | 115 ++
 arch/x86/include/asm/mshyperv.h |  78 +++-
 arch/x86/include/asm/sev-es.h   |   4 ++
 arch/x86/kernel/cpu/mshyperv.c  |   3 +
 arch/x86/kernel/sev-es-shared.c |  21 --
 drivers/hv/hv.c | 121 ++--
 include/asm-generic/mshyperv.h  |  12 +++-
 8 files changed, 308 insertions(+), 71 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e058f725..97d1c774cfce 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -445,7 +445,7 @@ void __init hyperv_init(void)
goto clean_guest_os_id;
 
if (hv_isolation_type_snp()) {
-   ms_hyperv.ghcb_base = alloc_percpu(void *);
+   ms_hyperv.ghcb_base = alloc_percpu(union hv_ghcb __percpu *);
if (!ms_hyperv.ghcb_base)
goto clean_guest_os_id;
 
@@ -539,6 +539,7 @@ void hyperv_cleanup(void)
 
/* Reset our OS id */
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
+   hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
 
/*
 * Reset hypercall page reference before reset the page,
@@ -620,28 +621,6 @@ bool hv_is_hibernation_supported(void)
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
 
-enum hv_isolation_type hv_get_isolation_type(void)
-{
-   if (!(ms_hyperv.priv_high & HV_ISOLATION))
-   return HV_ISOLATION_TYPE_NONE;
-   return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
-}
-EXPORT_SYMBOL_GPL(hv_get_isolation_type);
-
-bool hv_is_isolation_supported(void)
-{
-   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
-}
-EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
-
-DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
-
-bool hv_isolation_type_snp(void)
-{
-   return static_branch_unlikely(_type_snp);
-}
-EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
-
 /* Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx */
 bool hv_query_ext_cap(u64 cap_query)
 {
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 24a58795abd8..c7b54631ca0d 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -6,6 +6,8 @@
  *  Tianyu Lan 
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -13,6 +15,119 @@
 #include 
 #include 
 
+void hv_ghcb_msr_write(u64 msr, u64 value)
+{
+   union hv_ghcb *hv_ghcb;
+   void **ghcb_base;
+   unsigned long flags;
+
+   if (!ms_hyperv.ghcb_base)
+   return;
+
+   WARN_ON(in_nmi());
+
+   local_irq_save(flags);
+   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   hv_ghcb = (union hv_ghcb *)*ghcb_base;
+   if (!hv_ghcb) {
+   local_irq_restore(flags);
+   return;
+   }
+
+   memset(hv_ghcb, 0x00, HV_HYP_PAGE_SIZE);
+
+   ghcb_set_rcx(_ghcb->ghcb, msr);
+   ghcb_set_rax(_ghcb->ghcb, lower_32_bits(value));
+   ghcb_set_rdx(_ghcb->ghcb, value >> 32);
+
+   if (sev_es_ghcb_hv_call(_ghcb->ghcb, NULL, SVM_EXIT_MSR, 1, 0))
+   pr_warn("Fail to write msr via ghcb %llx.\n", msr);
+
+   local_irq_restore(flags);
+}
+
+void hv_ghcb_msr_read(u64 msr, u64 *value)
+{
+   union hv_ghcb *hv_ghcb;
+   void **ghcb_base;
+   unsigned long flags;
+
+   if (!ms_hyperv.ghcb_base)
+   return;
+
+   WARN_ON(in_nmi());
+
+   local_irq_save(flags);
+   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   hv_ghcb = (union hv_ghcb *)*ghcb_base;
+   if (!hv_ghcb) {
+   local_irq_restore(flags);
+   return;
+   }
+
+   memset(hv_ghcb, 0x00, HV_HYP_PAGE_SIZE);
+
+   ghcb_set_rcx(_ghcb->ghcb, msr);
+   if (sev_es_ghcb_hv_call(_ghcb->ghcb, NULL, SVM_EXIT_MSR, 0, 0))
+   pr_warn("Fail to read msr via ghcb %llx.\n", msr);
+   else
+   *value = (u64)lower_32_bits(hv_ghcb->ghcb.save.rax)
+   | ((u64)lower_32_bits(hv_ghcb->ghcb.save.rdx) << 32);
+   local_irq_restore(flags);
+}
+
+void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value)
+{
+   hv_ghcb_msr_read(msr, value);
+}
+EXPORT_SYMBOL_GPL(hv_sint_rdmsrl_ghcb);
+
+void hv_sint_wrmsrl_ghcb(u64 msr, u64 value)
+{
+   hv_ghcb_msr_write(msr, value);
+
+   /* Write proxy bit vua wrmsrl instruction. */
+   if (msr >= HV_X64_MSR_SINT0 && msr <= HV_X64_MSR_SINT15)
+   wrmsrl(msr, value | 1 

[Resend RFC PATCH V4 04/13] HV: Mark vmbus ring buffer visible to host in Isolation VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Mark vmbus ring buffer visible with set_memory_decrypted() when
establish gpadl handle.

Signed-off-by: Tianyu Lan 
---
 drivers/hv/channel.c   | 38 --
 include/linux/hyperv.h | 10 ++
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index f3761c73b074..01048bb07082 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -465,7 +466,7 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
struct list_head *curr;
u32 next_gpadl_handle;
unsigned long flags;
-   int ret = 0;
+   int ret = 0, index;
 
next_gpadl_handle =
(atomic_inc_return(_connection.next_gpadl_handle) - 1);
@@ -474,6 +475,13 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
if (ret)
return ret;
 
+   ret = set_memory_decrypted((unsigned long)kbuffer,
+  HVPFN_UP(size));
+   if (ret) {
+   pr_warn("Failed to set host visibility.\n");
+   return ret;
+   }
+
init_completion(>waitevent);
msginfo->waiting_channel = channel;
 
@@ -539,6 +547,15 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
/* At this point, we received the gpadl created msg */
*gpadl_handle = gpadlmsg->gpadl;
 
+   if (type == HV_GPADL_BUFFER)
+   index = 0;
+   else
+   index = channel->gpadl_range[1].gpadlhandle ? 2 : 1;
+
+   channel->gpadl_range[index].size = size;
+   channel->gpadl_range[index].buffer = kbuffer;
+   channel->gpadl_range[index].gpadlhandle = *gpadl_handle;
+
 cleanup:
spin_lock_irqsave(_connection.channelmsg_lock, flags);
list_del(>msglistentry);
@@ -549,6 +566,11 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
}
 
kfree(msginfo);
+
+   if (ret)
+   set_memory_encrypted((unsigned long)kbuffer,
+HVPFN_UP(size));
+
return ret;
 }
 
@@ -811,7 +833,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 
gpadl_handle)
struct vmbus_channel_gpadl_teardown *msg;
struct vmbus_channel_msginfo *info;
unsigned long flags;
-   int ret;
+   int ret, i;
 
info = kzalloc(sizeof(*info) +
   sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
@@ -859,6 +881,18 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
u32 gpadl_handle)
spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
 
kfree(info);
+
+   /* Find gpadl buffer virtual address and size. */
+   for (i = 0; i < VMBUS_GPADL_RANGE_COUNT; i++)
+   if (channel->gpadl_range[i].gpadlhandle == gpadl_handle)
+   break;
+
+   if (set_memory_encrypted((unsigned long)channel->gpadl_range[i].buffer,
+   HVPFN_UP(channel->gpadl_range[i].size)))
+   pr_warn("Fail to set mem host visibility.\n");
+
+   channel->gpadl_range[i].gpadlhandle = 0;
+
return ret;
 }
 EXPORT_SYMBOL_GPL(vmbus_teardown_gpadl);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 2e859d2f9609..06eccaba10c5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -809,6 +809,14 @@ struct vmbus_device {
 
 #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
 
+struct vmbus_gpadl_range {
+   u32 gpadlhandle;
+   u32 size;
+   void *buffer;
+};
+
+#define VMBUS_GPADL_RANGE_COUNT3
+
 struct vmbus_channel {
struct list_head listentry;
 
@@ -829,6 +837,8 @@ struct vmbus_channel {
struct completion rescind_event;
 
u32 ringbuffer_gpadlhandle;
+   /* GPADL_RING and Send/Receive GPADL_BUFFER. */
+   struct vmbus_gpadl_range gpadl_range[VMBUS_GPADL_RANGE_COUNT];
 
/* Allocated memory for ring buffer */
struct page *ringbuffer_page;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Resend RFC PATCH V4 03/13] x86/HV: Add new hvcall guest address host visibility support

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Add new hvcall guest address host visibility support to mark
memory visible to host. Call it inside set_memory_decrypted
/encrypted().

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/Makefile   |   2 +-
 arch/x86/hyperv/ivm.c  | 112 +
 arch/x86/include/asm/hyperv-tlfs.h |  18 +
 arch/x86/include/asm/mshyperv.h|   3 +-
 arch/x86/mm/pat/set_memory.c   |   6 +-
 include/asm-generic/hyperv-tlfs.h  |   1 +
 6 files changed, 139 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/hyperv/ivm.c

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 48e2c51464e8..5d2de10809ae 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y  := hv_init.o mmu.o nested.o irqdomain.o
+obj-y  := hv_init.o mmu.o nested.o irqdomain.o ivm.o
 obj-$(CONFIG_X86_64)   += hv_apic.o hv_proc.o
 
 ifdef CONFIG_X86_64
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
new file mode 100644
index ..24a58795abd8
--- /dev/null
+++ b/arch/x86/hyperv/ivm.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Hyper-V Isolation VM interface with paravisor and hypervisor
+ *
+ * Author:
+ *  Tianyu Lan 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
+ *
+ * In Isolation VM, all guest memory is encripted from host and guest
+ * needs to set memory visible to host via hvcall before sharing memory
+ * with host.
+ */
+int hv_mark_gpa_visibility(u16 count, const u64 pfn[], u32 visibility)
+{
+   struct hv_gpa_range_for_visibility **input_pcpu, *input;
+   u16 pages_processed;
+   u64 hv_status;
+   unsigned long flags;
+
+   /* no-op if partition isolation is not enabled */
+   if (!hv_is_isolation_supported())
+   return 0;
+
+   if (count > HV_MAX_MODIFY_GPA_REP_COUNT) {
+   pr_err("Hyper-V: GPA count:%d exceeds supported:%lu\n", count,
+   HV_MAX_MODIFY_GPA_REP_COUNT);
+   return -EINVAL;
+   }
+
+   local_irq_save(flags);
+   input_pcpu = (struct hv_gpa_range_for_visibility **)
+   this_cpu_ptr(hyperv_pcpu_input_arg);
+   input = *input_pcpu;
+   if (unlikely(!input)) {
+   local_irq_restore(flags);
+   return -EINVAL;
+   }
+
+   input->partition_id = HV_PARTITION_ID_SELF;
+   input->host_visibility = visibility;
+   input->reserved0 = 0;
+   input->reserved1 = 0;
+   memcpy((void *)input->gpa_page_list, pfn, count * sizeof(*pfn));
+   hv_status = hv_do_rep_hypercall(
+   HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
+   0, input, _processed);
+   local_irq_restore(flags);
+
+   if (!(hv_status & HV_HYPERCALL_RESULT_MASK))
+   return 0;
+
+   return hv_status & HV_HYPERCALL_RESULT_MASK;
+}
+EXPORT_SYMBOL(hv_mark_gpa_visibility);
+
+/*
+ * hv_set_mem_host_visibility - Set specified memory visible to host.
+ *
+ * In Isolation VM, all guest memory is encrypted from host and guest
+ * needs to set memory visible to host via hvcall before sharing memory
+ * with host. This function works as wrap of hv_mark_gpa_visibility()
+ * with memory base and size.
+ */
+static int hv_set_mem_host_visibility(void *kbuffer, size_t size, u32 
visibility)
+{
+   int pagecount = size >> HV_HYP_PAGE_SHIFT;
+   u64 *pfn_array;
+   int ret = 0;
+   int i, pfn;
+
+   if (!hv_is_isolation_supported() || !ms_hyperv.ghcb_base)
+   return 0;
+
+   pfn_array = kzalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
+   if (!pfn_array)
+   return -ENOMEM;
+
+   for (i = 0, pfn = 0; i < pagecount; i++) {
+   pfn_array[pfn] = virt_to_hvpfn(kbuffer + i * HV_HYP_PAGE_SIZE);
+   pfn++;
+
+   if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
+   ret |= hv_mark_gpa_visibility(pfn, pfn_array, 
visibility);
+   pfn = 0;
+
+   if (ret)
+   goto err_free_pfn_array;
+   }
+   }
+
+ err_free_pfn_array:
+   kfree(pfn_array);
+   return ret;
+}
+
+int hv_set_mem_enc(unsigned long addr, int numpages, bool enc)
+{
+   return hv_set_mem_host_visibility((void *)addr,
+   numpages * HV_HYP_PAGE_SIZE,
+   enc ? VMBUS_PAGE_NOT_VISIBLE
+   : VMBUS_PAGE_VISIBLE_READ_WRITE);
+}
diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 606f5cc579b2..68826fbf92ca 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -262,6 +262,11 @@ enum hv_isolation_type {
 #define HV_X64_MSR_TIME_REF_COUNT   

[Resend RFC PATCH V4 02/13] x86/HV: Initialize shared memory boundary in the Isolation VM.

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V exposes shared memory boundary via cpuid
HYPERV_CPUID_ISOLATION_CONFIG and store it in the
shared_gpa_boundary of ms_hyperv struct. This prepares
to share memory with host for SNP guest.

Signed-off-by: Tianyu Lan 
---
 arch/x86/kernel/cpu/mshyperv.c |  2 ++
 include/asm-generic/mshyperv.h | 12 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 10b2a8c10cb6..8aed689db621 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -334,6 +334,8 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.priv_high & HV_ISOLATION) {
ms_hyperv.isolation_config_a = 
cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
ms_hyperv.isolation_config_b = 
cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
+   ms_hyperv.shared_gpa_boundary =
+   (u64)1 << ms_hyperv.shared_gpa_boundary_bits;
 
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 
0x%x\n",
ms_hyperv.isolation_config_a, 
ms_hyperv.isolation_config_b);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 3ae56a29594f..2914e27b0429 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -34,8 +34,18 @@ struct ms_hyperv_info {
u32 max_vp_index;
u32 max_lp_index;
u32 isolation_config_a;
-   u32 isolation_config_b;
+   union {
+   u32 isolation_config_b;
+   struct {
+   u32 cvm_type : 4;
+   u32 Reserved11 : 1;
+   u32 shared_gpa_boundary_active : 1;
+   u32 shared_gpa_boundary_bits : 6;
+   u32 Reserved12 : 20;
+   };
+   };
void  __percpu **ghcb_base;
+   u64 shared_gpa_boundary;
 };
 extern struct ms_hyperv_info ms_hyperv;
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Resend RFC PATCH V4 01/13] x86/HV: Initialize GHCB page in Isolation VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V exposes GHCB page via SEV ES GHCB MSR for SNP guest
to communicate with hypervisor. Map GHCB page for all
cpus to read/write MSR register and submit hvcall request
via GHCB.

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/hv_init.c   | 64 ++---
 arch/x86/include/asm/mshyperv.h |  2 ++
 include/asm-generic/mshyperv.h  |  2 ++
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index b756b2866deb..e058f725 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -54,6 +55,26 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg);
 u32 hv_max_vp_index;
 EXPORT_SYMBOL_GPL(hv_max_vp_index);
 
+static int hyperv_init_ghcb(void)
+{
+   u64 ghcb_gpa;
+   void *ghcb_va;
+   void **ghcb_base;
+
+   if (!ms_hyperv.ghcb_base)
+   return -EINVAL;
+
+   rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
+   ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);
+   if (!ghcb_va)
+   return -ENOMEM;
+
+   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   *ghcb_base = ghcb_va;
+
+   return 0;
+}
+
 static int hv_cpu_init(unsigned int cpu)
 {
u64 msr_vp_index;
@@ -106,6 +127,8 @@ static int hv_cpu_init(unsigned int cpu)
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, val);
}
 
+   hyperv_init_ghcb();
+
return 0;
 }
 
@@ -201,6 +224,7 @@ static int hv_cpu_die(unsigned int cpu)
unsigned long flags;
void **input_arg;
void *pg;
+   void **ghcb_va = NULL;
 
local_irq_save(flags);
input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -214,6 +238,13 @@ static int hv_cpu_die(unsigned int cpu)
*output_arg = NULL;
}
 
+   if (ms_hyperv.ghcb_base) {
+   ghcb_va = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   if (*ghcb_va)
+   memunmap(*ghcb_va);
+   *ghcb_va = NULL;
+   }
+
local_irq_restore(flags);
 
free_pages((unsigned long)pg, hv_root_partition ? 1 : 0);
@@ -410,9 +441,22 @@ void __init hyperv_init(void)
VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
__builtin_return_address(0));
-   if (hv_hypercall_pg == NULL) {
-   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
-   goto remove_cpuhp_state;
+   if (hv_hypercall_pg == NULL)
+   goto clean_guest_os_id;
+
+   if (hv_isolation_type_snp()) {
+   ms_hyperv.ghcb_base = alloc_percpu(void *);
+   if (!ms_hyperv.ghcb_base)
+   goto clean_guest_os_id;
+
+   if (hyperv_init_ghcb()) {
+   free_percpu(ms_hyperv.ghcb_base);
+   ms_hyperv.ghcb_base = NULL;
+   goto clean_guest_os_id;
+   }
+
+   /* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
+   hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
}
 
rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
@@ -473,7 +517,8 @@ void __init hyperv_init(void)
hv_query_ext_cap(0);
return;
 
-remove_cpuhp_state:
+clean_guest_os_id:
+   wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
cpuhp_remove_state(cpuhp);
 free_vp_assist_page:
kfree(hv_vp_assist_page);
@@ -502,6 +547,9 @@ void hyperv_cleanup(void)
 */
hv_hypercall_pg = NULL;
 
+   if (ms_hyperv.ghcb_base)
+   free_percpu(ms_hyperv.ghcb_base);
+
/* Reset the hypercall page */
hypercall_msr.as_uint64 = 0;
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
@@ -586,6 +634,14 @@ bool hv_is_isolation_supported(void)
 }
 EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
 
+DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
+
+bool hv_isolation_type_snp(void)
+{
+   return static_branch_unlikely(_type_snp);
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
+
 /* Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx */
 bool hv_query_ext_cap(u64 cap_query)
 {
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 67ff0d637e55..aeacca7c4da8 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 
+DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
+
 typedef int (*hyperv_fill_flush_list_func)(
struct hv_guest_mapping_flush_list *flush,
void *data);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 9a000ba2bb75..3ae56a29594f 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -35,6 +35,7 @@ struct ms_hyperv_info {
u32 

[Resend RFC PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.

The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.

Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.

There are two exceptions - packets sent by vmbus_sendpacket_
pagebuffer() and vmbus_sendpacket_mpb_desc(). These packets
contains IO stack memory address and host will access these memory.
So add allocation bounce buffer support in vmbus for these packets.

For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.

Change since v3:
   - Add interface set_memory_decrypted_map() to decrypt memory and
 map bounce buffer in extra address space 
   - Remove swiotlb remap function and store the remap address
 returned by set_memory_decrypted_map() in swiotlb mem data structure.
   - Introduce hv_set_mem_enc() to make code more readable in the 
__set_memory_enc_dec().

Change since v2:
   - Remove not UIO driver in Isolation VM patch
   - Use vmap_pfn() to replace ioremap_page_range function in
   order to avoid exposing symbol ioremap_page_range() and
   ioremap_page_range()
   - Call hv set mem host visibility hvcall in 
set_memory_encrypted/decrypted()
   - Enable swiotlb force mode instead of adding Hyper-V dma map/unmap hook
   - Fix code style

Tianyu Lan (13):
  x86/HV: Initialize GHCB page in Isolation VM
  x86/HV: Initialize shared memory boundary in the Isolation VM.
  x86/HV: Add new hvcall guest address host visibility support
  HV: Mark vmbus ring buffer visible to host in Isolation VM
  HV: Add Write/Read MSR registers via ghcb page
  HV: Add ghcb hvcall support for SNP VM
  HV/Vmbus: Add SNP support for VMbus channel initiate message
  HV/Vmbus: Initialize VMbus ring buffer for Isolation VM
  x86/Swiotlb/HV: Add Swiotlb bounce buffer remap function for HV IVM
  HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM
  HV/Netvsc: Add Isolation VM support for netvsc driver
  HV/Storvsc: Add Isolation VM support for storvsc driver
  x86/HV: Not set memory decrypted/encrypted during kexec alloc/free
page in IVM

 arch/x86/hyperv/Makefile   |   2 +-
 arch/x86/hyperv/hv_init.c  |  25 +--
 arch/x86/hyperv/ivm.c  | 299 +
 arch/x86/include/asm/hyperv-tlfs.h |  18 ++
 arch/x86/include/asm/mshyperv.h|  84 +++-
 arch/x86/include/asm/set_memory.h  |   2 +
 arch/x86/include/asm/sev-es.h  |   4 +
 arch/x86/kernel/cpu/mshyperv.c |   5 +
 arch/x86/kernel/machine_kexec_64.c |   5 +-
 arch/x86/kernel/sev-es-shared.c|  21 +-
 arch/x86/mm/pat/set_memory.c   |  34 +++-
 arch/x86/xen/pci-swiotlb-xen.c |   3 +-
 drivers/hv/Kconfig |   1 +
 drivers/hv/channel.c   |  48 -
 drivers/hv/connection.c|  71 ++-
 drivers/hv/hv.c| 129 +
 drivers/hv/hyperv_vmbus.h  |   3 +
 drivers/hv/ring_buffer.c   |  84 ++--
 drivers/hv/vmbus_drv.c |   3 +
 drivers/iommu/hyperv-iommu.c   |  62 ++
 drivers/net/hyperv/hyperv_net.h|   6 +
 drivers/net/hyperv/netvsc.c| 144 +-
 drivers/net/hyperv/rndis_filter.c  |   2 +
 drivers/scsi/storvsc_drv.c |  68 ++-
 include/asm-generic/hyperv-tlfs.h  |   1 +
 include/asm-generic/mshyperv.h |  53 -
 include/linux/hyperv.h |  16 ++
 include/linux/swiotlb.h|   4 +
 kernel/dma/swiotlb.c   |  11 +-
 29 files changed, 1097 insertions(+), 111 deletions(-)
 create mode 100644 arch/x86/hyperv/ivm.c

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[Resend RFC PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.

The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.

Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.

There are two exceptions - packets sent by vmbus_sendpacket_
pagebuffer() and vmbus_sendpacket_mpb_desc(). These packets
contains IO stack memory address and host will access these memory.
So add allocation bounce buffer support in vmbus for these packets.

For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.

Change since v3:
   - Add interface set_memory_decrypted_map() to decrypt memory and
 map bounce buffer in extra address space 
   - Remove swiotlb remap function and store the remap address
 returned by set_memory_decrypted_map() in swiotlb mem data structure.
   - Introduce hv_set_mem_enc() to make code more readable in the 
__set_memory_enc_dec().

Change since v2:
   - Remove not UIO driver in Isolation VM patch
   - Use vmap_pfn() to replace ioremap_page_range function in
   order to avoid exposing symbol ioremap_page_range() and
   ioremap_page_range()
   - Call hv set mem host visibility hvcall in 
set_memory_encrypted/decrypted()
   - Enable swiotlb force mode instead of adding Hyper-V dma map/unmap hook
   - Fix code style

Tianyu Lan (13):
  x86/HV: Initialize GHCB page in Isolation VM
  x86/HV: Initialize shared memory boundary in the Isolation VM.
  x86/HV: Add new hvcall guest address host visibility support
  HV: Mark vmbus ring buffer visible to host in Isolation VM
  HV: Add Write/Read MSR registers via ghcb page
  HV: Add ghcb hvcall support for SNP VM
  HV/Vmbus: Add SNP support for VMbus channel initiate message
  HV/Vmbus: Initialize VMbus ring buffer for Isolation VM
  x86/Swiotlb/HV: Add Swiotlb bounce buffer remap function for HV IVM
  HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM
  HV/Netvsc: Add Isolation VM support for netvsc driver
  HV/Storvsc: Add Isolation VM support for storvsc driver
  x86/HV: Not set memory decrypted/encrypted during kexec alloc/free
page in IVM

 arch/x86/hyperv/Makefile   |   2 +-
 arch/x86/hyperv/hv_init.c  |  25 +--
 arch/x86/hyperv/ivm.c  | 299 +
 arch/x86/include/asm/hyperv-tlfs.h |  18 ++
 arch/x86/include/asm/mshyperv.h|  84 +++-
 arch/x86/include/asm/set_memory.h  |   2 +
 arch/x86/include/asm/sev-es.h  |   4 +
 arch/x86/kernel/cpu/mshyperv.c |   5 +
 arch/x86/kernel/machine_kexec_64.c |   5 +-
 arch/x86/kernel/sev-es-shared.c|  21 +-
 arch/x86/mm/pat/set_memory.c   |  34 +++-
 arch/x86/xen/pci-swiotlb-xen.c |   3 +-
 drivers/hv/Kconfig |   1 +
 drivers/hv/channel.c   |  48 -
 drivers/hv/connection.c|  71 ++-
 drivers/hv/hv.c| 129 +
 drivers/hv/hyperv_vmbus.h  |   3 +
 drivers/hv/ring_buffer.c   |  84 ++--
 drivers/hv/vmbus_drv.c |   3 +
 drivers/iommu/hyperv-iommu.c   |  62 ++
 drivers/net/hyperv/hyperv_net.h|   6 +
 drivers/net/hyperv/netvsc.c| 144 +-
 drivers/net/hyperv/rndis_filter.c  |   2 +
 drivers/scsi/storvsc_drv.c |  68 ++-
 include/asm-generic/hyperv-tlfs.h  |   1 +
 include/asm-generic/mshyperv.h |  53 -
 include/linux/hyperv.h |  16 ++
 include/linux/swiotlb.h|   4 +
 kernel/dma/swiotlb.c   |  11 +-
 29 files changed, 1097 insertions(+), 111 deletions(-)
 create mode 100644 arch/x86/hyperv/ivm.c

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH V4 12/12] x86/HV: Not set memory decrypted/encrypted during kexec alloc/free page in IVM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V Isolation VM reuses set_memory_decrypted/encrypted function
and not needs to decrypt/encrypt memory in arch_kexec_post_alloc(pre_free)
_pages() just likes AMD SEV VM. So skip them.

Signed-off-by: Tianyu Lan 
---
 arch/x86/kernel/machine_kexec_64.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index c078b0d3ab0e..0cadc64b6873 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_ACPI
 /*
@@ -598,7 +599,7 @@ void arch_kexec_unprotect_crashkres(void)
  */
 int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, gfp_t gfp)
 {
-   if (sev_active())
+   if (sev_active() || hv_is_isolation_supported())
return 0;
 
/*
@@ -611,7 +612,7 @@ int arch_kexec_post_alloc_pages(void *vaddr, unsigned int 
pages, gfp_t gfp)
 
 void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages)
 {
-   if (sev_active())
+   if (sev_active() || hv_is_isolation_supported())
return;
 
/*
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH V4 11/12] HV/Storvsc: Add Isolation VM support for storvsc driver

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
mpb_desc() still need to handle. Use DMA API to map/umap these
memory during sending/receiving packet and Hyper-V DMA ops callback
will use swiotlb function to allocate bounce buffer and copy data
from/to bounce buffer.

Signed-off-by: Tianyu Lan 
---
 drivers/scsi/storvsc_drv.c | 68 +++---
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 403753929320..cc9cb32f6621 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -427,6 +429,8 @@ struct storvsc_cmd_request {
u32 payload_sz;
 
struct vstor_packet vstor_packet;
+   u32 hvpg_count;
+   struct hv_dma_range *dma_range;
 };
 
 
@@ -509,6 +513,14 @@ struct storvsc_scan_work {
u8 tgt_id;
 };
 
+#define storvsc_dma_map(dev, page, offset, size, dir) \
+   dma_map_page(dev, page, offset, size, dir)
+
+#define storvsc_dma_unmap(dev, dma_range, dir) \
+   dma_unmap_page(dev, dma_range.dma,  \
+  dma_range.mapping_size,  \
+  dir ? DMA_FROM_DEVICE : DMA_TO_DEVICE)
+
 static void storvsc_device_scan(struct work_struct *work)
 {
struct storvsc_scan_work *wrk;
@@ -1267,6 +1279,7 @@ static void storvsc_on_channel_callback(void *context)
struct hv_device *device;
struct storvsc_device *stor_device;
struct Scsi_Host *shost;
+   int i;
 
if (channel->primary_channel != NULL)
device = channel->primary_channel->device_obj;
@@ -1321,6 +1334,15 @@ static void storvsc_on_channel_callback(void *context)
request = (struct storvsc_cmd_request 
*)scsi_cmd_priv(scmnd);
}
 
+   if (request->dma_range) {
+   for (i = 0; i < request->hvpg_count; i++)
+   storvsc_dma_unmap(>device,
+   request->dma_range[i],
+   
request->vstor_packet.vm_srb.data_in == READ_TYPE);
+
+   kfree(request->dma_range);
+   }
+
storvsc_on_receive(stor_device, packet, request);
continue;
}
@@ -1817,7 +1839,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
unsigned int hvpgoff, hvpfns_to_add;
unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
+   dma_addr_t dma;
u64 hvpfn;
+   u32 size;
 
if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {
 
@@ -1831,6 +1855,13 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
payload->range.len = length;
payload->range.offset = offset_in_hvpg;
 
+   cmd_request->dma_range = kcalloc(hvpg_count,
+sizeof(*cmd_request->dma_range),
+GFP_ATOMIC);
+   if (!cmd_request->dma_range) {
+   ret = -ENOMEM;
+   goto free_payload;
+   }
 
for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
/*
@@ -1854,9 +1885,29 @@ static int storvsc_queuecommand(struct Scsi_Host *host, 
struct scsi_cmnd *scmnd)
 * last sgl should be reached at the same time that
 * the PFN array is filled.
 */
-   while (hvpfns_to_add--)
-   payload->range.pfn_array[i++] = hvpfn++;
+   while (hvpfns_to_add--) {
+   size = min(HV_HYP_PAGE_SIZE - offset_in_hvpg,
+  (unsigned long)length);
+   dma = storvsc_dma_map(>device, 
pfn_to_page(hvpfn++),
+ offset_in_hvpg, size,
+ scmnd->sc_data_direction);
+   if (dma_mapping_error(>device, dma)) {
+   ret = -ENOMEM;
+   goto free_dma_range;
+   }
+
+   if (offset_in_hvpg) {
+   payload->range.offset = dma & 
~HV_HYP_PAGE_MASK;
+   

[RFC PATCH V4 10/12] HV/Netvsc: Add Isolation VM support for netvsc driver

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM, all shared memory with host needs to mark visible
to host via hvcall. vmbus_establish_gpadl() has already done it for
netvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_
pagebuffer() still need to handle. Use DMA API to map/umap these
memory during sending/receiving packet and Hyper-V DMA ops callback
will use swiotlb function to allocate bounce buffer and copy data
from/to bounce buffer.

Signed-off-by: Tianyu Lan 
---
 drivers/net/hyperv/hyperv_net.h   |   6 ++
 drivers/net/hyperv/netvsc.c   | 144 +-
 drivers/net/hyperv/rndis_filter.c |   2 +
 include/linux/hyperv.h|   5 ++
 4 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index b11aa68b44ec..c2fbb9d4df2c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -164,6 +164,7 @@ struct hv_netvsc_packet {
u32 total_bytes;
u32 send_buf_index;
u32 total_data_buflen;
+   struct hv_dma_range *dma_range;
 };
 
 #define NETVSC_HASH_KEYLEN 40
@@ -1074,6 +1075,7 @@ struct netvsc_device {
 
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
+   void *recv_original_buf;
u32 recv_buf_size; /* allocated bytes */
u32 recv_buf_gpadl_handle;
u32 recv_section_cnt;
@@ -1082,6 +1084,8 @@ struct netvsc_device {
 
/* Send buffer allocated by us */
void *send_buf;
+   void *send_original_buf;
+   u32 send_buf_size;
u32 send_buf_gpadl_handle;
u32 send_section_cnt;
u32 send_section_size;
@@ -1729,4 +1733,6 @@ struct rndis_message {
 #define RETRY_US_HI1
 #define RETRY_MAX  2000/* >10 sec */
 
+void netvsc_dma_unmap(struct hv_device *hv_dev,
+ struct hv_netvsc_packet *packet);
 #endif /* _HYPERV_NET_H */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 7bd935412853..fc312e5db4d5 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -153,8 +153,21 @@ static void free_netvsc_device(struct rcu_head *head)
int i;
 
kfree(nvdev->extension);
-   vfree(nvdev->recv_buf);
-   vfree(nvdev->send_buf);
+
+   if (nvdev->recv_original_buf) {
+   vunmap(nvdev->recv_buf);
+   vfree(nvdev->recv_original_buf);
+   } else {
+   vfree(nvdev->recv_buf);
+   }
+
+   if (nvdev->send_original_buf) {
+   vunmap(nvdev->send_buf);
+   vfree(nvdev->send_original_buf);
+   } else {
+   vfree(nvdev->send_buf);
+   }
+
kfree(nvdev->send_section_map);
 
for (i = 0; i < VRSS_CHANNEL_MAX; i++) {
@@ -330,6 +343,27 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device 
*net_device, u32 q_idx)
return nvchan->mrc.slots ? 0 : -ENOMEM;
 }
 
+static void *netvsc_remap_buf(void *buf, unsigned long size)
+{
+   unsigned long *pfns;
+   void *vaddr;
+   int i;
+
+   pfns = kcalloc(size / HV_HYP_PAGE_SIZE, sizeof(unsigned long),
+  GFP_KERNEL);
+   if (!pfns)
+   return NULL;
+
+   for (i = 0; i < size / HV_HYP_PAGE_SIZE; i++)
+   pfns[i] = virt_to_hvpfn(buf + i * HV_HYP_PAGE_SIZE)
+   + (ms_hyperv.shared_gpa_boundary >> HV_HYP_PAGE_SHIFT);
+
+   vaddr = vmap_pfn(pfns, size / HV_HYP_PAGE_SIZE, PAGE_KERNEL_IO);
+   kfree(pfns);
+
+   return vaddr;
+}
+
 static int netvsc_init_buf(struct hv_device *device,
   struct netvsc_device *net_device,
   const struct netvsc_device_info *device_info)
@@ -340,6 +374,7 @@ static int netvsc_init_buf(struct hv_device *device,
unsigned int buf_size;
size_t map_words;
int i, ret = 0;
+   void *vaddr;
 
/* Get receive buffer area. */
buf_size = device_info->recv_sections * device_info->recv_section_size;
@@ -375,6 +410,15 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
 
+   if (hv_isolation_type_snp()) {
+   vaddr = netvsc_remap_buf(net_device->recv_buf, buf_size);
+   if (!vaddr)
+   goto cleanup;
+
+   net_device->recv_original_buf = net_device->recv_buf;
+   net_device->recv_buf = vaddr;
+   }
+
/* Notify the NetVsp of the gpadl handle */
init_packet = _device->channel_init_pkt;
memset(init_packet, 0, sizeof(struct nvsp_message));
@@ -477,6 +521,15 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
}
 
+   if (hv_isolation_type_snp()) {
+   vaddr = netvsc_remap_buf(net_device->send_buf, buf_size);
+   if (!vaddr)
+   goto cleanup;
+
+   net_device->send_original_buf = 

[RFC PATCH V4 09/12] HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V Isolation VM requires bounce buffer support to copy
data from/to encrypted memory and so enable swiotlb force
mode to use swiotlb bounce buffer for DMA transaction.

In Isolation VM with AMD SEV, the bounce buffer needs to be
accessed via extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Swiotlb bounce buffer code calls set_memory_decrypted_map()
to mark bounce buffer visible to host and map it in extra
address space.

Hyper-V initalizes swiotlb bounce buffer and default swiotlb
needs to be disabled. pci_swiotlb_detect_override() and
pci_swiotlb_detect_4gb() enable the default one. To override
the setting, hyperv_swiotlb_detect() needs to run before
these detect functions which depends on the pci_xen_swiotlb_
init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb
_detect() to keep the order.

The map function vmap_pfn() can't work in the early place
hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce
buffer in the hyperv_iommu_swiotlb_later_init().

Signed-off-by: Tianyu Lan 
---
 arch/x86/xen/pci-swiotlb-xen.c |  3 +-
 drivers/hv/vmbus_drv.c |  3 ++
 drivers/iommu/hyperv-iommu.c   | 62 ++
 include/linux/hyperv.h |  1 +
 4 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 54f9aa7e8457..43bd031aa332 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)
 EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
 
 IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,
- NULL,
+ hyperv_swiotlb_detect,
  pci_xen_swiotlb_init,
  NULL);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 92cb3f7d21d9..5e3bb76d4dee 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -2080,6 +2081,7 @@ struct hv_device *vmbus_device_create(const guid_t *type,
return child_device_obj;
 }
 
+static u64 vmbus_dma_mask = DMA_BIT_MASK(64);
 /*
  * vmbus_device_register - Register the child device
  */
@@ -2120,6 +2122,7 @@ int vmbus_device_register(struct hv_device 
*child_device_obj)
}
hv_debug_add_dev_dir(child_device_obj);
 
+   child_device_obj->device.dma_mask = _dma_mask;
return 0;
 
 err_kset_unregister:
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e285a220c913..d7ea8e05b991 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -13,14 +13,22 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 #include "irq_remapping.h"
 
@@ -36,6 +44,8 @@
 static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
 static struct irq_domain *ioapic_ir_domain;
 
+static unsigned long hyperv_io_tlb_start, hyperv_io_tlb_size;
+
 static int hyperv_ir_set_affinity(struct irq_data *data,
const struct cpumask *mask, bool force)
 {
@@ -337,4 +347,56 @@ static const struct irq_domain_ops 
hyperv_root_ir_domain_ops = {
.free = hyperv_root_irq_remapping_free,
 };
 
+void __init hyperv_iommu_swiotlb_init(void)
+{
+   unsigned long bytes;
+   void *vstart;
+
+   /*
+* Allocate Hyper-V swiotlb bounce buffer at early place
+* to reserve large contiguous memory.
+*/
+   hyperv_io_tlb_size = 200 * 1024 * 1024;
+   hyperv_io_tlb_start = memblock_alloc_low(PAGE_ALIGN(hyperv_io_tlb_size),
+HV_HYP_PAGE_SIZE);
+
+   if (!hyperv_io_tlb_start) {
+   pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");
+   return;
+   }
+}
+
+int __init hyperv_swiotlb_detect(void)
+{
+   if (hypervisor_is_type(X86_HYPER_MS_HYPERV)
+   && hv_is_isolation_supported()) {
+   /*
+* Enable swiotlb force mode in Isolation VM to
+* use swiotlb bounce buffer for dma transaction.
+*/
+   swiotlb_force = SWIOTLB_FORCE;
+   return 1;
+   }
+
+   return 0;
+}
+
+void __init hyperv_iommu_swiotlb_later_init(void)
+{
+   void *hyperv_io_tlb_remap;
+   int ret;
+
+   /*
+* Swiotlb bounce buffer needs to be mapped in extra address
+* space. Map function 

[RFC PATCH V4 08/12] x86/Swiotlb/HV: Add Swiotlb bounce buffer remap function for HV IVM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

In Isolation VM with AMD SEV, bounce buffer needs to be accessed via
extra address space which is above shared_gpa_boundary
(E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.
The access physical address will be original physical address +
shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP
spec is called virtual top of memory(vTOM). Memory addresses below
vTOM are automatically treated as private while memory above
vTOM is treated as shared.

Introduce set_memory_decrypted_map() function to decrypt memory and
remap memory with platform callback. Use set_memory_decrypted_
map() in the swiotlb code, store remap address returned by the new
API and use the remap address to copy data from/to swiotlb bounce buffer.

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/ivm.c | 30 ++
 arch/x86/include/asm/mshyperv.h   |  2 ++
 arch/x86/include/asm/set_memory.h |  2 ++
 arch/x86/mm/pat/set_memory.c  | 28 
 include/linux/swiotlb.h   |  4 
 kernel/dma/swiotlb.c  | 11 ---
 6 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 8a6f4e9e3d6c..ea33935e0c17 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -267,3 +267,33 @@ int hv_set_mem_enc(unsigned long addr, int numpages, bool 
enc)
enc ? VMBUS_PAGE_NOT_VISIBLE
: VMBUS_PAGE_VISIBLE_READ_WRITE);
 }
+
+/*
+ * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.
+ */
+unsigned long hv_map_memory(unsigned long addr, unsigned long size)
+{
+   unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,
+ sizeof(unsigned long),
+  GFP_KERNEL);
+   unsigned long vaddr;
+   int i;
+
+   if (!pfns)
+   return (unsigned long)NULL;
+
+   for (i = 0; i < size / HV_HYP_PAGE_SIZE; i++)
+   pfns[i] = virt_to_hvpfn((void *)addr + i * HV_HYP_PAGE_SIZE) +
+   (ms_hyperv.shared_gpa_boundary >> HV_HYP_PAGE_SHIFT);
+
+   vaddr = (unsigned long)vmap_pfn(pfns, size / HV_HYP_PAGE_SIZE,
+   PAGE_KERNEL_IO);
+   kfree(pfns);
+
+   return vaddr;
+}
+
+void hv_unmap_memory(unsigned long addr)
+{
+   vunmap((void *)addr);
+}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index fe03e3e833ac..ba3cb9e32fdb 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -253,6 +253,8 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, int 
vcpu, int vector,
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
 int hv_mark_gpa_visibility(u16 count, const u64 pfn[], u32 visibility);
 int hv_set_mem_enc(unsigned long addr, int numpages, bool enc);
+unsigned long hv_map_memory(unsigned long addr, unsigned long size);
+void hv_unmap_memory(unsigned long addr);
 void hv_sint_wrmsrl_ghcb(u64 msr, u64 value);
 void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);
 void hv_signal_eom_ghcb(void);
diff --git a/arch/x86/include/asm/set_memory.h 
b/arch/x86/include/asm/set_memory.h
index 43fa081a1adb..7a2117931830 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,6 +49,8 @@ int set_memory_decrypted(unsigned long addr, int numpages);
 int set_memory_np_noalias(unsigned long addr, int numpages);
 int set_memory_nonglobal(unsigned long addr, int numpages);
 int set_memory_global(unsigned long addr, int numpages);
+unsigned long set_memory_decrypted_map(unsigned long addr, unsigned long size);
+int set_memory_encrypted_unmap(unsigned long addr, unsigned long size);
 
 int set_pages_array_uc(struct page **pages, int addrinarray);
 int set_pages_array_wc(struct page **pages, int addrinarray);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 6cc83c57383d..5d4d3963f4a2 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2039,6 +2039,34 @@ int set_memory_decrypted(unsigned long addr, int 
numpages)
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
+static unsigned long __map_memory(unsigned long addr, unsigned long size)
+{
+   if (hv_is_isolation_supported())
+   return hv_map_memory(addr, size);
+
+   return addr;
+}
+
+static void __unmap_memory(unsigned long addr)
+{
+   if (hv_is_isolation_supported())
+   hv_unmap_memory(addr);
+}
+
+unsigned long set_memory_decrypted_map(unsigned long addr, unsigned long size)
+{
+   if (__set_memory_enc_dec(addr, size / PAGE_SIZE, false))
+   return (unsigned long)NULL;
+
+   return __map_memory(addr, size);
+}
+
+int set_memory_encrypted_unmap(unsigned long addr, unsigned long size)
+{
+   __unmap_memory(addr);
+   return __set_memory_enc_dec(addr, size / PAGE_SIZE, true);
+}
+
 int 

[RFC PATCH V4 07/12] HV/Vmbus: Initialize VMbus ring buffer for Isolation VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

VMbus ring buffer are shared with host and it's need to
be accessed via extra address space of Isolation VM with
SNP support. This patch is to map the ring buffer
address in extra address space via ioremap(). HV host
visibility hvcall smears data in the ring buffer and
so reset the ring buffer memory to zero after calling
visibility hvcall.

Signed-off-by: Tianyu Lan 
---
 drivers/hv/Kconfig|  1 +
 drivers/hv/channel.c  | 10 +
 drivers/hv/hyperv_vmbus.h |  2 +
 drivers/hv/ring_buffer.c  | 84 ++-
 4 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 66c794d92391..a8386998be40 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -7,6 +7,7 @@ config HYPERV
depends on X86 && ACPI && X86_LOCAL_APIC && HYPERVISOR_GUEST
select PARAVIRT
select X86_HV_CALLBACK_VECTOR
+   select VMAP_PFN
help
  Select this option to run Linux as a Hyper-V client operating
  system.
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 01048bb07082..7350da9dbe97 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -707,6 +707,16 @@ static int __vmbus_open(struct vmbus_channel *newchannel,
if (err)
goto error_clean_ring;
 
+   err = hv_ringbuffer_post_init(>outbound,
+ page, send_pages);
+   if (err)
+   goto error_free_gpadl;
+
+   err = hv_ringbuffer_post_init(>inbound,
+ [send_pages], recv_pages);
+   if (err)
+   goto error_free_gpadl;
+
/* Create and init the channel open message */
open_info = kzalloc(sizeof(*open_info) +
   sizeof(struct vmbus_channel_open_channel),
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 40bc0eff6665..15cd23a561f3 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -172,6 +172,8 @@ extern int hv_synic_cleanup(unsigned int cpu);
 /* Interface */
 
 void hv_ringbuffer_pre_init(struct vmbus_channel *channel);
+int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
+   struct page *pages, u32 page_cnt);
 
 int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
   struct page *pages, u32 pagecnt, u32 max_pkt_size);
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 2aee356840a2..d4f93fca1108 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "hyperv_vmbus.h"
 
@@ -179,43 +181,89 @@ void hv_ringbuffer_pre_init(struct vmbus_channel *channel)
mutex_init(>outbound.ring_buffer_mutex);
 }
 
-/* Initialize the ring buffer. */
-int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
-  struct page *pages, u32 page_cnt, u32 max_pkt_size)
+int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,
+  struct page *pages, u32 page_cnt)
 {
+   u64 physic_addr = page_to_pfn(pages) << PAGE_SHIFT;
+   unsigned long *pfns_wraparound;
+   void *vaddr;
int i;
-   struct page **pages_wraparound;
 
-   BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
+   if (!hv_isolation_type_snp())
+   return 0;
+
+   physic_addr += ms_hyperv.shared_gpa_boundary;
 
/*
 * First page holds struct hv_ring_buffer, do wraparound mapping for
 * the rest.
 */
-   pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),
+   pfns_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(unsigned long),
   GFP_KERNEL);
-   if (!pages_wraparound)
+   if (!pfns_wraparound)
return -ENOMEM;
 
-   pages_wraparound[0] = pages;
+   pfns_wraparound[0] = physic_addr >> PAGE_SHIFT;
for (i = 0; i < 2 * (page_cnt - 1); i++)
-   pages_wraparound[i + 1] = [i % (page_cnt - 1) + 1];
-
-   ring_info->ring_buffer = (struct hv_ring_buffer *)
-   vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL);
-
-   kfree(pages_wraparound);
+   pfns_wraparound[i + 1] = (physic_addr >> PAGE_SHIFT) +
+   i % (page_cnt - 1) + 1;
 
-
-   if (!ring_info->ring_buffer)
+   vaddr = vmap_pfn(pfns_wraparound, page_cnt * 2 - 1, PAGE_KERNEL_IO);
+   kfree(pfns_wraparound);
+   if (!vaddr)
return -ENOMEM;
 
-   ring_info->ring_buffer->read_index =
-   ring_info->ring_buffer->write_index = 0;
+   /* Clean memory after setting host visibility. */
+   memset((void *)vaddr, 0x00, page_cnt * PAGE_SIZE);
+
+   ring_info->ring_buffer = (struct hv_ring_buffer *)vaddr;
+   ring_info->ring_buffer->read_index = 0;
+   

[RFC PATCH V4 06/12] HV/Vmbus: Add SNP support for VMbus channel initiate message

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

The monitor pages in the CHANNELMSG_INITIATE_CONTACT msg are shared
with host in Isolation VM and so it's necessary to use hvcall to set
them visible to host. In Isolation VM with AMD SEV SNP, the access
address should be in the extra space which is above shared gpa
boundary. So remap these pages into the extra address(pa +
shared_gpa_boundary).

Signed-off-by: Tianyu Lan 
---
 drivers/hv/connection.c   | 65 +++
 drivers/hv/hyperv_vmbus.h |  1 +
 2 files changed, 66 insertions(+)

diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 186fd4c8acd4..a32bde143e4c 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "hyperv_vmbus.h"
@@ -104,6 +105,12 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
*msginfo, u32 version)
 
msg->monitor_page1 = virt_to_phys(vmbus_connection.monitor_pages[0]);
msg->monitor_page2 = virt_to_phys(vmbus_connection.monitor_pages[1]);
+
+   if (hv_is_isolation_supported()) {
+   msg->monitor_page1 += ms_hyperv.shared_gpa_boundary;
+   msg->monitor_page2 += ms_hyperv.shared_gpa_boundary;
+   }
+
msg->target_vcpu = hv_cpu_number_to_vp_number(VMBUS_CONNECT_CPU);
 
/*
@@ -148,6 +155,31 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo 
*msginfo, u32 version)
return -ECONNREFUSED;
}
 
+   if (hv_is_isolation_supported()) {
+   vmbus_connection.monitor_pages_va[0]
+   = vmbus_connection.monitor_pages[0];
+   vmbus_connection.monitor_pages[0]
+   = memremap(msg->monitor_page1, HV_HYP_PAGE_SIZE,
+  MEMREMAP_WB);
+   if (!vmbus_connection.monitor_pages[0])
+   return -ENOMEM;
+
+   vmbus_connection.monitor_pages_va[1]
+   = vmbus_connection.monitor_pages[1];
+   vmbus_connection.monitor_pages[1]
+   = memremap(msg->monitor_page2, HV_HYP_PAGE_SIZE,
+  MEMREMAP_WB);
+   if (!vmbus_connection.monitor_pages[1]) {
+   memunmap(vmbus_connection.monitor_pages[0]);
+   return -ENOMEM;
+   }
+
+   memset(vmbus_connection.monitor_pages[0], 0x00,
+  HV_HYP_PAGE_SIZE);
+   memset(vmbus_connection.monitor_pages[1], 0x00,
+  HV_HYP_PAGE_SIZE);
+   }
+
return ret;
 }
 
@@ -159,6 +191,7 @@ int vmbus_connect(void)
struct vmbus_channel_msginfo *msginfo = NULL;
int i, ret = 0;
__u32 version;
+   u64 pfn[2];
 
/* Initialize the vmbus connection */
vmbus_connection.conn_state = CONNECTING;
@@ -216,6 +249,16 @@ int vmbus_connect(void)
goto cleanup;
}
 
+   if (hv_is_isolation_supported()) {
+   pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
+   pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
+   if (hv_mark_gpa_visibility(2, pfn,
+   VMBUS_PAGE_VISIBLE_READ_WRITE)) {
+   ret = -EFAULT;
+   goto cleanup;
+   }
+   }
+
msginfo = kzalloc(sizeof(*msginfo) +
  sizeof(struct vmbus_channel_initiate_contact),
  GFP_KERNEL);
@@ -282,6 +325,8 @@ int vmbus_connect(void)
 
 void vmbus_disconnect(void)
 {
+   u64 pfn[2];
+
/*
 * First send the unload request to the host.
 */
@@ -301,6 +346,26 @@ void vmbus_disconnect(void)
vmbus_connection.int_page = NULL;
}
 
+   if (hv_is_isolation_supported()) {
+   if (vmbus_connection.monitor_pages_va[0]) {
+   memunmap(vmbus_connection.monitor_pages[0]);
+   vmbus_connection.monitor_pages[0]
+   = vmbus_connection.monitor_pages_va[0];
+   vmbus_connection.monitor_pages_va[0] = NULL;
+   }
+
+   if (vmbus_connection.monitor_pages_va[1]) {
+   memunmap(vmbus_connection.monitor_pages[1]);
+   vmbus_connection.monitor_pages[1]
+   = vmbus_connection.monitor_pages_va[1];
+   vmbus_connection.monitor_pages_va[1] = NULL;
+   }
+
+   pfn[0] = virt_to_hvpfn(vmbus_connection.monitor_pages[0]);
+   pfn[1] = virt_to_hvpfn(vmbus_connection.monitor_pages[1]);
+   hv_mark_gpa_visibility(2, pfn, VMBUS_PAGE_NOT_VISIBLE);
+   }
+
hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[0]);
hv_free_hyperv_page((unsigned long)vmbus_connection.monitor_pages[1]);
   

[RFC PATCH V4 05/12] HV: Add ghcb hvcall support for SNP VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides ghcb hvcall to handle VMBus
HVCALL_SIGNAL_EVENT and HVCALL_POST_MESSAGE
msg in SNP Isolation VM. Add such support.

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/ivm.c   | 42 +
 arch/x86/include/asm/mshyperv.h |  1 +
 drivers/hv/connection.c |  6 -
 drivers/hv/hv.c |  8 ++-
 include/asm-generic/mshyperv.h  | 29 +++
 5 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index c7b54631ca0d..8a6f4e9e3d6c 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -15,6 +15,48 @@
 #include 
 #include 
 
+u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size)
+{
+   union hv_ghcb *hv_ghcb;
+   void **ghcb_base;
+   unsigned long flags;
+
+   if (!ms_hyperv.ghcb_base)
+   return -EFAULT;
+
+   WARN_ON(in_nmi());
+
+   local_irq_save(flags);
+   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   hv_ghcb = (union hv_ghcb *)*ghcb_base;
+   if (!hv_ghcb) {
+   local_irq_restore(flags);
+   return -EFAULT;
+   }
+
+   memset(hv_ghcb, 0x00, HV_HYP_PAGE_SIZE);
+   hv_ghcb->ghcb.protocol_version = 1;
+   hv_ghcb->ghcb.ghcb_usage = 1;
+
+   hv_ghcb->hypercall.outputgpa = (u64)output;
+   hv_ghcb->hypercall.hypercallinput.asuint64 = 0;
+   hv_ghcb->hypercall.hypercallinput.callcode = control;
+
+   if (input_size)
+   memcpy(hv_ghcb->hypercall.hypercalldata, input, input_size);
+
+   VMGEXIT();
+
+   hv_ghcb->ghcb.ghcb_usage = 0x;
+   memset(hv_ghcb->ghcb.save.valid_bitmap, 0,
+  sizeof(hv_ghcb->ghcb.save.valid_bitmap));
+
+   local_irq_restore(flags);
+
+   return hv_ghcb->hypercall.hypercalloutput.callstatus;
+}
+EXPORT_SYMBOL_GPL(hv_ghcb_hypercall);
+
 void hv_ghcb_msr_write(u64 msr, u64 value)
 {
union hv_ghcb *hv_ghcb;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f9cc3753040a..fe03e3e833ac 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -258,6 +258,7 @@ void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);
 void hv_signal_eom_ghcb(void);
 void hv_ghcb_msr_write(u64 msr, u64 value);
 void hv_ghcb_msr_read(u64 msr, u64 *value);
+u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size);
 
 #define hv_get_synint_state_ghcb(int_num, val) \
hv_sint_rdmsrl_ghcb(HV_X64_MSR_SINT0 + int_num, val)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 311cd005b3be..186fd4c8acd4 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -445,6 +445,10 @@ void vmbus_set_event(struct vmbus_channel *channel)
 
++channel->sig_events;
 
-   hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
+   if (hv_isolation_type_snp())
+   hv_ghcb_hypercall(HVCALL_SIGNAL_EVENT, >sig_event,
+   NULL, sizeof(u64));
+   else
+   hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
 }
 EXPORT_SYMBOL_GPL(vmbus_set_event);
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 59f7173c4d9f..e5c9fc467893 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -98,7 +98,13 @@ int hv_post_message(union hv_connection_id connection_id,
aligned_msg->payload_size = payload_size;
memcpy((void *)aligned_msg->payload, payload, payload_size);
 
-   status = hv_do_hypercall(HVCALL_POST_MESSAGE, aligned_msg, NULL);
+   if (hv_isolation_type_snp())
+   status = hv_ghcb_hypercall(HVCALL_POST_MESSAGE,
+   (void *)aligned_msg, NULL,
+   sizeof(struct hv_input_post_message));
+   else
+   status = hv_do_hypercall(HVCALL_POST_MESSAGE,
+   aligned_msg, NULL);
 
/* Preemption must remain disabled until after the hypercall
 * so some other thread can't get scheduled onto this cpu and
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index e6d6886faed1..8f6f283fb5b5 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -30,6 +30,35 @@
 
 union hv_ghcb {
struct ghcb ghcb;
+   struct {
+   u64 hypercalldata[509];
+   u64 outputgpa;
+   union {
+   union {
+   struct {
+   u32 callcode: 16;
+   u32 isfast  : 1;
+   u32 reserved1   : 14;
+   u32 isnested: 1;
+   u32 countofelements : 12;
+   u32 reserved2   : 4;
+  

[RFC PATCH V4 04/12] HV: Add Write/Read MSR registers via ghcb page

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides GHCB protocol to write Synthetic Interrupt
Controller MSR registers in Isolation VM with AMD SEV SNP
and these registers are emulated by hypervisor directly.
Hyper-V requires to write SINTx MSR registers twice. First
writes MSR via GHCB page to communicate with hypervisor
and then writes wrmsr instruction to talk with paravisor
which runs in VMPL0. Guest OS ID MSR also needs to be set
via GHCB.

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/hv_init.c   |  25 +--
 arch/x86/hyperv/ivm.c   | 115 ++
 arch/x86/include/asm/mshyperv.h |  78 +++-
 arch/x86/include/asm/sev-es.h   |   4 ++
 arch/x86/kernel/cpu/mshyperv.c  |   3 +
 arch/x86/kernel/sev-es-shared.c |  21 --
 drivers/hv/hv.c | 121 ++--
 include/asm-generic/mshyperv.h  |  12 +++-
 8 files changed, 308 insertions(+), 71 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e058f725..97d1c774cfce 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -445,7 +445,7 @@ void __init hyperv_init(void)
goto clean_guest_os_id;
 
if (hv_isolation_type_snp()) {
-   ms_hyperv.ghcb_base = alloc_percpu(void *);
+   ms_hyperv.ghcb_base = alloc_percpu(union hv_ghcb __percpu *);
if (!ms_hyperv.ghcb_base)
goto clean_guest_os_id;
 
@@ -539,6 +539,7 @@ void hyperv_cleanup(void)
 
/* Reset our OS id */
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
+   hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
 
/*
 * Reset hypercall page reference before reset the page,
@@ -620,28 +621,6 @@ bool hv_is_hibernation_supported(void)
 }
 EXPORT_SYMBOL_GPL(hv_is_hibernation_supported);
 
-enum hv_isolation_type hv_get_isolation_type(void)
-{
-   if (!(ms_hyperv.priv_high & HV_ISOLATION))
-   return HV_ISOLATION_TYPE_NONE;
-   return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);
-}
-EXPORT_SYMBOL_GPL(hv_get_isolation_type);
-
-bool hv_is_isolation_supported(void)
-{
-   return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
-}
-EXPORT_SYMBOL_GPL(hv_is_isolation_supported);
-
-DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
-
-bool hv_isolation_type_snp(void)
-{
-   return static_branch_unlikely(_type_snp);
-}
-EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
-
 /* Bit mask of the extended capability to query: see HV_EXT_CAPABILITY_xxx */
 bool hv_query_ext_cap(u64 cap_query)
 {
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 24a58795abd8..c7b54631ca0d 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -6,6 +6,8 @@
  *  Tianyu Lan 
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -13,6 +15,119 @@
 #include 
 #include 
 
+void hv_ghcb_msr_write(u64 msr, u64 value)
+{
+   union hv_ghcb *hv_ghcb;
+   void **ghcb_base;
+   unsigned long flags;
+
+   if (!ms_hyperv.ghcb_base)
+   return;
+
+   WARN_ON(in_nmi());
+
+   local_irq_save(flags);
+   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   hv_ghcb = (union hv_ghcb *)*ghcb_base;
+   if (!hv_ghcb) {
+   local_irq_restore(flags);
+   return;
+   }
+
+   memset(hv_ghcb, 0x00, HV_HYP_PAGE_SIZE);
+
+   ghcb_set_rcx(_ghcb->ghcb, msr);
+   ghcb_set_rax(_ghcb->ghcb, lower_32_bits(value));
+   ghcb_set_rdx(_ghcb->ghcb, value >> 32);
+
+   if (sev_es_ghcb_hv_call(_ghcb->ghcb, NULL, SVM_EXIT_MSR, 1, 0))
+   pr_warn("Fail to write msr via ghcb %llx.\n", msr);
+
+   local_irq_restore(flags);
+}
+
+void hv_ghcb_msr_read(u64 msr, u64 *value)
+{
+   union hv_ghcb *hv_ghcb;
+   void **ghcb_base;
+   unsigned long flags;
+
+   if (!ms_hyperv.ghcb_base)
+   return;
+
+   WARN_ON(in_nmi());
+
+   local_irq_save(flags);
+   ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
+   hv_ghcb = (union hv_ghcb *)*ghcb_base;
+   if (!hv_ghcb) {
+   local_irq_restore(flags);
+   return;
+   }
+
+   memset(hv_ghcb, 0x00, HV_HYP_PAGE_SIZE);
+
+   ghcb_set_rcx(_ghcb->ghcb, msr);
+   if (sev_es_ghcb_hv_call(_ghcb->ghcb, NULL, SVM_EXIT_MSR, 0, 0))
+   pr_warn("Fail to read msr via ghcb %llx.\n", msr);
+   else
+   *value = (u64)lower_32_bits(hv_ghcb->ghcb.save.rax)
+   | ((u64)lower_32_bits(hv_ghcb->ghcb.save.rdx) << 32);
+   local_irq_restore(flags);
+}
+
+void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value)
+{
+   hv_ghcb_msr_read(msr, value);
+}
+EXPORT_SYMBOL_GPL(hv_sint_rdmsrl_ghcb);
+
+void hv_sint_wrmsrl_ghcb(u64 msr, u64 value)
+{
+   hv_ghcb_msr_write(msr, value);
+
+   /* Write proxy bit vua wrmsrl instruction. */
+   if (msr >= HV_X64_MSR_SINT0 && msr <= HV_X64_MSR_SINT15)
+   wrmsrl(msr, value | 1 

[RFC PATCH V4 03/12] HV: Mark vmbus ring buffer visible to host in Isolation VM

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Mark vmbus ring buffer visible with set_memory_decrypted() when
establish gpadl handle.

Signed-off-by: Tianyu Lan 
---
 drivers/hv/channel.c   | 38 --
 include/linux/hyperv.h | 10 ++
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index f3761c73b074..01048bb07082 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -465,7 +466,7 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
struct list_head *curr;
u32 next_gpadl_handle;
unsigned long flags;
-   int ret = 0;
+   int ret = 0, index;
 
next_gpadl_handle =
(atomic_inc_return(_connection.next_gpadl_handle) - 1);
@@ -474,6 +475,13 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
if (ret)
return ret;
 
+   ret = set_memory_decrypted((unsigned long)kbuffer,
+  HVPFN_UP(size));
+   if (ret) {
+   pr_warn("Failed to set host visibility.\n");
+   return ret;
+   }
+
init_completion(>waitevent);
msginfo->waiting_channel = channel;
 
@@ -539,6 +547,15 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
/* At this point, we received the gpadl created msg */
*gpadl_handle = gpadlmsg->gpadl;
 
+   if (type == HV_GPADL_BUFFER)
+   index = 0;
+   else
+   index = channel->gpadl_range[1].gpadlhandle ? 2 : 1;
+
+   channel->gpadl_range[index].size = size;
+   channel->gpadl_range[index].buffer = kbuffer;
+   channel->gpadl_range[index].gpadlhandle = *gpadl_handle;
+
 cleanup:
spin_lock_irqsave(_connection.channelmsg_lock, flags);
list_del(>msglistentry);
@@ -549,6 +566,11 @@ static int __vmbus_establish_gpadl(struct vmbus_channel 
*channel,
}
 
kfree(msginfo);
+
+   if (ret)
+   set_memory_encrypted((unsigned long)kbuffer,
+HVPFN_UP(size));
+
return ret;
 }
 
@@ -811,7 +833,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 
gpadl_handle)
struct vmbus_channel_gpadl_teardown *msg;
struct vmbus_channel_msginfo *info;
unsigned long flags;
-   int ret;
+   int ret, i;
 
info = kzalloc(sizeof(*info) +
   sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
@@ -859,6 +881,18 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, 
u32 gpadl_handle)
spin_unlock_irqrestore(_connection.channelmsg_lock, flags);
 
kfree(info);
+
+   /* Find gpadl buffer virtual address and size. */
+   for (i = 0; i < VMBUS_GPADL_RANGE_COUNT; i++)
+   if (channel->gpadl_range[i].gpadlhandle == gpadl_handle)
+   break;
+
+   if (set_memory_encrypted((unsigned long)channel->gpadl_range[i].buffer,
+   HVPFN_UP(channel->gpadl_range[i].size)))
+   pr_warn("Fail to set mem host visibility.\n");
+
+   channel->gpadl_range[i].gpadlhandle = 0;
+
return ret;
 }
 EXPORT_SYMBOL_GPL(vmbus_teardown_gpadl);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 2e859d2f9609..06eccaba10c5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -809,6 +809,14 @@ struct vmbus_device {
 
 #define VMBUS_DEFAULT_MAX_PKT_SIZE 4096
 
+struct vmbus_gpadl_range {
+   u32 gpadlhandle;
+   u32 size;
+   void *buffer;
+};
+
+#define VMBUS_GPADL_RANGE_COUNT3
+
 struct vmbus_channel {
struct list_head listentry;
 
@@ -829,6 +837,8 @@ struct vmbus_channel {
struct completion rescind_event;
 
u32 ringbuffer_gpadlhandle;
+   /* GPADL_RING and Send/Receive GPADL_BUFFER. */
+   struct vmbus_gpadl_range gpadl_range[VMBUS_GPADL_RANGE_COUNT];
 
/* Allocated memory for ring buffer */
struct page *ringbuffer_page;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH V4 02/12] x86/HV: Add new hvcall guest address host visibility support

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Add new hvcall guest address host visibility support to mark
memory visible to host. Call it inside set_memory_decrypted
/encrypted().

Signed-off-by: Tianyu Lan 
---
 arch/x86/hyperv/Makefile   |   2 +-
 arch/x86/hyperv/ivm.c  | 112 +
 arch/x86/include/asm/hyperv-tlfs.h |  18 +
 arch/x86/include/asm/mshyperv.h|   3 +-
 arch/x86/mm/pat/set_memory.c   |   6 +-
 include/asm-generic/hyperv-tlfs.h  |   1 +
 6 files changed, 139 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/hyperv/ivm.c

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 48e2c51464e8..5d2de10809ae 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-y  := hv_init.o mmu.o nested.o irqdomain.o
+obj-y  := hv_init.o mmu.o nested.o irqdomain.o ivm.o
 obj-$(CONFIG_X86_64)   += hv_apic.o hv_proc.o
 
 ifdef CONFIG_X86_64
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
new file mode 100644
index ..24a58795abd8
--- /dev/null
+++ b/arch/x86/hyperv/ivm.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Hyper-V Isolation VM interface with paravisor and hypervisor
+ *
+ * Author:
+ *  Tianyu Lan 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
+ *
+ * In Isolation VM, all guest memory is encripted from host and guest
+ * needs to set memory visible to host via hvcall before sharing memory
+ * with host.
+ */
+int hv_mark_gpa_visibility(u16 count, const u64 pfn[], u32 visibility)
+{
+   struct hv_gpa_range_for_visibility **input_pcpu, *input;
+   u16 pages_processed;
+   u64 hv_status;
+   unsigned long flags;
+
+   /* no-op if partition isolation is not enabled */
+   if (!hv_is_isolation_supported())
+   return 0;
+
+   if (count > HV_MAX_MODIFY_GPA_REP_COUNT) {
+   pr_err("Hyper-V: GPA count:%d exceeds supported:%lu\n", count,
+   HV_MAX_MODIFY_GPA_REP_COUNT);
+   return -EINVAL;
+   }
+
+   local_irq_save(flags);
+   input_pcpu = (struct hv_gpa_range_for_visibility **)
+   this_cpu_ptr(hyperv_pcpu_input_arg);
+   input = *input_pcpu;
+   if (unlikely(!input)) {
+   local_irq_restore(flags);
+   return -EINVAL;
+   }
+
+   input->partition_id = HV_PARTITION_ID_SELF;
+   input->host_visibility = visibility;
+   input->reserved0 = 0;
+   input->reserved1 = 0;
+   memcpy((void *)input->gpa_page_list, pfn, count * sizeof(*pfn));
+   hv_status = hv_do_rep_hypercall(
+   HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
+   0, input, _processed);
+   local_irq_restore(flags);
+
+   if (!(hv_status & HV_HYPERCALL_RESULT_MASK))
+   return 0;
+
+   return hv_status & HV_HYPERCALL_RESULT_MASK;
+}
+EXPORT_SYMBOL(hv_mark_gpa_visibility);
+
+/*
+ * hv_set_mem_host_visibility - Set specified memory visible to host.
+ *
+ * In Isolation VM, all guest memory is encrypted from host and guest
+ * needs to set memory visible to host via hvcall before sharing memory
+ * with host. This function works as wrap of hv_mark_gpa_visibility()
+ * with memory base and size.
+ */
+static int hv_set_mem_host_visibility(void *kbuffer, size_t size, u32 
visibility)
+{
+   int pagecount = size >> HV_HYP_PAGE_SHIFT;
+   u64 *pfn_array;
+   int ret = 0;
+   int i, pfn;
+
+   if (!hv_is_isolation_supported() || !ms_hyperv.ghcb_base)
+   return 0;
+
+   pfn_array = kzalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
+   if (!pfn_array)
+   return -ENOMEM;
+
+   for (i = 0, pfn = 0; i < pagecount; i++) {
+   pfn_array[pfn] = virt_to_hvpfn(kbuffer + i * HV_HYP_PAGE_SIZE);
+   pfn++;
+
+   if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
+   ret |= hv_mark_gpa_visibility(pfn, pfn_array, 
visibility);
+   pfn = 0;
+
+   if (ret)
+   goto err_free_pfn_array;
+   }
+   }
+
+ err_free_pfn_array:
+   kfree(pfn_array);
+   return ret;
+}
+
+int hv_set_mem_enc(unsigned long addr, int numpages, bool enc)
+{
+   return hv_set_mem_host_visibility((void *)addr,
+   numpages * HV_HYP_PAGE_SIZE,
+   enc ? VMBUS_PAGE_NOT_VISIBLE
+   : VMBUS_PAGE_VISIBLE_READ_WRITE);
+}
diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 606f5cc579b2..68826fbf92ca 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -262,6 +262,11 @@ enum hv_isolation_type {
 #define HV_X64_MSR_TIME_REF_COUNT   

[RFC PATCH V4 00/12] x86/Hyper-V: Add Hyper-V Isolation VM support

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.

The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.

Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.

There are two exceptions - packets sent by vmbus_sendpacket_
pagebuffer() and vmbus_sendpacket_mpb_desc(). These packets
contains IO stack memory address and host will access these memory.
So add allocation bounce buffer support in vmbus for these packets.

For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.

Change since v3:
   - Add interface set_memory_decrypted_map() to decrypt memory and
 map bounce buffer in extra address space 
   - Remove swiotlb remap function and store the remap address
 returned by set_memory_decrypted_map() in swiotlb mem data structure.
   - Introduce hv_set_mem_enc() to make code more readable in the 
__set_memory_enc_dec().

Change since v2:
   - Remove not UIO driver in Isolation VM patch
   - Use vmap_pfn() to replace ioremap_page_range function in
   order to avoid exposing symbol ioremap_page_range() and
   ioremap_page_range()
   - Call hv set mem host visibility hvcall in 
set_memory_encrypted/decrypted()
   - Enable swiotlb force mode instead of adding Hyper-V dma map/unmap hook
   - Fix code style

Tianyu Lan (12):
  x86/HV: Initialize shared memory boundary in the Isolation VM.
  x86/HV: Add new hvcall guest address host visibility support
  HV: Mark vmbus ring buffer visible to host in Isolation VM
  HV: Add Write/Read MSR registers via ghcb page
  HV: Add ghcb hvcall support for SNP VM
  HV/Vmbus: Add SNP support for VMbus channel initiate message
  HV/Vmbus: Initialize VMbus ring buffer for Isolation VM
  x86/Swiotlb/HV: Add Swiotlb bounce buffer remap function for HV IVM
  HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM
  HV/Netvsc: Add Isolation VM support for netvsc driver
  HV/Storvsc: Add Isolation VM support for storvsc driver
  x86/HV: Not set memory decrypted/encrypted during kexec alloc/free
page in IVM

 arch/x86/hyperv/Makefile   |   2 +-
 arch/x86/hyperv/hv_init.c  |  25 +--
 arch/x86/hyperv/ivm.c  | 299 +
 arch/x86/include/asm/hyperv-tlfs.h |  18 ++
 arch/x86/include/asm/mshyperv.h|  84 +++-
 arch/x86/include/asm/set_memory.h  |   2 +
 arch/x86/include/asm/sev-es.h  |   4 +
 arch/x86/kernel/cpu/mshyperv.c |   5 +
 arch/x86/kernel/machine_kexec_64.c |   5 +-
 arch/x86/kernel/sev-es-shared.c|  21 +-
 arch/x86/mm/pat/set_memory.c   |  34 +++-
 arch/x86/xen/pci-swiotlb-xen.c |   3 +-
 drivers/hv/Kconfig |   1 +
 drivers/hv/channel.c   |  48 -
 drivers/hv/connection.c|  71 ++-
 drivers/hv/hv.c| 129 +
 drivers/hv/hyperv_vmbus.h  |   3 +
 drivers/hv/ring_buffer.c   |  84 ++--
 drivers/hv/vmbus_drv.c |   3 +
 drivers/iommu/hyperv-iommu.c   |  62 ++
 drivers/net/hyperv/hyperv_net.h|   6 +
 drivers/net/hyperv/netvsc.c| 144 +-
 drivers/net/hyperv/rndis_filter.c  |   2 +
 drivers/scsi/storvsc_drv.c |  68 ++-
 include/asm-generic/hyperv-tlfs.h  |   1 +
 include/asm-generic/mshyperv.h |  53 -
 include/linux/hyperv.h |  16 ++
 include/linux/swiotlb.h|   4 +
 kernel/dma/swiotlb.c   |  11 +-
 29 files changed, 1097 insertions(+), 111 deletions(-)
 create mode 100644 arch/x86/hyperv/ivm.c

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH V4 01/12] x86/HV: Initialize shared memory boundary in the Isolation VM.

2021-07-07 Thread Tianyu Lan
From: Tianyu Lan 

Hyper-V exposes shared memory boundary via cpuid
HYPERV_CPUID_ISOLATION_CONFIG and store it in the
shared_gpa_boundary of ms_hyperv struct. This prepares
to share memory with host for SNP guest.

Signed-off-by: Tianyu Lan 
---
 arch/x86/kernel/cpu/mshyperv.c |  2 ++
 include/asm-generic/mshyperv.h | 12 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 10b2a8c10cb6..8aed689db621 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -334,6 +334,8 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.priv_high & HV_ISOLATION) {
ms_hyperv.isolation_config_a = 
cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);
ms_hyperv.isolation_config_b = 
cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);
+   ms_hyperv.shared_gpa_boundary =
+   (u64)1 << ms_hyperv.shared_gpa_boundary_bits;
 
pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 
0x%x\n",
ms_hyperv.isolation_config_a, 
ms_hyperv.isolation_config_b);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 3ae56a29594f..2914e27b0429 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -34,8 +34,18 @@ struct ms_hyperv_info {
u32 max_vp_index;
u32 max_lp_index;
u32 isolation_config_a;
-   u32 isolation_config_b;
+   union {
+   u32 isolation_config_b;
+   struct {
+   u32 cvm_type : 4;
+   u32 Reserved11 : 1;
+   u32 shared_gpa_boundary_active : 1;
+   u32 shared_gpa_boundary_bits : 6;
+   u32 Reserved12 : 20;
+   };
+   };
void  __percpu **ghcb_base;
+   u64 shared_gpa_boundary;
 };
 extern struct ms_hyperv_info ms_hyperv;
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/1] dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()

2021-07-07 Thread Robin Murphy

On 2021-07-06 20:12, Gerald Schaefer wrote:

On Tue, 6 Jul 2021 10:22:40 +0100
Robin Murphy  wrote:


On 2021-07-05 19:52, Gerald Schaefer wrote:

The following warning occurred sporadically on s390:
DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or 
rodata [addr=48cc5e2f] [len=131072]
WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 
check_for_illegal_area+0xa8/0x138

It is a false-positive warning, due to a broken logic in debug_dma_map_sg().
check_for_illegal_area() should check for overlay of sg elements with kernel
text or rodata. It is called with sg_dma_len(s) instead of s->length as
parameter. After the call to ->map_sg(), sg_dma_len() contains the length
of possibly combined sg elements in the DMA address space, and not the
individual sg element length, which would be s->length.

The check will then use the kernel start address of an sg element, and add
the DMA length for overlap check, which can result in the false-positive
warning because the DMA length can be larger than the actual single sg
element length in kernel address space.

In addition, the call to check_for_illegal_area() happens in the iteration
over mapped_ents, which will not include all individual sg elements if
any of them were combined in ->map_sg().

Fix this by using s->length instead of sg_dma_len(s). Also put the call to
check_for_illegal_area() in a separate loop, iterating over all the
individual sg elements ("nents" instead of "mapped_ents").

Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor")
Tested-by: Niklas Schnelle 
Signed-off-by: Gerald Schaefer 
---
   kernel/dma/debug.c | 10 ++
   1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 14de1271463f..d7d44b7fe7e2 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -1299,6 +1299,12 @@ void debug_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
if (unlikely(dma_debug_disabled()))
return;
   
+	for_each_sg(sg, s, nents, i) {

+   if (!PageHighMem(sg_page(s))) {
+   check_for_illegal_area(dev, sg_virt(s), s->length);
+   }
+   }
+
for_each_sg(sg, s, mapped_ents, i) {
entry = dma_entry_alloc();
if (!entry)
@@ -1316,10 +1322,6 @@ void debug_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
   
   		check_for_stack(dev, sg_page(s), s->offset);


Strictly this should probably be moved to the new loop as well, as it is
similarly concerned with validating the source segments rather than the
DMA mappings - I think with virtually-mapped stacks it might technically
be possible for a stack page to be physically adjacent to a "valid" page
such that it could get merged and overlooked if it were near the end of
the list, although in fairness that would probably be indicative of
something having gone far more fundamentally wrong. Otherwise, the
overall reasoning looks sound to me.


I see, good point. I think I can add this to my patch, and a different
subject like "dma-debug: fix sg checks in debug_dma_map_sg()".


TBH it's more of a conceptual cleanliness thing than a significant 
practical concern, but if we *are* breaking out a separate "validate the 
source elements" step then it does seem logical to capture everything 
relevant at once.



However, I do not quite understand why check_for_stack() does not also
consider s->length. It seems to check only the first page of an sg
element.

So, shouldn't check_for_stack() behave similar to check_for_illegal_area(),
i.e. check all source sg elements for overlap with the task stack area?


Realistically, creating a scatterlist segment pointing to the stack at 
all would already be quite an audacious feat of brokenness, but getting 
a random stack page in the middle of a segment would seem to imply 
something having gone so catastrophically wrong that it's destined to 
end very badly whether or not dma-debug squawks about it - not to 
mention getting lucky enough for said random stack page to actually 
belong to the current task stack in the first place :)


Robin.


If yes, then this probably should be a separate patch, but I can try
to come up with something and send a new RFC with two patches. Maybe
check_for_stack() can also be integrated into check_for_illegal_area(),
they are both called at the same places. And mapping memory from the
stack also sounds rather illegal.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Jason Wang


在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:

On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:

在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:

On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:

On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi  wrote:

On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:

在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:

在 2021/7/4 下午5:49, Yongji Xie 写道:

OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".



Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.


I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Because:

1) Kernel can not wait forever in get_config(), this is the major difference
with vhost-user.

virtio_cread() can sleep:

#define virtio_cread(vdev, structname, member, ptr) \
do {\
typeof(((structname*)0)->member) virtio_cread_v;\
\
might_sleep();  \
^^

Which code path cannot sleep?

Well, it can sleep but it can't sleep forever. For VDUSE, a
buggy/malicious userspace may refuse to respond to the get_config.

It looks to me the ideal case, with the current virtio spec, for VDUSE is to

1) maintain the device and its state in the kernel, userspace may sync
with the kernel device via ioctls
2) offload the datapath (virtqueue) to the userspace

This seems more robust and safe than simply relaying everything to
userspace and waiting for its response.

And we know for sure this model can work, an example is TUN/TAP:
netdevice is abstracted in the kernel and datapath is done via
sendmsg()/recvmsg().

Maintaining the config in the kernel follows this model and it can
simplify the device generation implementation.

For config space write, it requires more thought but fortunately it's
not commonly used. So VDUSE can choose to filter out the
device/features that depends on the config write.

This is the problem. There are other messages like SET_FEATURES where I
guess we'll face the same challenge.


Probably not, userspace device can tell the kernel about the device_features
and mandated_features during creation, and the feature negotiation could be
done purely in the kernel without bothering the userspace.



(For 

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-07 Thread Yongji Xie
On Wed, Jul 7, 2021 at 4:53 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> > +static bool vduse_validate_config(struct vduse_dev_config *config)
> > +{
>
> The name field needs to be NUL terminated?
>

I think so.

> > + case VDUSE_CREATE_DEV: {
> > + struct vduse_dev_config config;
> > + unsigned long size = offsetof(struct vduse_dev_config, 
> > config);
> > + void *buf;
> > +
> > + ret = -EFAULT;
> > + if (copy_from_user(, argp, size))
> > + break;
> > +
> > + ret = -EINVAL;
> > + if (vduse_validate_config() == false)
> > + break;
> > +
> > + buf = vmemdup_user(argp + size, config.config_size);
> > + if (IS_ERR(buf)) {
> > + ret = PTR_ERR(buf);
> > + break;
> > + }
> > + ret = vduse_create_dev(, buf, control->api_version);
> > + break;
> > + }
> > + case VDUSE_DESTROY_DEV: {
> > + char name[VDUSE_NAME_MAX];
> > +
> > + ret = -EFAULT;
> > + if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> > + break;
>
> Is this missing a NUL terminator?

Oh, yes. Looks like I need to set '\0' to name[VDUSE_VDUSE_NAME_MAX - 1] here.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

2021-07-07 Thread Yongji Xie
On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi  wrote:
> > >
> > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > ioctl should not be called frequently.
> > > > > > The spec uses MUST and other terms to define the precise 
> > > > > > requirements.
> > > > > > Here the language (especially the word "generally") is weaker and 
> > > > > > means
> > > > > > there may be exceptions.
> > > > > >
> > > > > > Another type of access that doesn't work with the 
> > > > > > VDUSE_DEV_SET_CONFIG
> > > > > > approach is reads that have side-effects. For example, imagine a 
> > > > > > field
> > > > > > containing an error code if the device encounters a problem 
> > > > > > unrelated to
> > > > > > a specific virtqueue request. Reading from this field resets the 
> > > > > > error
> > > > > > code to 0, saving the driver an extra configuration space write 
> > > > > > access
> > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but 
> > > > > > it
> > > > > > makes me think that the interface does not allow full VIRTIO 
> > > > > > semantics.
> > > >
> > > >
> > > > Note that though you're correct, my understanding is that config space 
> > > > is
> > > > not suitable for this kind of error propagating. And it would be very 
> > > > hard
> > > > to implement such kind of semantic in some transports.  Virtqueue 
> > > > should be
> > > > much better. As Yong Ji quoted, the config space is used for
> > > > "rarely-changing or intialization-time parameters".
> > > >
> > > >
> > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > handle the message failure, I'm going to add a return value to
> > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > driver can be modified to handle that.
> > > > >
> > > > > Jason and Stefan, what do you think of this way?
> > >
> > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > >
> >
> > We add a timeout and return error in case userspace never replies to
> > the message.
> >
> > > The VIRTIO spec provides no way for the device to report errors from
> > > config space accesses.
> > >
> > > The QEMU virtio-pci implementation returns -1 from invalid
> > > virtio_config_read*() and silently discards virtio_config_write*()
> > > accesses.
> > >
> > > VDUSE can take the same approach with
> > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > >
> >
> > I noticed that virtio_config_read*() only returns -1 when we access a
> > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > when we access a valid field. Not sure if it's ok to silently ignore
> > this kind of error.
>
> That's a good point but it's a general VIRTIO issue. Any device
> implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> VIRTIO specification needs to provide a way for the driver to detect
> this.
>
> If userspace violates the contract then VDUSE needs to mark the device
> broken. QEMU's device emulation does something similar with the
> vdev->broken flag.
>
> The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> vDPA/VDUSE to indicate that the device is not operational and must be
> reset.
>

It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
currently. So I'm thinking whether it's ok to add a check of
DEVICE_NEEDS_RESET status bit in probe function of virtio device
driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
device initailization when configuration space access failed.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu: Fallback to default setting when def_domain_type() callback returns 0

2021-07-07 Thread Kai-Heng Feng
On Wed, Jul 7, 2021 at 2:03 AM Robin Murphy  wrote:
>
> On 2021-07-06 17:21, Kai-Heng Feng wrote:
> > On Tue, Jul 6, 2021 at 5:27 PM Robin Murphy  wrote:
> >>
> >> On 2021-07-06 07:51, Kai-Heng Feng wrote:
> >>> Commit 28b41e2c6aeb ("iommu: Move def_domain type check for untrusted
> >>> device into core") not only moved the check for untrusted device to
> >>> IOMMU core, it also introduced a behavioral change by returning
> >>> def_domain_type() directly without checking its return value. That makes
> >>> many devices no longer use the default IOMMU setting.
> >>>
> >>> So revert back to the old behavior which defaults to
> >>> iommu_def_domain_type when driver callback returns 0.
> >>>
> >>> Fixes: 28b41e2c6aeb ("iommu: Move def_domain type check for untrusted 
> >>> device into core")
> >>
> >> Are you sure about that? From that same commit:
> >>
> >> @@ -1507,7 +1509,7 @@ static int iommu_alloc_default_domain(struct
> >> iommu_group *group,
> >>   if (group->default_domain)
> >>   return 0;
> >>
> >> -   type = iommu_get_def_domain_type(dev);
> >> +   type = iommu_get_def_domain_type(dev) ? : iommu_def_domain_type;
> >>
> >>   return iommu_group_alloc_default_domain(dev->bus, group, type);
> >>}
> >>
> >> AFAICS the other two callers should also handle 0 correctly. Have you
> >> seen a problem in practice?
> >
> > Thanks for pointing out how the return value is being handled by the 
> > callers.
> > However, the same check is missing in probe_get_default_domain_type():
> > static int probe_get_default_domain_type(struct device *dev, void *data)
> > {
> >  struct __group_domain_type *gtype = data;
> >  unsigned int type = iommu_get_def_domain_type(dev);
> > ...
> > }
>
> I'm still not following - the next line right after that is "if (type)",
> which means it won't touch gtype, and if that happens for every
> iteration, probe_alloc_default_domain() subsequently hits its "if
> (!gtype.type)" condition and still ends up with iommu_def_domain_type.
> This *was* one of the other two callers I was talking about (the second
> being iommu_change_dev_def_domain()), and in fact on second look I think
> your proposed change will actually break this logic, since it's
> necessary to differentiate between a specific type being requested for
> the given device, and a "don't care" response which only implies to use
> the global default type if it's still standing after *all* the
> appropriate devices have been queried.

You are right, what I am seeing here is that the AMD GFX is using
IOMMU_DOMAIN_IDENTITY, and makes other devices in the same group,
including the Realtek WiFi, to use IOMMU_DOMAIN_IDENTITY as well.

>
> > I personally prefer the old way instead of open coding with ternary
> > operator, so I'll do that in v2.
> >
> > In practice, this causes a kernel panic when probing Realtek WiFi.
> > Because of the bug, dma_ops isn't set by probe_finalize(),
> > dma_map_single() falls back to swiotlb which isn't set and caused a
> > kernel panic.
>
> Hmm, but if that's the case, wouldn't it still be a problem anyway if
> the end result was IOMMU_DOMAIN_IDENTITY? I can't claim to fully
> understand the x86 swiotlb and no_iommu dance, but nothing really stands
> out to give me confidence that it handles the passthrough options correctly.

Yes, I think AMD IOMMU driver needs more thourough check on
static void __init amd_iommu_init_dma_ops(void)
{
swiotlb = (iommu_default_passthrough() || sme_me_mask) ? 1 : 0;
...
}

Since swiotlb gets disabled by the check, but the Realtek WiFi is only
capable of doing 32bit DMA, the kernel panics because
io_tlb_default_mem is NULL.

Thanks again for pointing to the right direction, this patch is indeed
incorrect.

Kai-Heng

>
> Robin.
>
> > I didn't attach the panic log because the system simply is frozen at
> > that point so the message is not logged to the storage.
> > I'll see if I can find another way to collect the log and attach it in v2.
> >
> > Kai-Heng
> >
> >>
> >> Robin.
> >>
> >>> Signed-off-by: Kai-Heng Feng 
> >>> ---
> >>>drivers/iommu/iommu.c | 5 +++--
> >>>1 file changed, 3 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >>> index 5419c4b9f27a..faac4f795025 100644
> >>> --- a/drivers/iommu/iommu.c
> >>> +++ b/drivers/iommu/iommu.c
> >>> @@ -1507,14 +1507,15 @@ EXPORT_SYMBOL_GPL(fsl_mc_device_group);
> >>>static int iommu_get_def_domain_type(struct device *dev)
> >>>{
> >>>const struct iommu_ops *ops = dev->bus->iommu_ops;
> >>> + unsigned int type = 0;
> >>>
> >>>if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
> >>>return IOMMU_DOMAIN_DMA;
> >>>
> >>>if (ops->def_domain_type)
> >>> - return ops->def_domain_type(dev);
> >>> + type = ops->def_domain_type(dev);
> >>>
> >>> - return 0;
> >>> + return (type == 0) ? iommu_def_domain_type : type;
> >>>}
> 

Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-07-07 Thread Stefan Hajnoczi
On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{

The name field needs to be NUL terminated?

> + case VDUSE_CREATE_DEV: {
> + struct vduse_dev_config config;
> + unsigned long size = offsetof(struct vduse_dev_config, config);
> + void *buf;
> +
> + ret = -EFAULT;
> + if (copy_from_user(, argp, size))
> + break;
> +
> + ret = -EINVAL;
> + if (vduse_validate_config() == false)
> + break;
> +
> + buf = vmemdup_user(argp + size, config.config_size);
> + if (IS_ERR(buf)) {
> + ret = PTR_ERR(buf);
> + break;
> + }
> + ret = vduse_create_dev(, buf, control->api_version);
> + break;
> + }
> + case VDUSE_DESTROY_DEV: {
> + char name[VDUSE_NAME_MAX];
> +
> + ret = -EFAULT;
> + if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> + break;

Is this missing a NUL terminator?


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 4/4] dma-iommu: Add iommu bounce buffers to dma-iommu api

2021-07-07 Thread David Stevens
From: David Stevens 

Add support for per-domain dynamic pools of iommu bounce buffers to the
dma-iommu api. When enabled, all non-direct streaming mappings below a
configurable size will go through bounce buffers.

Each domain has its own buffer pool. Each buffer pool is split into
multiple power-of-2 size classes. Each class has a number of
preallocated slots that can hold bounce buffers. Bounce buffers are
allocated on demand, and unmapped bounce buffers are stored in a cache
with periodic eviction of unused cache entries. As the buffer pool is an
optimization, any failures simply result in falling back to the normal
dma-iommu handling.

Signed-off-by: David Stevens 
---
 drivers/iommu/Kconfig  |  10 +
 drivers/iommu/Makefile |   1 +
 drivers/iommu/dma-iommu.c  |  75 +++-
 drivers/iommu/io-buffer-pool.c | 656 +
 drivers/iommu/io-buffer-pool.h |  91 +
 5 files changed, 826 insertions(+), 7 deletions(-)
 create mode 100644 drivers/iommu/io-buffer-pool.c
 create mode 100644 drivers/iommu/io-buffer-pool.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 1f111b399bca..6eee57b03ff9 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -420,4 +420,14 @@ config SPRD_IOMMU
 
  Say Y here if you want to use the multimedia devices listed above.
 
+config IOMMU_IO_BUFFER
+   bool "Use IOMMU bounce buffers"
+   depends on IOMMU_DMA
+   help
+ Use bounce buffers for small, streaming DMA operations. This may
+ have performance benefits on systems where establishing IOMMU mappings
+ is particularly expensive, such as when running as a guest.
+
+ If unsure, say N here.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..2287b2e3d92d 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
 obj-$(CONFIG_IOMMU_DEBUGFS) += iommu-debugfs.o
 obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
+obj-$(CONFIG_IOMMU_IO_BUFFER) += io-buffer-pool.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 48267d9f5152..1d2cfbbe03c1 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 
+#include "io-buffer-pool.h"
+
 struct iommu_dma_msi_page {
struct list_headlist;
dma_addr_t  iova;
@@ -46,6 +48,7 @@ struct iommu_dma_cookie {
dma_addr_t  msi_iova;
};
struct list_headmsi_page_list;
+   struct io_buffer_pool   *bounce_buffers;
 
/* Domain for flush queue callback; NULL if flush queue not in use */
struct iommu_domain *fq_domain;
@@ -83,6 +86,14 @@ static inline size_t cookie_msi_granule(struct 
iommu_dma_cookie *cookie)
return PAGE_SIZE;
 }
 
+static inline struct io_buffer_pool *dev_to_io_buffer_pool(struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_dma_domain(dev);
+   struct iommu_dma_cookie *cookie = domain->iova_cookie;
+
+   return cookie->bounce_buffers;
+}
+
 static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)
 {
struct iommu_dma_cookie *cookie;
@@ -162,6 +173,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
if (!cookie)
return;
 
+   if (IS_ENABLED(CONFIG_IOMMU_IO_BUFFER))
+   io_buffer_pool_destroy(cookie->bounce_buffers);
+
if (cookie->type == IOMMU_DMA_IOVA_COOKIE && cookie->iovad.granule)
put_iova_domain(>iovad);
 
@@ -334,6 +348,7 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
+   int ret;
 
if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
return -EINVAL;
@@ -381,7 +396,13 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
if (!dev)
return 0;
 
-   return iova_reserve_iommu_regions(dev, domain);
+   ret = iova_reserve_iommu_regions(dev, domain);
+
+   if (ret == 0 && IS_ENABLED(CONFIG_IOMMU_IO_BUFFER))
+   ret = io_buffer_pool_init(dev, domain, iovad,
+ >bounce_buffers);
+
+   return ret;
 }
 
 /**
@@ -537,11 +558,10 @@ static dma_addr_t __iommu_dma_map(struct device *dev, 
phys_addr_t phys,
 }
 
 static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
-   size_t org_size, dma_addr_t dma_mask, bool coherent,
+   size_t org_size, dma_addr_t dma_mask, int 

[PATCH 3/4] dma-iommu: expose a few helper functions to module

2021-07-07 Thread David Stevens
From: David Stevens 

Expose a few helper functions from dma-iommu to the rest of the module.

Signed-off-by: David Stevens 
---
 drivers/iommu/dma-iommu.c | 27 ++-
 include/linux/dma-iommu.h | 12 
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 98a5c566a303..48267d9f5152 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -413,7 +413,7 @@ static int dma_info_to_prot(enum dma_data_direction dir, 
bool coherent,
}
 }
 
-static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
+dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
size_t size, u64 dma_limit, struct device *dev)
 {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -453,7 +453,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
return (dma_addr_t)iova << shift;
 }
 
-static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
dma_addr_t iova, size_t size, struct page *freelist)
 {
struct iova_domain *iovad = >iovad;
@@ -489,7 +489,7 @@ static void __iommu_dma_unmap(struct device *dev, 
dma_addr_t dma_addr,
 
if (!cookie->fq_domain)
iommu_iotlb_sync(domain, _gather);
-   iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
+   __iommu_dma_free_iova(cookie, dma_addr, size, iotlb_gather.freelist);
 }
 
 static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr,
@@ -525,12 +525,12 @@ static dma_addr_t __iommu_dma_map(struct device *dev, 
phys_addr_t phys,
 
size = iova_align(iovad, size + iova_off);
 
-   iova = iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+   iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
if (!iova)
return DMA_MAPPING_ERROR;
 
if (iommu_map_atomic(domain, iova, phys - iova_off, size, prot)) {
-   iommu_dma_free_iova(cookie, iova, size, NULL);
+   __iommu_dma_free_iova(cookie, iova, size, NULL);
return DMA_MAPPING_ERROR;
}
return iova + iova_off;
@@ -585,14 +585,14 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
*dev, phys_addr_t phys,
return iova;
 }
 
-static void __iommu_dma_free_pages(struct page **pages, int count)
+void __iommu_dma_free_pages(struct page **pages, int count)
 {
while (count--)
__free_page(pages[count]);
kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(
+struct page **__iommu_dma_alloc_pages(
unsigned int count, unsigned long order_mask,
unsigned int nid, gfp_t page_gfp, gfp_t kalloc_gfp)
 {
@@ -686,7 +686,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
return NULL;
 
size = iova_align(iovad, size);
-   iova = iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
+   iova = __iommu_dma_alloc_iova(domain, size,
+ dev->coherent_dma_mask, dev);
if (!iova)
goto out_free_pages;
 
@@ -712,7 +713,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
 out_free_sg:
sg_free_table(sgt);
 out_free_iova:
-   iommu_dma_free_iova(cookie, iova, size, NULL);
+   __iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_pages:
__iommu_dma_free_pages(pages, count);
return NULL;
@@ -1063,7 +1064,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
 
-   iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
+   iova = __iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova)
goto out_restore_sg;
 
@@ -1077,7 +1078,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
return __finalise_sg(dev, sg, nents, iova);
 
 out_free_iova:
-   iommu_dma_free_iova(cookie, iova, iova_len, NULL);
+   __iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
__invalidate_sg(sg, nents);
return 0;
@@ -1370,7 +1371,7 @@ static struct iommu_dma_msi_page 
*iommu_dma_get_msi_page(struct device *dev,
if (!msi_page)
return NULL;
 
-   iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+   iova = __iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
if (!iova)
goto out_free_page;
 
@@ -1384,7 +1385,7 @@ static struct iommu_dma_msi_page 
*iommu_dma_get_msi_page(struct device *dev,
return msi_page;
 
 out_free_iova:
-   iommu_dma_free_iova(cookie, iova, size, NULL);
+   __iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_page:
kfree(msi_page);
return NULL;
diff 

[PATCH 2/4] dma-iommu: replace device arguments

2021-07-07 Thread David Stevens
From: David Stevens 

Replace the struct device argument with the device's nid in
__iommu_dma_alloc_pages, since it doesn't need the whole struct. This
allows it to be called from places which don't have access to the
device.

Signed-off-by: David Stevens 
---
 drivers/iommu/dma-iommu.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 00993b56c977..98a5c566a303 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -592,12 +592,12 @@ static void __iommu_dma_free_pages(struct page **pages, 
int count)
kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(struct device *dev,
+static struct page **__iommu_dma_alloc_pages(
unsigned int count, unsigned long order_mask,
-   gfp_t page_gfp, gfp_t kalloc_gfp)
+   unsigned int nid, gfp_t page_gfp, gfp_t kalloc_gfp)
 {
struct page **pages;
-   unsigned int i = 0, nid = dev_to_node(dev);
+   unsigned int i = 0;
 
order_mask &= (2U << MAX_ORDER) - 1;
if (!order_mask)
@@ -680,8 +680,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
alloc_sizes = min_size;
 
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-   pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
-   gfp, GFP_KERNEL);
+   pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT,
+   dev_to_node(dev), gfp, GFP_KERNEL);
if (!pages)
return NULL;
 
-- 
2.32.0.93.g670b81a890-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/4] dma-iommu: add kalloc gfp flag to alloc helper

2021-07-07 Thread David Stevens
From: David Stevens 

Add gfp flag for kalloc calls within __iommu_dma_alloc_pages, so the
function can be called from atomic contexts.

Signed-off-by: David Stevens 
---
 drivers/iommu/dma-iommu.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 614f0dd86b08..00993b56c977 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -593,7 +593,8 @@ static void __iommu_dma_free_pages(struct page **pages, int 
count)
 }
 
 static struct page **__iommu_dma_alloc_pages(struct device *dev,
-   unsigned int count, unsigned long order_mask, gfp_t gfp)
+   unsigned int count, unsigned long order_mask,
+   gfp_t page_gfp, gfp_t kalloc_gfp)
 {
struct page **pages;
unsigned int i = 0, nid = dev_to_node(dev);
@@ -602,15 +603,15 @@ static struct page **__iommu_dma_alloc_pages(struct 
device *dev,
if (!order_mask)
return NULL;
 
-   pages = kvzalloc(count * sizeof(*pages), GFP_KERNEL);
+   pages = kvzalloc(count * sizeof(*pages), kalloc_gfp);
if (!pages)
return NULL;
 
/* IOMMU can map any pages, so himem can also be used here */
-   gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
+   page_gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
 
/* It makes no sense to muck about with huge pages */
-   gfp &= ~__GFP_COMP;
+   page_gfp &= ~__GFP_COMP;
 
while (count) {
struct page *page = NULL;
@@ -624,7 +625,7 @@ static struct page **__iommu_dma_alloc_pages(struct device 
*dev,
for (order_mask &= (2U << __fls(count)) - 1;
 order_mask; order_mask &= ~order_size) {
unsigned int order = __fls(order_mask);
-   gfp_t alloc_flags = gfp;
+   gfp_t alloc_flags = page_gfp;
 
order_size = 1U << order;
if (order_mask > order_size)
@@ -680,7 +681,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
 
count = PAGE_ALIGN(size) >> PAGE_SHIFT;
pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
-   gfp);
+   gfp, GFP_KERNEL);
if (!pages)
return NULL;
 
-- 
2.32.0.93.g670b81a890-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/4] Add dynamic iommu backed bounce buffers

2021-07-07 Thread David Stevens
Add support for per-domain dynamic pools of iommu bounce buffers to the 
dma-iommu API. This allows iommu mappings to be reused while still
maintaining strict iommu protection. Allocating buffers dynamically
instead of using swiotlb carveouts makes per-domain pools more amenable
on systems with large numbers of devices or where devices are unknown.

When enabled, all non-direct streaming mappings below a configurable
size will go through bounce buffers. Note that this means drivers which
don't properly use the DMA API (e.g. i915) cannot use an iommu when this
feature is enabled. However, all drivers which work with swiotlb=force
should work.

Bounce buffers serve as an optimization in situations where interactions
with the iommu are very costly. For example, virtio-iommu operations in
a guest on a linux host require a vmexit, involvement the VMM, and a
VFIO syscall. For relatively small DMA operations, memcpy can be
significantly faster.

As a performance comparison, on a device with an i5-10210U, I ran fio
with a VFIO passthrough NVMe drive with '--direct=1 --rw=read
--ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k, and
128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
by >99%, as bounce buffers don't require syncing here in the read case.
Running with multiple jobs doesn't serve as a useful performance
comparison because virtio-iommu and vfio_iommu_type1 both have big
locks that significantly limit mulithreaded DMA performance.

This patch set is based on v5.13-rc7 plus the patches at [1].

David Stevens (4):
  dma-iommu: add kalloc gfp flag to alloc helper
  dma-iommu: replace device arguments
  dma-iommu: expose a few helper functions to module
  dma-iommu: Add iommu bounce buffers to dma-iommu api

 drivers/iommu/Kconfig  |  10 +
 drivers/iommu/Makefile |   1 +
 drivers/iommu/dma-iommu.c  | 119 --
 drivers/iommu/io-buffer-pool.c | 656 +
 drivers/iommu/io-buffer-pool.h |  91 +
 include/linux/dma-iommu.h  |  12 +
 6 files changed, 861 insertions(+), 28 deletions(-)
 create mode 100644 drivers/iommu/io-buffer-pool.c
 create mode 100644 drivers/iommu/io-buffer-pool.h

-- 
2.32.0.93.g670b81a890-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu