Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-04-21 Thread Dave Hansen
On 04/18/2017 02:22 PM, Tom Lendacky wrote:
> Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> determine if SME is active.
> 
> A new directory will be created:
>   /sys/kernel/mm/sme/
> 
> And two entries within the new directory:
>   /sys/kernel/mm/sme/active
>   /sys/kernel/mm/sme/encryption_mask

Why do they care, and what will they be doing with this information?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 09/32] x86/mm: Provide general kernel support for memory encryption

2017-04-21 Thread Dave Hansen
On 04/18/2017 02:17 PM, Tom Lendacky wrote:
> @@ -55,7 +57,7 @@ static inline void copy_user_page(void *to, void *from, 
> unsigned long vaddr,
>   __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
>  
>  #ifndef __va
> -#define __va(x)  ((void *)((unsigned 
> long)(x)+PAGE_OFFSET))
> +#define __va(x)  ((void *)(__sme_clr(x) + PAGE_OFFSET))
>  #endif

It seems wrong to be modifying __va().  It currently takes a physical
address, and this modifies it to take a physical address plus the SME bits.

How does that end up ever happening?  If we are pulling physical
addresses out of the page tables, we use p??_phys().  I'd expect *those*
to be masking off the SME bits.

Is it these cases?

pgd_t *base = __va(read_cr3());

For those, it seems like we really want to create two modes of reading
cr3.  One that truly reads CR3 and another that reads the pgd's physical
address out of CR3.  Then you only do the SME masking on the one
fetching a physical address, and the SME bits never leak into __va().
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 07/32] x86/mm: Add support to enable SME in early boot processing

2017-04-21 Thread Tom Lendacky

On 4/21/2017 9:55 AM, Borislav Petkov wrote:

On Tue, Apr 18, 2017 at 04:17:35PM -0500, Tom Lendacky wrote:

Add support to the early boot code to use Secure Memory Encryption (SME).
Since the kernel has been loaded into memory in a decrypted state, support
is added to encrypt the kernel in place and update the early pagetables


s/support is added to //


Done.

Thanks,
Tom




with the memory encryption mask so that new pagetable entries will use
memory encryption.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 32/32] x86/mm: Add support to make use of Secure Memory Encryption

2017-04-21 Thread Tom Lendacky

On 4/18/2017 4:22 PM, Tom Lendacky wrote:

Add support to check if SME has been enabled and if memory encryption
should be activated (checking of command line option based on the
configuration of the default state).  If memory encryption is to be
activated, then the encryption mask is set and the kernel is encrypted
"in place."

Signed-off-by: Tom Lendacky 
---
 arch/x86/kernel/head_64.S |1 +
 arch/x86/mm/mem_encrypt.c |   83 +++--
 2 files changed, 80 insertions(+), 4 deletions(-)



...



-unsigned long __init sme_enable(void)
+unsigned long __init sme_enable(struct boot_params *bp)
 {
+   const char *cmdline_ptr, *cmdline_arg, *cmdline_on, *cmdline_off;
+   unsigned int eax, ebx, ecx, edx;
+   unsigned long me_mask;
+   bool active_by_default;
+   char buffer[16];


So it turns out that when KASLR is enabled (CONFIG_RAMDOMIZE_BASE=y)
the stack-protector support causes issues with this function because
it is called so early. I can get past it by adding:

CFLAGS_mem_encrypt.o := $(nostackp)

in the arch/x86/mm/Makefile, but that obviously eliminates the support
for the whole file.  Would it be better to split out the sme_enable()
and other boot routines into a separate file or just apply the
$(nostackp) to the whole file?

Thanks,
Tom


+   u64 msr;
+
+   /* Check for the SME support leaf */
+   eax = 0x8000;
+   ecx = 0;
+   native_cpuid(, , , );
+   if (eax < 0x801f)
+   goto out;
+
+   /*
+* Check for the SME feature:
+*   CPUID Fn8000_001F[EAX] - Bit 0
+* Secure Memory Encryption support
+*   CPUID Fn8000_001F[EBX] - Bits 5:0
+* Pagetable bit position used to indicate encryption
+*/
+   eax = 0x801f;
+   ecx = 0;
+   native_cpuid(, , , );
+   if (!(eax & 1))
+   goto out;
+   me_mask = 1UL << (ebx & 0x3f);
+
+   /* Check if SME is enabled */
+   msr = __rdmsr(MSR_K8_SYSCFG);
+   if (!(msr & MSR_K8_SYSCFG_MEM_ENCRYPT))
+   goto out;
+
+   /*
+* Fixups have not been applied to phys_base yet, so we must obtain
+* the address to the SME command line option data in the following
+* way.
+*/
+   asm ("lea sme_cmdline_arg(%%rip), %0"
+: "=r" (cmdline_arg)
+: "p" (sme_cmdline_arg));
+   asm ("lea sme_cmdline_on(%%rip), %0"
+: "=r" (cmdline_on)
+: "p" (sme_cmdline_on));
+   asm ("lea sme_cmdline_off(%%rip), %0"
+: "=r" (cmdline_off)
+: "p" (sme_cmdline_off));
+
+   if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT))
+   active_by_default = true;
+   else
+   active_by_default = false;
+
+   cmdline_ptr = (const char *)((u64)bp->hdr.cmd_line_ptr |
+((u64)bp->ext_cmd_line_ptr << 32));
+
+   cmdline_find_option(cmdline_ptr, cmdline_arg, buffer, sizeof(buffer));
+
+   if (strncmp(buffer, cmdline_on, sizeof(buffer)) == 0)
+   sme_me_mask = me_mask;
+   else if (strncmp(buffer, cmdline_off, sizeof(buffer)) == 0)
+   sme_me_mask = 0;
+   else
+   sme_me_mask = active_by_default ? me_mask : 0;
+
+out:
return sme_me_mask;
 }

@@ -543,9 +618,9 @@ unsigned long sme_get_me_mask(void)

 #else  /* !CONFIG_AMD_MEM_ENCRYPT */

-void __init sme_encrypt_kernel(void)   { }
-unsigned long __init sme_enable(void)  { return 0; }
+void __init sme_encrypt_kernel(void)   { }
+unsigned long __init sme_enable(struct boot_params *bp){ return 0; }

-unsigned long sme_get_me_mask(void){ return 0; }
+unsigned long sme_get_me_mask(void){ return 0; }

 #endif /* CONFIG_AMD_MEM_ENCRYPT */


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/2] PCI: quirks: Fix ThunderX2 dma alias handling

2017-04-21 Thread Bjorn Helgaas
On Fri, Apr 21, 2017 at 05:05:41PM +, Jayachandran C wrote:
> On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote:
> > On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
> >  wrote:
> > > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:

> > >> Could you collect "lspci -vv" output from this system?  I'd like to
> > >> archive that as background for this IOMMU issue and the ASPM tweaks I
> > >> suspect we'll have to do.  I *wish* we had more information about that
> > >> VIA thing, because I suspect we could get rid of it if we had more
> > >> details.
> > >
> > > The full logs are slightly large, so I have kept them at:
> > > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> > > The lspci -vv output is lspci-vv.txt and lspci -tvn output is 
> > > lspci-tvn.txt
> > >
> > > The output is from 2 socket system, the cards are not on the first slot
> > > like the example above, so the bus and device numbers are different.
> > 
> > Can somebody with this system collect the "lspci -" output as well?
> > 
> > I'm making some lspci changes to handle the PCI-to-PCIe bridge
> > correctly, and I can use the "lspci -" output to create an lspci
> > test case.
> 
> [Sorry was AFK for a few days]
> 
> I have updated the above directory with the log. Also tested your next branch
> and it works fine on ThunderX2.

Thanks!

With regard to my lspci changes, they add "Slot-" here:

   01:0a.0 PCI bridge: Broadcom Corporation Device 9039
   ...
  -   Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge, MSI 00
  +   Capabilities: [40] Express (v2) PCI/PCI-X to PCI-Express Bridge (Slot-), 
MSI 00

for all your PCI-to-PCIe bridges.  I assume the "Slot-" is correct, i.e.,
the link is not connected to a slot, right?  This comes from the "Slot
Implemented" bit in the PCIe Capabilities Register.

I did notice that all the Root Port devices claim to *not* be connected to
slots, which doesn't seem right.  For example,

  12:00.0 PCI bridge: Broadcom Corporation Device 9084
  Bus: primary=12, secondary=13, subordinate=14, sec-latency=0
  Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00

  13:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ 
Network Connection

It seems strange because the 12:00.0 Root Port looks like it probably
*does* lead to a slot where the NIC is plugged in.  Or is that NIC really
soldered down?

But I assume there are *some* PCIe slots, so at some of those Root Ports
should advertise "Slot+" (which by itself does not imply hotplug support,
if that's the concern).

Bjorn
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] arm64/dma-mapping: fix DMA_ATTR_FORCE_CONTIGUOUS mmaping code

2017-04-21 Thread Geert Uytterhoeven
Hi Catalin,

On Fri, Apr 21, 2017 at 7:32 PM, Catalin Marinas
 wrote:
> On Fri, Mar 31, 2017 at 01:02:51PM +0200, Andrzej Hajda wrote:
>> In case of DMA_ATTR_FORCE_CONTIGUOUS allocations vm_area->pages
>> is invalid. __iommu_mmap_attrs and __iommu_get_sgtable should not
>> use it and take advantage of contiguous nature of the allocation.
>
> As I replied to your original patch, I think
> dma_common_contiguous_remap() should be fixed not to leave a dangling
> area->pages pointer.
>
>>  arch/arm64/mm/dma-mapping.c | 17 -
>>  1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
>> index f7b5401..41c7c36 100644
>> --- a/arch/arm64/mm/dma-mapping.c
>> +++ b/arch/arm64/mm/dma-mapping.c
>> @@ -703,6 +703,10 @@ static int __iommu_mmap_attrs(struct device *dev, 
>> struct vm_area_struct *vma,
>>   if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, ))
>>   return ret;
>>
>> + if (attrs & DMA_ATTR_FORCE_CONTIGUOUS)
>> + return vm_iomap_memory(vma,
>> + page_to_phys(vmalloc_to_page(cpu_addr)), size);
>
> I replied to Geert's patch on whether we actually need to call
> dma_common_contiguous_remap() in __iommu_alloc_attrs() if the device is
> coherent. We don't do this for the swiotlb allocation. If we are going
> to change this, cpu_addr may or may not be in the vmalloc range and
> vmalloc_to_page() would fail (I guess we could add an is_vmalloc_addr()
> check).
>
> Alternatively, just open-code dma_common_contiguous_remap() in
> __iommu_alloc_attrs() to keep a valid area->pages around (I think Geert
> already suggested this). AFAICT, arch/arm does this already in its own
> __iommu_alloc_buffer().
>
> Yet another option would be for iommu_dma_alloc() to understand
> DMA_ATTR_FORCE_CONTIGUOUS and remove the arm64-specific handling.

That was actually the approach I took in my v1.
V2 changed that to provide standalone iommu_dma_{alloc,free}_contiguous()
functions.
V3 changed that to call everything directly from the arm64 code.
...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] arm64/dma-mapping: fix DMA_ATTR_FORCE_CONTIGUOUS mmaping code

2017-04-21 Thread Catalin Marinas
On Fri, Mar 31, 2017 at 01:02:51PM +0200, Andrzej Hajda wrote:
> In case of DMA_ATTR_FORCE_CONTIGUOUS allocations vm_area->pages
> is invalid. __iommu_mmap_attrs and __iommu_get_sgtable should not
> use it and take advantage of contiguous nature of the allocation.

As I replied to your original patch, I think
dma_common_contiguous_remap() should be fixed not to leave a dangling
area->pages pointer.

>  arch/arm64/mm/dma-mapping.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index f7b5401..41c7c36 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -703,6 +703,10 @@ static int __iommu_mmap_attrs(struct device *dev, struct 
> vm_area_struct *vma,
>   if (dma_mmap_from_coherent(dev, vma, cpu_addr, size, ))
>   return ret;
>  
> + if (attrs & DMA_ATTR_FORCE_CONTIGUOUS)
> + return vm_iomap_memory(vma,
> + page_to_phys(vmalloc_to_page(cpu_addr)), size);

I replied to Geert's patch on whether we actually need to call
dma_common_contiguous_remap() in __iommu_alloc_attrs() if the device is
coherent. We don't do this for the swiotlb allocation. If we are going
to change this, cpu_addr may or may not be in the vmalloc range and
vmalloc_to_page() would fail (I guess we could add an is_vmalloc_addr()
check).

Alternatively, just open-code dma_common_contiguous_remap() in
__iommu_alloc_attrs() to keep a valid area->pages around (I think Geert
already suggested this). AFAICT, arch/arm does this already in its own
__iommu_alloc_buffer().

Yet another option would be for iommu_dma_alloc() to understand
DMA_ATTR_FORCE_CONTIGUOUS and remove the arm64-specific handling.

> @@ -715,8 +719,19 @@ static int __iommu_get_sgtable(struct device *dev, 
> struct sg_table *sgt,
>  size_t size, unsigned long attrs)
>  {
>   unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
> - struct vm_struct *area = find_vm_area(cpu_addr);
> + struct vm_struct *area;
> +
> + if (attrs & DMA_ATTR_FORCE_CONTIGUOUS) {
> + int ret = sg_alloc_table(sgt, 1, GFP_KERNEL);
> +
> + if (!ret)
> + sg_set_page(sgt->sgl, vmalloc_to_page(cpu_addr),
> + PAGE_ALIGN(size), 0);
>  
> + return ret;
> + }
> +
> + area = find_vm_area(cpu_addr);

As Russell already stated, this function needs a "fix" regardless of the
DMA_ATTR_FORCE_CONTIGUOUS as we may not have page structures
(*_from_coherent allocations). But I agree with you that
dma_get_sgtable() issues should be addressed separately (and I'm happy
to disable such API on arm64 until sorted).

-- 
Catalin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/2] PCI: quirks: Fix ThunderX2 dma alias handling

2017-04-21 Thread Jayachandran C
On Fri, Apr 21, 2017 at 10:48:15AM -0500, Bjorn Helgaas wrote:
> On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
>  wrote:
> > On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:
> >> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C
> >>  wrote:
> >> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
> >> >> I tentatively applied both patches to pci/host-thunder for v4.12.
> >> >>
> >> >> However, I am concerned about the topology here:
> >> >>
> >> >> On Thu, Apr 13, 2017 at 08:30:45PM +, Jayachandran C wrote:
> >> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
> >> >> > PCI topology is slightly unusual.  For a multi-node system, it looks
> >> >> > like:
> >> >> >
> >> >> > 00:00.0 [PCI] bridge to [bus 01-1e]
> >> >> > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
> >> >> > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] 
> >> >> > (XLATE_ROOT)
> >> >> > 03:00.0 PCIe Endpoint
> >> >>
> >> >> A root port normally has a single PCIe link leading downstream.
> >> >> According to this, 02:00.0 is a root port that has the usual
> >> >> downstream link leading to 03:00.0, but it also has an upstream link
> >> >> to 01:0a.0.
> >> >
> >> > The PCI topology is a bit broken due to the way that the PCIe IP block
> >> > was integrated into SoC PCI bridges and devices. The current mechanism
> >> > of adding a PCI-PCIe bridge to glue these together is not ideal.
> >>
> >> Yeah, that's definitely broken.
> >>
> >> >> Maybe this example is omitting details that are not relevant to DMA
> >> >> aliases?  The PCIe capability only contains one set of link-related
> >> >> registers, so I don't know how we could manage a single device that
> >> >> has two links.
> >> >
> >> > The root port is standard and has just one link to the EP (or whatever
> >> > is on the external PCIe slot). The fallout of the current hw design is
> >> > that the PCI-PCIe bridge has a link that does not follow standard and
> >> > does not have a counterpart (as you noted).
> >> >
> >> >> A device with two links would break things like ASPM.  In
> >> >> set_pcie_port_type(), for example, we have this comment:
> >> >>
> >> >>* A Root Port or a PCI-to-PCIe bridge is always the upstream end
> >> >>* of a Link.  No PCIe component has two Links.  Two Links are
> >> >>* connected by a Switch that has a Port on each Link and internal
> >> >>* logic to connect the two Ports.
> >> >>
> >> >> The topology above breaks these assumptions, which will make
> >> >> pdev->has_secondary_link incorrect, which means ASPM won't work
> >> >> correctly.
> >> >
> >> > Given the current hardware, the pcieport driver seems to work reasonably
> >> > for the root port at 02:00.0, with AER support. I will take a look at the
> >> > ASPM part.
> >>
> >> I don't think pcieport itself cares much about links.  ASPM does, but
> >> it looks like set_pcie_port_type() actually is smart enough to know
> >> that PCI-to-PCIe bridges and Root Ports always have links on their
> >> secondary sides.  So has_secondary_link probably does get set
> >> correctly.
> >>
> >> But I think you'll hit the VIA "strange chipset" thing in
> >> pcie_aspm_init_link_state(), which will probably prevent us from doing
> >> ASPM on the link from 02:00.0 to 03:00.0.
> >>
> >> Could you collect "lspci -vv" output from this system?  I'd like to
> >> archive that as background for this IOMMU issue and the ASPM tweaks I
> >> suspect we'll have to do.  I *wish* we had more information about that
> >> VIA thing, because I suspect we could get rid of it if we had more
> >> details.
> >
> > The full logs are slightly large, so I have kept them at:
> > https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> > The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt
> >
> > The output is from 2 socket system, the cards are not on the first slot
> > like the example above, so the bus and device numbers are different.
> 
> Can somebody with this system collect the "lspci -" output as well?
> 
> I'm making some lspci changes to handle the PCI-to-PCIe bridge
> correctly, and I can use the "lspci -" output to create an lspci
> test case.

[Sorry was AFK for a few days]

I have updated the above directory with the log. Also tested your next branch
and it works fine on ThunderX2.

JC.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] arm64/dma-mapping: fix DMA_ATTR_FORCE_CONTIGUOUS mmaping code

2017-04-21 Thread Catalin Marinas
On Wed, Mar 29, 2017 at 01:55:32PM +0100, Robin Murphy wrote:
> On 29/03/17 11:05, Andrzej Hajda wrote:
> > In case of DMA_ATTR_FORCE_CONTIGUOUS allocations vm_area->pages
> > is invalid. __iommu_mmap_attrs and __iommu_get_sgtable cannot use
> > it. In first case temporary pages array is passed to iommu_dma_mmap,
> > in 2nd case single entry sg table is created directly instead
> > of calling helper.
> > 
> > Fixes: 44176bb ("arm64: Add support for DMA_ATTR_FORCE_CONTIGUOUS to IOMMU")
> > Signed-off-by: Andrzej Hajda 
> > ---
> > Hi,
> > 
> > I am not familiar with this framework so please don't be too cruel ;)
> > Alternative solution I see is to always create vm_area->pages,
> > I do not know which approach is preferred.
> > 
> > Regards
> > Andrzej
> > ---
> >  arch/arm64/mm/dma-mapping.c | 40 ++--
> >  1 file changed, 38 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> > index f7b5401..bba2bc8 100644
> > --- a/arch/arm64/mm/dma-mapping.c
> > +++ b/arch/arm64/mm/dma-mapping.c
> > @@ -704,7 +704,30 @@ static int __iommu_mmap_attrs(struct device *dev, 
> > struct vm_area_struct *vma,
> > return ret;
> >  
> > area = find_vm_area(cpu_addr);
> > -   if (WARN_ON(!area || !area->pages))
> > +   if (WARN_ON(!area))
> 
> From the look of things, it doesn't seem strictly necessary to change
> this, but whether that's a good thing is another matter. I'm not sure
> that dma_common_contiguous_remap() should really be leaving a dangling
> pointer in area->pages as it apparently does... :/

On this specific point, I don't think area->pages should be set either
(cc'ing Laura). As in the vmalloc vs vmap case, area->pages when pages
need to be freed (via vfree), which is not the case here. The
dma_common_pages_remap() would need to set area->pages when called
directly from the iommu DMA ops. Proposal below, not tested with the
iommu ops. I assume the patch would cause __iommu_mmap_attrs() to return
-ENXIO if DMA_ATTR_FORCE_CONTIGUOUS is set.

8<-
diff --git a/drivers/base/dma-mapping.c b/drivers/base/dma-mapping.c
index efd71cf4fdea..ab7071041141 100644
--- a/drivers/base/dma-mapping.c
+++ b/drivers/base/dma-mapping.c
@@ -277,8 +277,8 @@ EXPORT_SYMBOL(dma_common_mmap);
  * remaps an array of PAGE_SIZE pages into another vm_area
  * Cannot be used in non-sleeping contexts
  */
-void *dma_common_pages_remap(struct page **pages, size_t size,
-   unsigned long vm_flags, pgprot_t prot,
+static struct vm_struct *__dma_common_pages_remap(struct page **pages,
+   size_t size, unsigned long vm_flags, pgprot_t prot,
const void *caller)
 {
struct vm_struct *area;
@@ -287,13 +287,26 @@ void *dma_common_pages_remap(struct page **pages, size_t 
size,
if (!area)
return NULL;
 
-   area->pages = pages;
-
if (map_vm_area(area, prot, pages)) {
vunmap(area->addr);
return NULL;
}
 
+   return area;
+}
+
+void *dma_common_pages_remap(struct page **pages, size_t size,
+   unsigned long vm_flags, pgprot_t prot,
+   const void *caller)
+{
+   struct vm_struct *area;
+
+   area = __dma_common_pages_remap(pages, size, vm_flags, prot, caller);
+   if (!area)
+   return NULL;
+
+   area->pages = pages;
+
return area->addr;
 }
 
@@ -308,7 +321,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t 
size,
 {
int i;
struct page **pages;
-   void *ptr;
+   struct vm_struct *area;
unsigned long pfn;
 
pages = kmalloc(sizeof(struct page *) << get_order(size), GFP_KERNEL);
@@ -318,11 +331,13 @@ void *dma_common_contiguous_remap(struct page *page, 
size_t size,
for (i = 0, pfn = page_to_pfn(page); i < (size >> PAGE_SHIFT); i++)
pages[i] = pfn_to_page(pfn + i);
 
-   ptr = dma_common_pages_remap(pages, size, vm_flags, prot, caller);
+   area = __dma_common_pages_remap(pages, size, vm_flags, prot, caller);
 
kfree(pages);
 
-   return ptr;
+   if (!area)
+   return NULL;
+   return area->addr;
 }
 
 /*

-- 
Catalin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/2] PCI: quirks: Fix ThunderX2 dma alias handling

2017-04-21 Thread Bjorn Helgaas via iommu
On Mon, Apr 17, 2017 at 12:47 PM, Jayachandran C
 wrote:
> On Fri, Apr 14, 2017 at 09:00:06PM -0500, Bjorn Helgaas wrote:
>> On Fri, Apr 14, 2017 at 4:06 PM, Jayachandran C
>>  wrote:
>> > On Thu, Apr 13, 2017 at 07:19:11PM -0500, Bjorn Helgaas wrote:
>> >> I tentatively applied both patches to pci/host-thunder for v4.12.
>> >>
>> >> However, I am concerned about the topology here:
>> >>
>> >> On Thu, Apr 13, 2017 at 08:30:45PM +, Jayachandran C wrote:
>> >> > On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
>> >> > PCI topology is slightly unusual.  For a multi-node system, it looks
>> >> > like:
>> >> >
>> >> > 00:00.0 [PCI] bridge to [bus 01-1e]
>> >> > 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04]
>> >> > 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT)
>> >> > 03:00.0 PCIe Endpoint
>> >>
>> >> A root port normally has a single PCIe link leading downstream.
>> >> According to this, 02:00.0 is a root port that has the usual
>> >> downstream link leading to 03:00.0, but it also has an upstream link
>> >> to 01:0a.0.
>> >
>> > The PCI topology is a bit broken due to the way that the PCIe IP block
>> > was integrated into SoC PCI bridges and devices. The current mechanism
>> > of adding a PCI-PCIe bridge to glue these together is not ideal.
>>
>> Yeah, that's definitely broken.
>>
>> >> Maybe this example is omitting details that are not relevant to DMA
>> >> aliases?  The PCIe capability only contains one set of link-related
>> >> registers, so I don't know how we could manage a single device that
>> >> has two links.
>> >
>> > The root port is standard and has just one link to the EP (or whatever
>> > is on the external PCIe slot). The fallout of the current hw design is
>> > that the PCI-PCIe bridge has a link that does not follow standard and
>> > does not have a counterpart (as you noted).
>> >
>> >> A device with two links would break things like ASPM.  In
>> >> set_pcie_port_type(), for example, we have this comment:
>> >>
>> >>* A Root Port or a PCI-to-PCIe bridge is always the upstream end
>> >>* of a Link.  No PCIe component has two Links.  Two Links are
>> >>* connected by a Switch that has a Port on each Link and internal
>> >>* logic to connect the two Ports.
>> >>
>> >> The topology above breaks these assumptions, which will make
>> >> pdev->has_secondary_link incorrect, which means ASPM won't work
>> >> correctly.
>> >
>> > Given the current hardware, the pcieport driver seems to work reasonably
>> > for the root port at 02:00.0, with AER support. I will take a look at the
>> > ASPM part.
>>
>> I don't think pcieport itself cares much about links.  ASPM does, but
>> it looks like set_pcie_port_type() actually is smart enough to know
>> that PCI-to-PCIe bridges and Root Ports always have links on their
>> secondary sides.  So has_secondary_link probably does get set
>> correctly.
>>
>> But I think you'll hit the VIA "strange chipset" thing in
>> pcie_aspm_init_link_state(), which will probably prevent us from doing
>> ASPM on the link from 02:00.0 to 03:00.0.
>>
>> Could you collect "lspci -vv" output from this system?  I'd like to
>> archive that as background for this IOMMU issue and the ASPM tweaks I
>> suspect we'll have to do.  I *wish* we had more information about that
>> VIA thing, because I suspect we could get rid of it if we had more
>> details.
>
> The full logs are slightly large, so I have kept them at:
> https://github.com/jchandra-cavm/thunderx2/blob/master/logs/
> The lspci -vv output is lspci-vv.txt and lspci -tvn output is lspci-tvn.txt
>
> The output is from 2 socket system, the cards are not on the first slot
> like the example above, so the bus and device numbers are different.

Can somebody with this system collect the "lspci -" output as well?

I'm making some lspci changes to handle the PCI-to-PCIe bridge
correctly, and I can use the "lspci -" output to create an lspci
test case.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 07/32] x86/mm: Add support to enable SME in early boot processing

2017-04-21 Thread Borislav Petkov
On Tue, Apr 18, 2017 at 04:17:35PM -0500, Tom Lendacky wrote:
> Add support to the early boot code to use Secure Memory Encryption (SME).
> Since the kernel has been loaded into memory in a decrypted state, support
> is added to encrypt the kernel in place and update the early pagetables

s/support is added to //

> with the memory encryption mask so that new pagetable entries will use
> memory encryption.
> 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu: arm-smmu: correct sid to mask

2017-04-21 Thread Peng Fan
>From code "SMR mask 0x%x out of range for SMMU",
so, we need to use mask, not sid.

Signed-off-by: Peng Fan 
Cc: Will Deacon 
Cc: Robin Murphy 
---
 drivers/iommu/arm-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index b493c99..5a06de2 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1467,7 +1467,7 @@ static int arm_smmu_add_device(struct device *dev)
}
if (mask & ~smmu->smr_mask_mask) {
dev_err(dev, "SMR mask 0x%x out of range for SMMU 
(0x%x)\n",
-   sid, smmu->smr_mask_mask);
+   mask, smmu->smr_mask_mask);
goto out_free;
}
}
-- 
2.6.6

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, April 19, 2017 2:46 AM
> 
> On 18/04/17 11:26, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> > [...]
> >>   II. Feature bits
> >>   
> >>
> >> VIRTIO_IOMMU_F_INPUT_RANGE (0)
> >>  Available range of virtual addresses is described in input_range
> >
> > Usually only the maximum supported address bits are important.
> > Curious do you see such situation where low end of the address
> > space is not usable (since you have both start/end defined later)?
> 
> A start address would allow to provide something resembling a GART to the
> guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
> aperture. I'm not sure how useful that would be in practice.

Intel VT-d has no such limitation, which I can tell. :-)

> 
> On a related note, the virtio-iommu itself doesn't provide a
> per-address-space aperture as it stands. For example, attaching a device
> to an address space might restrict the available IOVA range for the whole
> AS if that device cannot write to high memory (above 32-bit). If the guest
> attempts to map an IOVA outside this window into the device's address
> space, it should expect the MAP request to fail. And when attaching, if
> the address space already has mappings outside this window, then ATTACH
> should fail.
> 
> This too seems to be something that ought to be communicated by firmware,
> but bits are missing (I can't find anything equivalent to DT's dma-ranges
> for PCI root bridges in ACPI tables, for example). In addition VFIO
> doesn't communicate any DMA mask for devices, and doesn't check them
> itself. I guess that the host could find out the DMA mask of devices one
> way or another, but it is tricky to enforce, so I didn't make this a hard
> requirement. Although I should probably add a few words about it.

If there is no such communication on bare metal, then same for pvIOMMU.

> 
> > [...]
> >>   1. Attach device
> >>   
> >>
> >> struct virtio_iommu_req_attach {
> >>le32address_space;
> >>le32device;
> >>le32flags/reserved;
> >> };
> >>
> >> Attach a device to an address space. 'address_space' is an identifier
> >> unique to the guest. If the address space doesn't exist in the IOMMU
> >
> > Based on your description this address space ID is per operation right?
> > MAP/UNMAP and page-table sharing should have different ID spaces...
> 
> I think it's simpler if we keep a single IOASID space per virtio-iommu
> device, because the maximum number of address spaces (described by
> ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
> you still need to define which devices will share a page directory using
> ATTACH requests, though that interface is not set in stone.

got you. yes VM is supposed to consume less IOASIDs than physically
available. It doesn’t hurt to have one IOASID space for both IOVA
map/unmap usages (one IOASID per device) and SVM usages (multiple
IOASIDs per device). The former is digested by software and the latter
will be bound to hardware.

Thanks
Kevin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, April 19, 2017 2:41 AM
> 
> On 18/04/17 10:51, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> >> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> >> it is linked to other virtual or assigned devices. So before jumping into
> >> device operations, we need to define a way for the guest to discover the
> >> virtual IOMMU and the devices it translates.
> >>
> >> The host must describe the relation between IOMMU and devices to the
> >> guest
> >> using either device-tree or ACPI. The virtual IOMMU identifies each
> >
> > Do you plan to support both device tree and ACPI?
> 
> Yes, with ACPI the topology would be described using IORT nodes. I didn't
> include an example in my driver because DT is sufficient for a prototype
> and is readily available (both in Linux and kvmtool), whereas IORT would
> be quite easy to reuse in Linux, but isn't present in kvmtool at the
> moment. However, both interfaces have to be supported for the virtio-
> iommu
> to be portable.

'portable' means whether guest enables ACPI?

> 
> >> virtual device with a 32-bit ID, that we will call "Device ID" in this
> >> document. Device IDs are not necessarily unique system-wide, but they
> may
> >> not overlap within a single virtual IOMMU. Device ID of passed-through
> >> devices do not need to match IDs seen by the physical IOMMU.
> >>
> >> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> >> because with PCI the IOMMU interface would itself be an endpoint, and
> >> existing firmware interfaces don't allow to describe IOMMU<->master
> >> relations between PCI endpoints.
> >
> > I'm not familiar with virtio-mmio mechanism. Curious how devices in
> > virtio-mmio are enumerated today? Could we use that mechanism to
> > identify vIOMMUs and then invent a purely para-virtualized method to
> > enumerate devices behind each vIOMMU?
> 
> Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
> node, and with ACPI they use _HID LNRO0005. Since the host already
> describes available devices to a guest using a firmware interface, I think
> we should reuse the tools provided by that interface for describing
> relations between DMA masters and IOMMU.

OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.

> 
> > Asking this is because each vendor has its own enumeration methods.
> > ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
> > tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
> > current proposal looks following ARM definitions which I'm not sure
> > extensible enough to cover features defined only in other vendors'
> > structures.
> 
> ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
> regardless of the underlying architecture. It isn't defined solely for the
> ARM SMMU, but serves a more general purpose of describing a map of
> device
> identifiers communicated from one components to another. Both DMAR and
> IVRS have such description (respectively DRHD and IVHD), but they are
> designed for a specific IOMMU, whereas IORT could host other kinds.

I'll take a look at IORT definition. DRHD includes information more
than device mapping.

> 
> It seems that all we really need is an interface that says "there is a
> virtio-iommu at address X, here are the devices it translates and their
> corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
> 
> > Since the purpose of this series is to go para-virtualize, why not also
> > para-virtualize and simplify the enumeration method? For example,
> > we may define a query interface through vIOMMU registers to allow
> > guest query whether a device belonging to that vIOMMU. Then we
> > can even remove use of any enumeration structure completely...
> > Just a quick example which I may not think through all the pros and
> > cons. :-)
> 
> I don't think adding a brand new topology description mechanism is worth
> the effort, we're better off reusing what already exists and is
> implemented by operating systems. Adding a query interface inside the
> vIOMMU may work (though might be very painful to integrate with fwspec in
> Linux), but would be redundant since the host has to provide a firmware
> description of the system anyway.
> 
> >> The following diagram describes a situation where two virtual IOMMUs
> >> translate traffic from devices in the system. vIOMMU 1 translates two PCI
> >> domains, in which each function has a 16-bits requester ID. In order for
> >> the vIOMMU to differentiate guest requests targeted at devices in each
> >> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two
> PCI
> >> domains and a collection of platform devices.
> >>
> >>Device IDRequester ID
> >>   /   0x0   0x0  \
> >>  /   

RE: [RFC 3/3] virtio-iommu: future work

2017-04-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.

[...]
> 
>   II. Page table sharing
>   ==
> 
>   1. Sharing IOMMU page tables
>   
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature
> VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
> VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
> create page tables for use with the MAP/UNMAP API. The driver intends
> to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
> pg_format array.
> 
>   VIRTIO_IOMMU_T_PROBE_TABLE
> 
>   struct virtio_iommu_req_probe_table {
>   le32address_space;
>   le32flags;
>   le32len;
> 
>   le32nr_contexts;
>   struct {
>   le32model;
>   u8  format[64];
>   } pg_format[len];
>   };
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
> physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
> initialize the array to 0 and deduce from there which entries have
> been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and
> the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)

So essentially you need modify all existing IOMMU drivers to support page 
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
module, it actually includes vendor specific logic thus unlike typical 
para-virtualized virtio drivers being completely vendor agnostic. Is this 
understanding accurate?

It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those 
components.

[...]

> 
>   2. Sharing MMU page tables
>   --
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to
> the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest
> MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>64  2 1 0
>   table > +-+
>   |   pgd   |0|1|<--- context 0
>   |   ---   |0|0|<--- context 1
>   |