Re: [PATCH v4 0/9] mpt3sas and dmapool scalability

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:40:57AM -0500, Tony Battersby wrote:
> I posted v3 on August 7.  Nobody acked or merged the patches, and then
> I got too busy with other stuff to repost until now.

Thanks for resending.  They were in my pile of things to look at, but
that's an ever-growing pile.

> I believe these patches are ready for merging.

I agree.

> cat /sys/devices/pci:80/:80:07.0/:85:00.0/pools
> (manually cleaned up column alignment)
> poolinfo - 0.1
> reply_post_free_array pool  1  21 192 1
> reply_free pool 1  1  41728   1
> reply pool  1  1  1335296 1
> sense pool  1  1  970272  1
> chain pool  373959 386048 128 12064
> reply_post_free pool12 12 166528  12

That reply pool ... 1 object of 1.3MB?  That's a lot of strain to put
on the page allocator.  I wonder if anything can be done about that.

(I'm equally non-thrilled about the sense pool, the reply_post_free pool
and the reply_free pool, but they seem a little less stressful than the
reply pool)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 9/9] dmapool: debug: prevent endless loop in case of corruption

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:46:35AM -0500, Tony Battersby wrote:
> Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy
> driver corrupts DMA pool memory.
> 
> Signed-off-by: Tony Battersby 

I like it!  Also, here you're using blks_per_alloc in a way which isn't
normally in the performance path, but might be with the right config
options.  With that, I withdraw my objection to the previous patch and

Acked-by: Matthew Wilcox 

Andrew, can you funnel these in through your tree?  If you'd rather not,
I don't mind stuffing them into a git tree and asking Linus to pull
for 4.21.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu: arm-smmu: Set SCTLR.HUPCF bit

2018-11-12 Thread Will Deacon
On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote:
> On Mon, Oct 29, 2018 at 3:09 PM Will Deacon  wrote:
> > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > > faults trigger problems with other in-flight transactions from the
> > > GPU causing CP errors, etc.
> > >
> > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > > described as:
> > >
> > >  '0' - Stall or terminate subsequent transactions in the presence
> > >of an outstanding context fault
> > >  '1' - Process all subsequent transactions independently of any
> > >outstanding context fault.
> > >
> > > Since we don't enable CFCFG (stall) the behavior of terminating
> > > other transactions makes sense.  And is probably not what we want
> > > (and definately not what we want for GPU).
> > >
> > > Signed-off-by: Rob Clark 
> > > ---
> > > So I hit this issue a long time back on 820 (msm8996) and at the
> > > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > > because of concern that it would cause problems on other non-qcom
> > > arm smmu implementations.  And I think I forgot to send this version
> > > of the solution.
> > >
> > > If enabling HUPCF is anticipated to cause problems on other ARM
> > > SMMU implementations, I think I can come up with a variant of this
> > > patch which conditionally enables it for snapdragon.
> > >
> > > Either way, I'd really like to get some variant of this fix merged
> > > (and probably it would be a good idea for stable kernel branches
> > > too), since current behaviour with the GPU means faults turn into
> > > a fantastic cascade of fail.
> >
> > Can you describe how this fantastic cascade of fail improves with this
> > patch, please? If you're getting context faults then something has already
> > gone horribly wrong, so I'm trying to work out how this improves things.
> >
> 
> There are plenty of cases where getting iommu faults with a GPU is
> "normal", or at least not something the kernel or even GL driver can
> control.

Such as? All the mainline driver does is print a diagnostic and clear the
fault, which doesn't seem generally useful.

> With this patch, you still get the iommu fault, but it doesn't cause
> the gpu to crash.  But without it, other memory accesses in flight
> while the fault occurs, like the GPU command-processor reading further
> ahead in the cmdstream to setup next draw, would return zero's,
> causing the GPU to crash or get into a bad state.

I get that part, but I don't understand why we're seeing faults in the first
place and I worry that this patch is just the tip of the iceberg. It's also
not clear that processing subsequent transactions is always the right thing
to do in a world where we actually want to report (and handle) synchronous
faults from devices.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 8/9] dmapool: improve accuracy of debug statistics

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:45:58AM -0500, Tony Battersby wrote:
> +++ linux/mm/dmapool.c2018-08-06 17:52:53.0 -0400
> @@ -61,6 +61,7 @@ struct dma_pool {   /* the pool */
>   struct device *dev;
>   unsigned int allocation;
>   unsigned int boundary;
> + unsigned int blks_per_alloc;
>   char name[32];
>   struct list_head pools;
>  };

This one I'm not totally happy with.  You're storing this value when
it could be easily calculated each time through the show_pools() code.
I appreciate this is a topic where reasonable people might have different
opinions about which solution is preferable.

> @@ -182,6 +182,9 @@ struct dma_pool *dma_pool_create(const c
>   retval->size = size;
>   retval->boundary = boundary;
>   retval->allocation = allocation;
> + retval->blks_per_alloc =
> + (allocation / boundary) * (boundary / size) +
> + (allocation % boundary) / size;
>  
>   INIT_LIST_HEAD(>pools);
>  
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 7/9] dmapool: cleanup integer types

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:45:21AM -0500, Tony Battersby wrote:
> To represent the size of a single allocation, dmapool currently uses
> 'unsigned int' in some places and 'size_t' in other places.  Standardize
> on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
> the blocks in the entire pool.
> 
> Signed-off-by: Tony Battersby 

Acked-by: Matthew Wilcox 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 6/9] dmapool: improve scalability of dma_pool_free

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:44:40AM -0500, Tony Battersby wrote:
> dma_pool_free() scales poorly when the pool contains many pages because
> pool_find_page() does a linear scan of all allocated pages.  Improve its
> scalability by replacing the linear scan with virt_to_page() and storing
> dmapool private data directly in 'struct page', thereby eliminating
> 'struct dma_page'.  In big O notation, this improves the algorithm from
> O(n^2) to O(n) while also reducing memory usage.
> 
> Thanks to Matthew Wilcox for the suggestion to use struct page.
> 
> Signed-off-by: Tony Battersby 

Acked-by: Matthew Wilcox 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 5/9] dmapool: rename fields in dma_page

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:44:02AM -0500, Tony Battersby wrote:
> Rename fields in 'struct dma_page' in preparation for moving them into
> 'struct page'.  No functional changes.
> 
> in_use -> dma_in_use
> offset -> dma_free_off
> 
> Signed-off-by: Tony Battersby 

Acked-by: Matthew Wilcox 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 4/9] dmapool: improve scalability of dma_pool_alloc

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:43:25AM -0500, Tony Battersby wrote:
> dma_pool_alloc() scales poorly when allocating a large number of pages
> because it does a linear scan of all previously-allocated pages before
> allocating a new one.  Improve its scalability by maintaining a separate
> list of pages that have free blocks ready to (re)allocate.  In big O
> notation, this improves the algorithm from O(n^2) to O(n).
> 
> Signed-off-by: Tony Battersby 

Acked-by: Matthew Wilcox 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 3/9] dmapool: cleanup dma_pool_destroy

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:42:48AM -0500, Tony Battersby wrote:
> Remove a small amount of code duplication between dma_pool_destroy() and
> pool_free_page() in preparation for adding more code without having to
> duplicate it.  No functional changes.
> 
> Signed-off-by: Tony Battersby 

Acked-by: Matthew Wilcox 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/9] dmapool: remove checks for dev == NULL

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:42:12AM -0500, Tony Battersby wrote:
> dmapool originally tried to support pools without a device because
> dma_alloc_coherent() supports allocations without a device.  But nobody
> ended up using dma pools without a device, so the current checks in
> dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
> Remove them.
> 
> Signed-off-by: Tony Battersby 

Acked-by: Matthew Wilcox 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/9] dmapool: fix boundary comparison

2018-11-12 Thread Matthew Wilcox
On Mon, Nov 12, 2018 at 10:41:34AM -0500, Tony Battersby wrote:
> Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't 
> cross boundaries")
> Signed-off-by: Tony Battersby 

Acked-by: Matthew Wilcox 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 06/10] swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs

2018-11-12 Thread John Stultz
On Mon, Nov 12, 2018 at 4:07 PM, John Stultz  wrote:
> On Thu, Nov 8, 2018 at 11:49 PM, Christoph Hellwig  wrote:
>> On Tue, Nov 06, 2018 at 05:27:14PM -0800, John Stultz wrote:
>>> But at that point if I just re-apply "swiotlb: use swiotlb_map_page in
>>> swiotlb_map_sg_attrs", I reproduce the hangs.
>>>
>>> Any suggestions for how to further debug what might be going wrong
>>> would be appreciated!
>>
>> Very odd.  In the end map_sg and map_page are defined to do the same
>> things to start with.  The only real issue we had in this area was:
>>
>> "[PATCH v2] of/device: Really only set bus DMA mask when appropriate"
>>
>> so with current mainline + that you still see a problem, and if you
>> rever the commit we are replying to it still goes away?
>
> Just to confirm, as of 4.20-rc2 (which includes the of/device patch
> above), I'm still seeing this issue, but it isn't as reliable to
> reproduce as before.
>
> With "swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs" reverted
> (along with the dependent swiotlb patches) it doesn't seem to trigger
> (no matter what I try).  But re-applying that patch it tends to
> trigger by itself at boot up, but sometimes I have to run "find /" to
> trigger the io hang/stall.

Also, I wanted to mention I've not seen this issue on the original
hikey board. So far only on the hikey960, which might mean this is
tied to something in the hisi UFS driver?

thanks
-john
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 06/10] swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs

2018-11-12 Thread John Stultz
On Thu, Nov 8, 2018 at 11:49 PM, Christoph Hellwig  wrote:
> On Tue, Nov 06, 2018 at 05:27:14PM -0800, John Stultz wrote:
>> But at that point if I just re-apply "swiotlb: use swiotlb_map_page in
>> swiotlb_map_sg_attrs", I reproduce the hangs.
>>
>> Any suggestions for how to further debug what might be going wrong
>> would be appreciated!
>
> Very odd.  In the end map_sg and map_page are defined to do the same
> things to start with.  The only real issue we had in this area was:
>
> "[PATCH v2] of/device: Really only set bus DMA mask when appropriate"
>
> so with current mainline + that you still see a problem, and if you
> rever the commit we are replying to it still goes away?

Just to confirm, as of 4.20-rc2 (which includes the of/device patch
above), I'm still seeing this issue, but it isn't as reliable to
reproduce as before.

With "swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs" reverted
(along with the dependent swiotlb patches) it doesn't seem to trigger
(no matter what I try).  But re-applying that patch it tends to
trigger by itself at boot up, but sometimes I have to run "find /" to
trigger the io hang/stall.

thanks
-john
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] iommu/vt-d: Force IOMMU on for platform opt in hint

2018-11-12 Thread Raj, Ashok
On Mon, Nov 12, 2018 at 11:09:00AM -0700, Alex Williamson wrote:
> On Mon, 12 Nov 2018 19:06:26 +0300
> Mika Westerberg  wrote:
> 
> > From: Lu Baolu 
> > 
> > Intel VT-d spec added a new DMA_CTRL_PLATFORM_OPT_IN_FLAG flag
> > in DMAR ACPI table for BIOS to report compliance about platform
> > initiated DMA restricted to RMRR ranges when transferring control
> > to the OS. The OS treats this as a hint that the IOMMU should be
> > enabled to prevent DMA attacks from possible malicious devices.
> 
> Does this in any way suggest that there are additional recommended uses
> cases from Intel for RMRRs?  My concern here is the incompatibility we
> have with RMRRs and device assignment as we currently cannot assign
> devices where the IOVA address space is encumbered by RMRR
> requirements.  Unfortunately RMRRs do not indicate any sort or
> lifespan, so firmware enabling an RMRR simply to support some boot-time
> DMA encumbers the device with that RMRR for the life of that boot,
> unless we have VT-d code that decides it knows better.  Thanks,

IMO any new platform that requires RMRR should be a bug. It was designed 
originally for some legacy keyboard emulation etc.

The best behavior is to continue to not allow devices with RMRR be direct 
assigned.

Technically ignoring RMRR's and continuing to assign those devices is risky.
The problem is IF BIOS/SMM initiates some IO in the RMRR range and it happens 
to be
mapped by the direct assigned GPA its going to be ugly failure.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/4] PCI / iommu / thunderbolt: IOMMU based DMA protection

2018-11-12 Thread Yehezkel Bernat
On Mon, Nov 12, 2018 at 8:12 PM Lukas Wunner  wrote:
>
> On Mon, Nov 12, 2018 at 07:06:24PM +0300, Mika Westerberg wrote:
> > Recent systems shipping with Windows 10 version 1803 or newer may be
> > utilizing IOMMU to prevent DMA attacks via Thunderbolt ports. This is
> > different from the previous security level based scheme because the
> > connected device cannot access system memory outside of the regions
> > allocated for it by the driver.
> >
> > When enabled the BIOS makes sure no device can do DMA outside of RMRR
> > (Reserved Memory Region Record) regions. This means that during OS boot,
> > before it enables IOMMU, none of the connected devices can bypass DMA
> > protection for instance by overwriting the data structures used by the
> > IOMMU. The BIOS communicates support for this to the OS by setting a new
> > bit in ACPI DMAR table [1].
> >
> > Because these systems utilize an IOMMU to block possible DMA attacks,
> > typically (but not always) the Thunderbolt security level is set to "none"
> > which means that all PCIe devices are immediately usable. This also means
> > that Linux needs to follow Windows 10 and enable IOMMU automatically when
> > running on such system otherwise connected devices can read/write system
> > memory pretty much without any restrictions.
>
> What if the system is booted from a Thunderbolt-attached disk?
> Won't this suddenly break with these patches? That would seem like a
> pretty significant regression.

My assumption is that either it isn't supported on such platforms (at least with
this security configuration active) so this doesn't break anything, it never
worked there, or the BIOS configures IOMMU in a way that allows the disk to work
until the OS will take control and configure the IOMMU according to OS
decisions.
In the latter case, the kernel+initrd will be loaded before IOMMU configuration
will be changed, and then the kernel should be able to config it correctly to
work. (Unless I really don't understand the mechanism and workflow of using
IOMMU, which is possible.)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/4] PCI / iommu / thunderbolt: IOMMU based DMA protection

2018-11-12 Thread Lukas Wunner
On Mon, Nov 12, 2018 at 07:06:24PM +0300, Mika Westerberg wrote:
> Recent systems shipping with Windows 10 version 1803 or newer may be
> utilizing IOMMU to prevent DMA attacks via Thunderbolt ports. This is
> different from the previous security level based scheme because the
> connected device cannot access system memory outside of the regions
> allocated for it by the driver.
> 
> When enabled the BIOS makes sure no device can do DMA outside of RMRR
> (Reserved Memory Region Record) regions. This means that during OS boot,
> before it enables IOMMU, none of the connected devices can bypass DMA
> protection for instance by overwriting the data structures used by the
> IOMMU. The BIOS communicates support for this to the OS by setting a new
> bit in ACPI DMAR table [1].
> 
> Because these systems utilize an IOMMU to block possible DMA attacks,
> typically (but not always) the Thunderbolt security level is set to "none"
> which means that all PCIe devices are immediately usable. This also means
> that Linux needs to follow Windows 10 and enable IOMMU automatically when
> running on such system otherwise connected devices can read/write system
> memory pretty much without any restrictions.

What if the system is booted from a Thunderbolt-attached disk?
Won't this suddenly break with these patches?  That would seem like a
pretty significant regression.  What if the only GPU in the system is
Thunderbolt-attached?  Is it possible to recognize such scenarios and
automatically exempt affected devices from IOMMU blocking?

Thanks,

Lukas
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] iommu/vt-d: Force IOMMU on for platform opt in hint

2018-11-12 Thread Alex Williamson
On Mon, 12 Nov 2018 19:06:26 +0300
Mika Westerberg  wrote:

> From: Lu Baolu 
> 
> Intel VT-d spec added a new DMA_CTRL_PLATFORM_OPT_IN_FLAG flag
> in DMAR ACPI table for BIOS to report compliance about platform
> initiated DMA restricted to RMRR ranges when transferring control
> to the OS. The OS treats this as a hint that the IOMMU should be
> enabled to prevent DMA attacks from possible malicious devices.

Does this in any way suggest that there are additional recommended uses
cases from Intel for RMRRs?  My concern here is the incompatibility we
have with RMRRs and device assignment as we currently cannot assign
devices where the IOVA address space is encumbered by RMRR
requirements.  Unfortunately RMRRs do not indicate any sort or
lifespan, so firmware enabling an RMRR simply to support some boot-time
DMA encumbers the device with that RMRR for the life of that boot,
unless we have VT-d code that decides it knows better.  Thanks,

Alex
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/4] PCI / ACPI: Identify external PCI devices

2018-11-12 Thread Lukas Wunner
On Mon, Nov 12, 2018 at 07:06:25PM +0300, Mika Westerberg wrote:
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1378,6 +1378,27 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>   }
>  }
>  
> +static void set_pcie_external(struct pci_dev *dev)
> +{
> + struct pci_dev *parent;
> +
> + /*
> +  * Walk up the device hierarchy and check for any upstream
> +  * bridge that has is_external_facing set to true. This means
> +  * the hierarchy is below PCIe port that is exposed externally
> +  * (such as Thunderbolt).
> +  */
> + parent = pci_upstream_bridge(dev);
> + while (parent) {
> + if (parent->is_external) {
> + dev->is_external = true;
> + break;
> + }
> +
> + parent = pci_upstream_bridge(parent);
> + }
> +}
> +

This looks pretty much like a duplication of the is_thunderbolt bit
in struct pci_dev and the pci_is_thunderbolt_attached() helper.

Why constrain the functionality to ports with the _DSD property
instead of making it available for *any* Thunderbolt port?

Thanks,

Lukas
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] iommu/vt-d: Do not enable ATS for external devices

2018-11-12 Thread Raj, Ashok
On Mon, Nov 12, 2018 at 07:06:27PM +0300, Mika Westerberg wrote:
> Currently Linux automatically enables ATS (Address Translation Service)
> for any device that supports it (and IOMMU is turned on). ATS is used to
> accelerate DMA access as the device can cache translations locally so
> there is no need to do full translation on IOMMU side. However, as
> pointed out in [1] ATS can be used to bypass IOMMU based security
> completely by simply sending PCIe read/write transaction with AT
> (Address Translation) field set to "translated".
> 
> To mitigate this modify the Intel IOMMU code so that it does not enable
> ATS for any device that is marked as being external. In case this turns
> out to cause performance issues we may selectively allow ATS based on
> user decision but currently use big hammer and disable it completely to
> be on the safe side.
> 
> [1] https://www.repository.cam.ac.uk/handle/1810/274352
> 
> Signed-off-by: Mika Westerberg 

Reviewed-by: Ashok Raj 

> ---
>  drivers/iommu/intel-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index ada786b05a59..b79788da6971 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -1473,7 +1473,8 @@ static void iommu_enable_dev_iotlb(struct 
> device_domain_info *info)
>   if (info->pri_supported && !pci_reset_pri(pdev) && 
> !pci_enable_pri(pdev, 32))
>   info->pri_enabled = 1;
>  #endif
> - if (info->ats_supported && !pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
> + if (!pdev->is_external && info->ats_supported &&
> + !pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
>   info->ats_enabled = 1;
>   domain_update_iotlb(info->domain);
>   info->ats_qdep = pci_ats_queue_depth(pdev);
> -- 
> 2.19.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] iommu/vt-d: Force IOMMU on for platform opt in hint

2018-11-12 Thread Raj, Ashok
On Mon, Nov 12, 2018 at 07:06:26PM +0300, Mika Westerberg wrote:
> From: Lu Baolu 
> 
> Intel VT-d spec added a new DMA_CTRL_PLATFORM_OPT_IN_FLAG flag
> in DMAR ACPI table for BIOS to report compliance about platform
> initiated DMA restricted to RMRR ranges when transferring control
> to the OS. The OS treats this as a hint that the IOMMU should be
> enabled to prevent DMA attacks from possible malicious devices.
> 
> A use of this flag is Kernel DMA protection for Thunderbolt[1]
> which in practice means that IOMMU should be enabled for PCIe
> devices connected to the Thunderbolt ports. With IOMMU enabled
> for these devices, all DMA operations are limited in the range
> reserved for it, thus the DMA attacks are prevented. All these
> devices are enumerated in the PCI/PCIe module and marked with
> an is_external flag.
> 
> This forces IOMMU to be enabled if DMA_CTRL_PLATFORM_OPT_IN_FLAG
> is set in DMAR ACPI table and there are PCIe devices marked as
> is_external in the system. This can be turned off by adding
> "intel_iommu=off" in the kernel command line, if any problems are
> found.
> 
> [1] 
> https://docs.microsoft.com/en-us/windows/security/information-protection/kernel-dma-protection-for-thunderbolt
> 
> Cc: Ashok Raj 
> Cc: Jacob Pan 
> Cc: Sohil Mehta 
> Signed-off-by: Lu Baolu 
> Signed-off-by: Mika Westerberg 

Looks good to me

Reviewed-by: Ashok Raj 

> ---
>  drivers/iommu/dmar.c| 25 +
>  drivers/iommu/intel-iommu.c | 55 +++--
>  include/linux/dmar.h|  8 ++
>  3 files changed, 86 insertions(+), 2 deletions(-)
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/4] thunderbolt: Export IOMMU based DMA protection support to userspace

2018-11-12 Thread Yehezkel Bernat
On Mon, Nov 12, 2018 at 6:06 PM Mika Westerberg
 wrote:
>
> Recent systems shipping with Windows 10 version 1803 or later may
> support a feature called Kernel DMA protection [1]. In practice this
> means that Thunderbolt connected devices are placed behind an IOMMU
> during the whole time it is connected (including during boot) making
> Thunderbolt security levels redundant. Some of these systems still have
> Thunderbolt security level set to "user" in order to support OS
> downgrade (the older version of the OS might not support IOMMU based DMA
> protection so connecting a device still relies on user approval then).
>
> Export this information to userspace by introducing a new sysfs
> attribute (iommu_dma_protection). Based on it userspace tools can make
> more accurate decision whether or not authorize the connected device.
>
> In addition update Thunderbolt documentation regarding IOMMU based DMA
> protection.
>
> [1] 
> https://docs.microsoft.com/en-us/windows/security/information-protection/kernel-dma-protection-for-thunderbolt
>
> Signed-off-by: Mika Westerberg 
> ---

I can't comment on the IOMMU side, but the Thunderbolt side looks good to me.

Just one point:
Have you considered the option to add this property per (TBT?) device?
If the kernel may decide to enable/disable the IOMMU or AST per device, maybe
it should be on this level.
Or maybe the IOMMU decision isn't going to change (it's system-wide) and the AST
decision will be communicated per device by a new sysfs attribute anyway, if
needed?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/9] dmapool: remove checks for dev == NULL

2018-11-12 Thread Tony Battersby
On 11/12/18 11:32 AM, John Garry wrote:
> On 12/11/2018 15:42, Tony Battersby wrote:
>> dmapool originally tried to support pools without a device because
>> dma_alloc_coherent() supports allocations without a device.  But nobody
>> ended up using dma pools without a device, so the current checks in
>> dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
>> Remove them.
>>
> As an aside, is it right that dma_pool_create() does not actually reject 
> dev==NULL and would crash from a NULL-pointer dereference?
>
> Thanks,
> John
>
When passed a NULL dev, dma_pool_create() will already crash with a
NULL-pointer dereference before this patch series, because it checks for
dev == NULL in some places but not others.  Specifically, it will crash
in one of these two places in dma_pool_create():

retval = kmalloc_node(sizeof(*retval), GFP_KERNEL, dev_to_node(dev));
-or-
if (list_empty(>dma_pools))

So removing the checks for dev == NULL will not make previously-working
code to start crashing suddenly.  And since passing dev == NULL would be
an API misuse error and not a runtime error, I would rather not add a
new check to reject it.

Tony
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH 4/4] thunderbolt: Export IOMMU based DMA protection support to userspace

2018-11-12 Thread Mario.Limonciello
> -Original Message-
> From: Mika Westerberg 
> Sent: Monday, November 12, 2018 10:06 AM
> To: iommu@lists.linux-foundation.org
> Cc: Joerg Roedel; David Woodhouse; Lu Baolu; Ashok Raj; Bjorn Helgaas; Rafael 
> J.
> Wysocki; Jacob jun Pan; Andreas Noever; Michael Jamet; Yehezkel Bernat; Lukas
> Wunner; Christian Kellner; Limonciello, Mario; Anthony Wong; Mika Westerberg;
> linux-a...@vger.kernel.org; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: [PATCH 4/4] thunderbolt: Export IOMMU based DMA protection support
> to userspace
> 
> 
> 
> 
> Recent systems shipping with Windows 10 version 1803 or later may
> support a feature called Kernel DMA protection [1]. In practice this
> means that Thunderbolt connected devices are placed behind an IOMMU
> during the whole time it is connected (including during boot) making
> Thunderbolt security levels redundant. Some of these systems still have
> Thunderbolt security level set to "user" in order to support OS
> downgrade (the older version of the OS might not support IOMMU based DMA
> protection so connecting a device still relies on user approval then).
> 
> Export this information to userspace by introducing a new sysfs
> attribute (iommu_dma_protection). Based on it userspace tools can make
> more accurate decision whether or not authorize the connected device.
> 
> In addition update Thunderbolt documentation regarding IOMMU based DMA
> protection.
> 
> [1] https://docs.microsoft.com/en-us/windows/security/information-
> protection/kernel-dma-protection-for-thunderbolt
> 
> Signed-off-by: Mika Westerberg 
> ---
>  .../ABI/testing/sysfs-bus-thunderbolt |  9 
>  Documentation/admin-guide/thunderbolt.rst | 23 +++
>  drivers/thunderbolt/domain.c  | 17 ++
>  3 files changed, 49 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-thunderbolt
> b/Documentation/ABI/testing/sysfs-bus-thunderbolt
> index 151584a1f950..b21fba14689b 100644
> --- a/Documentation/ABI/testing/sysfs-bus-thunderbolt
> +++ b/Documentation/ABI/testing/sysfs-bus-thunderbolt
> @@ -21,6 +21,15 @@ Description:   Holds a comma separated list of device
> unique_ids that
>   If a device is authorized automatically during boot its
>   boot attribute is set to 1.
> 
> +What: /sys/bus/thunderbolt/devices/.../domainX/iommu_dma_protection
> +Date:Mar 2019
> +KernelVersion:   4.21
> +Contact: thunderbolt-softw...@lists.01.org
> +Description: This attribute tells whether the system uses IOMMU
> + for DMA protection. Value of 1 means IOMMU is used 0 means
> + it is not (DMA protection is solely based on Thunderbolt
> + security levels).
> +
>  What: /sys/bus/thunderbolt/devices/.../domainX/security
>  Date:Sep 2017
>  KernelVersion:   4.13
> diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-
> guide/thunderbolt.rst
> index 35fccba6a9a6..ccac2596a49f 100644
> --- a/Documentation/admin-guide/thunderbolt.rst
> +++ b/Documentation/admin-guide/thunderbolt.rst
> @@ -133,6 +133,29 @@ If the user still wants to connect the device they can
> either approve
>  the device without a key or write a new key and write 1 to the
>  ``authorized`` file to get the new key stored on the device NVM.
> 
> +DMA protection utilizing IOMMU
> +--
> +Recent systems shipping with Windows 10 version 1803 or later may support a
> +feature called `Kernel DMA Protection for Thunderbolt 3`_.  This means that

Keep in mind there will be systems that ship with Linux that enable this 
feature too ;)

It might be better to make it time frame and platform  firmware oriented as it's
entirely possible for an OEM to have a field firmware upgrade that may enable 
this
functionality from the platform and allow an end user to upgrade to a 
sufficiently
protected kernel or Windows OS to take advantage of it.

> +Thunderbolt security is handled by an IOMMU so connected devices cannot
> +access memory regions outside of what is allocated for them by drivers.
> +When Linux is running on such system it automatically enables IOMMU if not
> +enabled by the user already. These systems can be identified by reading
> +``1`` from ``/sys/bus/thunderbolt/devices/domainX/iommu_dma_protection``
> +attribute.
> +
> +The driver does not do anything special in this case but because DMA
> +protection is handled by the IOMMU, security levels (if set) are
> +redundant. For this reason some systems ship with security level set to
> +``none``. Other systems have security level set to ``user`` in order to
> +support downgrade to older Windows, so users who want to automatically
> +authorize devices when IOMMU DMA protection is enabled can use the
> +following ``udev`` rule::
> +
> +  ACTION=="add", SUBSYSTEM=="thunderbolt",
> ATTRS{iommu_dma_protection}=="1", ATTR{authorized}=="0",
> ATTR{authorized}="1"
> +

Re: [PATCH v4 2/9] dmapool: remove checks for dev == NULL

2018-11-12 Thread John Garry

On 12/11/2018 15:42, Tony Battersby wrote:

dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device.  But nobody
ended up using dma pools without a device, so the current checks in
dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
Remove them.



As an aside, is it right that dma_pool_create() does not actually reject 
dev==NULL and would crash from a NULL-pointer dereference?


Thanks,
John


Signed-off-by: Tony Battersby 
---
--- linux/mm/dmapool.c.orig 2018-08-03 16:12:23.0 -0400
+++ linux/mm/dmapool.c  2018-08-03 16:13:44.0 -0400
@@ -277,7 +277,7 @@ void dma_pool_destroy(struct dma_pool *p
mutex_lock(_reg_lock);
mutex_lock(_lock);
list_del(>pools);
-   if (pool->dev && list_empty(>dev->dma_pools))
+   if (list_empty(>dev->dma_pools))
empty = true;
mutex_unlock(_lock);
if (empty)
@@ -289,13 +289,9 @@ void dma_pool_destroy(struct dma_pool *p
page = list_entry(pool->page_list.next,
  struct dma_page, page_list);
if (is_page_busy(page)) {
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_destroy %s, %p busy\n",
-   pool->name, page->vaddr);
-   else
-   pr_err("dma_pool_destroy %s, %p busy\n",
-  pool->name, page->vaddr);
+   dev_err(pool->dev,
+   "dma_pool_destroy %s, %p busy\n",
+   pool->name, page->vaddr);
/* leak the still-in-use consistent memory */
list_del(>page_list);
kfree(page);
@@ -357,13 +353,9 @@ void *dma_pool_alloc(struct dma_pool *po
for (i = sizeof(page->offset); i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_alloc %s, %p (corrupted)\n",
-   pool->name, retval);
-   else
-   pr_err("dma_pool_alloc %s, %p (corrupted)\n",
-   pool->name, retval);
+   dev_err(pool->dev,
+   "dma_pool_alloc %s, %p (corrupted)\n",
+   pool->name, retval);

/*
 * Dump the first 4 bytes even if they are not
@@ -418,13 +410,9 @@ void dma_pool_free(struct dma_pool *pool
page = pool_find_page(pool, dma);
if (!page) {
spin_unlock_irqrestore(>lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_free %s, %p/%lx (bad dma)\n",
-   pool->name, vaddr, (unsigned long)dma);
-   else
-   pr_err("dma_pool_free %s, %p/%lx (bad dma)\n",
-  pool->name, vaddr, (unsigned long)dma);
+   dev_err(pool->dev,
+   "dma_pool_free %s, %p/%lx (bad dma)\n",
+   pool->name, vaddr, (unsigned long)dma);
return;
}

@@ -432,13 +420,9 @@ void dma_pool_free(struct dma_pool *pool
 #ifdef DMAPOOL_DEBUG
if ((dma - page->dma) != offset) {
spin_unlock_irqrestore(>lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_free %s, %p (bad vaddr)/%pad\n",
-   pool->name, vaddr, );
-   else
-   pr_err("dma_pool_free %s, %p (bad vaddr)/%pad\n",
-  pool->name, vaddr, );
+   dev_err(pool->dev,
+   "dma_pool_free %s, %p (bad vaddr)/%pad\n",
+   pool->name, vaddr, );
return;
}
{
@@ -449,12 +433,9 @@ void dma_pool_free(struct dma_pool *pool
continue;
}
spin_unlock_irqrestore(>lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "dma_pool_free %s, dma %pad 
already free\n",
-   pool->name, );
-   else
-   pr_err("dma_pool_free %s, dma %pad already 
free\n",
-  pool->name, );
+   dev_err(pool->dev,
+   "dma_pool_free %s, dma %pad already free\n",
+   

[PATCH 2/4] iommu/vt-d: Force IOMMU on for platform opt in hint

2018-11-12 Thread Mika Westerberg
From: Lu Baolu 

Intel VT-d spec added a new DMA_CTRL_PLATFORM_OPT_IN_FLAG flag
in DMAR ACPI table for BIOS to report compliance about platform
initiated DMA restricted to RMRR ranges when transferring control
to the OS. The OS treats this as a hint that the IOMMU should be
enabled to prevent DMA attacks from possible malicious devices.

A use of this flag is Kernel DMA protection for Thunderbolt[1]
which in practice means that IOMMU should be enabled for PCIe
devices connected to the Thunderbolt ports. With IOMMU enabled
for these devices, all DMA operations are limited in the range
reserved for it, thus the DMA attacks are prevented. All these
devices are enumerated in the PCI/PCIe module and marked with
an is_external flag.

This forces IOMMU to be enabled if DMA_CTRL_PLATFORM_OPT_IN_FLAG
is set in DMAR ACPI table and there are PCIe devices marked as
is_external in the system. This can be turned off by adding
"intel_iommu=off" in the kernel command line, if any problems are
found.

[1] 
https://docs.microsoft.com/en-us/windows/security/information-protection/kernel-dma-protection-for-thunderbolt

Cc: Ashok Raj 
Cc: Jacob Pan 
Cc: Sohil Mehta 
Signed-off-by: Lu Baolu 
Signed-off-by: Mika Westerberg 
---
 drivers/iommu/dmar.c| 25 +
 drivers/iommu/intel-iommu.c | 55 +++--
 include/linux/dmar.h|  8 ++
 3 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index d9c748b6f9e4..1edf2a251336 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -2042,3 +2042,28 @@ int dmar_device_remove(acpi_handle handle)
 {
return dmar_device_hotplug(handle, false);
 }
+
+/*
+ * dmar_platform_optin - Is %DMA_CTRL_PLATFORM_OPT_IN_FLAG set in DMAR table
+ *
+ * Returns true if the platform has %DMA_CTRL_PLATFORM_OPT_IN_FLAG set in
+ * the ACPI DMAR table. This means that the platform boot firmware has made
+ * sure no device can issue DMA outside of RMRR regions.
+ */
+bool dmar_platform_optin(void)
+{
+   struct acpi_table_dmar *dmar;
+   acpi_status status;
+   bool ret;
+
+   status = acpi_get_table(ACPI_SIG_DMAR, 0,
+   (struct acpi_table_header **));
+   if (ACPI_FAILURE(status))
+   return false;
+
+   ret = !!(dmar->flags & DMAR_PLATFORM_OPT_IN);
+   acpi_put_table((struct acpi_table_header *)dmar);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(dmar_platform_optin);
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f3ccf025108b..ada786b05a59 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -184,6 +184,7 @@ static int rwbf_quirk;
  */
 static int force_on = 0;
 int intel_iommu_tboot_noforce;
+static int no_platform_optin;
 
 #define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
 
@@ -503,6 +504,7 @@ static int __init intel_iommu_setup(char *str)
pr_info("IOMMU enabled\n");
} else if (!strncmp(str, "off", 3)) {
dmar_disabled = 1;
+   no_platform_optin = 1;
pr_info("IOMMU disabled\n");
} else if (!strncmp(str, "igfx_off", 8)) {
dmar_map_gfx = 0;
@@ -2895,6 +2897,14 @@ static int iommu_should_identity_map(struct device *dev, 
int startup)
if (device_is_rmrr_locked(dev))
return 0;
 
+   /*
+* Prevent any device marked as an external one from getting
+* placed into the statically identity mapping domain. This
+* is done because external devices are always untrusted.
+*/
+   if (pdev->is_external)
+   return 0;
+
if ((iommu_identity_mapping & IDENTMAP_AZALIA) && 
IS_AZALIA(pdev))
return 1;
 
@@ -4728,14 +4738,55 @@ const struct attribute_group *intel_iommu_groups[] = {
NULL,
 };
 
+static int __init platform_optin_force_iommu(void)
+{
+   struct pci_dev *pdev = NULL;
+   bool has_ext_dev = false;
+
+   if (!dmar_platform_optin() || no_platform_optin)
+   return 0;
+
+   for_each_pci_dev(pdev) {
+   if (pdev->is_external) {
+   has_ext_dev = true;
+   break;
+   }
+   }
+
+   if (!has_ext_dev)
+   return 0;
+
+   if (no_iommu || dmar_disabled)
+   pr_info("Intel-IOMMU force enabled due to platform opt in\n");
+
+   /*
+*  If Intel-IOMMU is disabled by default, we will apply
+*  identity map for all devices except those marked as
+*  unsafe external devices.
+*/
+   if (dmar_disabled)
+   iommu_identity_mapping |= IDENTMAP_ALL;
+
+   dmar_disabled = 0;
+#if defined(CONFIG_X86) && defined(CONFIG_SWIOTLB)
+   swiotlb = 0;
+#endif
+   

[PATCH 4/4] thunderbolt: Export IOMMU based DMA protection support to userspace

2018-11-12 Thread Mika Westerberg
Recent systems shipping with Windows 10 version 1803 or later may
support a feature called Kernel DMA protection [1]. In practice this
means that Thunderbolt connected devices are placed behind an IOMMU
during the whole time it is connected (including during boot) making
Thunderbolt security levels redundant. Some of these systems still have
Thunderbolt security level set to "user" in order to support OS
downgrade (the older version of the OS might not support IOMMU based DMA
protection so connecting a device still relies on user approval then).

Export this information to userspace by introducing a new sysfs
attribute (iommu_dma_protection). Based on it userspace tools can make
more accurate decision whether or not authorize the connected device.

In addition update Thunderbolt documentation regarding IOMMU based DMA
protection.

[1] 
https://docs.microsoft.com/en-us/windows/security/information-protection/kernel-dma-protection-for-thunderbolt

Signed-off-by: Mika Westerberg 
---
 .../ABI/testing/sysfs-bus-thunderbolt |  9 
 Documentation/admin-guide/thunderbolt.rst | 23 +++
 drivers/thunderbolt/domain.c  | 17 ++
 3 files changed, 49 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-thunderbolt 
b/Documentation/ABI/testing/sysfs-bus-thunderbolt
index 151584a1f950..b21fba14689b 100644
--- a/Documentation/ABI/testing/sysfs-bus-thunderbolt
+++ b/Documentation/ABI/testing/sysfs-bus-thunderbolt
@@ -21,6 +21,15 @@ Description: Holds a comma separated list of device 
unique_ids that
If a device is authorized automatically during boot its
boot attribute is set to 1.
 
+What: /sys/bus/thunderbolt/devices/.../domainX/iommu_dma_protection
+Date:  Mar 2019
+KernelVersion: 4.21
+Contact:   thunderbolt-softw...@lists.01.org
+Description:   This attribute tells whether the system uses IOMMU
+   for DMA protection. Value of 1 means IOMMU is used 0 means
+   it is not (DMA protection is solely based on Thunderbolt
+   security levels).
+
 What: /sys/bus/thunderbolt/devices/.../domainX/security
 Date:  Sep 2017
 KernelVersion: 4.13
diff --git a/Documentation/admin-guide/thunderbolt.rst 
b/Documentation/admin-guide/thunderbolt.rst
index 35fccba6a9a6..ccac2596a49f 100644
--- a/Documentation/admin-guide/thunderbolt.rst
+++ b/Documentation/admin-guide/thunderbolt.rst
@@ -133,6 +133,29 @@ If the user still wants to connect the device they can 
either approve
 the device without a key or write a new key and write 1 to the
 ``authorized`` file to get the new key stored on the device NVM.
 
+DMA protection utilizing IOMMU
+--
+Recent systems shipping with Windows 10 version 1803 or later may support a
+feature called `Kernel DMA Protection for Thunderbolt 3`_.  This means that
+Thunderbolt security is handled by an IOMMU so connected devices cannot
+access memory regions outside of what is allocated for them by drivers.
+When Linux is running on such system it automatically enables IOMMU if not
+enabled by the user already. These systems can be identified by reading
+``1`` from ``/sys/bus/thunderbolt/devices/domainX/iommu_dma_protection``
+attribute.
+
+The driver does not do anything special in this case but because DMA
+protection is handled by the IOMMU, security levels (if set) are
+redundant. For this reason some systems ship with security level set to
+``none``. Other systems have security level set to ``user`` in order to
+support downgrade to older Windows, so users who want to automatically
+authorize devices when IOMMU DMA protection is enabled can use the
+following ``udev`` rule::
+
+  ACTION=="add", SUBSYSTEM=="thunderbolt", ATTRS{iommu_dma_protection}=="1", 
ATTR{authorized}=="0", ATTR{authorized}="1"
+
+.. _Kernel DMA Protection for Thunderbolt 3: 
https://docs.microsoft.com/en-us/windows/security/information-protection/kernel-dma-protection-for-thunderbolt
+
 Upgrading NVM on Thunderbolt device or host
 ---
 Since most of the functionality is handled in firmware running on a
diff --git a/drivers/thunderbolt/domain.c b/drivers/thunderbolt/domain.c
index 93e562f18d40..7416bdbd8576 100644
--- a/drivers/thunderbolt/domain.c
+++ b/drivers/thunderbolt/domain.c
@@ -7,7 +7,9 @@
  */
 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -236,6 +238,20 @@ static ssize_t boot_acl_store(struct device *dev, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR_RW(boot_acl);
 
+static ssize_t iommu_dma_protection_show(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   /*
+* Kernel DMA protection is a feature where Thunderbolt security is
+* handled natively using IOMMU. It is enabled when IOMMU is
+* enabled and ACPI DMAR table has 

[PATCH 0/4] PCI / iommu / thunderbolt: IOMMU based DMA protection

2018-11-12 Thread Mika Westerberg
Hi all,

Recent systems shipping with Windows 10 version 1803 or newer may be
utilizing IOMMU to prevent DMA attacks via Thunderbolt ports. This is
different from the previous security level based scheme because the
connected device cannot access system memory outside of the regions
allocated for it by the driver.

When enabled the BIOS makes sure no device can do DMA outside of RMRR
(Reserved Memory Region Record) regions. This means that during OS boot,
before it enables IOMMU, none of the connected devices can bypass DMA
protection for instance by overwriting the data structures used by the
IOMMU. The BIOS communicates support for this to the OS by setting a new
bit in ACPI DMAR table [1].

Because these systems utilize an IOMMU to block possible DMA attacks,
typically (but not always) the Thunderbolt security level is set to "none"
which means that all PCIe devices are immediately usable. This also means
that Linux needs to follow Windows 10 and enable IOMMU automatically when
running on such system otherwise connected devices can read/write system
memory pretty much without any restrictions.

Since there is a way to identify PCIe root ports that are "external facing"
we can put all internal devices to pass through (identity mapping) mode and
only external devices need to go through full IOMMU mappings.

We also make sure PCIe ATS (Address Translation Service) is not enabled for
external devices because it could be used to bypass IOMMU completely as
explained in the changelog of patch 3/4.

Finally we expose this information to userspace so tools such as bolt can
do more accurate decision whether or not authorize the connected device.

[1] 
https://software.intel.com/sites/default/files/managed/c5/15/vt-directed-io-spec.pdf

Lu Baolu (1):
  iommu/vt-d: Force IOMMU on for platform opt in hint

Mika Westerberg (3):
  PCI / ACPI: Identify external PCI devices
  iommu/vt-d: Do not enable ATS for external devices
  thunderbolt: Export IOMMU based DMA protection support to userspace

 .../ABI/testing/sysfs-bus-thunderbolt |  9 +++
 Documentation/admin-guide/thunderbolt.rst | 23 
 drivers/acpi/property.c   |  3 +
 drivers/iommu/dmar.c  | 25 
 drivers/iommu/intel-iommu.c   | 58 ++-
 drivers/pci/pci-acpi.c| 13 +
 drivers/pci/probe.c   | 23 
 drivers/thunderbolt/domain.c  | 17 ++
 include/linux/dmar.h  |  8 +++
 include/linux/pci.h   |  1 +
 10 files changed, 177 insertions(+), 3 deletions(-)

-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/4] PCI / ACPI: Identify external PCI devices

2018-11-12 Thread Mika Westerberg
Recent systems shipping Windows 10 version 1803 or later may support
IOMMU based DMA protection. This means that the platform utilizes IOMMU
to prevent DMA attacks over externally exposed PCIe root ports
(typically Thunderbolt ports)

The system BIOS marks these PCIe root ports as being externally facing
ports by implementing following ACPI _DSD under the root port in
question:

  Name (_DSD, Package () {
  ToUUID ("efcc06cc-73ac-4bc3-bff0-76143807c389"),
  Package () {
  Package () {"ExternalFacingPort", 1},
  Package () {"UID", 0 }
  }
  })

This is documented [1].

To make it possible for IOMMU code to identify these external devices,
we look up for this property and mark all children devices (including
the root port itself) with a new flag (->is_external).

[1] 
https://docs.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#identifying-externally-exposed-pcie-root-ports

Signed-off-by: Mika Westerberg 
---
 drivers/acpi/property.c |  3 +++
 drivers/pci/pci-acpi.c  | 13 +
 drivers/pci/probe.c | 23 +++
 include/linux/pci.h |  1 +
 4 files changed, 40 insertions(+)

diff --git a/drivers/acpi/property.c b/drivers/acpi/property.c
index 8c7c4583b52d..4bdad32f62c8 100644
--- a/drivers/acpi/property.c
+++ b/drivers/acpi/property.c
@@ -31,6 +31,9 @@ static const guid_t prp_guids[] = {
/* Hotplug in D3 GUID: 6211e2c0-58a3-4af3-90e1-927a4e0c55a4 */
GUID_INIT(0x6211e2c0, 0x58a3, 0x4af3,
  0x90, 0xe1, 0x92, 0x7a, 0x4e, 0x0c, 0x55, 0xa4),
+   /* External facing port GUID: efcc06cc-73ac-4bc3-bff0-76143807c389 */
+   GUID_INIT(0xefcc06cc, 0x73ac, 0x4bc3,
+ 0xbf, 0xf0, 0x76, 0x14, 0x38, 0x07, 0xc3, 0x89),
 };
 
 static const guid_t ads_guid =
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 2a4aa6468579..fc193279d24d 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -789,6 +789,18 @@ static void pci_acpi_optimize_delay(struct pci_dev *pdev,
ACPI_FREE(obj);
 }
 
+static void pci_acpi_set_external(struct pci_dev *dev)
+{
+   u8 val;
+
+   if (pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)
+   return;
+   if (device_property_read_u8(>dev, "ExternalFacingPort", ))
+   return;
+
+   dev->is_external = val == 1;
+}
+
 static void pci_acpi_setup(struct device *dev)
 {
struct pci_dev *pci_dev = to_pci_dev(dev);
@@ -803,6 +815,7 @@ static void pci_acpi_setup(struct device *dev)
set_dev_node(dev, node);
 
pci_acpi_optimize_delay(pci_dev, adev->handle);
+   pci_acpi_set_external(pci_dev);
 
pci_acpi_add_pm_notifier(adev, pci_dev);
if (!adev->wakeup.flags.valid)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b1c05b5054a0..f1db195a4a90 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1378,6 +1378,27 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
}
 }
 
+static void set_pcie_external(struct pci_dev *dev)
+{
+   struct pci_dev *parent;
+
+   /*
+* Walk up the device hierarchy and check for any upstream
+* bridge that has is_external_facing set to true. This means
+* the hierarchy is below PCIe port that is exposed externally
+* (such as Thunderbolt).
+*/
+   parent = pci_upstream_bridge(dev);
+   while (parent) {
+   if (parent->is_external) {
+   dev->is_external = true;
+   break;
+   }
+
+   parent = pci_upstream_bridge(parent);
+   }
+}
+
 /**
  * pci_ext_cfg_is_aliased - Is ext config space just an alias of std config?
  * @dev: PCI device
@@ -1638,6 +1659,8 @@ int pci_setup_device(struct pci_dev *dev)
/* Need to have dev->cfg_size ready */
set_pcie_thunderbolt(dev);
 
+   set_pcie_external(dev);
+
/* "Unknown power state" */
dev->current_state = PCI_UNKNOWN;
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 11c71c4ecf75..e1c0e032da55 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -396,6 +396,7 @@ struct pci_dev {
unsigned intis_hotplug_bridge:1;
unsigned intshpc_managed:1; /* SHPC owned by shpchp */
unsigned intis_thunderbolt:1;   /* Thunderbolt controller */
+   unsigned intis_external:1;  /* External PCIe device */
unsigned int__aer_firmware_first_valid:1;
unsigned int__aer_firmware_first:1;
unsigned intbroken_intx_masking:1;  /* INTx masking can't be used */
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/4] iommu/vt-d: Do not enable ATS for external devices

2018-11-12 Thread Mika Westerberg
Currently Linux automatically enables ATS (Address Translation Service)
for any device that supports it (and IOMMU is turned on). ATS is used to
accelerate DMA access as the device can cache translations locally so
there is no need to do full translation on IOMMU side. However, as
pointed out in [1] ATS can be used to bypass IOMMU based security
completely by simply sending PCIe read/write transaction with AT
(Address Translation) field set to "translated".

To mitigate this modify the Intel IOMMU code so that it does not enable
ATS for any device that is marked as being external. In case this turns
out to cause performance issues we may selectively allow ATS based on
user decision but currently use big hammer and disable it completely to
be on the safe side.

[1] https://www.repository.cam.ac.uk/handle/1810/274352

Signed-off-by: Mika Westerberg 
---
 drivers/iommu/intel-iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index ada786b05a59..b79788da6971 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1473,7 +1473,8 @@ static void iommu_enable_dev_iotlb(struct 
device_domain_info *info)
if (info->pri_supported && !pci_reset_pri(pdev) && 
!pci_enable_pri(pdev, 32))
info->pri_enabled = 1;
 #endif
-   if (info->ats_supported && !pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
+   if (!pdev->is_external && info->ats_supported &&
+   !pci_enable_ats(pdev, VTD_PAGE_SHIFT)) {
info->ats_enabled = 1;
domain_update_iotlb(info->domain);
info->ats_qdep = pci_ats_queue_depth(pdev);
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 6/9] dmapool: improve scalability of dma_pool_free

2018-11-12 Thread Tony Battersby
dma_pool_free() scales poorly when the pool contains many pages because
pool_find_page() does a linear scan of all allocated pages.  Improve its
scalability by replacing the linear scan with virt_to_page() and storing
dmapool private data directly in 'struct page', thereby eliminating
'struct dma_page'.  In big O notation, this improves the algorithm from
O(n^2) to O(n) while also reducing memory usage.

Thanks to Matthew Wilcox for the suggestion to use struct page.

Signed-off-by: Tony Battersby 
---
--- linux/include/linux/mm_types.h.orig 2018-08-01 17:59:46.0 -0400
+++ linux/include/linux/mm_types.h  2018-08-01 17:59:56.0 -0400
@@ -156,6 +156,12 @@ struct page {
unsigned long _zd_pad_1;/* uses mapping */
};
 
+   struct {/* dma_pool pages */
+   struct list_head dma_list;
+   dma_addr_t dma;
+   unsigned int dma_free_off;
+   };
+
/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;
};
@@ -177,6 +183,8 @@ struct page {
 
unsigned int active;/* SLAB */
int units;  /* SLOB */
+
+   unsigned int dma_in_use;/* dma_pool pages */
};
 
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
--- linux/mm/dmapool.c.orig 2018-08-03 17:47:03.0 -0400
+++ linux/mm/dmapool.c  2018-08-03 17:47:22.0 -0400
@@ -25,6 +25,10 @@
  * list tracks pages that have one or more free blocks, and the 'full' list
  * tracks pages that have no free blocks.  Pages are moved from one list to
  * the other as their blocks are allocated and freed.
+ *
+ * When allocating DMA pages, we use some available space in 'struct page' to
+ * store data private to dmapool; search 'dma_pool' in the definition of
+ * 'struct page' for details.
  */
 
 #include 
@@ -61,14 +65,6 @@ struct dma_pool {/* the pool */
struct list_head pools;
 };
 
-struct dma_page {  /* cacheable header for 'allocation' bytes */
-   struct list_head dma_list;
-   void *vaddr;
-   dma_addr_t dma;
-   unsigned int dma_in_use;
-   unsigned int dma_free_off;
-};
-
 static DEFINE_MUTEX(pools_lock);
 static DEFINE_MUTEX(pools_reg_lock);
 
@@ -95,7 +91,7 @@ show_pools(struct device *dev, struct de
 
spin_lock_irq(>lock);
for (list_idx = 0; list_idx < POOL_MAX_IDX; list_idx++) {
-   struct dma_page *page;
+   struct page *page;
 
list_for_each_entry(page,
>page_list[list_idx],
@@ -218,7 +214,7 @@ struct dma_pool *dma_pool_create(const c
 }
 EXPORT_SYMBOL(dma_pool_create);
 
-static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page)
+static void pool_initialize_free_block_list(struct dma_pool *pool, void *vaddr)
 {
unsigned int offset = 0;
unsigned int next_boundary = pool->boundary;
@@ -229,47 +225,57 @@ static void pool_initialise_page(struct 
next = next_boundary;
next_boundary += pool->boundary;
}
-   *(int *)(page->vaddr + offset) = next;
+   *(int *)(vaddr + offset) = next;
offset = next;
} while (offset < pool->allocation);
 }
 
-static struct dma_page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
+static struct page *pool_alloc_page(struct dma_pool *pool, gfp_t mem_flags)
 {
-   struct dma_page *page;
+   struct page *page;
+   dma_addr_t dma;
+   void *vaddr;
 
-   page = kmalloc(sizeof(*page), mem_flags);
-   if (!page)
+   vaddr = dma_alloc_coherent(pool->dev, pool->allocation, ,
+  mem_flags);
+   if (!vaddr)
return NULL;
-   page->vaddr = dma_alloc_coherent(pool->dev, pool->allocation,
->dma, mem_flags);
-   if (page->vaddr) {
+
 #ifdef DMAPOOL_DEBUG
-   memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
+   memset(vaddr, POOL_POISON_FREED, pool->allocation);
 #endif
-   pool_initialise_page(pool, page);
-   page->dma_in_use = 0;
-   page->dma_free_off = 0;
-   } else {
-   kfree(page);
-   page = NULL;
-   }
+   pool_initialize_free_block_list(pool, vaddr);
+
+   page = virt_to_page(vaddr);
+   page->dma = dma;
+   page->dma_free_off = 0;
+   page->dma_in_use = 0;
+
return page;
 }
 
-static inline bool is_page_busy(struct dma_page *page)
+static inline bool is_page_busy(struct page *page)
 {
return page->dma_in_use != 0;
 }
 
-static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
+static void 

[PATCH v4 4/9] dmapool: improve scalability of dma_pool_alloc

2018-11-12 Thread Tony Battersby
dma_pool_alloc() scales poorly when allocating a large number of pages
because it does a linear scan of all previously-allocated pages before
allocating a new one.  Improve its scalability by maintaining a separate
list of pages that have free blocks ready to (re)allocate.  In big O
notation, this improves the algorithm from O(n^2) to O(n).

Signed-off-by: Tony Battersby 
---
--- linux/mm/dmapool.c.orig 2018-08-03 16:16:49.0 -0400
+++ linux/mm/dmapool.c  2018-08-03 16:45:33.0 -0400
@@ -15,11 +15,16 @@
  * Many older drivers still have their own code to do this.
  *
  * The current design of this allocator is fairly simple.  The pool is
- * represented by the 'struct dma_pool' which keeps a doubly-linked list of
- * allocated pages.  Each page in the page_list is split into blocks of at
- * least 'size' bytes.  Free blocks are tracked in an unsorted singly-linked
- * list of free blocks within the page.  Used blocks aren't tracked, but we
- * keep a count of how many are currently allocated from each page.
+ * represented by the 'struct dma_pool'.  Each allocated page is split into
+ * blocks of at least 'size' bytes.  Free blocks are tracked in an unsorted
+ * singly-linked list of free blocks within the page.  Used blocks aren't
+ * tracked, but we keep a count of how many are currently allocated from each
+ * page.
+ *
+ * The pool keeps two doubly-linked list of allocated pages.  The 'available'
+ * list tracks pages that have one or more free blocks, and the 'full' list
+ * tracks pages that have no free blocks.  Pages are moved from one list to
+ * the other as their blocks are allocated and freed.
  */
 
 #include 
@@ -43,7 +48,10 @@
 #endif
 
 struct dma_pool {  /* the pool */
-   struct list_head page_list;
+#define POOL_FULL_IDX   0
+#define POOL_AVAIL_IDX  1
+#define POOL_MAX_IDX2
+   struct list_head page_list[POOL_MAX_IDX];
spinlock_t lock;
size_t size;
struct device *dev;
@@ -54,7 +62,7 @@ struct dma_pool { /* the pool */
 };
 
 struct dma_page {  /* cacheable header for 'allocation' bytes */
-   struct list_head page_list;
+   struct list_head dma_list;
void *vaddr;
dma_addr_t dma;
unsigned int in_use;
@@ -70,7 +78,6 @@ show_pools(struct device *dev, struct de
unsigned temp;
unsigned size;
char *next;
-   struct dma_page *page;
struct dma_pool *pool;
 
next = buf;
@@ -84,11 +91,18 @@ show_pools(struct device *dev, struct de
list_for_each_entry(pool, >dma_pools, pools) {
unsigned pages = 0;
unsigned blocks = 0;
+   int list_idx;
 
spin_lock_irq(>lock);
-   list_for_each_entry(page, >page_list, page_list) {
-   pages++;
-   blocks += page->in_use;
+   for (list_idx = 0; list_idx < POOL_MAX_IDX; list_idx++) {
+   struct dma_page *page;
+
+   list_for_each_entry(page,
+   >page_list[list_idx],
+   dma_list) {
+   pages++;
+   blocks += page->in_use;
+   }
}
spin_unlock_irq(>lock);
 
@@ -163,7 +177,8 @@ struct dma_pool *dma_pool_create(const c
 
retval->dev = dev;
 
-   INIT_LIST_HEAD(>page_list);
+   INIT_LIST_HEAD(>page_list[POOL_FULL_IDX]);
+   INIT_LIST_HEAD(>page_list[POOL_AVAIL_IDX]);
spin_lock_init(>lock);
retval->size = size;
retval->boundary = boundary;
@@ -252,7 +267,7 @@ static void pool_free_page(struct dma_po
void *vaddr = page->vaddr;
dma_addr_t dma = page->dma;
 
-   list_del(>page_list);
+   list_del(>dma_list);
 
if (is_page_busy(page)) {
dev_err(pool->dev,
@@ -278,8 +293,8 @@ static void pool_free_page(struct dma_po
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
-   struct dma_page *page;
bool empty = false;
+   int list_idx;
 
if (unlikely(!pool))
return;
@@ -294,10 +309,15 @@ void dma_pool_destroy(struct dma_pool *p
device_remove_file(pool->dev, _attr_pools);
mutex_unlock(_reg_lock);
 
-   while ((page = list_first_entry_or_null(>page_list,
-   struct dma_page,
-   page_list))) {
-   pool_free_page(pool, page);
+   for (list_idx = 0; list_idx < POOL_MAX_IDX; list_idx++) {
+   struct dma_page *page;
+
+   while ((page = list_first_entry_or_null(
+   >page_list[list_idx],
+   struct dma_page,
+   dma_list))) {
+   pool_free_page(pool, page);
+

[PATCH v4 7/9] dmapool: cleanup integer types

2018-11-12 Thread Tony Battersby
To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places.  Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.

Signed-off-by: Tony Battersby 
---

This puts an upper bound on 'size' of INT_MAX to avoid overflowing the
following comparison in pool_initialize_free_block_list():

unsigned int offset = 0;
unsigned int next = offset + pool->size;
if (unlikely((next + pool->size) > ...

The actual maximum allocation size is probably lower anyway, probably
KMALLOC_MAX_SIZE, but that gets into the implementation details of other
subsystems which don't export a predefined maximum, so I didn't want to
hardcode it here.  The purpose of the added bounds check is to avoid
overflowing integers, not to check the actual
(platform/device/config-specific?) maximum allocation size.

'boundary' is passed in as a size_t but gets stored as an unsigned int. 
'boundary' values >= 'allocation' do not have any effect, so clipping
'boundary' to 'allocation' keeps it within the range of unsigned int
without affecting anything else.  A few lines above (not in the diff)
you can see that if 'boundary' is passed in as 0 then it is set to
'allocation', so it is nothing new.  For reference, here is the
relevant code after being patched:

if (!boundary)
boundary = allocation;
else if ((boundary < size) || (boundary & (boundary - 1)))
return NULL;

boundary = min(boundary, allocation);

--- linux/mm/dmapool.c.orig 2018-08-06 17:48:19.0 -0400
+++ linux/mm/dmapool.c  2018-08-06 17:48:54.0 -0400
@@ -57,10 +57,10 @@ struct dma_pool {   /* the pool */
 #define POOL_MAX_IDX2
struct list_head page_list[POOL_MAX_IDX];
spinlock_t lock;
-   size_t size;
+   unsigned int size;
struct device *dev;
-   size_t allocation;
-   size_t boundary;
+   unsigned int allocation;
+   unsigned int boundary;
char name[32];
struct list_head pools;
 };
@@ -86,7 +86,7 @@ show_pools(struct device *dev, struct de
mutex_lock(_lock);
list_for_each_entry(pool, >dma_pools, pools) {
unsigned pages = 0;
-   unsigned blocks = 0;
+   size_t blocks = 0;
int list_idx;
 
spin_lock_irq(>lock);
@@ -103,9 +103,10 @@ show_pools(struct device *dev, struct de
spin_unlock_irq(>lock);
 
/* per-pool info, no real statistics yet */
-   temp = scnprintf(next, size, "%-16s %4u %4zu %4zu %2u\n",
+   temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 pool->name, blocks,
-pages * (pool->allocation / pool->size),
+(size_t) pages *
+(pool->allocation / pool->size),
 pool->size, pages);
size -= temp;
next += temp;
@@ -150,7 +151,7 @@ struct dma_pool *dma_pool_create(const c
else if (align & (align - 1))
return NULL;
 
-   if (size == 0)
+   if (size == 0 || size > INT_MAX)
return NULL;
else if (size < 4)
size = 4;
@@ -165,6 +166,8 @@ struct dma_pool *dma_pool_create(const c
else if ((boundary < size) || (boundary & (boundary - 1)))
return NULL;
 
+   boundary = min(boundary, allocation);
+
retval = kmalloc_node(sizeof(*retval), GFP_KERNEL, dev_to_node(dev));
if (!retval)
return retval;
@@ -344,7 +347,7 @@ void *dma_pool_alloc(struct dma_pool *po
 {
unsigned long flags;
struct page *page;
-   size_t offset;
+   unsigned int offset;
void *retval;
void *vaddr;
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 5/9] dmapool: rename fields in dma_page

2018-11-12 Thread Tony Battersby
Rename fields in 'struct dma_page' in preparation for moving them into
'struct page'.  No functional changes.

in_use -> dma_in_use
offset -> dma_free_off

Signed-off-by: Tony Battersby 
---
--- linux/mm/dmapool.c.orig 2018-08-03 17:46:13.0 -0400
+++ linux/mm/dmapool.c  2018-08-03 17:46:24.0 -0400
@@ -65,8 +65,8 @@ struct dma_page { /* cacheable header f
struct list_head dma_list;
void *vaddr;
dma_addr_t dma;
-   unsigned int in_use;
-   unsigned int offset;
+   unsigned int dma_in_use;
+   unsigned int dma_free_off;
 };
 
 static DEFINE_MUTEX(pools_lock);
@@ -101,7 +101,7 @@ show_pools(struct device *dev, struct de
>page_list[list_idx],
dma_list) {
pages++;
-   blocks += page->in_use;
+   blocks += page->dma_in_use;
}
}
spin_unlock_irq(>lock);
@@ -248,8 +248,8 @@ static struct dma_page *pool_alloc_page(
memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
 #endif
pool_initialise_page(pool, page);
-   page->in_use = 0;
-   page->offset = 0;
+   page->dma_in_use = 0;
+   page->dma_free_off = 0;
} else {
kfree(page);
page = NULL;
@@ -259,7 +259,7 @@ static struct dma_page *pool_alloc_page(
 
 static inline bool is_page_busy(struct dma_page *page)
 {
-   return page->in_use != 0;
+   return page->dma_in_use != 0;
 }
 
 static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
@@ -362,10 +362,10 @@ void *dma_pool_alloc(struct dma_pool *po
 
list_add(>dma_list, >page_list[POOL_AVAIL_IDX]);
  ready:
-   page->in_use++;
-   offset = page->offset;
-   page->offset = *(int *)(page->vaddr + offset);
-   if (page->offset >= pool->allocation)
+   page->dma_in_use++;
+   offset = page->dma_free_off;
+   page->dma_free_off = *(int *)(page->vaddr + offset);
+   if (page->dma_free_off >= pool->allocation)
/* Move page from the "available" list to the "full" list. */
list_move_tail(>dma_list,
   >page_list[POOL_FULL_IDX]);
@@ -375,8 +375,8 @@ void *dma_pool_alloc(struct dma_pool *po
{
int i;
u8 *data = retval;
-   /* page->offset is stored in first 4 bytes */
-   for (i = sizeof(page->offset); i < pool->size; i++) {
+   /* page->dma_free_off is stored in first 4 bytes */
+   for (i = sizeof(page->dma_free_off); i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
dev_err(pool->dev,
@@ -458,7 +458,7 @@ void dma_pool_free(struct dma_pool *pool
return;
}
{
-   unsigned int chain = page->offset;
+   unsigned int chain = page->dma_free_off;
while (chain < pool->allocation) {
if (chain != offset) {
chain = *(int *)(page->vaddr + chain);
@@ -474,12 +474,12 @@ void dma_pool_free(struct dma_pool *pool
memset(vaddr, POOL_POISON_FREED, pool->size);
 #endif
 
-   page->in_use--;
-   if (page->offset >= pool->allocation)
+   page->dma_in_use--;
+   if (page->dma_free_off >= pool->allocation)
/* Move page from the "full" list to the "available" list. */
list_move(>dma_list, >page_list[POOL_AVAIL_IDX]);
-   *(int *)vaddr = page->offset;
-   page->offset = offset;
+   *(int *)vaddr = page->dma_free_off;
+   page->dma_free_off = offset;
/*
 * Resist a temptation to do
 *if (!is_page_busy(page)) pool_free_page(pool, page);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/9] dmapool: remove checks for dev == NULL

2018-11-12 Thread Tony Battersby
dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device.  But nobody
ended up using dma pools without a device, so the current checks in
dmapool.c for pool->dev == NULL are both insufficient and causing bloat.
Remove them.

Signed-off-by: Tony Battersby 
---
--- linux/mm/dmapool.c.orig 2018-08-03 16:12:23.0 -0400
+++ linux/mm/dmapool.c  2018-08-03 16:13:44.0 -0400
@@ -277,7 +277,7 @@ void dma_pool_destroy(struct dma_pool *p
mutex_lock(_reg_lock);
mutex_lock(_lock);
list_del(>pools);
-   if (pool->dev && list_empty(>dev->dma_pools))
+   if (list_empty(>dev->dma_pools))
empty = true;
mutex_unlock(_lock);
if (empty)
@@ -289,13 +289,9 @@ void dma_pool_destroy(struct dma_pool *p
page = list_entry(pool->page_list.next,
  struct dma_page, page_list);
if (is_page_busy(page)) {
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_destroy %s, %p busy\n",
-   pool->name, page->vaddr);
-   else
-   pr_err("dma_pool_destroy %s, %p busy\n",
-  pool->name, page->vaddr);
+   dev_err(pool->dev,
+   "dma_pool_destroy %s, %p busy\n",
+   pool->name, page->vaddr);
/* leak the still-in-use consistent memory */
list_del(>page_list);
kfree(page);
@@ -357,13 +353,9 @@ void *dma_pool_alloc(struct dma_pool *po
for (i = sizeof(page->offset); i < pool->size; i++) {
if (data[i] == POOL_POISON_FREED)
continue;
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_alloc %s, %p (corrupted)\n",
-   pool->name, retval);
-   else
-   pr_err("dma_pool_alloc %s, %p (corrupted)\n",
-   pool->name, retval);
+   dev_err(pool->dev,
+   "dma_pool_alloc %s, %p (corrupted)\n",
+   pool->name, retval);
 
/*
 * Dump the first 4 bytes even if they are not
@@ -418,13 +410,9 @@ void dma_pool_free(struct dma_pool *pool
page = pool_find_page(pool, dma);
if (!page) {
spin_unlock_irqrestore(>lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_free %s, %p/%lx (bad dma)\n",
-   pool->name, vaddr, (unsigned long)dma);
-   else
-   pr_err("dma_pool_free %s, %p/%lx (bad dma)\n",
-  pool->name, vaddr, (unsigned long)dma);
+   dev_err(pool->dev,
+   "dma_pool_free %s, %p/%lx (bad dma)\n",
+   pool->name, vaddr, (unsigned long)dma);
return;
}
 
@@ -432,13 +420,9 @@ void dma_pool_free(struct dma_pool *pool
 #ifdef DMAPOOL_DEBUG
if ((dma - page->dma) != offset) {
spin_unlock_irqrestore(>lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev,
-   "dma_pool_free %s, %p (bad vaddr)/%pad\n",
-   pool->name, vaddr, );
-   else
-   pr_err("dma_pool_free %s, %p (bad vaddr)/%pad\n",
-  pool->name, vaddr, );
+   dev_err(pool->dev,
+   "dma_pool_free %s, %p (bad vaddr)/%pad\n",
+   pool->name, vaddr, );
return;
}
{
@@ -449,12 +433,9 @@ void dma_pool_free(struct dma_pool *pool
continue;
}
spin_unlock_irqrestore(>lock, flags);
-   if (pool->dev)
-   dev_err(pool->dev, "dma_pool_free %s, dma %pad 
already free\n",
-   pool->name, );
-   else
-   pr_err("dma_pool_free %s, dma %pad already 
free\n",
-  pool->name, );
+   dev_err(pool->dev,
+   "dma_pool_free %s, dma %pad already free\n",
+   pool->name, );
return;
}
}

___
iommu mailing list
iommu@lists.linux-foundation.org

[PATCH v4 3/9] dmapool: cleanup dma_pool_destroy

2018-11-12 Thread Tony Battersby
Remove a small amount of code duplication between dma_pool_destroy() and
pool_free_page() in preparation for adding more code without having to
duplicate it.  No functional changes.

Signed-off-by: Tony Battersby 
---
--- linux/mm/dmapool.c.orig 2018-08-02 09:59:15.0 -0400
+++ linux/mm/dmapool.c  2018-08-02 10:01:26.0 -0400
@@ -249,13 +249,22 @@ static inline bool is_page_busy(struct d
 
 static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
 {
+   void *vaddr = page->vaddr;
dma_addr_t dma = page->dma;
 
+   list_del(>page_list);
+
+   if (is_page_busy(page)) {
+   dev_err(pool->dev,
+   "dma_pool_destroy %s, %p busy\n",
+   pool->name, vaddr);
+   /* leak the still-in-use consistent memory */
+   } else {
 #ifdef DMAPOOL_DEBUG
-   memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
+   memset(vaddr, POOL_POISON_FREED, pool->allocation);
 #endif
-   dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
-   list_del(>page_list);
+   dma_free_coherent(pool->dev, pool->allocation, vaddr, dma);
+   }
kfree(page);
 }
 
@@ -269,6 +278,7 @@ static void pool_free_page(struct dma_po
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
+   struct dma_page *page;
bool empty = false;
 
if (unlikely(!pool))
@@ -284,19 +294,10 @@ void dma_pool_destroy(struct dma_pool *p
device_remove_file(pool->dev, _attr_pools);
mutex_unlock(_reg_lock);
 
-   while (!list_empty(>page_list)) {
-   struct dma_page *page;
-   page = list_entry(pool->page_list.next,
- struct dma_page, page_list);
-   if (is_page_busy(page)) {
-   dev_err(pool->dev,
-   "dma_pool_destroy %s, %p busy\n",
-   pool->name, page->vaddr);
-   /* leak the still-in-use consistent memory */
-   list_del(>page_list);
-   kfree(page);
-   } else
-   pool_free_page(pool, page);
+   while ((page = list_first_entry_or_null(>page_list,
+   struct dma_page,
+   page_list))) {
+   pool_free_page(pool, page);
}
 
kfree(pool);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 1/9] dmapool: fix boundary comparison

2018-11-12 Thread Tony Battersby
Fix the boundary comparison when constructing the list of free blocks
for the case that 'size' is a power of two.  Since 'boundary' is also a
power of two, that would make 'boundary' a multiple of 'size', in which
case a single block would never cross the boundary.  This bug would
cause some of the allocated memory to be wasted (but not leaked).

Example:

size   = 512
boundary   = 2048
allocation = 4096

Address range
   0 -  511
 512 - 1023
1024 - 1535
1536 - 2047 *
2048 - 2559
2560 - 3071
3072 - 3583
3584 - 4095 *

Prior to this fix, the address ranges marked with "*" would not have
been used even though they didn't cross the given boundary.

Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't cross 
boundaries")
Signed-off-by: Tony Battersby 
---

Even though I described this as a "fix", it does not seem important
enough to Cc: stable from a strict reading of the stable kernel rules. 
IOW, it is not "bothering" anyone.

--- linux/mm/dmapool.c.orig 2018-08-01 17:57:04.0 -0400
+++ linux/mm/dmapool.c  2018-08-01 17:57:16.0 -0400
@@ -210,7 +210,7 @@ static void pool_initialise_page(struct 
 
do {
unsigned int next = offset + pool->size;
-   if (unlikely((next + pool->size) >= next_boundary)) {
+   if (unlikely((next + pool->size) > next_boundary)) {
next = next_boundary;
next_boundary += pool->boundary;
}


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 0/9] mpt3sas and dmapool scalability

2018-11-12 Thread Tony Battersby
I posted v3 on August 7.  Nobody acked or merged the patches, and then
I got too busy with other stuff to repost until now.

The only change since v3:
*) Dropped patch #10 (the mpt3sas patch) since the mpt3sas maintainers
didn't show any interest.

I believe these patches are ready for merging.

---

drivers/scsi/mpt3sas is running into a scalability problem with the
kernel's DMA pool implementation.  With a LSI/Broadcom SAS 9300-8i
12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the
equivalent of:

chain_dma_pool = dma_pool_create(size = 128);
for (i = 0; i < 373959; i++)
{
dma_addr[i] = dma_pool_alloc(chain_dma_pool);
}

And at rmmod, system shutdown, or system reboot, mpt3sas does the
equivalent of:

for (i = 0; i < 373959; i++)
{
dma_pool_free(chain_dma_pool, dma_addr[i]);
}
dma_pool_destroy(chain_dma_pool);

With this usage, both dma_pool_alloc() and dma_pool_free() exhibit
O(n^2) complexity, although dma_pool_free() is much worse due to
implementation details.  On my system, the dma_pool_free() loop above
takes about 9 seconds to run.  Note that the problem was even worse
before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory
consumption in driver."), where the dma_pool_free() loop could take ~30
seconds.

mpt3sas also has some other DMA pools, but chain_dma_pool is the only
one with so many allocations:

cat /sys/devices/pci:80/:80:07.0/:85:00.0/pools
(manually cleaned up column alignment)
poolinfo - 0.1
reply_post_free_array pool  1  21 192 1
reply_free pool 1  1  41728   1
reply pool  1  1  1335296 1
sense pool  1  1  970272  1
chain pool  373959 386048 128 12064
reply_post_free pool12 12 166528  12

The patches in this series improve the scalability of the DMA pool
implementation, which significantly reduces the running time of the
DMA alloc/free loops.  With the patches applied, "modprobe mpt3sas",
"rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are
significantly faster.  Here are some benchmarks (of DMA alloc/free
only, not the entire modprobe/rmmod):

dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959
  original:350 ms ( 1x)
  dmapool patches:  17 ms (21x)

dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959
  original:8901 ms (   1x)
  dmapool patches:   15 ms ( 618x)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 8/9] dmapool: improve accuracy of debug statistics

2018-11-12 Thread Tony Battersby
The "total number of blocks in pool" debug statistic currently does not
take the boundary value into account, so it diverges from the "total
number of blocks in use" statistic when a boundary is in effect.  Add a
calculation for the number of blocks per allocation that takes the
boundary into account, and use it to replace the inaccurate calculation.

Signed-off-by: Tony Battersby 
---

This depends on patch #1 "dmapool: fix boundary comparison" for the
calculated blks_per_alloc value to be correct.

The added blks_per_alloc value will also be used in the next patch.

--- linux/mm/dmapool.c.orig 2018-08-06 17:48:54.0 -0400
+++ linux/mm/dmapool.c  2018-08-06 17:52:53.0 -0400
@@ -61,6 +61,7 @@ struct dma_pool { /* the pool */
struct device *dev;
unsigned int allocation;
unsigned int boundary;
+   unsigned int blks_per_alloc;
char name[32];
struct list_head pools;
 };
@@ -105,8 +106,7 @@ show_pools(struct device *dev, struct de
/* per-pool info, no real statistics yet */
temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n",
 pool->name, blocks,
-(size_t) pages *
-(pool->allocation / pool->size),
+(size_t) pages * pool->blks_per_alloc,
 pool->size, pages);
size -= temp;
next += temp;
@@ -182,6 +182,9 @@ struct dma_pool *dma_pool_create(const c
retval->size = size;
retval->boundary = boundary;
retval->allocation = allocation;
+   retval->blks_per_alloc =
+   (allocation / boundary) * (boundary / size) +
+   (allocation % boundary) / size;
 
INIT_LIST_HEAD(>pools);
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 9/9] dmapool: debug: prevent endless loop in case of corruption

2018-11-12 Thread Tony Battersby
Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy
driver corrupts DMA pool memory.

Signed-off-by: Tony Battersby 
---
--- linux/mm/dmapool.c.orig 2018-08-06 17:52:53.0 -0400
+++ linux/mm/dmapool.c  2018-08-06 17:53:31.0 -0400
@@ -454,17 +454,39 @@ void dma_pool_free(struct dma_pool *pool
{
void *page_vaddr = vaddr - offset;
unsigned int chain = page->dma_free_off;
+   unsigned int free_blks = 0;
+
while (chain < pool->allocation) {
-   if (chain != offset) {
-   chain = *(int *)(page_vaddr + chain);
-   continue;
+   if (unlikely(chain == offset)) {
+   spin_unlock_irqrestore(>lock, flags);
+   dev_err(pool->dev,
+   "dma_pool_free %s, dma %pad already 
free\n",
+   pool->name, );
+   return;
+   }
+
+   /*
+* A buggy driver could corrupt the freelist by
+* use-after-free, buffer overflow, etc.  Besides
+* checking for corruption, this also prevents an
+* endless loop in case corruption causes a circular
+* loop in the freelist.
+*/
+   if (unlikely(++free_blks + page->dma_in_use >
+pool->blks_per_alloc)) {
+ freelist_corrupt:
+   spin_unlock_irqrestore(>lock, flags);
+   dev_err(pool->dev,
+   "dma_pool_free %s, freelist 
corrupted\n",
+   pool->name);
+   return;
}
-   spin_unlock_irqrestore(>lock, flags);
-   dev_err(pool->dev,
-   "dma_pool_free %s, dma %pad already free\n",
-   pool->name, );
-   return;
+
+   chain = *(int *)(page_vaddr + chain);
}
+   if (unlikely(free_blks + page->dma_in_use !=
+pool->blks_per_alloc))
+   goto freelist_corrupt;
}
memset(vaddr, POOL_POISON_FREED, pool->size);
 #endif

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/6] Auxiliary IOMMU domains and Arm SMMUv3

2018-11-12 Thread j...@8bytes.org
Hi Jean-Philippe,

On Thu, Nov 08, 2018 at 06:29:42PM +, Jean-Philippe Brucker wrote:
> (1) My initial approach would have been to use the same page tables for
> the default_domain and this new domain, but it might be precisely what
> you were trying to avoid with this new proposal: a device attached to
> two domains at the same time.

My request was about the initial assumptions a device driver can make
about a device. This assumptions is that DMA-API is set up and
initialized for it, for everything else (like SVA) the device driver
needs to take special action, like allocating an SVA domain and attaching
the device to it.

This should of course not break any IRQ setups for the device, and also
enforcing an ordering is not a good and maintainable solution.

So what you could do here is to either:

1) Save the needed IRQ mappings in some extra datastructure and
   duplicate it in the SVA domain. This also makes it easier to
   use the same SVA domain with multiple devices.

2) Just re-use the 'page-tables' from the default domain in the
   sva-domain. This needs to happen at attach-time, because at
   allocation time you don't know the device yet.

I think 1) is the best solution, what do you think?

Btw, things would be different if we could expose SVA through the
DMA-API to drivers. In this case we could just make the default domain
of type SVA and be done, but we are not there yet.

Regards,

Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 2/6] drivers core: Add I/O ASID allocator

2018-11-12 Thread Joerg Roedel
Hi Jean-Philippe,

On Fri, Oct 19, 2018 at 07:11:54PM +0100, Jean-Philippe Brucker wrote:
> The allocator doesn't really belong in drivers/iommu because some
> drivers would like to allocate PASIDs for devices that aren't managed by
> an IOMMU, using the same ID space as IOMMU. It doesn't really belong in
> drivers/pci either since platform device also support PASID. Add the
> allocator in drivers/base.

I don't really buy this argument, in the end it is the IOMMU that
enforces the PASID mappings for a device. Why should a device not behind
an IOMMU need to allocate a pasid? Do you have examples of such devices
which also don't have their own iommu built-in?

> +int ioasid_for_each(struct ioasid_set *set, ioasid_iter_t func, void *data)
> +{
> + int ret;
> + struct ioasid_iter_data iter_data = {
> + .set= set,
> + .func   = func,
> + .data   = data,
> + };
> +
> + idr_lock(_idr);
> + ret = idr_for_each(_idr, ioasid_iter, _data);
> + idr_unlock(_idr);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_for_each);

What is the intended use-case for this function?

> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + void *priv = NULL;
> + struct ioasid_data *ioasid_data;
> +
> + idr_lock(_idr);
> + ioasid_data = idr_find(_idr, ioasid);
> + if (ioasid_data && ioasid_data->set == set)
> + priv = ioasid_data->private;
> + idr_unlock(_idr);
> +
> + return priv;
> +}

I think using the idr_lock here will become a performance problem, as we
need this function in the SVA page-fault path to retrieve the mm_struct
belonging to a PASID.

The head comment in lib/idr.c states that it should be safe to use
rcu_readlock() on read-only iterations. If so, this should be used here
so that concurrent lookups are possible.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] amd/iommu: Fix Guest Virtual APIC Log Tail Address Register

2018-11-12 Thread j...@8bytes.org
On Mon, Nov 12, 2018 at 12:26:30PM +, Suthikulpanit, Suravee wrote:
> From: Filippo Sironi 
> 
> This register should have been programmed with the physical address
> of the memory location containing the shadow tail pointer for
> the guest virtual APIC log instead of the base address.
> 
> Fixes: 8bda0cfbdc1a  ('iommu/amd: Detect and initialize guest vAPIC log')
> Signed-off-by: Filippo Sironi 
> Signed-off-by: Wei Wang 
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  drivers/iommu/amd_iommu_init.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Applied to iommu/fixes, thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/1] iommu/vtd: Cleanup dma_remapping.h header

2018-11-12 Thread Joerg Roedel
On Mon, Nov 12, 2018 at 02:40:08PM +0800, Lu Baolu wrote:
>  arch/x86/kernel/tboot.c|  2 +-
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  2 +-
>  drivers/gpu/drm/i915/intel_display.c   |  2 +-
>  drivers/gpu/drm/vmwgfx/vmwgfx_drv.c|  2 +-
>  drivers/misc/mic/scif/scif_rma.c   |  2 +-
>  drivers/misc/mic/scif/scif_rma.h   |  2 +-
>  include/linux/dma_remapping.h  | 58 --
>  include/linux/intel-iommu.h| 49 +-
>  8 files changed, 53 insertions(+), 66 deletions(-)
>  delete mode 100644 include/linux/dma_remapping.h

Applied, thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] amd/iommu: Fix Guest Virtual APIC Log Tail Address Register

2018-11-12 Thread Suthikulpanit, Suravee
From: Filippo Sironi 

This register should have been programmed with the physical address
of the memory location containing the shadow tail pointer for
the guest virtual APIC log instead of the base address.

Fixes: 8bda0cfbdc1a  ('iommu/amd: Detect and initialize guest vAPIC log')
Signed-off-by: Filippo Sironi 
Signed-off-by: Wei Wang 
Signed-off-by: Suravee Suthikulpanit 
---
 drivers/iommu/amd_iommu_init.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c
index 9c3d610e1e19..e777fa90b2c2 100644
--- a/drivers/iommu/amd_iommu_init.c
+++ b/drivers/iommu/amd_iommu_init.c
@@ -797,7 +797,8 @@ static int iommu_init_ga_log(struct amd_iommu *iommu)
entry = iommu_virt_to_phys(iommu->ga_log) | GA_LOG_SIZE_512;
memcpy_toio(iommu->mmio_base + MMIO_GA_LOG_BASE_OFFSET,
, sizeof(entry));
-   entry = (iommu_virt_to_phys(iommu->ga_log) & 0xFULL) & 
~7ULL;
+   entry = (iommu_virt_to_phys(iommu->ga_log_tail) &
+(BIT_ULL(52)-1)) & ~7ULL;
memcpy_toio(iommu->mmio_base + MMIO_GA_LOG_TAIL_OFFSET,
, sizeof(entry));
writel(0x00, iommu->mmio_base + MMIO_GA_HEAD_OFFSET);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] mm: Replace all open encodings for NUMA_NO_NODE

2018-11-12 Thread Anshuman Khandual



On 11/12/2018 02:13 PM, Hans Verkuil wrote:
> On 11/12/2018 03:41 AM, Anshuman Khandual wrote:
>> At present there are multiple places where invalid node number is encoded
>> as -1. Even though implicitly understood it is always better to have macros
>> in there. Replace these open encodings for an invalid node number with the
>> global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like
>> 'invalid node' from various places redirecting them to a common definition.
>>
>> Signed-off-by: Anshuman Khandual 
>> ---
>> Build tested this with multiple cross compiler options like alpha, sparc,
>> arm64, x86, powerpc64le etc with their default config which might not have
>> compiled tested all driver related changes. I will appreciate folks giving
>> this a test in their respective build environment.
>>
>> All these places for replacement were found by running the following grep
>> patterns on the entire kernel code. Please let me know if this might have
>> missed some instances. This might also have replaced some false positives.
>> I will appreciate suggestions, inputs and review.
> The 'node' in the drivers/media and the drivers/video sources has nothing to
> do with numa. It's an index for a framebuffer instead (i.e. the X in 
> /dev/fbX).

Thanks for the input. Will drop the changes there.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] iommu/vt-d: Group and domain relationship

2018-11-12 Thread James Sewart via iommu
Hey Jacob,

> On 9 Nov 2018, at 19:09, Jacob Pan  wrote:
> 
> On Thu, 8 Nov 2018 11:30:04 +
> James Sewart mailto:jamessew...@arista.com>> wrote:
> 
>> Hey,
>> 
>>> On 8 Nov 2018, at 01:42, Lu Baolu  wrote:
>>> 
>>> Hi,
>>> 
>>> On 11/8/18 1:55 AM, James Sewart wrote:  
 Hey,  
> On 7 Nov 2018, at 02:10, Lu Baolu 
> wrote:
> 
> Hi,
> 
> On 11/6/18 6:40 PM, James Sewart wrote:  
>> Hey Lu,
>> Would you be able to go into more detail about the issues with
>> allowing IOMMU_DOMAIN_DMA to be allocated via domain_alloc?  
> 
> This door is closed because intel iommu driver does everything for
> IOMMU_DOMAIN_DMA: allocating a domain and setup the context
> entries for the domain.  
 As far as I can tell, attach_device in the intel driver will handle
 cleaning up any old domain context mapping and ensure the new
 domain is mapped with calls to dmar_remove_one_dev_info and
 domain_add_dev_info.  
>>> 
>>> That's only for domains of IOMMU_DOMAIN_UNMANAGED, right?  
>> 
>> attach_device has logic for cleaning up domains that are allocated by
>> the intel driver.
>> 
>>  old_domain = find_domain(dev);
>>  if (old_domain) {
>>  rcu_read_lock();
>>  dmar_remove_one_dev_info(old_domain, dev);
>>  rcu_read_unlock();
>> 
>>  if (!domain_type_is_vm_or_si(old_domain) &&
>>   list_empty(_domain->devices))
>>  domain_exit(old_domain);
>>  }
>> 
>> This is checking the type of the old domain only, freeing it if is
>> not attached to any devices. Looking at this now, maybe the solution
>> would be to distinguish between internally allocated dma domains and
>> dma domains allocated via the external api, so we avoid freeing a
>> domain a driver has reference to.
>> 
> I think this can also be solved by adopting default domain and moving
> internal domain to the generic code at group level. i.e. when device is
> added to a group, the group gets a default domain (also set as current
> domain) for DMA APIs. When device is attached to an unmanaged domain
> (VFIO container and IOMMU API), it will switch the current domain to
> the new domain. Old domain will be inactive but not exited.
> 
> Why do we want to open this door? Probably we want the generic
> iommu layer to handle these things (it's called default domain).  
 I’d like to allocate a domain and attach it to multiple devices in
 a group/multiple groups so that they share address translation,
 but still allow drivers for devices in those groups to use the
 dma_map_ops api.  
>>> 
>>> Just out of curiosity, why do you want to share a single domain
>>> across multiple groups? By default, the groups and DMA domains are
>>> normally 1-1 mapped, right?  
>> 
>> Currently we see each device in a group with their own domain. 
>> find_or_alloc_domain looks at dma aliases to determine who shares
>> domains whereas pci_device_group in the generic iommu code determines
>> groups using a few other checks. We have observed internally that
>> devices under a pcie switch will be put in the same group but they do
>> not share a domain within that group.
>> 
>> Getting every device within a group to share a domain would get us
>> 90% of the way to what we want. But we have some configurations where
>> there exist devices put in other groups that we want to share
>> translations with.
>> 
> This is indeed a huge disconnect between IOMMU API and VT-d handling
> of the DMA API. IOMMU API operates at group granu (per group domain)
> that is based on ACS whereas DMA API in VT-d has per device domain and
> sharing is not based on ACS. So I agree you could have multiple domains
> under the same group for DMA API use.
> 
> I am working on a patch to align the two, e.g. use per group
> default domain for DMA APIs. My idea is as follows:
> - support DOMAIN_DMA type in intel_iommu_domain_alloc()
> - add device to group will attach to the default DMA domain
>   - have to inherit any RMRR or identity mapping holes set up
> prior to IOMMU group setup.
>   - Or perhaps implement apply_resv_region() in VT-d driver and
> move RMRR setup here.
> - DMA map API, will use per group default domain instead of per device
>  private domain
> 
> The change is rather pervasive so i am trying to set a balance for
> short term functionality and complete clean up. Any ideas welcome.
> 

Sounds good, thanks. I am eager to test the patch.

>>> 
> So we can't just open the door but not cleanup the things right?  
 A user of domain_alloc and attach_device is responsible for
 detaching a domain if it is no longer needed and calling
 domain_free.  
>>> 
>>> Currently DMA API calls get_valid_domain_for_dev() to retrieve a DMA
>>> domain. If the domain has already been allocated, return directly.
>>> Otherwise, allocate and 

Re: [RFC] mm: Replace all open encodings for NUMA_NO_NODE

2018-11-12 Thread Hans Verkuil
On 11/12/2018 03:41 AM, Anshuman Khandual wrote:
> At present there are multiple places where invalid node number is encoded
> as -1. Even though implicitly understood it is always better to have macros
> in there. Replace these open encodings for an invalid node number with the
> global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like
> 'invalid node' from various places redirecting them to a common definition.
> 
> Signed-off-by: Anshuman Khandual 
> ---
> Build tested this with multiple cross compiler options like alpha, sparc,
> arm64, x86, powerpc64le etc with their default config which might not have
> compiled tested all driver related changes. I will appreciate folks giving
> this a test in their respective build environment.
> 
> All these places for replacement were found by running the following grep
> patterns on the entire kernel code. Please let me know if this might have
> missed some instances. This might also have replaced some false positives.
> I will appreciate suggestions, inputs and review.

The 'node' in the drivers/media and the drivers/video sources has nothing to
do with numa. It's an index for a framebuffer instead (i.e. the X in /dev/fbX).

Regards,

Hans

> 
> 1. git grep "nid == -1"
> 2. git grep "node == -1"
> 3. git grep "nid = -1"
> 4. git grep "node = -1"
> 
>  arch/alpha/include/asm/topology.h |  2 +-
>  arch/ia64/kernel/numa.c   |  2 +-
>  arch/ia64/mm/discontig.c  |  6 +++---
>  arch/ia64/sn/kernel/io_common.c   |  2 +-
>  arch/powerpc/include/asm/pci-bridge.h |  2 +-
>  arch/powerpc/kernel/paca.c|  2 +-
>  arch/powerpc/kernel/pci-common.c  |  2 +-
>  arch/powerpc/mm/numa.c| 14 +++---
>  arch/powerpc/platforms/powernv/memtrace.c |  4 ++--
>  arch/sparc/kernel/auxio_32.c  |  2 +-
>  arch/sparc/kernel/pci_fire.c  |  2 +-
>  arch/sparc/kernel/pci_schizo.c|  2 +-
>  arch/sparc/kernel/pcic.c  |  6 +++---
>  arch/sparc/kernel/psycho_common.c |  2 +-
>  arch/sparc/kernel/sbus.c  |  2 +-
>  arch/sparc/mm/init_64.c   |  6 +++---
>  arch/sparc/prom/init_32.c |  2 +-
>  arch/sparc/prom/init_64.c |  4 ++--
>  arch/sparc/prom/tree_32.c | 12 ++--
>  arch/sparc/prom/tree_64.c | 18 +-
>  arch/x86/include/asm/pci.h|  2 +-
>  arch/x86/kernel/apic/x2apic_uv_x.c|  6 +++---
>  arch/x86/kernel/smpboot.c |  2 +-
>  arch/x86/platform/olpc/olpc_dt.c  | 16 
>  drivers/block/mtip32xx/mtip32xx.c |  4 ++--
>  drivers/dma/dmaengine.c   |  3 ++-
>  drivers/infiniband/hw/hfi1/affinity.c |  2 +-
>  drivers/infiniband/hw/hfi1/init.c |  2 +-
>  drivers/iommu/dmar.c  |  4 ++--
>  drivers/iommu/intel-iommu.c   |  2 +-
>  drivers/media/pci/ivtv/ivtvfb.c   |  2 +-
>  drivers/media/platform/vivid/vivid-osd.c  |  2 +-
>  drivers/misc/sgi-xp/xpc_uv.c  |  2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  4 ++--
>  drivers/video/fbdev/mmp/fb/mmpfb.c|  2 +-
>  drivers/video/fbdev/pxa168fb.c|  2 +-
>  drivers/video/fbdev/w100fb.c  |  2 +-
>  fs/ocfs2/dlm/dlmcommon.h  |  2 +-
>  fs/ocfs2/dlm/dlmdomain.c  | 10 +-
>  fs/ocfs2/dlm/dlmmaster.c  |  2 +-
>  fs/ocfs2/dlm/dlmrecovery.c|  2 +-
>  fs/ocfs2/stack_user.c |  6 +++---
>  init/init_task.c  |  2 +-
>  kernel/kthread.c  |  2 +-
>  kernel/sched/fair.c   | 15 ---
>  lib/cpumask.c |  2 +-
>  mm/huge_memory.c  | 12 ++--
>  mm/hugetlb.c  |  2 +-
>  mm/ksm.c  |  2 +-
>  mm/memory.c   |  6 +++---
>  mm/memory_hotplug.c   | 12 ++--
>  mm/mempolicy.c|  2 +-
>  mm/page_alloc.c   |  4 ++--
>  mm/page_ext.c |  2 +-
>  net/core/pktgen.c |  2 +-
>  net/qrtr/qrtr.c   |  2 +-
>  tools/perf/bench/numa.c   |  6 +++---
>  57 files changed, 125 insertions(+), 123 deletions(-)
> 
> diff --git a/arch/alpha/include/asm/topology.h 
> b/arch/alpha/include/asm/topology.h
> index e6e13a8..f6dc89c 100644
> --- a/arch/alpha/include/asm/topology.h
> +++ b/arch/alpha/include/asm/topology.h
> @@ -29,7