Re: [PATCH 1/3] x86/Hyper-V: Set x2apic destination mode to physical when x2apic is available

2019-01-31 Thread Dan Carpenter
On Thu, Jan 31, 2019 at 06:17:31PM +0800, lantianyu1...@gmail.com wrote:
>
>

This comment needs to be indented one tab or it looks like we're outside
the funciton.

> +/*
> + * Hyper-V doesn't provide irq remapping for IO-APIC. To enable x2apic,
> + * set x2apic destination mode to physcial mode when x2apic is available
> + * and Hyper-V IOMMU driver makes sure cpus assigned with IO-APIC irqs
> + * have 8-bit APIC id.
> + */
> +# if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> + if (x2apic_supported())
> + x2apic_phys = 1;
> +# endif

The IS_ENABLED() macro is really magical.  You could write this like so:

if (IS_ENABLED(CONFIG_HYPERV_IOMMU) && x2apic_supported())
x2apic_phys = 1;

It works the same and is slightly more pleasant to look at.

regards,
dan carpenter

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: fix a layering violation in videobuf2 and improve dma_map_resource v2

2019-01-31 Thread Marek Szyprowski
Hi All,

On 2019-01-18 12:37, Christoph Hellwig wrote:
> Hi all,
>
> this series fixes a rather gross layering violation in videobuf2, which
> pokes into arm DMA mapping internals to get a DMA address for memory that
> does not have a page structure, and to do so fixes up the dma_map_resource
> implementation to not provide a somewhat dangerous default and improve
> the error handling.
>
> Changes since v1:
>  - don't apply bus offsets in dma_direct_map_resource

Works fine on older Exynos based boards with IOMMU and CMA disabled.

Tested-by: Marek Szyprowski 

Best regards
-- 
Marek Szyprowski, PhD
Samsung R Institute Poland

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] HYPERV/IOMMU: Add Hyper-V stub IOMMU driver

2019-01-31 Thread Tianyu Lan
Hi Vitaly:
Thanks for your review.

On Thu, Jan 31, 2019 at 10:04 PM Vitaly Kuznetsov  wrote:
>
> lantianyu1...@gmail.com writes:
>
> > From: Lan Tianyu 
> >
> > On the bare metal, enabling X2APIC mode requires interrupt remapping
> > function which helps to deliver irq to cpu with 32-bit APIC ID.
> > Hyper-V doesn't provide interrupt remapping function so far and Hyper-V
> > MSI protocol already supports to deliver interrupt to the CPU whose
> > virtual processor index is more than 255. IO-APIC interrupt still has
> > 8-bit APIC ID limitation.
> >
> > This patch is to add Hyper-V stub IOMMU driver in order to enable
> > X2APIC mode successfully in Hyper-V Linux guest. The driver returns X2APIC
> > interrupt remapping capability when X2APIC mode is available. Otherwise,
> > it creates a Hyper-V irq domain to limit IO-APIC interrupts' affinity
> > and make sure cpus assigned with IO-APIC interrupt have 8-bit APIC ID.
> >
> > Signed-off-by: Lan Tianyu 
> > ---
> >  drivers/iommu/Kconfig |   7 ++
> >  drivers/iommu/Makefile|   1 +
> >  drivers/iommu/hyperv-iommu.c  | 189 
> > ++
> >  drivers/iommu/irq_remapping.c |   3 +
> >  drivers/iommu/irq_remapping.h |   1 +
> >  5 files changed, 201 insertions(+)
> >  create mode 100644 drivers/iommu/hyperv-iommu.c
> >
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index 45d7021..5c397c0 100644
> > --- a/drivers/iommu/Kconfig
> > +++ b/drivers/iommu/Kconfig
> > @@ -437,4 +437,11 @@ config QCOM_IOMMU
> >   help
> > Support for IOMMU on certain Qualcomm SoCs.
> >
> > +config HYPERV_IOMMU
> > + bool "Hyper-V stub IOMMU support"
> > + depends on HYPERV
> > + help
> > + Hyper-V stub IOMMU driver provides capability to run
> > + Linux guest with X2APIC mode enabled.
> > +
> >  endif # IOMMU_SUPPORT
> > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> > index a158a68..8c71a15 100644
> > --- a/drivers/iommu/Makefile
> > +++ b/drivers/iommu/Makefile
> > @@ -32,3 +32,4 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
> >  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
> >  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> >  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
> > +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> > diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> > new file mode 100644
> > index 000..a64b747
> > --- /dev/null
> > +++ b/drivers/iommu/hyperv-iommu.c
> > @@ -0,0 +1,189 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#define pr_fmt(fmt) "HYPERV-IR: " fmt
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "irq_remapping.h"
> > +
> > +/*
> > + * According IO-APIC spec, IO APIC has a 24-entry Interrupt
> > + * Redirection Table.
> > + */
> > +#define IOAPIC_REMAPPING_ENTRY 24
>
> KVM already defines KVM_IOAPIC_NUM_PINS - is this the same thing?

It's the same purpose  IOMMU driver is out of KVM scope and so define a new one.
Otherwise, this maybe changed in the future when add interrupt
remapping function.

>
> > +
> > +static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
> > +struct irq_domain *ioapic_ir_domain;
> > +
> > +static int hyperv_ir_set_affinity(struct irq_data *data,
> > + const struct cpumask *mask, bool force)
> > +{
> > + struct irq_data *parent = data->parent_data;
> > + struct irq_cfg *cfg = irqd_cfg(data);
> > + struct IO_APIC_route_entry *entry;
> > + cpumask_t cpumask;
> > + int ret;
> > +
> > + cpumask_andnot(, mask, _max_cpumask);
> > +
> > + /* Return error If new irq affinity is out of ioapic_max_cpumask. */
> > + if (!cpumask_empty())
> > + return -EINVAL;
> > +
> > + ret = parent->chip->irq_set_affinity(parent, mask, force);
> > + if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
> > + return ret;
> > +
> > + entry = data->chip_data;
> > + entry->dest = cfg->dest_apicid;
> > + entry->vector = cfg->vector;
> > + send_cleanup_vector(cfg);
> > +
> > + return 0;
> > +}
> > +
> > +static struct irq_chip hyperv_ir_chip = {
> > + .name   = "HYPERV-IR",
> > + .irq_ack= apic_ack_irq,
> > + .irq_set_affinity   = hyperv_ir_set_affinity,
> > +};
> > +
> > +static int hyperv_irq_remapping_alloc(struct irq_domain *domain,
> > +  unsigned int virq, unsigned int nr_irqs,
> > +  void *arg)
> > +{
> > + struct irq_alloc_info *info = arg;
> > + struct IO_APIC_route_entry *entry;
> > + struct irq_data *irq_data;
> > + struct irq_desc *desc;
> > + struct irq_cfg *cfg;
> > + int ret = 0;
> > +
> > + if (!info || info->type != X86_IRQ_ALLOC_TYPE_IOAPIC || nr_irqs > 1)
> > + return -EINVAL;
> > +
> > + ret = 

Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier

2019-01-31 Thread Peter Xu
On Thu, Jan 31, 2019 at 12:25:40PM +, Jean-Philippe Brucker wrote:
> On 31/01/2019 07:59, Peter Xu wrote:
> > On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote:
> >> Hi Peter,
> > 
> > Hi, Jean,
> > 
> >>
> >> On 30/01/2019 05:57, Peter Xu wrote:
> >>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> >>> but that's actually already covered by invalidate_range().  Remove the
> >>> extra notifier and the chunks.
> >>
> >> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> >> I understood correctly, when doing reclaim the kernel clears the young
> >> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> >> new accesses from devices will go through the IOMMU, set the young bit
> >> again and the kernel can accurately track page use. I didn't see
> >> invalidate_range() being called by rmap or vmscan in this case, but
> >> might just be missing it.
> >>
> >> Two MMU notifiers are used for the young bit, clear_flush_young() and
> >> clear_young(). I believe the former should invalidate ATCs so that DMA
> >> accesses participate in PTE aging. Otherwise the kernel can't know that
> >> the device is still using entries marked 'old'. The latter,
> >> clear_young() is exempted from invalidating the secondary TLBs by
> >> mmu_notifier.h and the IOMMU driver doesn't need to implement it.
> > 
> > Yes I agree most of your analysis, but IMHO the problem is that the
> > AMD IOMMU is not really participating in the PTE aging after all.
> > Please have a look at mn_clear_flush_young() below at [1] - it's
> > always returning zero, no matter whether the page has been accessed by
> > device or not.  A real user of it could be someone like KVM (please
> > see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
> > shadow PTEs and do test-and-clear on that, then return the result to
> > the core mm.  That's why I think currently the AMD driver was only
> > trying to use that as a way to do an extra flush however IMHO it's
> > redundant.
> 
> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the
> flush, since the level-1 page table is shared with the CPU. But removing
> the flush gets you in the following situation:
> 
>   (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE
> young and the entry is cached in the ATC.
> 
>   (2) The CPU does ptep_clear_flush_young_notify(), clears young,
> notices that the page is being used.
> 
>   (3) Device accesses $addr again. Given that we didn't invalidate the
> ATC in (2) it accesses the page directly, without going through the IOMMU.
> 
>   (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't
> have the young bit, which means the page isn't being used and can be
> reclaimed.
> 
> That's not accurate since the page is being used by the device. At step
> (2) we should invalidate the ATC, so that (3) fetches the PTE again and
> marks it young.
> 
> I can agree that the clear_flush_young() notifier is too brutal for this
> purpose, since we send spurious ATC invalidation even when the PTE
> wasn't young (and ATC inv is expensive). There should be another MMU
> notifier "flush_young()" that is called by
> ptep_clear_flush_young_notify() only when the page was actually young.
> But for the moment it's the best we have to avoid the situation above.
> 
> I don't know enough about mm to understand exactly how the
> page_referenced() information is used, but I believe the problem is only
> about accuracy and not correctness. So applying this patch might not
> break anything (after all, intel-svm.c never implemented the notifier)
> but I think I'll keep the notifier in my SVA rework [1].

I see your point.  I'm not an expert of mm either, but I'd say AFAIU
this happens in CPU TLB too.  Please have a look at
ptep_clear_flush_young():

int ptep_clear_flush_young(struct vm_area_struct *vma,
   unsigned long address, pte_t *ptep)
{
/*
 * On x86 CPUs, clearing the accessed bit without a TLB flush
 * doesn't cause data corruption. [ It could cause incorrect
 * page aging and the (mistaken) reclaim of hot pages, but the
 * chance of that should be relatively low. ]
 *
 * So as a performance optimization don't flush the TLB when
 * clearing the accessed bit, it will eventually be flushed by
 * a context switch or a VM operation anyway. [ In the rare
 * event of it not getting flushed for a long time the delay
 * shouldn't really matter because there's no real memory
 * pressure for swapout to react to. ]
 */
return ptep_test_and_clear_young(vma, address, ptep);
}

So maybe it is a tradeoff between performance and accuracy.  IMHO the
IOMMU cache flushing might affect the performance even more than CPU
TLB flushing if the invalidation command takes a long time to run
(e.g., amd_iommu_flush_page is far slower than a 

Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Logan Gunthorpe



On 2019-01-31 4:48 p.m., Dave Jiang wrote:
> 
> On 1/31/2019 4:41 PM, Logan Gunthorpe wrote:
>>
>> On 2019-01-31 3:46 p.m., Dave Jiang wrote:
>>> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So
>>> maybe take a look at the code that starts from there and see if it would
>>> have any impact on your stuff.
>> Ok, well on my system I can write to the smp_affinity all day and the
>> MSI interrupts still work fine.
> 
> Maybe your code is ok then. If the stats show up in /proc/interrupts 
> then you can see it moving to different cores.

Yes, I did check that the stats change CPU in proc interrupts.

> Yeah I'm not sure what to do about it either as I'm not super familiar 
> with that area either. Just making note of what I encountered. And you 
> are right, the updated info has to go over NTB for the other side to 
> write to the updated place. So there's a lot of latency involved.

Ok, well I'll implement the callback anyway for v2. Better safe than
sorry. We can operate on the assumption that someone thought of the race
condition and if we ever see reports of lost interrupts we'll know where
to look.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 4:41 PM, Logan Gunthorpe wrote:


On 2019-01-31 3:46 p.m., Dave Jiang wrote:

I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So
maybe take a look at the code that starts from there and see if it would
have any impact on your stuff.

Ok, well on my system I can write to the smp_affinity all day and the
MSI interrupts still work fine.


Maybe your code is ok then. If the stats show up in /proc/interrupts 
then you can see it moving to different cores.



The MSI code is a bit difficult to trace and audit with all the
different chips and the parent chips which I don't have a good
understanding of. But I can definitely see that it could be possible for
some chips to change the address as smp_affinitiy will eventually
sometimes call msi_domain_set_affinity() which does seem to recompose
the message and write it back to the chip.

So, I could relatively easily add a callback to msi_desc to catch this
and resend the MSI address/data. However, I'm not sure how this is ever
done atomically. It seems like there would be a race while the device
updates its address where old interrupts could be triggered. This race
would be much longer for us when sending this information over the NTB
link. Though, I guess if the only change is that it encodes CPU
information in the address then that would not be an issue. However, I'm
not sure I can say that for certain without a comprehensive
understanding of all the IRQ chips.

Any thoughts on this?


Yeah I'm not sure what to do about it either as I'm not super familiar 
with that area either. Just making note of what I encountered. And you 
are right, the updated info has to go over NTB for the other side to 
write to the updated place. So there's a lot of latency involved.






Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Logan Gunthorpe



On 2019-01-31 3:46 p.m., Dave Jiang wrote:
> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So 
> maybe take a look at the code that starts from there and see if it would 
> have any impact on your stuff.

Ok, well on my system I can write to the smp_affinity all day and the
MSI interrupts still work fine.

The MSI code is a bit difficult to trace and audit with all the
different chips and the parent chips which I don't have a good
understanding of. But I can definitely see that it could be possible for
some chips to change the address as smp_affinitiy will eventually
sometimes call msi_domain_set_affinity() which does seem to recompose
the message and write it back to the chip.

So, I could relatively easily add a callback to msi_desc to catch this
and resend the MSI address/data. However, I'm not sure how this is ever
done atomically. It seems like there would be a race while the device
updates its address where old interrupts could be triggered. This race
would be much longer for us when sending this information over the NTB
link. Though, I guess if the only change is that it encodes CPU
information in the address then that would not be an issue. However, I'm
not sure I can say that for certain without a comprehensive
understanding of all the IRQ chips.

Any thoughts on this?

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts

2019-01-31 Thread Logan Gunthorpe



On 2019-01-31 3:39 p.m., Bjorn Helgaas wrote:
> I assume you'll merge this along with the rest of the series, so:
> 
> Acked-by: Bjorn Helgaas 

Thanks!

>> diff --git a/include/linux/msi.h b/include/linux/msi.h
>> index 784fb52b9900..6458ab049852 100644
>> --- a/include/linux/msi.h
>> +++ b/include/linux/msi.h
>> @@ -88,6 +88,7 @@ struct msi_desc {
>>  __u8multi_cap   : 3;
>>  __u8maskbit : 1;
>>  __u8is_64   : 1;
>> +__u8is_virtual  : 1;
> 
> You did the right thing by using the same style as what's already
> here, but does anybody know why are we using __u8 and __u16 here?
> 
> Those typedefs are in include/uapi/asm-generic/int-l64.h, which
> suggests they're for things exported to user space, but I don't think
> that's the case here, so I'm wondering if we could someday replace
> these with u8 and u16.  Obviously that wouldn't be part of *this*
> series.

Yes, I was also confused by this. But I always follow the "when-in-rome"
rule. My understanding is the same as yours is that __u8 should be used
for userspace compatibility which doesn't apply here. If there is
consensus on this being wrong, I'd be happy to write a cleanup patch
that fixes it separate from this series.
>> +/*
>> + * Virtual interrupts allow for more interrupts to be allocated
>> + * than the device has interrupts for. These are not programmed
>> + * into the devices MSI-X table and must be handled by some
> 
> s/devices/device's/

Fixed for when I send v2.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 3:39 PM, Logan Gunthorpe wrote:


On 2019-01-31 1:58 p.m., Dave Jiang wrote:

On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:

On 2019-01-31 1:20 p.m., Dave Jiang wrote:

Does this work when the system moves the MSI vector either via software
(irqbalance) or BIOS APIC programming (some modes cause round robin
behavior)?

I don't know how irqbalance works, and I'm not sure what you are
referring to by BIOS APIC programming, however I would expect these
things would not be a problem.

The MSI code I'm presenting here doesn't do anything crazy with the
interrupts, it allocates and uses them just as any PCI driver would. The
only real difference here is that instead of a piece of hardware sending
the IRQ TLP, it will be sent through the memory window (which, from the
OS's perspective, is just coming from an NTB hardware proxy alias).

Logan

Right. I did that as a hack a while back for some silicon errata
workaround. When the vector moves, the address for the LAPIC changes. So
unless it gets updated, you end up writing to the old location and lose
all the new interrupts. irqbalance is a user daemon that rotates the
system interrupts around to ensure that not all interrupts are pinned on
a single core.

Yes, that would be a problem if something changes the MSI vectors out
from under us. Seems like that would be a bit difficult to do even with
regular hardware. So far I haven't seen anything that would do that. If
you know of where in the kernel this happens I'd be interested in
getting a pointer to the flow in the code. If that is the case this MSI
stuff will need to get much more complicated...


I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So 
maybe take a look at the code that starts from there and see if it would 
have any impact on your stuff.






I think it's enabled by default on several distros.
Although MSIX has nothing to do with the IOAPIC, the mode that the APIC
is programmed can have an influence on how the interrupts are delivered.
There are certain Intel platforms (I don't know if AMD does anything
like that) puts the IOAPIC in a certain configuration that causes the
interrupts to be moved in a round robin fashion. I think it's physical
flat mode? I don't quite recall. Normally on the low end Xeons. It's
probably worth doing a test run with the irqbalance daemon running and
make sure you traffic stream doesn't all of sudden stop.

I've tested with irqbalance running and haven't found any noticeable
difference.

Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Logan Gunthorpe



On 2019-01-31 1:58 p.m., Dave Jiang wrote:
> 
> On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:
>>
>> On 2019-01-31 1:20 p.m., Dave Jiang wrote:
>>> Does this work when the system moves the MSI vector either via software
>>> (irqbalance) or BIOS APIC programming (some modes cause round robin
>>> behavior)?
>>
>> I don't know how irqbalance works, and I'm not sure what you are
>> referring to by BIOS APIC programming, however I would expect these
>> things would not be a problem.
>>
>> The MSI code I'm presenting here doesn't do anything crazy with the
>> interrupts, it allocates and uses them just as any PCI driver would. The
>> only real difference here is that instead of a piece of hardware sending
>> the IRQ TLP, it will be sent through the memory window (which, from the
>> OS's perspective, is just coming from an NTB hardware proxy alias).
>>
>> Logan
> Right. I did that as a hack a while back for some silicon errata 
> workaround. When the vector moves, the address for the LAPIC changes. So 
> unless it gets updated, you end up writing to the old location and lose 
> all the new interrupts. irqbalance is a user daemon that rotates the 
> system interrupts around to ensure that not all interrupts are pinned on 
> a single core. 

Yes, that would be a problem if something changes the MSI vectors out
from under us. Seems like that would be a bit difficult to do even with
regular hardware. So far I haven't seen anything that would do that. If
you know of where in the kernel this happens I'd be interested in
getting a pointer to the flow in the code. If that is the case this MSI
stuff will need to get much more complicated...

> I think it's enabled by default on several distros. 

> Although MSIX has nothing to do with the IOAPIC, the mode that the APIC 
> is programmed can have an influence on how the interrupts are delivered. 
> There are certain Intel platforms (I don't know if AMD does anything 
> like that) puts the IOAPIC in a certain configuration that causes the 
> interrupts to be moved in a round robin fashion. I think it's physical 
> flat mode? I don't quite recall. Normally on the low end Xeons. It's 
> probably worth doing a test run with the irqbalance daemon running and 
> make sure you traffic stream doesn't all of sudden stop.

I've tested with irqbalance running and haven't found any noticeable
difference.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts

2019-01-31 Thread Bjorn Helgaas
[+cc Thomas, Marc]

On Thu, Jan 31, 2019 at 11:56:49AM -0700, Logan Gunthorpe wrote:
> For NTB devices, we want to be able to trigger MSI interrupts
> through a memory window. In these cases we may want to use
> more interrupts than the NTB PCI device has available in its MSI-X
> table.
> 
> We allow for this by creating a new 'virtual' interrupt. These
> interrupts are allocated as usual but are not programmed into the
> MSI-X table (as there may not be space for them).
> 
> The MSI address and data will then handled through an NTB MSI library
> introduced later in this series.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Bjorn Helgaas 

I assume you'll merge this along with the rest of the series, so:

Acked-by: Bjorn Helgaas 

Minor question and typo below.

> ---
>  drivers/pci/msi.c   | 51 +
>  include/linux/msi.h |  1 +
>  include/linux/pci.h |  9 
>  3 files changed, 52 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 4c0b47867258..145587da686c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -192,6 +192,9 @@ static void msi_mask_irq(struct msi_desc *desc, u32 mask, 
> u32 flag)
>  
>  static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
>  {
> + if (desc->msi_attrib.is_virtual)
> + return NULL;
> +
>   return desc->mask_base +
>   desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
>  }
> @@ -206,14 +209,19 @@ static void __iomem *pci_msix_desc_addr(struct msi_desc 
> *desc)
>  u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
>  {
>   u32 mask_bits = desc->masked;
> + void __iomem *desc_addr;
>  
>   if (pci_msi_ignore_mask)
>   return 0;
> + desc_addr = pci_msix_desc_addr(desc);
> + if (!desc_addr)
> + return 0;
>  
>   mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
>   if (flag)
>   mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
> - writel(mask_bits, pci_msix_desc_addr(desc) + 
> PCI_MSIX_ENTRY_VECTOR_CTRL);
> +
> + writel(mask_bits, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
>  
>   return mask_bits;
>  }
> @@ -273,6 +281,11 @@ void __pci_read_msi_msg(struct msi_desc *entry, struct 
> msi_msg *msg)
>   if (entry->msi_attrib.is_msix) {
>   void __iomem *base = pci_msix_desc_addr(entry);
>  
> + if (!base) {
> + WARN_ON(1);
> + return;
> + }
> +
>   msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR);
>   msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR);
>   msg->data = readl(base + PCI_MSIX_ENTRY_DATA);
> @@ -303,6 +316,9 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct 
> msi_msg *msg)
>   } else if (entry->msi_attrib.is_msix) {
>   void __iomem *base = pci_msix_desc_addr(entry);
>  
> + if (!base)
> + goto skip;
> +
>   writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
>   writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
>   writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
> @@ -327,6 +343,8 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct 
> msi_msg *msg)
> msg->data);
>   }
>   }
> +
> +skip:
>   entry->msg = *msg;
>  }
>  
> @@ -550,6 +568,7 @@ msi_setup_entry(struct pci_dev *dev, int nvec, const 
> struct irq_affinity *affd)
>  
>   entry->msi_attrib.is_msix   = 0;
>   entry->msi_attrib.is_64 = !!(control & PCI_MSI_FLAGS_64BIT);
> + entry->msi_attrib.is_virtual= 0;
>   entry->msi_attrib.entry_nr  = 0;
>   entry->msi_attrib.maskbit   = !!(control & PCI_MSI_FLAGS_MASKBIT);
>   entry->msi_attrib.default_irq   = dev->irq; /* Save IOAPIC IRQ */
> @@ -674,6 +693,7 @@ static int msix_setup_entries(struct pci_dev *dev, void 
> __iomem *base,
>   struct irq_affinity_desc *curmsk, *masks = NULL;
>   struct msi_desc *entry;
>   int ret, i;
> + int vec_count = pci_msix_vec_count(dev);
>  
>   if (affd)
>   masks = irq_create_affinity_masks(nvec, affd);
> @@ -696,6 +716,10 @@ static int msix_setup_entries(struct pci_dev *dev, void 
> __iomem *base,
>   entry->msi_attrib.entry_nr = entries[i].entry;
>   else
>   entry->msi_attrib.entry_nr = i;
> +
> + entry->msi_attrib.is_virtual =
> + entry->msi_attrib.entry_nr >= vec_count;
> +
>   entry->msi_attrib.default_irq   = dev->irq;
>   entry->mask_base= base;
>  
> @@ -714,12 +738,19 @@ static void msix_program_entries(struct pci_dev *dev,
>  {
>   struct msi_desc *entry;
>   int i = 0;
> + void __iomem *desc_addr;
>  
>   for_each_pci_msi_entry(entry, dev) {
>   if (entries)
>  

Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:


On 2019-01-31 1:20 p.m., Dave Jiang wrote:

Does this work when the system moves the MSI vector either via software
(irqbalance) or BIOS APIC programming (some modes cause round robin
behavior)?


I don't know how irqbalance works, and I'm not sure what you are
referring to by BIOS APIC programming, however I would expect these
things would not be a problem.

The MSI code I'm presenting here doesn't do anything crazy with the
interrupts, it allocates and uses them just as any PCI driver would. The
only real difference here is that instead of a piece of hardware sending
the IRQ TLP, it will be sent through the memory window (which, from the
OS's perspective, is just coming from an NTB hardware proxy alias).

Logan
Right. I did that as a hack a while back for some silicon errata 
workaround. When the vector moves, the address for the LAPIC changes. So 
unless it gets updated, you end up writing to the old location and lose 
all the new interrupts. irqbalance is a user daemon that rotates the 
system interrupts around to ensure that not all interrupts are pinned on 
a single core. I think it's enabled by default on several distros. 
Although MSIX has nothing to do with the IOAPIC, the mode that the APIC 
is programmed can have an influence on how the interrupts are delivered. 
There are certain Intel platforms (I don't know if AMD does anything 
like that) puts the IOAPIC in a certain configuration that causes the 
interrupts to be moved in a round robin fashion. I think it's physical 
flat mode? I don't quite recall. Normally on the low end Xeons. It's 
probably worth doing a test run with the irqbalance daemon running and 
make sure you traffic stream doesn't all of sudden stop.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Logan Gunthorpe



On 2019-01-31 1:20 p.m., Dave Jiang wrote:
> Does this work when the system moves the MSI vector either via software 
> (irqbalance) or BIOS APIC programming (some modes cause round robin 
> behavior)?


I don't know how irqbalance works, and I'm not sure what you are
referring to by BIOS APIC programming, however I would expect these
things would not be a problem.

The MSI code I'm presenting here doesn't do anything crazy with the
interrupts, it allocates and uses them just as any PCI driver would. The
only real difference here is that instead of a piece of hardware sending
the IRQ TLP, it will be sent through the memory window (which, from the
OS's perspective, is just coming from an NTB hardware proxy alias).

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 11:56 AM, Logan Gunthorpe wrote:

Hi,

This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.

To do this, a couple changes are required outside of the NTB tree:

1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.

2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.


Logan,

Does this work when the system moves the MSI vector either via software 
(irqbalance) or BIOS APIC programming (some modes cause round robin 
behavior)?






The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.

The series is based off of v5.0-rc4 and I've tested it on top of a
of the patches I've already sent to the NTB tree (though they are
independent changes). A git repo is available here:

https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1

Thanks,

Logan

--

Logan Gunthorpe (9):
   iommu/vt-d: Allow interrupts from the entire bus for aliased devices
   PCI/MSI: Support allocating virtual MSI interrupts
   PCI/switchtec: Add module parameter to request more interrupts
   NTB: Introduce functions to calculate multi-port resource index
   NTB: Rename ntb.c to support multiple source files in the module
   NTB: Introduce MSI library
   NTB: Introduce NTB MSI Test Client
   NTB: Add ntb_msi_test support to ntb_test
   NTB: Add MSI interrupt support to ntb_transport

  drivers/iommu/intel_irq_remapping.c |  12 +
  drivers/ntb/Kconfig |  10 +
  drivers/ntb/Makefile|   3 +
  drivers/ntb/{ntb.c => core.c}   |   0
  drivers/ntb/msi.c   | 313 ++
  drivers/ntb/ntb_transport.c | 134 +++-
  drivers/ntb/test/Kconfig|   9 +
  drivers/ntb/test/Makefile   |   1 +
  drivers/ntb/test/ntb_msi_test.c | 416 
  drivers/pci/msi.c   |  51 ++-
  drivers/pci/switch/switchtec.c  |  12 +-
  include/linux/msi.h |   1 +
  include/linux/ntb.h | 139 
  include/linux/pci.h |   9 +
  tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
  15 files changed, 1150 insertions(+), 14 deletions(-)
  rename drivers/ntb/{ntb.c => core.c} (100%)
  create mode 100644 drivers/ntb/msi.c
  create mode 100644 drivers/ntb/test/ntb_msi_test.c

--
2.19.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jason Gunthorpe
On Thu, Jan 31, 2019 at 02:35:14PM -0500, Jerome Glisse wrote:

> > Basically invert the API flow - the DMA map would be done close to
> > GUP, not buried in the driver. This absolutely doesn't work for every
> > flow we have, but it does enable the ones that people seem to care
> > about when talking about P2P.
> 
> This does not work for GPU really i do not want to have to rewrite GPU
> driver for this. Struct page is a burden and it does not bring anything
> to the table. I rather provide an all in one stop for driver to use
> this without having to worry between regular vma and special vma.

I'm talking about almost exactly what you've done in here - make a
'sgl' that is dma addresses only. 

In these VMA patches you used a simple array of physical addreses -
I'm only talking about moving that array into a 'dma sgl'.

The flow is still basically the same - the driver directly gets DMA
physical addresses with no possibility to get a struct page or CPU
memory.

And then we can build more stuff around the 'dma sgl', including
the in-kernel users Logan is worrying about.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe



On 2019-01-31 12:35 p.m., Jerome Glisse wrote:
> So what is this O_DIRECT thing that keep coming again and again here :)
> What is the use case ? Note that bio will always have valid struct page
> of regular memory as using PCIE BAR for filesystem is crazy (you do not
> have atomic or cache coherence and many CPU instruction have _undefined_
> effect so what ever the userspace would do might do nothing.

The point is to be able to use a BAR as the source of data to write/read
from a file system. So as a simple example, if an NVMe drive had a CMB,
and you could map that CMB to userspace, you could do an O_DIRECT read
to the BAR on one drive and an O_DIRECT write from the BAR on another
drive. Thus you could bypass the upstream port of a switch (and
therefore all CPU resources) altogether.

For the most part nobody would want to put a filesystem on a BAR.
(Though there have been some crazy ideas to put persistent memory behind
a CMB...)

> Now if you want to use BAR address as destination or source of directIO
> then let just update the directIO code to handle this. There is no need
> to go hack every single place in the kernel that might deal with struct
> page or sgl. Just update the place that need to understand this. We can
> even update directIO to work on weird platform. The change to directIO
> will be small, couple hundred line of code at best.

Well if you want to figure out how to remove struct page from the entire
block layer that would help everybody. But until then, it's pretty much
impossible to use the block layer (and therefore O_DIRECT) without
struct page.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 07:02:15PM +, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > > *shrug* so what if the special GUP called a VMA op instead of
> > > > traversing the VMA PTEs today? Why does it really matter? It could
> > > > easily change to a struct page flow tomorrow..
> > > 
> > > Well it's so that it's composable. We want the SGL->DMA side to work for
> > > APIs from kernel space and not have to run a completely different flow
> > > for kernel drivers than from userspace memory.
> > 
> > Yes, I think that is the important point.
> > 
> > All the other struct page discussion is not about anyone of us wanting
> > struct page - heck it is a pain to deal with, but then again it is
> > there for a reason.
> > 
> > In the typical GUP flows we have three uses of a struct page:
> > 
> >  (1) to carry a physical address.  This is mostly through
> >  struct scatterlist and struct bio_vec.  We could just store
> >  a magic PFN-like value that encodes the physical address
> >  and allow looking up a page if it exists, and we had at least
> >  two attempts at it.  In some way I think that would actually
> >  make the interfaces cleaner, but Linus has NACKed it in the
> >  past, so we'll have to convince him first that this is the
> >  way forward
> 
> Something like this (and more) has always been the roadblock with
> trying to mix BAR memory into SGL. I think it is such a big problem as
> to be unsolvable in one step.. 
> 
> Struct page doesn't even really help anything beyond dma_map as we
> still can't pretend that __iomem is normal memory for general SGL
> users.
> 
> >  (2) to keep a reference to the memory so that it doesn't go away
> >  under us due to swapping, process exit, unmapping, etc.
> >  No idea how we want to solve this, but I guess you have
> >  some smart ideas?
> 
> Jerome, how does this work anyhow? Did you do something to make the
> VMA lifetime match the p2p_map/unmap? Or can we get into a situation
> were the VMA is destroyed and the importing driver can't call the
> unmap anymore?
> 
> I know in the case of notifiers the VMA liftime should be strictly
> longer than the map/unmap - but does this mean we can never support
> non-notifier users via this scheme?

So in this version the requirement is that the importer also have a mmu
notifier registered and that's what all GPU driver do already. Any
driver that map some range of vma to a device should register itself as
a mmu notifier listener to do something when vma goes away. I posted a
patchset a while ago to allow listener to differentiate when the vma is
going away from other type of invalidation [1]

With that in place you can easily handle the pin case. Driver really
need to do something when the vma goes away with GUP or not. As the
device is then writing/reading to/from something that does not match
anything in the process address space.

So user that want pin would register notifier, call p2p_map with pin
flag and ignore all notifier callback except the unmap one when the
unmap one happens they have the vma and they should call p2p_unmap
from their invalidate callback and update their device to either some
dummy memory or program it in a way that the userspace application
will notice.

This can all be handled by some helper so that driver do not have to
write more than 5 lines of code and function to update their device
mapping to something of their choosing.


> 
> >  (3) to make the PTEs dirty after writing to them.  Again no sure
> >  what our preferred interface here would be
> 
> This need doesn't really apply to BAR memory..
> 
> > If we solve all of the above problems I'd be more than happy to
> > go with a non-struct page based interface for BAR P2P.  But we'll
> > have to solve these issues in a generic way first.
> 
> I still think the right direction is to build on what Logan has done -
> realize that he created a DMA-only SGL - make that a formal type of
> the kernel and provide the right set of APIs to work with this type,
> without being forced to expose struct page.
> 
> Basically invert the API flow - the DMA map would be done close to
> GUP, not buried in the driver. This absolutely doesn't work for every
> flow we have, but it does enable the ones that people seem to care
> about when talking about P2P.

This does not work for GPU really i do not want to have to rewrite GPU
driver for this. Struct page is a burden and it does not bring anything
to the table. I rather provide an all in one stop for driver to use
this without having to worry between regular vma and special vma.

Note that in this patchset i reuse chunk of Logan works and intention is
to also allow PCI struct page to work too. But they should not be the
only mechanisms.

> 
> To get to where we are today we'd need a few new IB APIs, and some
> nvme change to work with DMA-only 

Re: [PATCH v6 06/20] iommu/io-pgtable-arm-v7s: Extend MediaTek 4GB Mode

2019-01-31 Thread Evan Green
On Wed, Jan 30, 2019 at 10:59 PM Yong Wu  wrote:
>
> On Wed, 2019-01-30 at 10:28 -0800, Evan Green wrote:
> > On Mon, Dec 31, 2018 at 7:57 PM Yong Wu  wrote:
> > >
> > > MediaTek extend the arm v7s descriptor to support the dram over 4GB.
> > >
> > > In the mt2712 and mt8173, it's called "4GB mode", the physical address
> > > is from 0x4000_ to 0x1_3fff_, but from EMI point of view, it
> > > is remapped to high address from 0x1__ to 0x1__, the
> > > bit32 is always enabled. thus, in the M4U, we always enable the bit9
> > > for all PTEs which means to enable bit32 of physical address.
> >
> > I got a little lost here. I get that you're trying to explain why you
> > always used to set bit32 of the physical address. But I don't totally
> > get the part about physical addresses being from 0x4000_ -
> > 0x1_3fff_, but also from 0x1__-0x1__. Are you
> > saying that the physical addresses from the iommu's perspective were
> > always >0x1__?
>
> Yes. From the IOMMU's perspective, the Physical address is from
> 0x1__ to 0x1__.
>
> > But then from whose perspective is it 0x4000_? ...
>
> I guess from SW point view. it is from 0x4000_ to 0x1_3fff_.
>
> If 4GB mode is enabled, the memory property in dts like this:
>
> memory@4000 {
> device_type = "memory";
> reg = <0 0x4000 0x0001 0x>;
> };
>
> > oh, or you're saying there was some sort of remapping
> > facility that moved the physical addresses around?
> >
> > >
> > > but in mt8183, M4U support the dram from 0x4000_ to 0x3__
> > > which isn't remaped. We extend the PTEs: the bit9 represent bit32 of
> > > PA and the bit4 represent bit33 of PA. Meanwhile the iova still is
> > > 32bits.
> > >
> > > In order to unify code, in the "4GB mode", we add the bit32 for the
> > > physical address manually in our driver.
> > >
> > > Correspondingly, Adding bit32 and bit33 for the PA in the iova_to_phys
> > > has to been moved into v7s.
> > >
> > > Regarding whether the pagetable address could be over 4GB, the mt8183
> > > support it while the previous mt8173 don't. thus keep it as is.
> > >
> > > Signed-off-by: Yong Wu 
> > > Reviewed-by: Robin Murphy 
> > > ---
> > >  drivers/iommu/io-pgtable-arm-v7s.c | 31 ---
> > >  drivers/iommu/io-pgtable.h |  7 +++
> > >  drivers/iommu/mtk_iommu.c  | 14 --
> > >  drivers/iommu/mtk_iommu.h  |  1 +
> > >  4 files changed, 36 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/drivers/iommu/io-pgtable-arm-v7s.c 
> > > b/drivers/iommu/io-pgtable-arm-v7s.c
> > > index 11d8505..8803a35 100644
> > > --- a/drivers/iommu/io-pgtable-arm-v7s.c
> > > +++ b/drivers/iommu/io-pgtable-arm-v7s.c
> > > @@ -124,7 +124,9 @@
> > >  #define ARM_V7S_TEX_MASK   0x7
> > >  #define ARM_V7S_ATTR_TEX(val)  (((val) & ARM_V7S_TEX_MASK) << 
> > > ARM_V7S_TEX_SHIFT)
> > >
> > > -#define ARM_V7S_ATTR_MTK_4GB   BIT(9) /* MTK extend it for 4GB 
> > > mode */
> > > +/* MediaTek extend the two bits below for over 4GB mode */
> > > +#define ARM_V7S_ATTR_MTK_PA_BIT32  BIT(9)
> > > +#define ARM_V7S_ATTR_MTK_PA_BIT33  BIT(4)
> >
> > If other vendors start doing stuff like this we'll need a more generic
> > way to handle this... but I guess until we see a pattern this is okay.
> >
> > >
> > >  /* *well, except for TEX on level 2 large pages, of course :( */
> > >  #define ARM_V7S_CONT_PAGE_TEX_SHIFT6
> > > @@ -183,13 +185,22 @@ static dma_addr_t __arm_v7s_dma_addr(void *pages)
> > >  static arm_v7s_iopte paddr_to_iopte(phys_addr_t paddr, int lvl,
> > > struct io_pgtable_cfg *cfg)
> > >  {
> > > -   return paddr & ARM_V7S_LVL_MASK(lvl);
> > > +   arm_v7s_iopte pte = paddr & ARM_V7S_LVL_MASK(lvl);
> > > +
> > > +   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB) {
> > > +   if (paddr & BIT_ULL(32))
> > > +   pte |= ARM_V7S_ATTR_MTK_PA_BIT32;
> > > +   if (paddr & BIT_ULL(33))
> > > +   pte |= ARM_V7S_ATTR_MTK_PA_BIT33;
> > > +   }
> > > +   return pte;
> > >  }
> > >
> > >  static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, int lvl,
> > >   struct io_pgtable_cfg *cfg)
> > >  {
> > > arm_v7s_iopte mask;
> > > +   phys_addr_t paddr;
> > >
> > > if (ARM_V7S_PTE_IS_TABLE(pte, lvl))
> > > mask = ARM_V7S_TABLE_MASK;
> > > @@ -198,7 +209,14 @@ static phys_addr_t iopte_to_paddr(arm_v7s_iopte pte, 
> > > int lvl,
> > > else
> > > mask = ARM_V7S_LVL_MASK(lvl);
> > >
> > > -   return pte & mask;
> > > +   paddr = pte & mask;
> > > +   if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_MTK_4GB) {
> > > +   if (pte & ARM_V7S_ATTR_MTK_PA_BIT32)
> > > +   paddr |= BIT_ULL(32);
> > > + 

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Logan Gunthorpe



On 2019-01-31 12:02 p.m., Jason Gunthorpe wrote:
> I still think the right direction is to build on what Logan has done -
> realize that he created a DMA-only SGL - make that a formal type of
> the kernel and provide the right set of APIs to work with this type,
> without being forced to expose struct page.

> Basically invert the API flow - the DMA map would be done close to
> GUP, not buried in the driver. This absolutely doesn't work for every
> flow we have, but it does enable the ones that people seem to care
> about when talking about P2P.
> It also does present a path to solve some cases of the O_DIRECT
> problems if the block stack can develop some way to know if an IO will
> go down a DMA-only IO path or not... This seems less challenging that
> auditing every SGL user for iomem safety??


The DMA-only SGL will work for some use cases, but I think it's going to
be a challenge for others. We care most about NVMe and, therefore, the
block layer.

Given my understanding of the block layer, and it's queuing
infrastructure, I don't think having a DMA-only IO path makes sense. I
think it has to be the same path, but with a special DMA-only bio; and
endpoints would have to indicate support for that bio. I can't say I
have a deep enough understanding of the block layer to know how possible
that would be.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jason Gunthorpe
On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > *shrug* so what if the special GUP called a VMA op instead of
> > > traversing the VMA PTEs today? Why does it really matter? It could
> > > easily change to a struct page flow tomorrow..
> > 
> > Well it's so that it's composable. We want the SGL->DMA side to work for
> > APIs from kernel space and not have to run a completely different flow
> > for kernel drivers than from userspace memory.
> 
> Yes, I think that is the important point.
> 
> All the other struct page discussion is not about anyone of us wanting
> struct page - heck it is a pain to deal with, but then again it is
> there for a reason.
> 
> In the typical GUP flows we have three uses of a struct page:
> 
>  (1) to carry a physical address.  This is mostly through
>  struct scatterlist and struct bio_vec.  We could just store
>  a magic PFN-like value that encodes the physical address
>  and allow looking up a page if it exists, and we had at least
>  two attempts at it.  In some way I think that would actually
>  make the interfaces cleaner, but Linus has NACKed it in the
>  past, so we'll have to convince him first that this is the
>  way forward

Something like this (and more) has always been the roadblock with
trying to mix BAR memory into SGL. I think it is such a big problem as
to be unsolvable in one step.. 

Struct page doesn't even really help anything beyond dma_map as we
still can't pretend that __iomem is normal memory for general SGL
users.

>  (2) to keep a reference to the memory so that it doesn't go away
>  under us due to swapping, process exit, unmapping, etc.
>  No idea how we want to solve this, but I guess you have
>  some smart ideas?

Jerome, how does this work anyhow? Did you do something to make the
VMA lifetime match the p2p_map/unmap? Or can we get into a situation
were the VMA is destroyed and the importing driver can't call the
unmap anymore?

I know in the case of notifiers the VMA liftime should be strictly
longer than the map/unmap - but does this mean we can never support
non-notifier users via this scheme?

>  (3) to make the PTEs dirty after writing to them.  Again no sure
>  what our preferred interface here would be

This need doesn't really apply to BAR memory..

> If we solve all of the above problems I'd be more than happy to
> go with a non-struct page based interface for BAR P2P.  But we'll
> have to solve these issues in a generic way first.

I still think the right direction is to build on what Logan has done -
realize that he created a DMA-only SGL - make that a formal type of
the kernel and provide the right set of APIs to work with this type,
without being forced to expose struct page.

Basically invert the API flow - the DMA map would be done close to
GUP, not buried in the driver. This absolutely doesn't work for every
flow we have, but it does enable the ones that people seem to care
about when talking about P2P.

To get to where we are today we'd need a few new IB APIs, and some
nvme change to work with DMA-only SGL's and so forth, but that doesn't
seem so bad. The API also seems much more safe and understandable than
todays version that is trying to hope that the SGL is never touched by
the CPU.

It also does present a path to solve some cases of the O_DIRECT
problems if the block stack can develop some way to know if an IO will
go down a DMA-only IO path or not... This seems less challenging that
auditing every SGL user for iomem safety??

Yes we end up with a duality, but we already basically have that with
the p2p flow today..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices

2019-01-31 Thread Logan Gunthorpe
When a device has multiple aliases that all are from the same bus,
we program the IRTE to accept requests from any matching device on the
bus.

This is so NTB devices which can have requests from multiple bus-devfns
can pass MSI interrupts through across the bridge.

Signed-off-by: Logan Gunthorpe 
Cc: David Woodhouse 
---
 drivers/iommu/intel_irq_remapping.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 24d45b07f425..8d3107a6b2b4 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -356,6 +356,8 @@ static int set_hpet_sid(struct irte *irte, u8 id)
 struct set_msi_sid_data {
struct pci_dev *pdev;
u16 alias;
+   int count;
+   int busmatch_count;
 };
 
 static int set_msi_sid_cb(struct pci_dev *pdev, u16 alias, void *opaque)
@@ -364,6 +366,10 @@ static int set_msi_sid_cb(struct pci_dev *pdev, u16 alias, 
void *opaque)
 
data->pdev = pdev;
data->alias = alias;
+   data->count++;
+
+   if (PCI_BUS_NUM(alias) == pdev->bus->number)
+   data->busmatch_count++;
 
return 0;
 }
@@ -375,6 +381,8 @@ static int set_msi_sid(struct irte *irte, struct pci_dev 
*dev)
if (!irte || !dev)
return -1;
 
+   data.count = 0;
+   data.busmatch_count = 0;
pci_for_each_dma_alias(dev, set_msi_sid_cb, );
 
/*
@@ -394,6 +402,10 @@ static int set_msi_sid(struct irte *irte, struct pci_dev 
*dev)
set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
 PCI_DEVID(PCI_BUS_NUM(data.alias),
   dev->bus->number));
+   else if (data.count >= 2 && data.busmatch_count == data.count)
+   set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
+PCI_DEVID(dev->bus->number,
+  dev->bus->number));
else if (data.pdev->bus->number != dev->bus->number)
set_irte_sid(irte, SVT_VERIFY_SID_SQ, SQ_ALL_16, data.alias);
else
-- 
2.19.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Logan Gunthorpe
Hi,

This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.

To do this, a couple changes are required outside of the NTB tree:

1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.

2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.

The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.

The series is based off of v5.0-rc4 and I've tested it on top of a
of the patches I've already sent to the NTB tree (though they are
independent changes). A git repo is available here:

https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1

Thanks,

Logan

--

Logan Gunthorpe (9):
  iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  PCI/MSI: Support allocating virtual MSI interrupts
  PCI/switchtec: Add module parameter to request more interrupts
  NTB: Introduce functions to calculate multi-port resource index
  NTB: Rename ntb.c to support multiple source files in the module
  NTB: Introduce MSI library
  NTB: Introduce NTB MSI Test Client
  NTB: Add ntb_msi_test support to ntb_test
  NTB: Add MSI interrupt support to ntb_transport

 drivers/iommu/intel_irq_remapping.c |  12 +
 drivers/ntb/Kconfig |  10 +
 drivers/ntb/Makefile|   3 +
 drivers/ntb/{ntb.c => core.c}   |   0
 drivers/ntb/msi.c   | 313 ++
 drivers/ntb/ntb_transport.c | 134 +++-
 drivers/ntb/test/Kconfig|   9 +
 drivers/ntb/test/Makefile   |   1 +
 drivers/ntb/test/ntb_msi_test.c | 416 
 drivers/pci/msi.c   |  51 ++-
 drivers/pci/switch/switchtec.c  |  12 +-
 include/linux/msi.h |   1 +
 include/linux/ntb.h | 139 
 include/linux/pci.h |   9 +
 tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
 15 files changed, 1150 insertions(+), 14 deletions(-)
 rename drivers/ntb/{ntb.c => core.c} (100%)
 create mode 100644 drivers/ntb/msi.c
 create mode 100644 drivers/ntb/test/ntb_msi_test.c

--
2.19.0
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 5/9] NTB: Rename ntb.c to support multiple source files in the module

2019-01-31 Thread Logan Gunthorpe
The kbuild system does not support having multiple source files in
a module if one of those source files has the same name as the module.

Therefore, we must rename ntb.c to core.c, while the module remains
ntb.ko.

This is similar to the way the nvme modules are structured.

Signed-off-by: Logan Gunthorpe 
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: Allen Hubbe 
---
 drivers/ntb/Makefile  | 2 ++
 drivers/ntb/{ntb.c => core.c} | 0
 2 files changed, 2 insertions(+)
 rename drivers/ntb/{ntb.c => core.c} (100%)

diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 1921dec1949d..537226f8e78d 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_NTB) += ntb.o hw/ test/
 obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
+
+ntb-y := core.o
diff --git a/drivers/ntb/ntb.c b/drivers/ntb/core.c
similarity index 100%
rename from drivers/ntb/ntb.c
rename to drivers/ntb/core.c
-- 
2.19.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 6/9] NTB: Introduce MSI library

2019-01-31 Thread Logan Gunthorpe
The NTB MSI library allows passing MSI interrupts across a memory
window. This offers similar functionality to doorbells or messages
except will often have much better latency and the client can
potentially use significantly more remote interrupts than typical hardware
provides for doorbells. (Which can be important in high-multiport
setups.)

The library utilizes one memory window per peer and uses the highest
index memory windows. Before any ntb_msi function may be used, the user
must call ntb_msi_init(). It may then setup and tear down the memory
windows when the link state changes using ntb_msi_setup_mws() and
ntb_msi_clear_mws().

The peer which receives the interrupt must call ntb_msim_request_irq()
to assign the interrupt handler (this function is functionally
similar to devm_request_irq()) and the returned descriptor must be
transferred to the peer which can use it to trigger the interrupt.
The triggering peer, once having received the descriptor, can
trigger the interrupt by calling ntb_msi_peer_trigger().

Signed-off-by: Logan Gunthorpe 
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: Allen Hubbe 
---
 drivers/ntb/Kconfig  |  10 ++
 drivers/ntb/Makefile |   3 +-
 drivers/ntb/msi.c| 313 +++
 include/linux/ntb.h  |  69 ++
 4 files changed, 394 insertions(+), 1 deletion(-)
 create mode 100644 drivers/ntb/msi.c

diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
index 95944e52fa36..b60811d91e3d 100644
--- a/drivers/ntb/Kconfig
+++ b/drivers/ntb/Kconfig
@@ -12,6 +12,16 @@ menuconfig NTB
 
 if NTB
 
+config NTB_MSI
+   bool "MSI Interrupt Support"
+   help
+Support using MSI interrupt forwarding instead of (or in addition to)
+hardware doorbells. MSI interrupts typically offer lower latency
+than doorbells and more MSI interrupts can be made available to
+clients. However this requires an extra memory window and support
+in the hardware driver for creating the MSI interrupts.
+
+If unsure, say N.
 source "drivers/ntb/hw/Kconfig"
 
 source "drivers/ntb/test/Kconfig"
diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 537226f8e78d..cc27ad2ef150 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_NTB) += ntb.o hw/ test/
 obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
 
-ntb-y := core.o
+ntb-y  := core.o
+ntb-$(CONFIG_NTB_MSI)  += msi.o
diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
new file mode 100644
index ..b636956b3772
--- /dev/null
+++ b/drivers/ntb/msi.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe ");
+MODULE_DESCRIPTION("NTB MSI Interrupt Library");
+
+struct ntb_msi {
+   u64 base_addr;
+   u64 end_addr;
+   u32 *peer_mws[];
+};
+
+/**
+ * ntb_msi_init() - Initialize the MSI context
+ * @ntb:   NTB device context.
+ *
+ * This function must be called before any other ntb_msi function.
+ * It initializes the context for MSI operations and maps
+ * the peer memory windows.
+ *
+ * This function reserves the last N outbound memory windows (where N
+ * is the number of peers).
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_init(struct ntb_dev *ntb)
+{
+   phys_addr_t mw_phys_addr;
+   resource_size_t mw_size;
+   size_t struct_size;
+   int peer_widx;
+   int peers;
+   int ret;
+   int i;
+
+   peers = ntb_peer_port_count(ntb);
+   if (peers <= 0)
+   return -EINVAL;
+
+   struct_size = sizeof(*ntb->msi) + sizeof(*ntb->msi->peer_mws) * peers;
+
+   ntb->msi = devm_kzalloc(>dev, struct_size, GFP_KERNEL);
+   if (!ntb->msi)
+   return -ENOMEM;
+
+   for (i = 0; i < peers; i++) {
+   peer_widx = ntb_peer_mw_count(ntb) - 1 - i;
+
+   ret = ntb_peer_mw_get_addr(ntb, peer_widx, _phys_addr,
+  _size);
+   if (ret)
+   goto unroll;
+
+   ntb->msi->peer_mws[i] = devm_ioremap(>dev, mw_phys_addr,
+mw_size);
+   if (!ntb->msi->peer_mws[i]) {
+   ret = -EFAULT;
+   goto unroll;
+   }
+   }
+
+   return 0;
+
+unroll:
+   for (i = 0; i < peers; i++)
+   if (ntb->msi->peer_mws[i])
+   devm_iounmap(>dev, ntb->msi->peer_mws[i]);
+
+   devm_kfree(>dev, ntb->msi);
+   ntb->msi = NULL;
+   return ret;
+}
+EXPORT_SYMBOL(ntb_msi_init);
+
+/**
+ * ntb_msi_setup_mws() - Initialize the MSI inbound memory windows
+ * @ntb:   NTB device context.
+ *
+ * This function sets up the required inbound memory windows. It should be
+ * called from a work function after a link 

[PATCH 9/9] NTB: Add MSI interrupt support to ntb_transport

2019-01-31 Thread Logan Gunthorpe
Introduce the module parameter 'use_msi' which, when set uses
MSI interrupts instead of doorbells for each queue pair (QP). T
he parameter is only available if NTB MSI support is configured into
the kernel. We also require there to be more than one memory window
(MW) so that an extra one is available to forward the APIC region.

To use MSIs, we request one interrupt per QP and forward the MSI address
and data to the peer using scratch pad registers (SPADS) above the MW
spads. (If there are not enough SPADS the MSI interrupt will not be used.)

Once registered, we simply use ntb_msi_peer_trigger and the recieving
ISR simply queues up the rxc_db_work for the queue.

This addition can significantly improve performance of ntb_transport.
In a simple, untuned, apples-to-apples comparision using ntb_netdev
and iperf with switchtec hardware, I see 3.88Gb/s without MSI
interrupts and 14.1Gb/s which is a more than 3x improvement.

Signed-off-by: Logan Gunthorpe 
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: Allen Hubbe 
---
 drivers/ntb/ntb_transport.c | 134 +++-
 1 file changed, 133 insertions(+), 1 deletion(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 526b65afc16a..d9e61bf28830 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -93,6 +93,12 @@ static bool use_dma;
 module_param(use_dma, bool, 0644);
 MODULE_PARM_DESC(use_dma, "Use DMA engine to perform large data copy");
 
+static bool use_msi;
+#ifdef CONFIG_NTB_MSI
+module_param(use_msi, bool, 0644);
+MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
+#endif
+
 static struct dentry *nt_debugfs_dir;
 
 /* Only two-ports NTB devices are supported */
@@ -188,6 +194,11 @@ struct ntb_transport_qp {
u64 tx_err_no_buf;
u64 tx_memcpy;
u64 tx_async;
+
+   bool use_msi;
+   int msi_irq;
+   struct ntb_msi_desc msi_desc;
+   struct ntb_msi_desc peer_msi_desc;
 };
 
 struct ntb_transport_mw {
@@ -221,6 +232,9 @@ struct ntb_transport_ctx {
u64 qp_bitmap;
u64 qp_bitmap_free;
 
+   bool use_msi;
+   unsigned int msi_spad_offset;
+
bool link_is_up;
struct delayed_work link_work;
struct work_struct link_cleanup;
@@ -667,6 +681,91 @@ static int ntb_transport_setup_qp_mw(struct 
ntb_transport_ctx *nt,
return 0;
 }
 
+static irqreturn_t ntb_transport_isr(int irq, void *dev)
+{
+   struct ntb_transport_qp *qp = dev;
+
+   tasklet_schedule(>rxc_db_work);
+
+   return IRQ_HANDLED;
+}
+
+static void ntb_transport_setup_qp_peer_msi(struct ntb_transport_ctx *nt,
+   unsigned int qp_num)
+{
+   struct ntb_transport_qp *qp = >qp_vec[qp_num];
+   int spad = qp_num * 2 + nt->msi_spad_offset;
+
+   if (!nt->use_msi)
+   return;
+
+   if (spad >= ntb_spad_count(nt->ndev))
+   return;
+
+   qp->peer_msi_desc.addr_offset =
+   ntb_peer_spad_read(qp->ndev, PIDX, spad);
+   qp->peer_msi_desc.data =
+   ntb_peer_spad_read(qp->ndev, PIDX, spad + 1);
+
+   dev_dbg(>ndev->pdev->dev, "QP%d Peer MSI addr=%x data=%x\n",
+   qp_num, qp->peer_msi_desc.addr_offset, qp->peer_msi_desc.data);
+
+   if (qp->peer_msi_desc.addr_offset) {
+   qp->use_msi = true;
+   dev_info(>ndev->pdev->dev,
+"Using MSI interrupts for QP%d\n", qp_num);
+   }
+}
+
+static void ntb_transport_setup_qp_msi(struct ntb_transport_ctx *nt,
+  unsigned int qp_num)
+{
+   struct ntb_transport_qp *qp = >qp_vec[qp_num];
+   int spad = qp_num * 2 + nt->msi_spad_offset;
+   int rc;
+
+   if (!nt->use_msi)
+   return;
+
+   if (spad >= ntb_spad_count(nt->ndev)) {
+   dev_warn_once(>ndev->pdev->dev,
+ "Not enough SPADS to use MSI interrupts\n");
+   return;
+   }
+
+   ntb_spad_write(qp->ndev, spad, 0);
+   ntb_spad_write(qp->ndev, spad + 1, 0);
+
+   if (!qp->msi_irq) {
+   qp->msi_irq = ntbm_msi_request_irq(qp->ndev, ntb_transport_isr,
+  KBUILD_MODNAME, qp,
+  >msi_desc);
+   if (qp->msi_irq < 0) {
+   dev_warn(>ndev->pdev->dev,
+"Unable to allocate MSI interrupt for qp%d\n",
+qp_num);
+   return;
+   }
+   }
+
+   rc = ntb_spad_write(qp->ndev, spad, qp->msi_desc.addr_offset);
+   if (rc)
+   goto err_free_interrupt;
+
+   rc = ntb_spad_write(qp->ndev, spad + 1, qp->msi_desc.data);
+   if (rc)
+   goto err_free_interrupt;
+
+   dev_dbg(>ndev->pdev->dev, "QP%d MSI %d addr=%x data=%x\n",
+   qp_num, qp->msi_irq, 

[PATCH 7/9] NTB: Introduce NTB MSI Test Client

2019-01-31 Thread Logan Gunthorpe
Introduce a tool to test NTB MSI interrupts similar to the other
NTB test tools. This tool creates a debugfs directory for each
NTB device with the following files:

port
irqX_occurrences
peerX/port
peerX/count
peerX/trigger

The 'port' file tells the user the local port number and the
'occurrences' files tell the number of local interrupts that
have been received for each interrupt.

For each peer, the 'port' file and the 'count' file tell you the
peer's port number and number of interrupts respectively. Writing
the interrupt number to the 'trigger' file triggers the interrupt
handler for the peer which should increment their corresponding
'occurrences' file. The 'ready' file indicates if a peer is ready,
writing to this file blocks until it is ready.

The module parameter num_irqs can be used to set the number of
local interrupts. By default this is 4. This is only limited by
the number of unused MSI interrupts registered by the hardware
(this will require support of the hardware driver) and there must
be at least 2*num_irqs + 1 spads registers available.

Signed-off-by: Logan Gunthorpe 
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: Allen Hubbe 
---
 drivers/ntb/test/Kconfig|   9 +
 drivers/ntb/test/Makefile   |   1 +
 drivers/ntb/test/ntb_msi_test.c | 416 
 3 files changed, 426 insertions(+)
 create mode 100644 drivers/ntb/test/ntb_msi_test.c

diff --git a/drivers/ntb/test/Kconfig b/drivers/ntb/test/Kconfig
index a5d0eda44438..a3f3e2638935 100644
--- a/drivers/ntb/test/Kconfig
+++ b/drivers/ntb/test/Kconfig
@@ -25,3 +25,12 @@ config NTB_PERF
 to and from the window without additional software interaction.
 
 If unsure, say N.
+
+config NTB_MSI_TEST
+   tristate "NTB MSI Test Client"
+   depends on NTB_MSI
+   help
+ This tool demonstrates the use of the NTB MSI library to
+ send MSI interrupts between peers.
+
+ If unsure, say N.
diff --git a/drivers/ntb/test/Makefile b/drivers/ntb/test/Makefile
index 9e77e0b761c2..d2895ca995e4 100644
--- a/drivers/ntb/test/Makefile
+++ b/drivers/ntb/test/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_NTB_PINGPONG) += ntb_pingpong.o
 obj-$(CONFIG_NTB_TOOL) += ntb_tool.o
 obj-$(CONFIG_NTB_PERF) += ntb_perf.o
+obj-$(CONFIG_NTB_MSI_TEST) += ntb_msi_test.o
diff --git a/drivers/ntb/test/ntb_msi_test.c b/drivers/ntb/test/ntb_msi_test.c
new file mode 100644
index ..2d7435591a65
--- /dev/null
+++ b/drivers/ntb/test/ntb_msi_test.c
@@ -0,0 +1,416 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe ");
+MODULE_DESCRIPTION("Test for sending MSI interrupts over an NTB memory 
window");
+
+static int num_irqs = 4;
+module_param(num_irqs, int, 0644);
+MODULE_PARM_DESC(num_irqs, "number of irqs to use");
+
+struct ntb_msit_ctx {
+   struct ntb_dev *ntb;
+   struct dentry *dbgfs_dir;
+   struct work_struct setup_work;
+
+   struct ntb_msit_isr_ctx {
+   int irq_idx;
+   int irq_num;
+   int occurrences;
+   struct ntb_msit_ctx *nm;
+   struct ntb_msi_desc desc;
+   } *isr_ctx;
+
+   struct ntb_msit_peer {
+   struct ntb_msit_ctx *nm;
+   int pidx;
+   int num_irqs;
+   struct completion init_comp;
+   struct ntb_msi_desc *msi_desc;
+   } peers[];
+};
+
+static struct dentry *ntb_msit_dbgfs_topdir;
+
+static irqreturn_t ntb_msit_isr(int irq, void *dev)
+{
+   struct ntb_msit_isr_ctx *isr_ctx = dev;
+   struct ntb_msit_ctx *nm = isr_ctx->nm;
+
+   dev_dbg(>ntb->dev, "Interrupt Occurred: %d",
+   isr_ctx->irq_idx);
+
+   isr_ctx->occurrences++;
+
+   return IRQ_HANDLED;
+}
+
+static void ntb_msit_setup_work(struct work_struct *work)
+{
+   struct ntb_msit_ctx *nm = container_of(work, struct ntb_msit_ctx,
+  setup_work);
+   int irq_count = 0;
+   int irq;
+   int ret;
+   uintptr_t i;
+
+   ret = ntb_msi_setup_mws(nm->ntb);
+   if (ret) {
+   dev_err(>ntb->dev, "Unable to setup MSI windows: %d\n",
+   ret);
+   return;
+   }
+
+   for (i = 0; i < num_irqs; i++) {
+   nm->isr_ctx[i].irq_idx = i;
+   nm->isr_ctx[i].nm = nm;
+
+   if (!nm->isr_ctx[i].irq_num) {
+   irq = ntbm_msi_request_irq(nm->ntb, ntb_msit_isr,
+  KBUILD_MODNAME,
+  >isr_ctx[i],
+  >isr_ctx[i].desc);
+   if (irq < 0)
+   break;
+
+   nm->isr_ctx[i].irq_num = irq;
+   }
+
+   ret = 

[PATCH 8/9] NTB: Add ntb_msi_test support to ntb_test

2019-01-31 Thread Logan Gunthorpe
When the ntb_msi_test module is available, the test code will trigger
each of the interrupts and ensure the corresponding occurrences files
gets incremented.

Signed-off-by: Logan Gunthorpe 
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: Allen Hubbe 
---
 tools/testing/selftests/ntb/ntb_test.sh | 54 -
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/ntb/ntb_test.sh 
b/tools/testing/selftests/ntb/ntb_test.sh
index 17ca36403d04..1a10b8f67727 100755
--- a/tools/testing/selftests/ntb/ntb_test.sh
+++ b/tools/testing/selftests/ntb/ntb_test.sh
@@ -87,10 +87,10 @@ set -e
 
 function _modprobe()
 {
-   modprobe "$@"
+   modprobe "$@" || return 1
 
if [[ "$REMOTE_HOST" != "" ]]; then
-   ssh "$REMOTE_HOST" modprobe "$@"
+   ssh "$REMOTE_HOST" modprobe "$@" || return 1
fi
 }
 
@@ -451,6 +451,30 @@ function pingpong_test()
echo "  Passed"
 }
 
+function msi_test()
+{
+   LOC=$1
+   REM=$2
+
+   write_file 1 $LOC/ready
+
+   echo "Running MSI interrupt tests on: $(subdirname $LOC) / $(subdirname 
$REM)"
+
+   CNT=$(read_file "$LOC/count")
+   for ((i = 0; i < $CNT; i++)); do
+   START=$(read_file $REM/../irq${i}_occurrences)
+   write_file $i $LOC/trigger
+   END=$(read_file $REM/../irq${i}_occurrences)
+
+   if [[ $(($END - $START)) != 1 ]]; then
+   echo "MSI did not trigger the interrupt on the remote 
side!" >&2
+   exit 1
+   fi
+   done
+
+   echo "  Passed"
+}
+
 function perf_test()
 {
USE_DMA=$1
@@ -529,6 +553,29 @@ function ntb_pingpong_tests()
_modprobe -r ntb_pingpong
 }
 
+function ntb_msi_tests()
+{
+   LOCAL_MSI="$DEBUGFS/ntb_msi_test/$LOCAL_DEV"
+   REMOTE_MSI="$REMOTE_HOST:$DEBUGFS/ntb_msi_test/$REMOTE_DEV"
+
+   echo "Starting ntb_msi_test tests..."
+
+   if ! _modprobe ntb_msi_test 2> /dev/null; then
+   echo "  Not doing MSI tests seeing the module is not available."
+   return
+   fi
+
+   port_test $LOCAL_MSI $REMOTE_MSI
+
+   LOCAL_PEER="$LOCAL_MSI/peer$LOCAL_PIDX"
+   REMOTE_PEER="$REMOTE_MSI/peer$REMOTE_PIDX"
+
+   msi_test $LOCAL_PEER $REMOTE_PEER
+   msi_test $REMOTE_PEER $LOCAL_PEER
+
+   _modprobe -r ntb_msi_test
+}
+
 function ntb_perf_tests()
 {
LOCAL_PERF="$DEBUGFS/ntb_perf/$LOCAL_DEV"
@@ -550,6 +597,7 @@ function cleanup()
_modprobe -r ntb_perf 2> /dev/null
_modprobe -r ntb_pingpong 2> /dev/null
_modprobe -r ntb_transport 2> /dev/null
+   _modprobe -r ntb_msi_test 2> /dev/null
set -e
 }
 
@@ -586,5 +634,7 @@ ntb_tool_tests
 echo
 ntb_pingpong_tests
 echo
+ntb_msi_tests
+echo
 ntb_perf_tests
 echo
-- 
2.19.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/9] PCI/switchtec: Add module parameter to request more interrupts

2019-01-31 Thread Logan Gunthorpe
Seeing the we want to use more interrupts in the NTB MSI code
we need to be able allocate more (sometimes virtual) interrupts
in the switchtec driver. Therefore add a module parameter to
request to allocate additional interrupts.

This puts virtually no limit on the number of MSI interrupts available
to NTB clients.

Signed-off-by: Logan Gunthorpe 
Cc: Bjorn Helgaas 
---
 drivers/pci/switch/switchtec.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index e22766c79fe9..8b1db78197d9 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -30,6 +30,10 @@ module_param(use_dma_mrpc, bool, 0644);
 MODULE_PARM_DESC(use_dma_mrpc,
 "Enable the use of the DMA MRPC feature");
 
+static int nirqs = 32;
+module_param(nirqs, int, 0644);
+MODULE_PARM_DESC(nirqs, "number of interrupts to allocate (more may be useful 
for NTB applications)");
+
 static dev_t switchtec_devt;
 static DEFINE_IDA(switchtec_minor_ida);
 
@@ -1247,8 +1251,12 @@ static int switchtec_init_isr(struct switchtec_dev 
*stdev)
int dma_mrpc_irq;
int rc;
 
-   nvecs = pci_alloc_irq_vectors(stdev->pdev, 1, 4,
- PCI_IRQ_MSIX | PCI_IRQ_MSI);
+   if (nirqs < 4)
+   nirqs = 4;
+
+   nvecs = pci_alloc_irq_vectors(stdev->pdev, 1, nirqs,
+ PCI_IRQ_MSIX | PCI_IRQ_MSI |
+ PCI_IRQ_VIRTUAL);
if (nvecs < 0)
return nvecs;
 
-- 
2.19.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 17/20] memory: mtk-smi: Get rid of need_larbid

2019-01-31 Thread Evan Green
On Wed, Jan 30, 2019 at 7:22 PM Yong Wu  wrote:
>
> On Wed, 2019-01-30 at 11:11 -0800, Evan Green wrote:
> > On Mon, Dec 31, 2018 at 7:59 PM Yong Wu  wrote:
> > >
> > > The "mediatek,larb-id" has already been parsed in MTK IOMMU driver.
> > > It's no need to parse it again in SMI driver. Only clean some codes.
> > > This patch is fit for all the current mt2701, mt2712, mt7623, mt8173
> > > and mt8183.
> > >
> > > After this patch, the "mediatek,larb-id" only be needed for mt2712
> > > which have 2 M4Us. In the other SoCs, we can get the larb-id from M4U
> > > in which the larbs in the "mediatek,larbs" always are ordered.
> > >
> > > CC: Matthias Brugger 
> > > Signed-off-by: Yong Wu 
> > > ---
> > >  drivers/memory/mtk-smi.c | 26 ++
> > >  1 file changed, 2 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/drivers/memory/mtk-smi.c b/drivers/memory/mtk-smi.c
> > > index 08cf40d..10e6493 100644
> > > --- a/drivers/memory/mtk-smi.c
> > > +++ b/drivers/memory/mtk-smi.c
> > > @@ -67,7 +67,6 @@ struct mtk_smi_common_plat {
> > >  };
> > >
> > >  struct mtk_smi_larb_gen {
> > > -   bool need_larbid;
> > > int port_in_larb[MTK_LARB_NR_MAX + 1];
> > > void (*config_port)(struct device *);
> > > unsigned int larb_direct_to_common_mask;
> > > @@ -153,18 +152,9 @@ void mtk_smi_larb_put(struct device *larbdev)
> > > struct mtk_smi_iommu *smi_iommu = data;
> > > unsigned int i;
> > >
> > > -   if (larb->larb_gen->need_larbid) {
> > > -   larb->mmu = _iommu->larb_imu[larb->larbid].mmu;
> > > -   return 0;
> > > -   }
> > > -
> > > -   /*
> > > -* If there is no larbid property, Loop to find the corresponding
> > > -* iommu information.
> > > -*/
> > > -   for (i = 0; i < smi_iommu->larb_nr; i++) {
> > > +   for (i = 0; i < MTK_LARB_NR_MAX; i++) {
> >
> > Looks like this was the only use of mtk_smi_iommu.larb_nr. Should we
> > remove that now?
>
> This is necessary since the mt2712 which has two M4U HW.
>
> From its dtsi[1], take iommu1 as a example, its larb_nr only is 3, but
> we need scan all the larb.
>
> [1]
> http://lists.infradead.org/pipermail/linux-mediatek/2018-December/016119.html

I'm not sure I follow. My point was that this structure member is only
ever set but never read:
$ git grep '[.>]larb_nr'
drivers/iommu/mtk_iommu.c:  data->smi_imu.larb_nr = larb_nr;
drivers/iommu/mtk_iommu_v1.c:   data->smi_imu.larb_nr = larb_nr;

Maybe I've applied the patches to the wrong tree, and there is a use
of this member I'm not seeing?
-Evan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 11/20] iommu/mediatek: Move vld_pa_rng into plat_data

2019-01-31 Thread Evan Green
On Wed, Jan 30, 2019 at 7:20 PM Yong Wu  wrote:
>
> On Wed, 2019-01-30 at 10:30 -0800, Evan Green wrote:
> > On Mon, Dec 31, 2018 at 7:58 PM Yong Wu  wrote:
> > >
> > > Both mt8173 and mt8183 don't have this vld_pa_rng(valid physical address
> > > range) register while mt2712 have. Move it into the plat_data.
> > >
> > > Signed-off-by: Yong Wu 
> > > ---
> > >  drivers/iommu/mtk_iommu.c | 3 ++-
> > >  drivers/iommu/mtk_iommu.h | 1 +
> > >  2 files changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
> > > index 8d8ab21..2913ddb 100644
> > > --- a/drivers/iommu/mtk_iommu.c
> > > +++ b/drivers/iommu/mtk_iommu.c
> > > @@ -548,7 +548,7 @@ static int mtk_iommu_hw_init(const struct 
> > > mtk_iommu_data *data)
> > >  upper_32_bits(data->protect_base);
> > > writel_relaxed(regval, data->base + REG_MMU_IVRP_PADDR);
> > >
> > > -   if (data->enable_4GB && data->plat_data->m4u_plat != M4U_MT8173) {
> > > +   if (data->enable_4GB && data->plat_data->vld_pa_rng) {
> > > /*
> > >  * If 4GB mode is enabled, the validate PA range is from
> > >  * 0x1__ to 0x1__. here record bit[32:30].
> > > @@ -741,6 +741,7 @@ static int __maybe_unused mtk_iommu_resume(struct 
> > > device *dev)
> > > .m4u_plat = M4U_MT2712,
> > > .has_4gb_mode = true,
> > > .has_bclk = true,
> > > +   .vld_pa_rng   = true,
> > > .larbid_remap = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9},
> > >  };
> > >
> > > diff --git a/drivers/iommu/mtk_iommu.h b/drivers/iommu/mtk_iommu.h
> > > index b46aeaa..a8c5d1e 100644
> > > --- a/drivers/iommu/mtk_iommu.h
> > > +++ b/drivers/iommu/mtk_iommu.h
> > > @@ -48,6 +48,7 @@ struct mtk_iommu_plat_data {
> > > /* HW will use the EMI clock if there isn't the "bclk". */
> > > boolhas_bclk;
> > > boolreset_axi;
> > > +   boolvld_pa_rng;
> >
> > I agree with Nicolas that valid_pa_range would be much clearer...
> > although, even now that I know what it's supposed to mean, I don't get
> > what it represents. What is this saying?
>
> This register in the coda is called "vld_pa_rng".
>
> How about I change it to "has_vld_pa_rng"?. In the comment above, I have
> explained the meaning(valid physical address range).
>

Ok, that sounds fine.
-Evan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/5] swiotlb: Add is_swiotlb_active() function

2019-01-31 Thread Joerg Roedel
From: Joerg Roedel 

This function will be used from dma_direct code to determine
the maximum segment size of a dma mapping.

Reviewed-by: Konrad Rzeszutek Wilk 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Joerg Roedel 
---
 include/linux/swiotlb.h | 6 ++
 kernel/dma/swiotlb.c| 9 +
 2 files changed, 15 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 1c22d96e1742..e9e786b4b598 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -63,6 +63,7 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev,
 extern int
 swiotlb_dma_supported(struct device *hwdev, u64 mask);
 size_t swiotlb_max_mapping_size(struct device *dev);
+bool is_swiotlb_active(void);
 
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
@@ -100,6 +101,11 @@ static inline size_t swiotlb_max_mapping_size(struct 
device *dev)
 {
return SIZE_MAX;
 }
+
+static inline bool is_swiotlb_active(void)
+{
+   return false;
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 9cb21259cb0b..c873f9cc2146 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -667,3 +667,12 @@ size_t swiotlb_max_mapping_size(struct device *dev)
 {
return ((size_t)1 << IO_TLB_SHIFT) * IO_TLB_SEGSIZE;
 }
+
+bool is_swiotlb_active(void)
+{
+   /*
+* When SWIOTLB is initialized, even if io_tlb_start points to physical
+* address zero, io_tlb_end surely doesn't.
+*/
+   return io_tlb_end != 0;
+}
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/5 v6] Fix virtio-blk issue with SWIOTLB

2019-01-31 Thread Joerg Roedel
Hi,

here is the next version of this patch-set. Previous
versions can be found here:

V1: https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/

V2: https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/

V3: https://lore.kernel.org/lkml/20190123163049.24863-1-j...@8bytes.org/

V4: https://lore.kernel.org/lkml/20190129084342.26030-1-j...@8bytes.org/

V5: https://lore.kernel.org/lkml/20190130164007.26497-1-j...@8bytes.org/

The problem solved here is a limitation of the SWIOTLB implementation,
which does not support allocations larger than 256kb.  When the
virtio-blk driver tries to read/write a block larger than that, the
allocation of the dma-handle fails and an IO error is reported.

Changes to v5 are:

- Changed patch 3 to uninline dma_max_mapping_size()

Please review.

Thanks,

Joerg

Joerg Roedel (5):
  swiotlb: Introduce swiotlb_max_mapping_size()
  swiotlb: Add is_swiotlb_active() function
  dma: Introduce dma_max_mapping_size()
  virtio: Introduce virtio_max_dma_size()
  virtio-blk: Consider virtio_max_dma_size() for maximum segment size

 Documentation/DMA-API.txt|  8 
 drivers/block/virtio_blk.c   | 10 ++
 drivers/virtio/virtio_ring.c | 11 +++
 include/linux/dma-mapping.h  |  8 
 include/linux/swiotlb.h  | 11 +++
 include/linux/virtio.h   |  2 ++
 kernel/dma/direct.c  | 11 +++
 kernel/dma/mapping.c | 14 ++
 kernel/dma/swiotlb.c | 14 ++
 9 files changed, 85 insertions(+), 4 deletions(-)

-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/5] dma: Introduce dma_max_mapping_size()

2019-01-31 Thread Joerg Roedel
From: Joerg Roedel 

The function returns the maximum size that can be mapped
using DMA-API functions. The patch also adds the
implementation for direct DMA and a new dma_map_ops pointer
so that other implementations can expose their limit.

Reviewed-by: Konrad Rzeszutek Wilk 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Joerg Roedel 
---
 Documentation/DMA-API.txt   |  8 
 include/linux/dma-mapping.h |  8 
 kernel/dma/direct.c | 11 +++
 kernel/dma/mapping.c| 14 ++
 4 files changed, 41 insertions(+)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index e133ccd60228..acfe3d0f78d1 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -195,6 +195,14 @@ Requesting the required mask does not alter the current 
mask.  If you
 wish to take advantage of it, you should issue a dma_set_mask()
 call to set the mask to the value returned.
 
+::
+
+   size_t
+   dma_direct_max_mapping_size(struct device *dev);
+
+Returns the maximum size of a mapping for the device. The size parameter
+of the mapping functions like dma_map_single(), dma_map_page() and
+others should not be larger than the returned value.
 
 Part Id - Streaming DMA mappings
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f6ded992c183..5b21f14802e1 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -130,6 +130,7 @@ struct dma_map_ops {
enum dma_data_direction direction);
int (*dma_supported)(struct device *dev, u64 mask);
u64 (*get_required_mask)(struct device *dev);
+   size_t (*max_mapping_size)(struct device *dev);
 };
 
 #define DMA_MAPPING_ERROR  (~(dma_addr_t)0)
@@ -257,6 +258,8 @@ static inline void dma_direct_sync_sg_for_cpu(struct device 
*dev,
 }
 #endif
 
+size_t dma_direct_max_mapping_size(struct device *dev);
+
 #ifdef CONFIG_HAS_DMA
 #include 
 
@@ -460,6 +463,7 @@ int dma_supported(struct device *dev, u64 mask);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
+size_t dma_max_mapping_size(struct device *dev);
 #else /* CONFIG_HAS_DMA */
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
struct page *page, size_t offset, size_t size,
@@ -561,6 +565,10 @@ static inline u64 dma_get_required_mask(struct device *dev)
 {
return 0;
 }
+static inline size_t dma_max_mapping_size(struct device *dev)
+{
+   return 0;
+}
 #endif /* CONFIG_HAS_DMA */
 
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 355d16acee6d..6310ad01f915 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -380,3 +380,14 @@ int dma_direct_supported(struct device *dev, u64 mask)
 */
return mask >= __phys_to_dma(dev, min_mask);
 }
+
+size_t dma_direct_max_mapping_size(struct device *dev)
+{
+   size_t size = SIZE_MAX;
+
+   /* If SWIOTLB is active, use its maximum mapping size */
+   if (is_swiotlb_active())
+   size = swiotlb_max_mapping_size(dev);
+
+   return size;
+}
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index a11006b6d8e8..5753008ab286 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -357,3 +357,17 @@ void dma_cache_sync(struct device *dev, void *vaddr, 
size_t size,
ops->cache_sync(dev, vaddr, size, dir);
 }
 EXPORT_SYMBOL(dma_cache_sync);
+
+size_t dma_max_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (dma_is_direct(ops))
+   size = dma_direct_max_mapping_size(dev);
+   else if (ops && ops->max_mapping_size)
+   size = ops->max_mapping_size(dev);
+
+   return size;
+}
+EXPORT_SYMBOL_GPL(dma_max_mapping_size);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma: Uninline dma_max_mapping_size()

2019-01-31 Thread Joerg Roedel
On Thu, Jan 31, 2019 at 09:43:51AM -0500, Michael S. Tsirkin wrote:
> OK. Joerg can you repost the series with this squashed
> and all acks applied?

Sure, sent out now as v6.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 5/5] virtio-blk: Consider virtio_max_dma_size() for maximum segment size

2019-01-31 Thread Joerg Roedel
From: Joerg Roedel 

Segments can't be larger than the maximum DMA mapping size
supported on the platform. Take that into account when
setting the maximum segment size for a block device.

Reviewed-by: Konrad Rzeszutek Wilk 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Joerg Roedel 
---
 drivers/block/virtio_blk.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index b16a887bbd02..4bc083b7c9b5 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -723,7 +723,7 @@ static int virtblk_probe(struct virtio_device *vdev)
struct request_queue *q;
int err, index;
 
-   u32 v, blk_size, sg_elems, opt_io_size;
+   u32 v, blk_size, max_size, sg_elems, opt_io_size;
u16 min_io_size;
u8 physical_block_exp, alignment_offset;
 
@@ -826,14 +826,16 @@ static int virtblk_probe(struct virtio_device *vdev)
/* No real sector limit. */
blk_queue_max_hw_sectors(q, -1U);
 
+   max_size = virtio_max_dma_size(vdev);
+
/* Host can optionally specify maximum segment size and number of
 * segments. */
err = virtio_cread_feature(vdev, VIRTIO_BLK_F_SIZE_MAX,
   struct virtio_blk_config, size_max, );
if (!err)
-   blk_queue_max_segment_size(q, v);
-   else
-   blk_queue_max_segment_size(q, -1U);
+   max_size = min(max_size, v);
+
+   blk_queue_max_segment_size(q, max_size);
 
/* Host can optionally specify the block size of the device */
err = virtio_cread_feature(vdev, VIRTIO_BLK_F_BLK_SIZE,
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/5] virtio: Introduce virtio_max_dma_size()

2019-01-31 Thread Joerg Roedel
From: Joerg Roedel 

This function returns the maximum segment size for a single
dma transaction of a virtio device. The possible limit comes
from the SWIOTLB implementation in the Linux kernel, that
has an upper limit of (currently) 256kb of contiguous
memory it can map. Other DMA-API implementations might also
have limits.

Use the new dma_max_mapping_size() function to determine the
maximum mapping size when DMA-API is in use for virtio.

Reviewed-by: Konrad Rzeszutek Wilk 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Joerg Roedel 
---
 drivers/virtio/virtio_ring.c | 11 +++
 include/linux/virtio.h   |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cd7e755484e3..8a31c6862b2b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -266,6 +266,17 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
return false;
 }
 
+size_t virtio_max_dma_size(struct virtio_device *vdev)
+{
+   size_t max_segment_size = SIZE_MAX;
+
+   if (vring_use_dma_api(vdev))
+   max_segment_size = dma_max_mapping_size(>dev);
+
+   return max_segment_size;
+}
+EXPORT_SYMBOL_GPL(virtio_max_dma_size);
+
 static void *vring_alloc_queue(struct virtio_device *vdev, size_t size,
  dma_addr_t *dma_handle, gfp_t flag)
 {
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index fa1b5da2804e..673fe3ef3607 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -157,6 +157,8 @@ int virtio_device_freeze(struct virtio_device *dev);
 int virtio_device_restore(struct virtio_device *dev);
 #endif
 
+size_t virtio_max_dma_size(struct virtio_device *vdev);
+
 #define virtio_device_for_each_vq(vdev, vq) \
list_for_each_entry(vq, >vqs, list)
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/5] swiotlb: Introduce swiotlb_max_mapping_size()

2019-01-31 Thread Joerg Roedel
From: Joerg Roedel 

The function returns the maximum size that can be remapped
by the SWIOTLB implementation. This function will be later
exposed to users through the DMA-API.

Reviewed-by: Konrad Rzeszutek Wilk 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Joerg Roedel 
---
 include/linux/swiotlb.h | 5 +
 kernel/dma/swiotlb.c| 5 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7c007ed7505f..1c22d96e1742 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -62,6 +62,7 @@ extern void swiotlb_tbl_sync_single(struct device *hwdev,
 
 extern int
 swiotlb_dma_supported(struct device *hwdev, u64 mask);
+size_t swiotlb_max_mapping_size(struct device *dev);
 
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
@@ -95,6 +96,10 @@ static inline unsigned int swiotlb_max_segment(void)
 {
return 0;
 }
+static inline size_t swiotlb_max_mapping_size(struct device *dev)
+{
+   return SIZE_MAX;
+}
 #endif /* CONFIG_SWIOTLB */
 
 extern void swiotlb_print_info(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1fb6fd68b9c7..9cb21259cb0b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -662,3 +662,8 @@ swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
return __phys_to_dma(hwdev, io_tlb_end - 1) <= mask;
 }
+
+size_t swiotlb_max_mapping_size(struct device *dev)
+{
+   return ((size_t)1 << IO_TLB_SHIFT) * IO_TLB_SEGSIZE;
+}
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] swiotlb: Return error from swiotlb_init_with_tbl()

2019-01-31 Thread Joerg Roedel
From: Joerg Roedel 

The only reason why swiotlb_init_with_tbl() can fail is an
allocation failure in the memblock_alloc() function. But
this function just calls panic() in case it can't fulfill
the request and never returns an error, therefore
swiotlb_init_with_tbl() also never actually returns an
error.

There are three call-sites of this function in the kernel.
All of them check for an error being returned, and two of
them call panic() itself in case the function returns an
error. The third call-site handles the error case and
doesn't crash the system. This is in the swiotlb_init()
function.

Since there is a call-site which handles an error and does
not crash the system, change the function to return
allocation failures to the caller.

Signed-off-by: Joerg Roedel 
---
 kernel/dma/swiotlb.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c873f9cc2146..4619d078c67c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -203,12 +203,15 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 * between io_tlb_start and io_tlb_end.
 */
-   io_tlb_list = memblock_alloc(
+   io_tlb_list = memblock_alloc_nopanic(
PAGE_ALIGN(io_tlb_nslabs * sizeof(int)),
PAGE_SIZE);
-   io_tlb_orig_addr = memblock_alloc(
+   io_tlb_orig_addr = memblock_alloc_nopanic(
PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)),
PAGE_SIZE);
+   if (io_tlb_list == NULL || io_tlb_orig_addr == NULL)
+   goto out_fail;
+
for (i = 0; i < io_tlb_nslabs; i++) {
io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
@@ -219,7 +222,26 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
swiotlb_print_info();
 
swiotlb_set_max_segment(io_tlb_nslabs << IO_TLB_SHIFT);
+
return 0;
+
+out_fail:
+   if (io_tlb_list)
+   memblock_free(io_tlb_list,
+ PAGE_ALIGN(io_tlb_nslabs * sizeof(int)));
+
+   if (io_tlb_orig_addr)
+   memblock_free(io_tlb_orig_addr,
+ PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)));
+
+   io_tlb_list  = NULL;
+   io_tlb_orig_addr = NULL;
+   io_tlb_end   = 0;
+   io_tlb_start = 0;
+   io_tlb_nslabs= 0;
+   max_segment  = 0;
+
+   return -ENOMEM;
 }
 
 /*
-- 
2.16.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] HYPERV/IOMMU: Add Hyper-V stub IOMMU driver

2019-01-31 Thread Vitaly Kuznetsov
lantianyu1...@gmail.com writes:

> From: Lan Tianyu 
>
> On the bare metal, enabling X2APIC mode requires interrupt remapping
> function which helps to deliver irq to cpu with 32-bit APIC ID.
> Hyper-V doesn't provide interrupt remapping function so far and Hyper-V
> MSI protocol already supports to deliver interrupt to the CPU whose
> virtual processor index is more than 255. IO-APIC interrupt still has
> 8-bit APIC ID limitation.
>
> This patch is to add Hyper-V stub IOMMU driver in order to enable
> X2APIC mode successfully in Hyper-V Linux guest. The driver returns X2APIC
> interrupt remapping capability when X2APIC mode is available. Otherwise,
> it creates a Hyper-V irq domain to limit IO-APIC interrupts' affinity
> and make sure cpus assigned with IO-APIC interrupt have 8-bit APIC ID.
>
> Signed-off-by: Lan Tianyu 
> ---
>  drivers/iommu/Kconfig |   7 ++
>  drivers/iommu/Makefile|   1 +
>  drivers/iommu/hyperv-iommu.c  | 189 
> ++
>  drivers/iommu/irq_remapping.c |   3 +
>  drivers/iommu/irq_remapping.h |   1 +
>  5 files changed, 201 insertions(+)
>  create mode 100644 drivers/iommu/hyperv-iommu.c
>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 45d7021..5c397c0 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -437,4 +437,11 @@ config QCOM_IOMMU
>   help
> Support for IOMMU on certain Qualcomm SoCs.
>  
> +config HYPERV_IOMMU
> + bool "Hyper-V stub IOMMU support"
> + depends on HYPERV
> + help
> + Hyper-V stub IOMMU driver provides capability to run
> + Linux guest with X2APIC mode enabled.
> +
>  endif # IOMMU_SUPPORT
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index a158a68..8c71a15 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -32,3 +32,4 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> new file mode 100644
> index 000..a64b747
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -0,0 +1,189 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define pr_fmt(fmt) "HYPERV-IR: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "irq_remapping.h"
> +
> +/*
> + * According IO-APIC spec, IO APIC has a 24-entry Interrupt
> + * Redirection Table.
> + */
> +#define IOAPIC_REMAPPING_ENTRY 24

KVM already defines KVM_IOAPIC_NUM_PINS - is this the same thing?

> +
> +static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
> +struct irq_domain *ioapic_ir_domain;
> +
> +static int hyperv_ir_set_affinity(struct irq_data *data,
> + const struct cpumask *mask, bool force)
> +{
> + struct irq_data *parent = data->parent_data;
> + struct irq_cfg *cfg = irqd_cfg(data);
> + struct IO_APIC_route_entry *entry;
> + cpumask_t cpumask;
> + int ret;
> +
> + cpumask_andnot(, mask, _max_cpumask);
> +
> + /* Return error If new irq affinity is out of ioapic_max_cpumask. */
> + if (!cpumask_empty())
> + return -EINVAL;
> +
> + ret = parent->chip->irq_set_affinity(parent, mask, force);
> + if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
> + return ret;
> +
> + entry = data->chip_data;
> + entry->dest = cfg->dest_apicid;
> + entry->vector = cfg->vector;
> + send_cleanup_vector(cfg);
> +
> + return 0;
> +}
> +
> +static struct irq_chip hyperv_ir_chip = {
> + .name   = "HYPERV-IR",
> + .irq_ack= apic_ack_irq,
> + .irq_set_affinity   = hyperv_ir_set_affinity,
> +};
> +
> +static int hyperv_irq_remapping_alloc(struct irq_domain *domain,
> +  unsigned int virq, unsigned int nr_irqs,
> +  void *arg)
> +{
> + struct irq_alloc_info *info = arg;
> + struct IO_APIC_route_entry *entry;
> + struct irq_data *irq_data;
> + struct irq_desc *desc;
> + struct irq_cfg *cfg;
> + int ret = 0;
> +
> + if (!info || info->type != X86_IRQ_ALLOC_TYPE_IOAPIC || nr_irqs > 1)
> + return -EINVAL;
> +
> + ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg);
> + if (ret < 0)
> + goto fail;
> +
> + irq_data = irq_domain_get_irq_data(domain, virq);
> + cfg = irqd_cfg(irq_data);
> + if (!irq_data || !cfg) {
> + ret = -EINVAL;
> + goto fail;
> + }

fail: label doesn't do anything, you can just return (and you actually
do on the first failure path).

> +
> + irq_data->chip = _ir_chip;
> +
> + /*
> +  * Save IOAPIC entry pointer here in order to set vector and
> +  * and dest_apicid 

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > *shrug* so what if the special GUP called a VMA op instead of
> > > traversing the VMA PTEs today? Why does it really matter? It could
> > > easily change to a struct page flow tomorrow..
> > 
> > Well it's so that it's composable. We want the SGL->DMA side to work for
> > APIs from kernel space and not have to run a completely different flow
> > for kernel drivers than from userspace memory.
> 
> Yes, I think that is the important point.
> 
> All the other struct page discussion is not about anyone of us wanting
> struct page - heck it is a pain to deal with, but then again it is
> there for a reason.
> 
> In the typical GUP flows we have three uses of a struct page:

We do not want GUP. Yes some RDMA driver and other use GUP but they
should only use GUP on regular vma not on special vma (ie mmap of a
device file). Allowing GUP on those is insane. It is better to special
case the peer to peer mapping because _it is_ special, nothing inside
those are manage by core mm and driver can deal with them in weird
way (GPU certainly do and for very good reasons without which they
would perform badly).

> 
>  (1) to carry a physical address.  This is mostly through
>  struct scatterlist and struct bio_vec.  We could just store
>  a magic PFN-like value that encodes the physical address
>  and allow looking up a page if it exists, and we had at least
>  two attempts at it.  In some way I think that would actually
>  make the interfaces cleaner, but Linus has NACKed it in the
>  past, so we'll have to convince him first that this is the
>  way forward

Wasting 64bytes just to carry address is a waste for everyone.

>  (2) to keep a reference to the memory so that it doesn't go away
>  under us due to swapping, process exit, unmapping, etc.
>  No idea how we want to solve this, but I guess you have
>  some smart ideas?

The DMA API has _never_ dealt with page refcount and it have always
been up to the user of the DMA API to ascertain that it is safe for
them to map/unmap page/resource they are providing to the DMA API.

The lifetime management of page or resource provided to the DMA API
should remain the problem of the caller and not be something the DMA
API cares one bit about.

>  (3) to make the PTEs dirty after writing to them.  Again no sure
>  what our preferred interface here would be

Again the DMA API has never dealt with that nor should he. What does
dirty pte means for a special mapping (mmap of device file) ? There is
no single common definition for that, most driver do not care about it
and it get fully ignore.

> 
> If we solve all of the above problems I'd be more than happy to
> go with a non-struct page based interface for BAR P2P.  But we'll
> have to solve these issues in a generic way first.

None of the above are problems the DMA API need to solve. The DMA API
is about mapping some memory resource to a device. For regular main
memory it is easy on most architecture (anything with a sane IOMMU).
For IO resources it is not as straight forward as it was often left
undefined in the architecture platform documentation or the inter-
connect standard. AFAIK mapping BAR from one PCIE device to another
through IOMMU works well on recent Intel and AMD platform. We will
probably need to use some whitelist at i am not sure this is something
Intel or AMD guarantee, i believe they want to start guaranteeing it.

So having one DMA API for regular memory and one for IO memory aka
resource (dma_map_resource()) sounds like the only sane approach here.
It is fundamentally different memory and we should not try to muddle
the water by having it go through a single common API. There is no
benefit to that beside saving couple hundred of lines of code to some
driver and this couple hundred lines of code can be move to a common
helpers.

So to me it is lot sane to provide an helper that would deal with
the different vma type on behalf of device than forcing down struct
page. Something like:

vma_dma_map_range(vma, device, start, end, flags, pa[])
vma_dma_unmap_range(vma, device, start, end, flags, pa[])

VMA_DMA_MAP_FLAG_WRITE
VMA_DMA_MAP_FLAG_PIN

Which would use GUP or special vma handling on behalf of the calling
device or use a special p2p code path for special vma. Device that
need pinning set the flag and it is up to the exporting device to
accept or not. Pinning when using GUP is obvious.

When the vma goes away the importing device must update its device
page table to some dummy page or do something sane, because keeping
things map after that point does not make sense anymore. Device is
no longer operating on a range of virtual address that make sense.

So instead of pushing p2p handling within GUP to not disrupt existing
driver workflow. It is better to provide an helper that handle all
the gory details for the device driver. It 

Re: [PATCH 3/3] videobuf2: replace a layering violation with dma_map_resource

2019-01-31 Thread Christoph Hellwig
Hi Marek,

can chance you could retest the v2 version?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 09:05:01AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 08:44:20PM +, Jason Gunthorpe wrote:
> > Not really, for MRs most drivers care about DMA addresses only. The
> > only reason struct page ever gets involved is because it is part of
> > the GUP, SGL and dma_map family of APIs.
> 
> And the only way you get the DMA address is through the dma mapping
> APIs.  Which except for the little oddball dma_map_resource expect
> a struct page in some form.  And dma_map_resource isn't really up
> to speed for full blown P2P.
> 
> Now we could and maybe eventually should change all this.  But that
> is a pre-requisitive for doing anything more fancy, and not something
> to be hacked around.
> 
> > O_DIRECT seems to be the justification for struct page, but nobody is
> > signing up to make O_DIRECT have the required special GUP/SGL/P2P flow
> > that would be needed to *actually* make that work - so it really isn't
> > a justification today.
> 
> O_DIRECT is just the messenger.  Anything using GUP will need a struct
> page, which is all our interfaces that do I/O directly to user pages.

I do not want to allow GUP to pin I/O space this would open a pandora
box that we do not want to open at all. Many driver manage their IO
space and if they get random pinning because some other kernel bits
they never heard of starts to do GUP on their stuff it is gonna cause
havoc.

So far mmap of device file have always been special and it has been
reflected to userspace in all the instance i know of (media and GPU).
Pretending we can handle them like any other vma is a lie because
they were never designed that way in the first place and it would be
disruptive to all those driver.

Minimum disruption with minimun changes is what we should aim for and
is what i am trying to do with this patchset. Using struct page and
allowing GUP would mean rewritting huge chunk of GPU drivers (pretty
much rewritting their whole memory management) with no benefit at the
end.

When something is special it is better to leave it that way.

Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 09:02:03AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 01:50:27PM -0500, Jerome Glisse wrote:
> > I do not see how VMA changes are any different than using struct page
> > in respect to userspace exposure. Those vma callback do not need to be
> > set by everyone, in fact expectation is that only handful of driver
> > will set those.
> > 
> > How can we do p2p between RDMA and GPU for instance, without exposure
> > to userspace ? At some point you need to tell userspace hey this kernel
> > does allow you to do that :)
> 
> To do RDMA on a memory region you need struct page backіng to start
> with..

No you do not with this patchset and there is no reason to tie RDMA to
struct page it does not provide a single feature we would need. So as
it can be done without and they are not benefit of using one i do not
see why we should use one.

Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 1/1] iommu: set the default iommu-dma mode as non-strict

2019-01-31 Thread Jean-Philippe Brucker
Hi,

On 31/01/2019 13:52, Zhen Lei wrote:
> Currently, many peripherals are faster than before. For example, the top
> speed of the older netcard is 10Gb/s, and now it's more than 25Gb/s. But
> when iommu page-table mapping enabled, it's hard to reach the top speed
> in strict mode, because of frequently map and unmap operations. In order
> to keep abreast of the times, I think it's better to set non-strict as
> default.

Most users won't be aware of this relaxation and will have their system
vulnerable to e.g. thunderbolt hotplug. See for example 4.3 Deferred
Invalidation in
http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/2018/MSC/MSC-2018-21.pdf

Why not keep the policy to secure by default, as we do for
iommu.passthrough? And maybe add something similar to
CONFIG_IOMMU_DEFAULT_PASSTRHOUGH? It's easy enough for experts to pass a
command-line argument or change the default config.

Thanks,
Jean

> 
> Below it's our iperf performance data of 25Gb netcard:
> strict mode: 18-20 Gb/s
> non-strict mode: 23.5 Gb/s
> 
> Signed-off-by: Zhen Lei 
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 4 ++--
>  drivers/iommu/iommu.c   | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index b799bcf..667221f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1779,13 +1779,13 @@
> 
>   iommu.strict=   [ARM64] Configure TLB invalidation behaviour
>   Format: { "0" | "1" }
> - 0 - Lazy mode.
> + 0 - Lazy mode (default).
> Request that DMA unmap operations use deferred
> invalidation of hardware TLBs, for increased
> throughput at the cost of reduced device isolation.
> Will fall back to strict mode if not supported by
> the relevant IOMMU driver.
> - 1 - Strict mode (default).
> + 1 - Strict mode.
> DMA unmap operations invalidate IOMMU hardware TLBs
> synchronously.
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3ed4db3..10e0b49 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -43,7 +43,7 @@
>  #else
>  static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA;
>  #endif
> -static bool iommu_dma_strict __read_mostly = true;
> +static bool iommu_dma_strict __read_mostly;
> 
>  struct iommu_callback_data {
>   const struct iommu_ops *ops;
> --
> 1.8.3
> 
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma: Uninline dma_max_mapping_size()

2019-01-31 Thread Michael S. Tsirkin
On Thu, Jan 31, 2019 at 03:37:23PM +0100, Christoph Hellwig wrote:
> On Thu, Jan 31, 2019 at 02:01:27PM +0100, Joerg Roedel wrote:
> > On Thu, Jan 31, 2019 at 11:41:29AM +0100, Christoph Hellwig wrote:
> > > Sorry for not noticing last time, but since 5.0 we keep all non-fast
> > > path DMA mapping interfaces out of line, so this should move to
> > > kernel/dma/mapping.c.
> > 
> > Okay, attached patch does that. It applies on-top of this patch-set.
> > 
> > Michael, feel free to either apply this on-top of the patch-set or merge
> > the diff into patch 3, whatever you prefer.
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig 
> 
> I'd prefer it to be squashed if possible.

OK. Joerg can you repost the series with this squashed
and all acks applied?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma: Uninline dma_max_mapping_size()

2019-01-31 Thread Christoph Hellwig
On Thu, Jan 31, 2019 at 02:01:27PM +0100, Joerg Roedel wrote:
> On Thu, Jan 31, 2019 at 11:41:29AM +0100, Christoph Hellwig wrote:
> > Sorry for not noticing last time, but since 5.0 we keep all non-fast
> > path DMA mapping interfaces out of line, so this should move to
> > kernel/dma/mapping.c.
> 
> Okay, attached patch does that. It applies on-top of this patch-set.
> 
> Michael, feel free to either apply this on-top of the patch-set or merge
> the diff into patch 3, whatever you prefer.

Looks good:

Reviewed-by: Christoph Hellwig 

I'd prefer it to be squashed if possible.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 1/1] iommu: set the default iommu-dma mode as non-strict

2019-01-31 Thread Zhen Lei
Currently, many peripherals are faster than before. For example, the top
speed of the older netcard is 10Gb/s, and now it's more than 25Gb/s. But
when iommu page-table mapping enabled, it's hard to reach the top speed
in strict mode, because of frequently map and unmap operations. In order
to keep abreast of the times, I think it's better to set non-strict as
default.

Below it's our iperf performance data of 25Gb netcard:
strict mode: 18-20 Gb/s
non-strict mode: 23.5 Gb/s

Signed-off-by: Zhen Lei 
---
 Documentation/admin-guide/kernel-parameters.txt | 4 ++--
 drivers/iommu/iommu.c   | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b799bcf..667221f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1779,13 +1779,13 @@

iommu.strict=   [ARM64] Configure TLB invalidation behaviour
Format: { "0" | "1" }
-   0 - Lazy mode.
+   0 - Lazy mode (default).
  Request that DMA unmap operations use deferred
  invalidation of hardware TLBs, for increased
  throughput at the cost of reduced device isolation.
  Will fall back to strict mode if not supported by
  the relevant IOMMU driver.
-   1 - Strict mode (default).
+   1 - Strict mode.
  DMA unmap operations invalidate IOMMU hardware TLBs
  synchronously.

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3ed4db3..10e0b49 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -43,7 +43,7 @@
 #else
 static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA;
 #endif
-static bool iommu_dma_strict __read_mostly = true;
+static bool iommu_dma_strict __read_mostly;

 struct iommu_callback_data {
const struct iommu_ops *ops;
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma: Uninline dma_max_mapping_size()

2019-01-31 Thread Joerg Roedel
On Thu, Jan 31, 2019 at 11:41:29AM +0100, Christoph Hellwig wrote:
> Sorry for not noticing last time, but since 5.0 we keep all non-fast
> path DMA mapping interfaces out of line, so this should move to
> kernel/dma/mapping.c.

Okay, attached patch does that. It applies on-top of this patch-set.

Michael, feel free to either apply this on-top of the patch-set or merge
the diff into patch 3, whatever you prefer.

>From 2bb95d2136280c79de9553852ee3370f6d42d7b3 Mon Sep 17 00:00:00 2001
From: Joerg Roedel 
Date: Thu, 31 Jan 2019 13:55:27 +0100
Subject: [PATCH] dma: Uninline dma_max_mapping_size()

The function is not performance sensitive and doesn't need
to be inlined at every call-site. Move it out of the header
into the appropriate C source file.

Signed-off-by: Joerg Roedel 
---
 include/linux/dma-mapping.h | 18 +-
 kernel/dma/direct.c |  1 -
 kernel/dma/mapping.c| 14 ++
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index a3ca8a71a704..5b21f14802e1 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -443,19 +443,6 @@ static inline int dma_mapping_error(struct device *dev, 
dma_addr_t dma_addr)
return 0;
 }
 
-static inline size_t dma_max_mapping_size(struct device *dev)
-{
-   const struct dma_map_ops *ops = get_dma_ops(dev);
-   size_t size = SIZE_MAX;
-
-   if (dma_is_direct(ops))
-   size = dma_direct_max_mapping_size(dev);
-   else if (ops && ops->max_mapping_size)
-   size = ops->max_mapping_size(dev);
-
-   return size;
-}
-
 void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle,
gfp_t flag, unsigned long attrs);
 void dma_free_attrs(struct device *dev, size_t size, void *cpu_addr,
@@ -476,6 +463,7 @@ int dma_supported(struct device *dev, u64 mask);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
+size_t dma_max_mapping_size(struct device *dev);
 #else /* CONFIG_HAS_DMA */
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
struct page *page, size_t offset, size_t size,
@@ -577,6 +565,10 @@ static inline u64 dma_get_required_mask(struct device *dev)
 {
return 0;
 }
+static inline size_t dma_max_mapping_size(struct device *dev)
+{
+   return 0;
+}
 #endif /* CONFIG_HAS_DMA */
 
 static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 81ca8170b928..6310ad01f915 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -391,4 +391,3 @@ size_t dma_direct_max_mapping_size(struct device *dev)
 
return size;
 }
-EXPORT_SYMBOL(dma_direct_max_mapping_size);
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index a11006b6d8e8..5753008ab286 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -357,3 +357,17 @@ void dma_cache_sync(struct device *dev, void *vaddr, 
size_t size,
ops->cache_sync(dev, vaddr, size, dir);
 }
 EXPORT_SYMBOL(dma_cache_sync);
+
+size_t dma_max_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (dma_is_direct(ops))
+   size = dma_direct_max_mapping_size(dev);
+   else if (ops && ops->max_mapping_size)
+   size = ops->max_mapping_size(dev);
+
+   return size;
+}
+EXPORT_SYMBOL_GPL(dma_max_mapping_size);
-- 
2.16.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: use generic DMA mapping code in powerpc V4

2019-01-31 Thread Christian Zigotzky

Hi Christoph,

I compiled kernels for the X5000 and X1000 from your branch 
'powerpc-dma.6' today.


Gitweb: 
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.6


git clone git://git.infradead.org/users/hch/misc.git -b powerpc-dma.6 a

The X1000 and X5000 boot but unfortunately the P.A. Semi Ethernet 
doesn't work.


Error messages (X1000):

[   17.371736] pci :00:1a.0: overflow 0x0002691bf802+1646 of DMA 
mask  bus mask 0
[   17.371760] WARNING: CPU: 0 PID: 2496 at kernel/dma/direct.c:43 
.dma_direct_map_page+0x11c/0x200

[   17.371762] Modules linked in:
[   17.371769] CPU: 0 PID: 2496 Comm: NetworkManager Not tainted 
5.0.0-rc4-3_A-EON_AmigaOne_X1000_Nemo-54580-g8d7a724-dirty #2
[   17.371772] NIP:  c010395c LR: c0103a30 CTR: 
c0726f70
[   17.371775] REGS: c0026900e9a0 TRAP: 0700   Not tainted 
(5.0.0-rc4-3_A-EON_AmigaOne_X1000_Nemo-54580-g8d7a724-dirty)
[   17.371777] MSR:  90029032  CR: 
2400  XER: 2000

[   17.371786] IRQMASK: 0
   GPR00: c0103a30 c0026900ec30 
c1923f00 0052
   GPR04: c0026f206778 c0026f20d458 
 0346
   GPR08: 0007  
 0010
   GPR12: 22002444 c1b1 
 
   GPR16: 10382410  
 c0026bd9d820
   GPR20:  c0026919c000 
 
   GPR24: 0800 c0026919 
c002692a4180 c0026919
   GPR28: c00277ada1c8 066e 
c0026d3c68b0 0802

[   17.371823] NIP [c010395c] .dma_direct_map_page+0x11c/0x200
[   17.371827] LR [c0103a30] .dma_direct_map_page+0x1f0/0x200
[   17.371829] Call Trace:
[   17.371833] [c0026900ec30] [c0103a30] 
.dma_direct_map_page+0x1f0/0x200 (unreliable)
[   17.371840] [c0026900ecd0] [c099b7ec] 
.pasemi_mac_replenish_rx_ring+0x12c/0x2a0
[   17.371846] [c0026900eda0] [c099dc64] 
.pasemi_mac_open+0x384/0x7c0

[   17.371853] [c0026900ee40] [c0c6f484] .__dev_open+0x134/0x1e0
[   17.371858] [c0026900eee0] [c0c6f9ec] 
.__dev_change_flags+0x1bc/0x210
[   17.371863] [c0026900ef90] [c0c6fa88] 
.dev_change_flags+0x48/0xa0

[   17.371869] [c0026900f030] [c0c8c88c] .do_setlink+0x3dc/0xf60
[   17.371875] [c0026900f1b0] [c0c8dd84] 
.__rtnl_newlink+0x5e4/0x900

[   17.371880] [c0026900f5f0] [c0c8e10c] .rtnl_newlink+0x6c/0xb0
[   17.371885] [c0026900f680] [c0c89838] 
.rtnetlink_rcv_msg+0x2e8/0x3d0
[   17.371891] [c0026900f760] [c0cc0f90] 
.netlink_rcv_skb+0x120/0x170
[   17.371896] [c0026900f820] [c0c87318] 
.rtnetlink_rcv+0x28/0x40
[   17.371901] [c0026900f8a0] [c0cc03f8] 
.netlink_unicast+0x208/0x2f0
[   17.371906] [c0026900f950] [c0cc09a8] 
.netlink_sendmsg+0x348/0x460

[   17.371911] [c0026900fa30] [c0c38774] .sock_sendmsg+0x44/0x70
[   17.371915] [c0026900fab0] [c0c3a79c] 
.___sys_sendmsg+0x30c/0x320
[   17.371920] [c0026900fca0] [c0c3c3b4] 
.__sys_sendmsg+0x74/0xf0
[   17.371926] [c0026900fd90] [c0cb4da0] 
.__se_compat_sys_sendmsg+0x40/0x60

[   17.371932] [c0026900fe20] [c000a21c] system_call+0x5c/0x70
[   17.371934] Instruction dump:
[   17.371937] 6000 f8610070 3d20 6129fffe 79290020 e8e7 
7fa74840 409d00b8
[   17.371946] 3d420001 892acb59 2f89 419e00b8 <0fe0> 382100a0 
3860 e8010010

[   17.371954] ---[ end trace a81f3c344f625f76 ]---
[   17.396654] IPv6: ADDRCONF(NETDEV_UP): enp0s20f3: link is not ready



Additionally, Xorg doesn't start on a virtual e5500 QEMU machine 
anymore. I tested with the following QEMU command:


./qemu-system-ppc64 -M ppce500 -cpu e5500 -m 2048 -kernel 
/home/christian/Downloads/vmlinux-5.0-rc4-3-AmigaOne_X1000_X5000/X5000_and_QEMU_e5500/uImage-5.0 
-drive 
format=raw,file=/home/christian/Downloads/Fienix-Beta120418.img,index=0,if=virtio 
-nic user,model=e1000 -append "rw root=/dev/vda" -device virtio-vga 
-device virtio-mouse-pci -device virtio-keyboard-pci -usb -soundhw 
es1370 -smp 4


Cheers,
Christian


On 30 January 2019 at 05:40AM, Christian Zigotzky wrote:

Hi Christoph,

Thanks a lot for the updates. I will test the full branch tomorrow.

Cheers,
Christian

Sent from my iPhone


On 29. Jan 2019, at 17:34, Christoph Hellwig  wrote:


On Tue, Jan 29, 2019 at 05:14:11PM +0100, Christoph Hellwig wrote:

On Tue, Jan 29, 2019 at 04:03:32PM +0100, Christian Zigotzky wrote:
Hi Christoph,

I compiled kernels for the X5000 and X1000 from your new branch
'powerpc-dma.6-debug.2' today. The kernels boot and the P.A. Semi Ethernet
works!

Thanks for testing!  I'll prepare a new series that adds the 

Re: [PATCHv2 1/9] mm: Introduce new vm_insert_range and vm_insert_range_buggy API

2019-01-31 Thread Heiko Stuebner
Am Donnerstag, 31. Januar 2019, 13:31:52 CET schrieb Souptick Joarder:
> On Thu, Jan 31, 2019 at 5:37 PM Heiko Stuebner  wrote:
> >
> > Am Donnerstag, 31. Januar 2019, 04:08:12 CET schrieb Souptick Joarder:
> > > Previouly drivers have their own way of mapping range of
> > > kernel pages/memory into user vma and this was done by
> > > invoking vm_insert_page() within a loop.
> > >
> > > As this pattern is common across different drivers, it can
> > > be generalized by creating new functions and use it across
> > > the drivers.
> > >
> > > vm_insert_range() is the API which could be used to mapped
> > > kernel memory/pages in drivers which has considered vm_pgoff
> > >
> > > vm_insert_range_buggy() is the API which could be used to map
> > > range of kernel memory/pages in drivers which has not considered
> > > vm_pgoff. vm_pgoff is passed default as 0 for those drivers.
> > >
> > > We _could_ then at a later "fix" these drivers which are using
> > > vm_insert_range_buggy() to behave according to the normal vm_pgoff
> > > offsetting simply by removing the _buggy suffix on the function
> > > name and if that causes regressions, it gives us an easy way to revert.
> > >
> > > Signed-off-by: Souptick Joarder 
> > > Suggested-by: Russell King 
> > > Suggested-by: Matthew Wilcox 
> >
> > hmm, I'm missing a changelog here between v1 and v2.
> > Nevertheless I managed to test v1 on Rockchip hardware
> > and display is still working, including talking to Lima via prime.
> >
> > So if there aren't any big changes for v2, on Rockchip
> > Tested-by: Heiko Stuebner 
> 
> Change log is available in [0/9].
> Patch [1/9] & [4/9] have no changes between v1 -> v2.

I never seem to get your cover-letters, so didn't see that, sorry.

But great that there weren't changes then :-)

Heiko


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv2 1/9] mm: Introduce new vm_insert_range and vm_insert_range_buggy API

2019-01-31 Thread Souptick Joarder
On Thu, Jan 31, 2019 at 5:37 PM Heiko Stuebner  wrote:
>
> Am Donnerstag, 31. Januar 2019, 04:08:12 CET schrieb Souptick Joarder:
> > Previouly drivers have their own way of mapping range of
> > kernel pages/memory into user vma and this was done by
> > invoking vm_insert_page() within a loop.
> >
> > As this pattern is common across different drivers, it can
> > be generalized by creating new functions and use it across
> > the drivers.
> >
> > vm_insert_range() is the API which could be used to mapped
> > kernel memory/pages in drivers which has considered vm_pgoff
> >
> > vm_insert_range_buggy() is the API which could be used to map
> > range of kernel memory/pages in drivers which has not considered
> > vm_pgoff. vm_pgoff is passed default as 0 for those drivers.
> >
> > We _could_ then at a later "fix" these drivers which are using
> > vm_insert_range_buggy() to behave according to the normal vm_pgoff
> > offsetting simply by removing the _buggy suffix on the function
> > name and if that causes regressions, it gives us an easy way to revert.
> >
> > Signed-off-by: Souptick Joarder 
> > Suggested-by: Russell King 
> > Suggested-by: Matthew Wilcox 
>
> hmm, I'm missing a changelog here between v1 and v2.
> Nevertheless I managed to test v1 on Rockchip hardware
> and display is still working, including talking to Lima via prime.
>
> So if there aren't any big changes for v2, on Rockchip
> Tested-by: Heiko Stuebner 

Change log is available in [0/9].
Patch [1/9] & [4/9] have no changes between v1 -> v2.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] x86/Hyper-V: Set x2apic destination mode to physical when x2apic is available

2019-01-31 Thread Tianyu Lan
Hi Greg:
 Thanks for your review.

On Thu, Jan 31, 2019 at 7:57 PM Greg KH  wrote:
>
> On Thu, Jan 31, 2019 at 06:17:31PM +0800, lantianyu1...@gmail.com wrote:
> > From: Lan Tianyu 
> >
> > Hyper-V doesn't provide irq remapping for IO-APIC. To enable x2apic,
> > set x2apic destination mode to physcial mode when x2apic is available
> > and Hyper-V IOMMU driver makes sure cpus assigned with IO-APIC irqs have
> > 8-bit APIC id.
> >
> > Signed-off-by: Lan Tianyu 
> > ---
> >  arch/x86/kernel/cpu/mshyperv.c | 14 ++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > index e81a2db..9d62f33 100644
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -36,6 +36,8 @@
> >  struct ms_hyperv_info ms_hyperv;
> >  EXPORT_SYMBOL_GPL(ms_hyperv);
> >
> > +extern int x2apic_phys;
>
> Shouldn't this be in a .h file somewhere instead?

You are right. I should use  here. Thanks.

> thanks,
>
> greg k-h



-- 
Best regards
Tianyu Lan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/3] HYPERV/IOMMU: Add Hyper-V stub IOMMU driver

2019-01-31 Thread lantianyu1986
From: Lan Tianyu 

On the bare metal, enabling X2APIC mode requires interrupt remapping
function which helps to deliver irq to cpu with 32-bit APIC ID.
Hyper-V doesn't provide interrupt remapping function so far and Hyper-V
MSI protocol already supports to deliver interrupt to the CPU whose
virtual processor index is more than 255. IO-APIC interrupt still has
8-bit APIC ID limitation.

This patch is to add Hyper-V stub IOMMU driver in order to enable
X2APIC mode successfully in Hyper-V Linux guest. The driver returns X2APIC
interrupt remapping capability when X2APIC mode is available. Otherwise,
it creates a Hyper-V irq domain to limit IO-APIC interrupts' affinity
and make sure cpus assigned with IO-APIC interrupt have 8-bit APIC ID.

Signed-off-by: Lan Tianyu 
---
 drivers/iommu/Kconfig |   7 ++
 drivers/iommu/Makefile|   1 +
 drivers/iommu/hyperv-iommu.c  | 189 ++
 drivers/iommu/irq_remapping.c |   3 +
 drivers/iommu/irq_remapping.h |   1 +
 5 files changed, 201 insertions(+)
 create mode 100644 drivers/iommu/hyperv-iommu.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 45d7021..5c397c0 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -437,4 +437,11 @@ config QCOM_IOMMU
help
  Support for IOMMU on certain Qualcomm SoCs.
 
+config HYPERV_IOMMU
+   bool "Hyper-V stub IOMMU support"
+   depends on HYPERV
+   help
+   Hyper-V stub IOMMU driver provides capability to run
+   Linux guest with X2APIC mode enabled.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index a158a68..8c71a15 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -32,3 +32,4 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
 obj-$(CONFIG_QCOM_IOMMU) += qcom_iommu.o
+obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
new file mode 100644
index 000..a64b747
--- /dev/null
+++ b/drivers/iommu/hyperv-iommu.c
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "HYPERV-IR: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include "irq_remapping.h"
+
+/*
+ * According IO-APIC spec, IO APIC has a 24-entry Interrupt
+ * Redirection Table.
+ */
+#define IOAPIC_REMAPPING_ENTRY 24
+
+static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
+struct irq_domain *ioapic_ir_domain;
+
+static int hyperv_ir_set_affinity(struct irq_data *data,
+   const struct cpumask *mask, bool force)
+{
+   struct irq_data *parent = data->parent_data;
+   struct irq_cfg *cfg = irqd_cfg(data);
+   struct IO_APIC_route_entry *entry;
+   cpumask_t cpumask;
+   int ret;
+
+   cpumask_andnot(, mask, _max_cpumask);
+
+   /* Return error If new irq affinity is out of ioapic_max_cpumask. */
+   if (!cpumask_empty())
+   return -EINVAL;
+
+   ret = parent->chip->irq_set_affinity(parent, mask, force);
+   if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
+   return ret;
+
+   entry = data->chip_data;
+   entry->dest = cfg->dest_apicid;
+   entry->vector = cfg->vector;
+   send_cleanup_vector(cfg);
+
+   return 0;
+}
+
+static struct irq_chip hyperv_ir_chip = {
+   .name   = "HYPERV-IR",
+   .irq_ack= apic_ack_irq,
+   .irq_set_affinity   = hyperv_ir_set_affinity,
+};
+
+static int hyperv_irq_remapping_alloc(struct irq_domain *domain,
+unsigned int virq, unsigned int nr_irqs,
+void *arg)
+{
+   struct irq_alloc_info *info = arg;
+   struct IO_APIC_route_entry *entry;
+   struct irq_data *irq_data;
+   struct irq_desc *desc;
+   struct irq_cfg *cfg;
+   int ret = 0;
+
+   if (!info || info->type != X86_IRQ_ALLOC_TYPE_IOAPIC || nr_irqs > 1)
+   return -EINVAL;
+
+   ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg);
+   if (ret < 0)
+   goto fail;
+
+   irq_data = irq_domain_get_irq_data(domain, virq);
+   cfg = irqd_cfg(irq_data);
+   if (!irq_data || !cfg) {
+   ret = -EINVAL;
+   goto fail;
+   }
+
+   irq_data->chip = _ir_chip;
+
+   /*
+* Save IOAPIC entry pointer here in order to set vector and
+* and dest_apicid in the hyperv_irq_remappng_activate()
+* and hyperv_ir_set_affinity(). IOAPIC driver ignores
+* cfg->dest_apicid and cfg->vector when irq remapping
+* mode is enabled. Detail see ioapic_configure_entry().
+*/
+   irq_data->chip_data = entry = info->ioapic_entry;
+
+   /*
+* Hypver-V IO APIC irq affinity should be in the scope of
+* 

[PATCH 1/3] x86/Hyper-V: Set x2apic destination mode to physical when x2apic is available

2019-01-31 Thread lantianyu1986
From: Lan Tianyu 

Hyper-V doesn't provide irq remapping for IO-APIC. To enable x2apic,
set x2apic destination mode to physcial mode when x2apic is available
and Hyper-V IOMMU driver makes sure cpus assigned with IO-APIC irqs have
8-bit APIC id.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kernel/cpu/mshyperv.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e81a2db..9d62f33 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -36,6 +36,8 @@
 struct ms_hyperv_info ms_hyperv;
 EXPORT_SYMBOL_GPL(ms_hyperv);
 
+extern int x2apic_phys;
+
 #if IS_ENABLED(CONFIG_HYPERV)
 static void (*vmbus_handler)(void);
 static void (*hv_stimer0_handler)(void);
@@ -328,6 +330,18 @@ static void __init ms_hyperv_init_platform(void)
 # ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = hv_smp_prepare_boot_cpu;
 # endif
+
+/*
+ * Hyper-V doesn't provide irq remapping for IO-APIC. To enable x2apic,
+ * set x2apic destination mode to physcial mode when x2apic is available
+ * and Hyper-V IOMMU driver makes sure cpus assigned with IO-APIC irqs
+ * have 8-bit APIC id.
+ */
+# if IS_ENABLED(CONFIG_HYPERV_IOMMU)
+   if (x2apic_supported())
+   x2apic_phys = 1;
+# endif
+
 #endif
 }
 
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/3] x86/Hyper-V/IOMMU: Add Hyper-V IOMMU driver to support x2apic mode

2019-01-31 Thread lantianyu1986
From: Lan Tianyu 

On the bare metal, enabling X2APIC mode requires interrupt remapping
function which helps to deliver irq to cpu with 32-bit APIC ID.
Hyper-V doesn't provide interrupt remapping function so far and Hyper-V
MSI protocol already supports to deliver interrupt to the CPU whose
virtual processor index is more than 255. IO-APIC interrupt still has
8-bit APIC ID limitation.

This patchset is to add Hyper-V stub IOMMU driver in order to enable
X2APIC mode successfully in Hyper-V Linux guest. The driver returns X2APIC
interrupt remapping capability when X2APIC mode is available. X2APIC
destination mode is set to physical by PATCH 1 when X2APIC is available.
Hyper-V IOMMU driver will scan cpu 0~255 and set cpu into IO-APIC MAX cpu
affinity cpumask if its APIC ID is 8-bit. Driver creates a Hyper-V irq domain
to limit IO-APIC interrupts' affinity and make sure cpus assigned with IO-APIC
interrupt are in the scope of IO-APIC MAX cpu affinity.

Lan Tianyu (3):
  x86/Hyper-V: Set x2apic destination mode to physical when x2apic is   
 available
  HYPERV/IOMMU: Add Hyper-V stub IOMMU driver
  MAINTAINERS: Add Hyper-V IOMMU driver into Hyper-V CORE AND DRIVERS
scope

 MAINTAINERS|   1 +
 arch/x86/kernel/cpu/mshyperv.c |  14 +++
 drivers/iommu/Kconfig  |   7 ++
 drivers/iommu/Makefile |   1 +
 drivers/iommu/hyperv-iommu.c   | 189 +
 drivers/iommu/irq_remapping.c  |   3 +
 drivers/iommu/irq_remapping.h  |   1 +
 7 files changed, 216 insertions(+)
 create mode 100644 drivers/iommu/hyperv-iommu.c

-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] HYPERV/IOMMU: Add Hyper-V stub IOMMU driver

2019-01-31 Thread Tianyu Lan
On Thu, Jan 31, 2019 at 7:59 PM Greg KH  wrote:
>
> On Thu, Jan 31, 2019 at 06:17:32PM +0800, lantianyu1...@gmail.com wrote:
> > --- /dev/null
> > +++ b/drivers/iommu/hyperv-iommu.c
> > @@ -0,0 +1,189 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#define pr_fmt(fmt) "HYPERV-IR: " fmt
>
> Minor nit, you never do any pr_*() calls, so this isn't needed, right?

Yes, you are right. I will remove it. Sorry. I used pr_info() during
development stage and removed
them before sending patch out. Thanks.

>
> > +static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
> > +struct irq_domain *ioapic_ir_domain;
>
> Global?  Why?

It should be "static" here.

-- 
Best regards
Tianyu Lan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv2 1/9] mm: Introduce new vm_insert_range and vm_insert_range_buggy API

2019-01-31 Thread Heiko Stuebner
Am Donnerstag, 31. Januar 2019, 04:08:12 CET schrieb Souptick Joarder:
> Previouly drivers have their own way of mapping range of
> kernel pages/memory into user vma and this was done by
> invoking vm_insert_page() within a loop.
> 
> As this pattern is common across different drivers, it can
> be generalized by creating new functions and use it across
> the drivers.
> 
> vm_insert_range() is the API which could be used to mapped
> kernel memory/pages in drivers which has considered vm_pgoff
> 
> vm_insert_range_buggy() is the API which could be used to map
> range of kernel memory/pages in drivers which has not considered
> vm_pgoff. vm_pgoff is passed default as 0 for those drivers.
> 
> We _could_ then at a later "fix" these drivers which are using
> vm_insert_range_buggy() to behave according to the normal vm_pgoff
> offsetting simply by removing the _buggy suffix on the function
> name and if that causes regressions, it gives us an easy way to revert.
> 
> Signed-off-by: Souptick Joarder 
> Suggested-by: Russell King 
> Suggested-by: Matthew Wilcox 

hmm, I'm missing a changelog here between v1 and v2.
Nevertheless I managed to test v1 on Rockchip hardware
and display is still working, including talking to Lima via prime.

So if there aren't any big changes for v2, on Rockchip
Tested-by: Heiko Stuebner 

Heiko


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] HYPERV/IOMMU: Add Hyper-V stub IOMMU driver

2019-01-31 Thread Greg KH
On Thu, Jan 31, 2019 at 06:17:32PM +0800, lantianyu1...@gmail.com wrote:
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -0,0 +1,189 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define pr_fmt(fmt) "HYPERV-IR: " fmt

Minor nit, you never do any pr_*() calls, so this isn't needed, right?

> +static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
> +struct irq_domain *ioapic_ir_domain;

Global?  Why?

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] x86/Hyper-V: Set x2apic destination mode to physical when x2apic is available

2019-01-31 Thread Greg KH
On Thu, Jan 31, 2019 at 06:17:31PM +0800, lantianyu1...@gmail.com wrote:
> From: Lan Tianyu 
> 
> Hyper-V doesn't provide irq remapping for IO-APIC. To enable x2apic,
> set x2apic destination mode to physcial mode when x2apic is available
> and Hyper-V IOMMU driver makes sure cpus assigned with IO-APIC irqs have
> 8-bit APIC id.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index e81a2db..9d62f33 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -36,6 +36,8 @@
>  struct ms_hyperv_info ms_hyperv;
>  EXPORT_SYMBOL_GPL(ms_hyperv);
>  
> +extern int x2apic_phys;

Shouldn't this be in a .h file somewhere instead?

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv2 1/9] mm: Introduce new vm_insert_range and vm_insert_range_buggy API

2019-01-31 Thread Mike Rapoport
On Thu, Jan 31, 2019 at 03:43:39PM +0530, Souptick Joarder wrote:
> On Thu, Jan 31, 2019 at 2:09 PM Mike Rapoport  wrote:
> >
> > On Thu, Jan 31, 2019 at 08:38:12AM +0530, Souptick Joarder wrote:
> > > Previouly drivers have their own way of mapping range of
> > > kernel pages/memory into user vma and this was done by
> > > invoking vm_insert_page() within a loop.
> > >
> > > As this pattern is common across different drivers, it can
> > > be generalized by creating new functions and use it across
> > > the drivers.
> > >
> > > vm_insert_range() is the API which could be used to mapped
> > > kernel memory/pages in drivers which has considered vm_pgoff
> > >
> > > vm_insert_range_buggy() is the API which could be used to map
> > > range of kernel memory/pages in drivers which has not considered
> > > vm_pgoff. vm_pgoff is passed default as 0 for those drivers.
> > >
> > > We _could_ then at a later "fix" these drivers which are using
> > > vm_insert_range_buggy() to behave according to the normal vm_pgoff
> > > offsetting simply by removing the _buggy suffix on the function
> > > name and if that causes regressions, it gives us an easy way to revert.
> > >
> > > Signed-off-by: Souptick Joarder 
> > > Suggested-by: Russell King 
> > > Suggested-by: Matthew Wilcox 
> > > ---
> > >  include/linux/mm.h |  4 +++
> > >  mm/memory.c| 81 
> > > ++
> > >  mm/nommu.c | 14 ++
> > >  3 files changed, 99 insertions(+)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 80bb640..25752b0 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2565,6 +2565,10 @@ unsigned long change_prot_numa(struct 
> > > vm_area_struct *vma,
> > >  int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> > >   unsigned long pfn, unsigned long size, pgprot_t);
> > >  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct 
> > > page *);
> > > +int vm_insert_range(struct vm_area_struct *vma, struct page **pages,
> > > + unsigned long num);
> > > +int vm_insert_range_buggy(struct vm_area_struct *vma, struct page 
> > > **pages,
> > > + unsigned long num);
> > >  vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> > >   unsigned long pfn);
> > >  vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long 
> > > addr,
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index e11ca9d..0a4bf57 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1520,6 +1520,87 @@ int vm_insert_page(struct vm_area_struct *vma, 
> > > unsigned long addr,
> > >  }
> > >  EXPORT_SYMBOL(vm_insert_page);
> > >
> > > +/**
> > > + * __vm_insert_range - insert range of kernel pages into user vma
> > > + * @vma: user vma to map to
> > > + * @pages: pointer to array of source kernel pages
> > > + * @num: number of pages in page array
> > > + * @offset: user's requested vm_pgoff
> > > + *
> > > + * This allows drivers to insert range of kernel pages they've allocated
> > > + * into a user vma.
> > > + *
> > > + * If we fail to insert any page into the vma, the function will return
> > > + * immediately leaving any previously inserted pages present.  Callers
> > > + * from the mmap handler may immediately return the error as their caller
> > > + * will destroy the vma, removing any successfully inserted pages. Other
> > > + * callers should make their own arrangements for calling unmap_region().
> > > + *
> > > + * Context: Process context.
> > > + * Return: 0 on success and error code otherwise.
> > > + */
> > > +static int __vm_insert_range(struct vm_area_struct *vma, struct page 
> > > **pages,
> > > + unsigned long num, unsigned long offset)
> > > +{
> > > + unsigned long count = vma_pages(vma);
> > > + unsigned long uaddr = vma->vm_start;
> > > + int ret, i;
> > > +
> > > + /* Fail if the user requested offset is beyond the end of the 
> > > object */
> > > + if (offset > num)
> > > + return -ENXIO;
> > > +
> > > + /* Fail if the user requested size exceeds available object size */
> > > + if (count > num - offset)
> > > + return -ENXIO;
> > > +
> > > + for (i = 0; i < count; i++) {
> > > + ret = vm_insert_page(vma, uaddr, pages[offset + i]);
> > > + if (ret < 0)
> > > + return ret;
> > > + uaddr += PAGE_SIZE;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * vm_insert_range - insert range of kernel pages starts with non zero 
> > > offset
> > > + * @vma: user vma to map to
> > > + * @pages: pointer to array of source kernel pages
> > > + * @num: number of pages in page array
> > > + *
> > > + * Maps an object consisting of `num' `pages', catering for the user's
> > > + * requested vm_pgoff
> > > + *
> >
> > 

Re: [PATCH 3/5] dma: Introduce dma_max_mapping_size()

2019-01-31 Thread Christoph Hellwig
> +static inline size_t dma_max_mapping_size(struct device *dev)
> +{
> + const struct dma_map_ops *ops = get_dma_ops(dev);
> + size_t size = SIZE_MAX;
> +
> + if (dma_is_direct(ops))
> + size = dma_direct_max_mapping_size(dev);
> + else if (ops && ops->max_mapping_size)
> + size = ops->max_mapping_size(dev);
> +
> + return size;
> +}

Sorry for not noticing last time, but since 5.0 we keep all non-fast
path DMA mapping interfaces out of line, so this should move to
kernel/dma/mapping.c.

> +EXPORT_SYMBOL(dma_direct_max_mapping_size);

And then there is no need to export this one.

The dma_max_mapping_size export should be EXPORT_SYMBOL_GPL like all
new dma-mapping interfaces.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCHv2 1/9] mm: Introduce new vm_insert_range and vm_insert_range_buggy API

2019-01-31 Thread Souptick Joarder
On Thu, Jan 31, 2019 at 2:09 PM Mike Rapoport  wrote:
>
> On Thu, Jan 31, 2019 at 08:38:12AM +0530, Souptick Joarder wrote:
> > Previouly drivers have their own way of mapping range of
> > kernel pages/memory into user vma and this was done by
> > invoking vm_insert_page() within a loop.
> >
> > As this pattern is common across different drivers, it can
> > be generalized by creating new functions and use it across
> > the drivers.
> >
> > vm_insert_range() is the API which could be used to mapped
> > kernel memory/pages in drivers which has considered vm_pgoff
> >
> > vm_insert_range_buggy() is the API which could be used to map
> > range of kernel memory/pages in drivers which has not considered
> > vm_pgoff. vm_pgoff is passed default as 0 for those drivers.
> >
> > We _could_ then at a later "fix" these drivers which are using
> > vm_insert_range_buggy() to behave according to the normal vm_pgoff
> > offsetting simply by removing the _buggy suffix on the function
> > name and if that causes regressions, it gives us an easy way to revert.
> >
> > Signed-off-by: Souptick Joarder 
> > Suggested-by: Russell King 
> > Suggested-by: Matthew Wilcox 
> > ---
> >  include/linux/mm.h |  4 +++
> >  mm/memory.c| 81 
> > ++
> >  mm/nommu.c | 14 ++
> >  3 files changed, 99 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 80bb640..25752b0 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2565,6 +2565,10 @@ unsigned long change_prot_numa(struct vm_area_struct 
> > *vma,
> >  int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> >   unsigned long pfn, unsigned long size, pgprot_t);
> >  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct 
> > page *);
> > +int vm_insert_range(struct vm_area_struct *vma, struct page **pages,
> > + unsigned long num);
> > +int vm_insert_range_buggy(struct vm_area_struct *vma, struct page **pages,
> > + unsigned long num);
> >  vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> >   unsigned long pfn);
> >  vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long 
> > addr,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e11ca9d..0a4bf57 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1520,6 +1520,87 @@ int vm_insert_page(struct vm_area_struct *vma, 
> > unsigned long addr,
> >  }
> >  EXPORT_SYMBOL(vm_insert_page);
> >
> > +/**
> > + * __vm_insert_range - insert range of kernel pages into user vma
> > + * @vma: user vma to map to
> > + * @pages: pointer to array of source kernel pages
> > + * @num: number of pages in page array
> > + * @offset: user's requested vm_pgoff
> > + *
> > + * This allows drivers to insert range of kernel pages they've allocated
> > + * into a user vma.
> > + *
> > + * If we fail to insert any page into the vma, the function will return
> > + * immediately leaving any previously inserted pages present.  Callers
> > + * from the mmap handler may immediately return the error as their caller
> > + * will destroy the vma, removing any successfully inserted pages. Other
> > + * callers should make their own arrangements for calling unmap_region().
> > + *
> > + * Context: Process context.
> > + * Return: 0 on success and error code otherwise.
> > + */
> > +static int __vm_insert_range(struct vm_area_struct *vma, struct page 
> > **pages,
> > + unsigned long num, unsigned long offset)
> > +{
> > + unsigned long count = vma_pages(vma);
> > + unsigned long uaddr = vma->vm_start;
> > + int ret, i;
> > +
> > + /* Fail if the user requested offset is beyond the end of the object 
> > */
> > + if (offset > num)
> > + return -ENXIO;
> > +
> > + /* Fail if the user requested size exceeds available object size */
> > + if (count > num - offset)
> > + return -ENXIO;
> > +
> > + for (i = 0; i < count; i++) {
> > + ret = vm_insert_page(vma, uaddr, pages[offset + i]);
> > + if (ret < 0)
> > + return ret;
> > + uaddr += PAGE_SIZE;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * vm_insert_range - insert range of kernel pages starts with non zero 
> > offset
> > + * @vma: user vma to map to
> > + * @pages: pointer to array of source kernel pages
> > + * @num: number of pages in page array
> > + *
> > + * Maps an object consisting of `num' `pages', catering for the user's
> > + * requested vm_pgoff
> > + *
>
> The elaborate description you've added to __vm_insert_range() is better put
> here, as this is the "public" function.

Ok, will add it in v3. Which means __vm_insert_range() still needs a short
description ?
>
> > + * Context: Process context. Called by mmap handlers.
> > + * Return: 0 on success and error 

Re: [PATCHv2 1/9] mm: Introduce new vm_insert_range and vm_insert_range_buggy API

2019-01-31 Thread Mike Rapoport
On Thu, Jan 31, 2019 at 08:38:12AM +0530, Souptick Joarder wrote:
> Previouly drivers have their own way of mapping range of
> kernel pages/memory into user vma and this was done by
> invoking vm_insert_page() within a loop.
> 
> As this pattern is common across different drivers, it can
> be generalized by creating new functions and use it across
> the drivers.
> 
> vm_insert_range() is the API which could be used to mapped
> kernel memory/pages in drivers which has considered vm_pgoff
> 
> vm_insert_range_buggy() is the API which could be used to map
> range of kernel memory/pages in drivers which has not considered
> vm_pgoff. vm_pgoff is passed default as 0 for those drivers.
> 
> We _could_ then at a later "fix" these drivers which are using
> vm_insert_range_buggy() to behave according to the normal vm_pgoff
> offsetting simply by removing the _buggy suffix on the function
> name and if that causes regressions, it gives us an easy way to revert.
> 
> Signed-off-by: Souptick Joarder 
> Suggested-by: Russell King 
> Suggested-by: Matthew Wilcox 
> ---
>  include/linux/mm.h |  4 +++
>  mm/memory.c| 81 
> ++
>  mm/nommu.c | 14 ++
>  3 files changed, 99 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb640..25752b0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2565,6 +2565,10 @@ unsigned long change_prot_numa(struct vm_area_struct 
> *vma,
>  int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
>   unsigned long pfn, unsigned long size, pgprot_t);
>  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page 
> *);
> +int vm_insert_range(struct vm_area_struct *vma, struct page **pages,
> + unsigned long num);
> +int vm_insert_range_buggy(struct vm_area_struct *vma, struct page **pages,
> + unsigned long num);
>  vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn);
>  vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long 
> addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9d..0a4bf57 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1520,6 +1520,87 @@ int vm_insert_page(struct vm_area_struct *vma, 
> unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_page);
> 
> +/**
> + * __vm_insert_range - insert range of kernel pages into user vma
> + * @vma: user vma to map to
> + * @pages: pointer to array of source kernel pages
> + * @num: number of pages in page array
> + * @offset: user's requested vm_pgoff
> + *
> + * This allows drivers to insert range of kernel pages they've allocated
> + * into a user vma.
> + *
> + * If we fail to insert any page into the vma, the function will return
> + * immediately leaving any previously inserted pages present.  Callers
> + * from the mmap handler may immediately return the error as their caller
> + * will destroy the vma, removing any successfully inserted pages. Other
> + * callers should make their own arrangements for calling unmap_region().
> + *
> + * Context: Process context.
> + * Return: 0 on success and error code otherwise.
> + */
> +static int __vm_insert_range(struct vm_area_struct *vma, struct page **pages,
> + unsigned long num, unsigned long offset)
> +{
> + unsigned long count = vma_pages(vma);
> + unsigned long uaddr = vma->vm_start;
> + int ret, i;
> +
> + /* Fail if the user requested offset is beyond the end of the object */
> + if (offset > num)
> + return -ENXIO;
> +
> + /* Fail if the user requested size exceeds available object size */
> + if (count > num - offset)
> + return -ENXIO;
> +
> + for (i = 0; i < count; i++) {
> + ret = vm_insert_page(vma, uaddr, pages[offset + i]);
> + if (ret < 0)
> + return ret;
> + uaddr += PAGE_SIZE;
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * vm_insert_range - insert range of kernel pages starts with non zero offset
> + * @vma: user vma to map to
> + * @pages: pointer to array of source kernel pages
> + * @num: number of pages in page array
> + *
> + * Maps an object consisting of `num' `pages', catering for the user's
> + * requested vm_pgoff
> + *

The elaborate description you've added to __vm_insert_range() is better put
here, as this is the "public" function.

> + * Context: Process context. Called by mmap handlers.
> + * Return: 0 on success and error code otherwise.
> + */
> +int vm_insert_range(struct vm_area_struct *vma, struct page **pages,
> + unsigned long num)
> +{
> + return __vm_insert_range(vma, pages, num, vma->vm_pgoff);
> +}
> +EXPORT_SYMBOL(vm_insert_range);
> +
> +/**
> + * vm_insert_range_buggy - insert range of kernel pages starts with zero 
> offset
> + * @vma: user vma to map to
> 

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Christoph Hellwig
On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > *shrug* so what if the special GUP called a VMA op instead of
> > traversing the VMA PTEs today? Why does it really matter? It could
> > easily change to a struct page flow tomorrow..
> 
> Well it's so that it's composable. We want the SGL->DMA side to work for
> APIs from kernel space and not have to run a completely different flow
> for kernel drivers than from userspace memory.

Yes, I think that is the important point.

All the other struct page discussion is not about anyone of us wanting
struct page - heck it is a pain to deal with, but then again it is
there for a reason.

In the typical GUP flows we have three uses of a struct page:

 (1) to carry a physical address.  This is mostly through
 struct scatterlist and struct bio_vec.  We could just store
 a magic PFN-like value that encodes the physical address
 and allow looking up a page if it exists, and we had at least
 two attempts at it.  In some way I think that would actually
 make the interfaces cleaner, but Linus has NACKed it in the
 past, so we'll have to convince him first that this is the
 way forward
 (2) to keep a reference to the memory so that it doesn't go away
 under us due to swapping, process exit, unmapping, etc.
 No idea how we want to solve this, but I guess you have
 some smart ideas?
 (3) to make the PTEs dirty after writing to them.  Again no sure
 what our preferred interface here would be

If we solve all of the above problems I'd be more than happy to
go with a non-struct page based interface for BAR P2P.  But we'll
have to solve these issues in a generic way first.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Christoph Hellwig
On Wed, Jan 30, 2019 at 08:44:20PM +, Jason Gunthorpe wrote:
> Not really, for MRs most drivers care about DMA addresses only. The
> only reason struct page ever gets involved is because it is part of
> the GUP, SGL and dma_map family of APIs.

And the only way you get the DMA address is through the dma mapping
APIs.  Which except for the little oddball dma_map_resource expect
a struct page in some form.  And dma_map_resource isn't really up
to speed for full blown P2P.

Now we could and maybe eventually should change all this.  But that
is a pre-requisitive for doing anything more fancy, and not something
to be hacked around.

> O_DIRECT seems to be the justification for struct page, but nobody is
> signing up to make O_DIRECT have the required special GUP/SGL/P2P flow
> that would be needed to *actually* make that work - so it really isn't
> a justification today.

O_DIRECT is just the messenger.  Anything using GUP will need a struct
page, which is all our interfaces that do I/O directly to user pages.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Christoph Hellwig
On Wed, Jan 30, 2019 at 01:50:27PM -0500, Jerome Glisse wrote:
> I do not see how VMA changes are any different than using struct page
> in respect to userspace exposure. Those vma callback do not need to be
> set by everyone, in fact expectation is that only handful of driver
> will set those.
> 
> How can we do p2p between RDMA and GPU for instance, without exposure
> to userspace ? At some point you need to tell userspace hey this kernel
> does allow you to do that :)

To do RDMA on a memory region you need struct page backіng to start
with..
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier

2019-01-31 Thread Peter Xu
On Wed, Jan 30, 2019 at 12:27:40PM +, Jean-Philippe Brucker wrote:
> Hi Peter,

Hi, Jean,

> 
> On 30/01/2019 05:57, Peter Xu wrote:
> > AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> > but that's actually already covered by invalidate_range().  Remove the
> > extra notifier and the chunks.
> 
> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> I understood correctly, when doing reclaim the kernel clears the young
> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> new accesses from devices will go through the IOMMU, set the young bit
> again and the kernel can accurately track page use. I didn't see
> invalidate_range() being called by rmap or vmscan in this case, but
> might just be missing it.
> 
> Two MMU notifiers are used for the young bit, clear_flush_young() and
> clear_young(). I believe the former should invalidate ATCs so that DMA
> accesses participate in PTE aging. Otherwise the kernel can't know that
> the device is still using entries marked 'old'. The latter,
> clear_young() is exempted from invalidating the secondary TLBs by
> mmu_notifier.h and the IOMMU driver doesn't need to implement it.

Yes I agree most of your analysis, but IMHO the problem is that the
AMD IOMMU is not really participating in the PTE aging after all.
Please have a look at mn_clear_flush_young() below at [1] - it's
always returning zero, no matter whether the page has been accessed by
device or not.  A real user of it could be someone like KVM (please
see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
shadow PTEs and do test-and-clear on that, then return the result to
the core mm.  That's why I think currently the AMD driver was only
trying to use that as a way to do an extra flush however IMHO it's
redundant.

Thanks,

> > -static int mn_clear_flush_young(struct mmu_notifier *mn,
> > -   struct mm_struct *mm,
> > -   unsigned long start,
> > -   unsigned long end)
> > -{
> > -   for (; start < end; start += PAGE_SIZE)
> > -   __mn_flush_page(mn, start);
> > -
> > -   return 0;

[1]

-- 
Peter Xu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu