Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711

2020-10-08 Thread Ard Biesheuvel
(+ Lorenzo)

On Thu, 8 Oct 2020 at 12:14, Catalin Marinas  wrote:
>
> On Thu, Oct 08, 2020 at 12:05:25PM +0200, Nicolas Saenz Julienne wrote:
> > On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote:
> > > On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote:
> > > > On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote:
> > > > > On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote:
> > > > > > On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne 
> > > > > > wrote:
> > > > > > > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> > > > > > > index 4602e467ca8b..cd0d115ef329 100644
> > > > > > > --- a/drivers/of/fdt.c
> > > > > > > +++ b/drivers/of/fdt.c
> > > > > > > @@ -25,6 +25,7 @@
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > > +#include   /* for zone_dma_bits */
> > > > > > >
> > > > > > >  #include   /* for COMMAND_LINE_SIZE */
> > > > > > >  #include 
> > > > > > > @@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void)
> > > > > > > of_scan_flat_dt(early_init_dt_scan_memory, NULL);
> > > > > > >  }
> > > > > > >
> > > > > > > +void __init early_init_dt_update_zone_dma_bits(void)
> > > > > > > +{
> > > > > > > +   unsigned long dt_root = of_get_flat_dt_root();
> > > > > > > +
> > > > > > > +   if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711"))
> > > > > > > +   zone_dma_bits = 30;
> > > > > > > +}
> > > > > >
> > > > > > I think we could keep this entirely in the arm64 
> > > > > > setup_machine_fdt() and
> > > > > > not pollute the core code with RPi4-specific code.
> > > > >
> > > > > Actually, even better, could we not move the check to
> > > > > arm64_memblock_init() when we initialise zone_dma_bits?
> > > >
> > > > I did it this way as I vaguely remembered Rob saying he wanted to 
> > > > centralise
> > > > all early boot fdt code in one place. But I'll be happy to move it 
> > > > there.
> > >
> > > I can see Rob replied and I'm fine if that's his preference. However,
> > > what I don't particularly like is that in the arm64 code, if
> > > zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by
> > > the early_init_dt_update_zone_dma_bits(). What if at some point we'll
> > > get a platform that actually needs 24 here (I truly hope not, but just
> > > the principle of relying on magic values)?
> > >
> > > So rather than guessing, I'd prefer if the arch code can override
> > > ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no
> > > need to explicitly touch the zone_dma_bits variable.
> >
> > Yes, sonds like the way to go. TBH I wasn't happy with that solution either,
> > but couldn't think of a nicer alternative.
> >
> > Sadly I just realised that the series is incomplete, we have RPi4 users that
> > want to boot unsing ACPI, and this series would break things for them. I'll
> > have a word with them to see what we can do for their use-case.
>
> Is there a way to get some SoC information from ACPI?
>

This is unfortunate. We used ACPI _DMA methods as they were designed
to communicate the DMA limit of the XHCI controller to the OS.

It shouldn't be too hard to match the OEM id field in the DSDT, and
switch to the smaller mask. But it sucks to have to add a quirk like
that.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

2020-10-08 Thread Thomas Gleixner
On Thu, Oct 08 2020 at 12:10, David Woodhouse wrote:
> On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote:
>> The overall conclusion for this is:
>> 
>>  1) X2APIC support on bare metal w/o irq remapping is not going to
>> happen unless you:
>> 
>>   - added support in multi-queue devices which utilize managed
>> interrupts
>> 
>>   - audited the whole tree for other assumptions related to the
>> reachability of possible CPUs.
>> 
>> I'm not expecting you to be done with that before I retire so for
>> me it's just not going to happen :)
>
> Makes sense. It probably does mean we should a BUG_ON for the case
> where IRQ remapping *is* enabled but any device is found which isn't
> behind it. But that's OK.

We can kinda gracefully handle that. See the completely untested and
incomplete patch below.

>>  2) X2APIC support on VIRT is possible if the extended ID magic is
>> supported by the hypervisor because that does not make any CPU
>> unreachable for MSI and therefore the multi-queue muck and
>> everything else just works.
>> 
>> This requires to have either the domain affinity limitation for HPET
>> in place or just to force disable HPET or at least HPET-MSI which is
>> a reasonable tradeoff.
>> 
>> HPET is not required for guests which have kvmclock and
>> APIC/deadline timer and known (hypervisor provided) frequencies.
>
> HPET-MSI should work fine. Like the IOAPIC, it's just a child of the
> *actual* MSI domain. The address/data in the MSI message are completely
> opaque to it, and if the parent domain happens to put meaningful
> information into bits 11-5 of the MSI address, the HPET won't even
> notice.
>
> The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI
> address; it's not doing bit-swizzling like the IOAPIC does, which might
> potentially *not* have been able to set certain bits in the MSI.

Indeed. I thought it was crippled in some way, but you're right it has
all the bits.

Thanks,

tglx
---
Subject: x86/iommu: Make interrupt remapping more robust
From: Thomas Gleixner 
Date: Thu, 08 Oct 2020 14:09:44 +0200

Needs to be split into pieces and cover PCI proper. Right now PCI gets a
NULL pointer assigned which makes it explode at the wrong place
later. Also hyperv iommu wants some love.

NOT-Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/apic/io_apic.c  |4 +++-
 arch/x86/kernel/apic/msi.c  |   24 ++--
 drivers/iommu/amd/iommu.c   |6 +++---
 drivers/iommu/intel/irq_remapping.c |4 ++--
 4 files changed, 22 insertions(+), 16 deletions(-)

--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi
info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT;
info.devid = mpc_ioapic_id(ioapic);
parent = irq_remapping_get_irq_domain();
-   if (!parent)
+   if (IS_ERR(parent))
+   return PTR_ERR(parent);
+   else if (!parent)
parent = x86_vector_domain;
else
name = "IO-APIC-IR";
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d
 struct irq_domain *hpet_create_irq_domain(int hpet_id)
 {
struct msi_domain_info *domain_info;
+   struct fwnode_handle *fn = NULL;
struct irq_domain *parent, *d;
struct irq_alloc_info info;
-   struct fwnode_handle *fn;
 
if (x86_vector_domain == NULL)
return NULL;
@@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai
init_irq_alloc_info(, NULL);
info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT;
info.devid = hpet_id;
+
parent = irq_remapping_get_irq_domain();
-   if (parent == NULL)
+   if (IS_ERR(parent))
+   goto fail;
+   else if (!parent)
parent = x86_vector_domain;
else
hpet_msi_controller.name = "IR-HPET-MSI";
 
fn = irq_domain_alloc_named_id_fwnode(hpet_msi_controller.name,
  hpet_id);
-   if (!fn) {
-   kfree(domain_info);
-   return NULL;
-   }
+   if (!fn)
+   goto fail;
 
d = msi_create_irq_domain(fn, domain_info, parent);
-   if (!d) {
-   irq_domain_free_fwnode(fn);
-   kfree(domain_info);
-   }
+   if (!d)
+   goto fail;
return d;
+
+fail:
+   irq_domain_free_fwnode(fn);
+   kfree(domain_info);
+   return NULL;
 }
 
 int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3557,7 +3557,7 @@ static struct irq_domain *get_irq_domain
struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
 
if (!iommu)
-   return NULL;
+   

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-08 Thread Nicolin Chen
On Thu, Oct 08, 2020 at 11:53:43AM +0200, Thierry Reding wrote:
> On Mon, Oct 05, 2020 at 06:05:46PM -0700, Nicolin Chen wrote:
> > On Mon, Oct 05, 2020 at 11:57:54AM +0200, Thierry Reding wrote:
> > > On Fri, Oct 02, 2020 at 11:58:29AM -0700, Nicolin Chen wrote:
> > > > On Fri, Oct 02, 2020 at 06:02:18PM +0300, Dmitry Osipenko wrote:
> > > > > 02.10.2020 09:08, Nicolin Chen пишет:
> > > > > >  static int tegra_smmu_of_xlate(struct device *dev,
> > > > > >struct of_phandle_args *args)
> > > > > >  {
> > > > > > +   struct platform_device *iommu_pdev = 
> > > > > > of_find_device_by_node(args->np);
> > > > > > +   struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> > > > > > u32 id = args->args[0];
> > > > > >  
> > > > > > +   of_node_put(args->np);
> > > > > 
> > > > > of_find_device_by_node() takes device reference and not the np
> > > > > reference. This is a bug, please remove of_node_put().
> > > > 
> > > > Looks like so. Replacing it with put_device(_pdev->dev);
> > > 
> > > Putting the put_device() here is wrong, though. You need to make sure
> > > you keep a reference to it as long as you keep accessing the data that
> > > is owned by it.
> > 
> > I am confused. You said in the other reply (to Dmitry) that we do
> > need to put_device(mc->dev), where mc->dev should be the same as
> > iommu_pdev->dev. But here your comments sounds that we should not
> > put_device at all since ->probe_device/group_device/attach_dev()
> > will use it later.
> 
> You need to call put_device() at some point to release the reference
> that you acquired by calling of_find_device_by_node(). If you don't
> release it, you're leaking the reference and the kernel isn't going to
> know when it's safe to delete the device.
> 
> So what I'm saying is that we either release it here, which isn't quite
> right because we do reference data relating to the device later on. And

I see. A small question here by the way: By looking at other IOMMU
drivers that are calling driver_find_device_by_fwnode() function,
I found that most of them put_device right after the function call,
and dev_get_drvdata() after putting the device..

Feels like they are doing it wrongly?

> because it isn't quite right there should be a reason to justify it,
> which is that the SMMU parent device is the same as the MC, so the
> reference count isn't strictly necessary. But that's not quite obvious,
> so highlighting it in a comment makes sense.
> 
> The other alternative is to not call put_device() here and keep on to
> the reference as long as you keep using "mc". This might be difficult to
> implement because it may not be obvious where to release it. I think
> this is the better alternative, but if it's too complicated to implement
> it might not be worth it.

I feel so too. The dev is got at of_xlate() that does not have an
obvious counterpart function. So I'll just remove put_device() and
put a line of comments, as you suggested.

> > > Like I said earlier, this is a bit weird in this case because we're
> > > self-referencing, so iommu_pdev->dev is going to stay around as long as
> > > the SMMU is. However, it might be worth to properly track the lifetime
> > > anyway just so that the code can serve as a good example of how to do
> > > things.
> > 
> > What's this "track-the-lifetime"?
> 
> This basically just means that SMMU needs to ensure that MC stays alive
> (by holding a reference to it) as long as SMMU uses it. If the last
> reference to MC is dropped, then the mc pointer and potentially anything
> that it points to will become dangling. If you were to drop the last
> reference at this point, then on the next line the mc pointer could
> already be invalid.
> 
> That's how it generally works, anyway. What's special about this use-
> case is that the SMMU and MC are the same device, so it should be safe
> to omit this additional tracking because the IOMMU tracking should take
> care of that already.

Okay.

> > > If you decide to go for the shortcut and not track this reference
> > > properly, then at least you need to add a comment as to why it is safe
> > > to do in this case. This ensures that readers are away of the
> > > circumstances and don't copy this bad code into a context where the
> > > circumstances are different.
> > 
> > I don't quite get this "shortcut" here either...mind elaborating?
> 
> The shortcut is taking advantage of the knowledge that the SMMU and the
> MC are the same device and therefore not properly track the MC object.
> Given that their code is located in different locations, this isn't
> obvious to the casual reader of the code, so they may assume that this
> is the normal way to do things. To avoid that, the code should have a
> comment explaining why that is.

Got it. Thanks!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711

2020-10-08 Thread Jeremy Linton

On 10/8/20 2:43 PM, Ard Biesheuvel wrote:

(+ Lorenzo)

On Thu, 8 Oct 2020 at 12:14, Catalin Marinas  wrote:


On Thu, Oct 08, 2020 at 12:05:25PM +0200, Nicolas Saenz Julienne wrote:

On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote:

On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote:

On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote:

On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote:

On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne wrote:

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 4602e467ca8b..cd0d115ef329 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  #include 
+#include   /* for zone_dma_bits */

  #include   /* for COMMAND_LINE_SIZE */
  #include 
@@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void)
 of_scan_flat_dt(early_init_dt_scan_memory, NULL);
  }

+void __init early_init_dt_update_zone_dma_bits(void)
+{
+   unsigned long dt_root = of_get_flat_dt_root();
+
+   if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711"))
+   zone_dma_bits = 30;
+}


I think we could keep this entirely in the arm64 setup_machine_fdt() and
not pollute the core code with RPi4-specific code.


Actually, even better, could we not move the check to
arm64_memblock_init() when we initialise zone_dma_bits?


I did it this way as I vaguely remembered Rob saying he wanted to centralise
all early boot fdt code in one place. But I'll be happy to move it there.


I can see Rob replied and I'm fine if that's his preference. However,
what I don't particularly like is that in the arm64 code, if
zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by
the early_init_dt_update_zone_dma_bits(). What if at some point we'll
get a platform that actually needs 24 here (I truly hope not, but just
the principle of relying on magic values)?

So rather than guessing, I'd prefer if the arch code can override
ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no
need to explicitly touch the zone_dma_bits variable.


Yes, sonds like the way to go. TBH I wasn't happy with that solution either,
but couldn't think of a nicer alternative.

Sadly I just realised that the series is incomplete, we have RPi4 users that
want to boot unsing ACPI, and this series would break things for them. I'll
have a word with them to see what we can do for their use-case.


Is there a way to get some SoC information from ACPI?



This is unfortunate. We used ACPI _DMA methods as they were designed
to communicate the DMA limit of the XHCI controller to the OS.

It shouldn't be too hard to match the OEM id field in the DSDT, and
switch to the smaller mask. But it sucks to have to add a quirk like
that.
It also requires delaying setting the arm64_dma_phy_limit a bit, but 
that doesn't appear to be causing a problem. I've been boot/compiling 
with a patch like:


diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c
index cada0b816c8a..9dfe776c1c75 100644
--- a/arch/arm64/kernel/acpi.c
+++ b/arch/arm64/kernel/acpi.c
@@ -14,6 +14,7 @@

 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -168,6 +169,11 @@ static int __init acpi_fadt_sanity_check(void)
ret = -EINVAL;
}

+   if (!strncmp(table->oem_id, "RPIFDN", ACPI_OEM_ID_SIZE) &&
+   !strncmp(table->oem_table_id,  "RPI4", 
ACPI_OEM_TABLE_ID_SIZE) &&

+   table->oem_revision <= 0x200)  {
+   zone_dma_bits = 30;
+   }
 out:
/*
 * acpi_get_table() creates FADT table mapping that
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index cd5caca8a929..6c8aaf1570ce 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -195,6 +195,7 @@ static void __init zone_sizes_init(unsigned long 
min, unsigned long max)

unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};

 #ifdef CONFIG_ZONE_DMA
+   arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
 #ifdef CONFIG_ZONE_DMA32
@@ -393,7 +394,6 @@ void __init arm64_memblock_init(void)
 */
if (zone_dma_bits == ZONE_DMA_BITS_DEFAULT)
zone_dma_bits = 32;
-   arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
}

if (IS_ENABLED(CONFIG_ZONE_DMA32))


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

2020-10-08 Thread David Woodhouse
On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
> TLDR & HTF;
> 
> Multiqueue devices want to have at max 1 queue per CPU or if the device
> has less queues than CPUs they want the queues to have a fixed
> association to a set of CPUs.
> 
> At setup time this is established considering possible CPUs to handle
> 'physical' hotplug correctly.
> 
> If a queue has no online CPUs it cannot be started. If it's active and
> the last CPU goes down then it's quiesced and stopped and the core code
> shuts down the interrupt and does not move it to a still online CPU.
> 
> So with your hackery, we end up in a situation where we have a large
> possible mask, but not all CPUs in that mask can be reached, which means
> in a 1 queue per CPU scenario all unreachable CPUs would have
> disfunctional queues.
> 
> So that spreading algorithm needs to know about this limitation.

OK, thanks. So the queue exists, with an MSI assigned to point to an
offline CPU(s), but it cannot actually be used until/unless at least
one CPU in its mask comes online.

So when I said I wanted to try treating "reachable" the same way as
"online", that would mean the queue can't start until/unless at least
one *reachable* CPU in its mask comes online.


The underlying problem here is that until a CPU comes online, we don't
actually *know* if it's reachable or not.

So if we want carefully create the affinity masks at setup time so that
they don't include any unreachable CPUs... that basically means we
don't include any non-present CPUs at all (unless they've been added
once and then removed).

That's because — at least on x86 — we don't assign CPU numbers to CPUs
which haven't yet been added. Theoretically we *could* but if there are
more than NR_CPUS listed in (e.g.) the MADT then we could run out of
CPU numbers. Then if the admin hotplugs one of the CPUs we *didn't*
have space for, we'd not be able to handle it.

I suppose there might be options like pre-assigning CPU numbers only to
non-present APIC IDs below 256 (or 32768). Or *grouping* the CPU
numbers, so some not-yet-assigned CPU#s are for low APICIDs and some
are for high APICIDs but it doesn't actually have to be a 1:1
predetermined mapping.

But those really do seem like hacks which might only apply on x86,
while the generic approach of treating "reachable" like "online" seems
like it would work in other cases too.

Fundamentally, there are three sets of CPUs. There are those known to
be reachable, those known not to be, and those which are not yet known.

So another approach we could use is to work with a cpumask of those
*known* not to be reachable, and to filter those *out* of the prebuilt
affinities. That gives us basically the right behaviour without
hotplug, but does include absent CPUs in a mask that *if* they are ever
added, wouldn't be able to receive the IRQ. Which does mean we'd have
to refrain from bringing up the corresponding queue. 



smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Bug 209321] DMAR: [DMA Read] Request device [03:00.0] PASID ffffffff fault addr fffd3000 [fault reason 06] PTE Read access is not set

2020-10-08 Thread Joerg Roedel
Hi Bjorn,

On Wed, Oct 07, 2020 at 10:43:14AM -0500, Bjorn Helgaas wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=209321
> 
> Not much detail in the bugzilla yet, but apparently this started in
> v5.8.0-rc1:
> 
>   DMAR: [DMA Read] Request device [03:00.0] PASID  fault addr 
> fffd3000 [fault reason 06] PTE Read access is not set
> 
> Currently assigned to Driver/PCI, but not clear to me yet whether PCI
> is the culprit or the victim.

Thanks for the heads-up, looks like Intel is already on it.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/3] drm/msm: add DRM_MSM_GEM_SYNC_CACHE for non-coherent cache maintenance

2020-10-08 Thread Joerg Roedel
On Tue, Oct 06, 2020 at 08:23:06AM +0100, Christoph Hellwig wrote:
> If people want to use the "raw" IOMMU API with not cache coherent
> devices we'll need a cache maintainance API that goes along with it.
> It could either be formally part of the IOMMU API or be separate.

The IOMMU-API does not care about the caching effects of DMA, is manages
IO address spaces for devices. I also don't know how this would be going
to be implemented, the IOMMU-API does not have the concept of handles
for mapped ranges and does not care about CPU virtual addresses (which
are needed for cache flushes) of the memory it maps into IO page-tables.

So I think a cache management API should be separate from the IOMMU-API.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711

2020-10-08 Thread Nicolas Saenz Julienne
Hi Catalin, sorry for the late reply.

On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote:
> On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote:
> > On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote:
> > > On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote:
> > > > On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne wrote:
> > > > > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> > > > > index 4602e467ca8b..cd0d115ef329 100644
> > > > > --- a/drivers/of/fdt.c
> > > > > +++ b/drivers/of/fdt.c
> > > > > @@ -25,6 +25,7 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > +#include /* for zone_dma_bits */
> > > > >  
> > > > >  #include   /* for COMMAND_LINE_SIZE */
> > > > >  #include 
> > > > > @@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void)
> > > > >   of_scan_flat_dt(early_init_dt_scan_memory, NULL);
> > > > >  }
> > > > >  
> > > > > +void __init early_init_dt_update_zone_dma_bits(void)
> > > > > +{
> > > > > + unsigned long dt_root = of_get_flat_dt_root();
> > > > > +
> > > > > + if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711"))
> > > > > + zone_dma_bits = 30;
> > > > > +}
> > > > 
> > > > I think we could keep this entirely in the arm64 setup_machine_fdt() and
> > > > not pollute the core code with RPi4-specific code.
> > > 
> > > Actually, even better, could we not move the check to
> > > arm64_memblock_init() when we initialise zone_dma_bits?
> > 
> > I did it this way as I vaguely remembered Rob saying he wanted to centralise
> > all early boot fdt code in one place. But I'll be happy to move it there.
> 
> I can see Rob replied and I'm fine if that's his preference. However,
> what I don't particularly like is that in the arm64 code, if
> zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by
> the early_init_dt_update_zone_dma_bits(). What if at some point we'll
> get a platform that actually needs 24 here (I truly hope not, but just
> the principle of relying on magic values)?
> 
> So rather than guessing, I'd prefer if the arch code can override
> ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no
> need to explicitly touch the zone_dma_bits variable.

Yes, sonds like the way to go. TBH I wasn't happy with that solution either,
but couldn't think of a nicer alternative.

Sadly I just realised that the series is incomplete, we have RPi4 users that
want to boot unsing ACPI, and this series would break things for them. I'll
have a word with them to see what we can do for their use-case.

Regards,
Nicolas



signature.asc
Description: This is a digitally signed message part
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711

2020-10-08 Thread Catalin Marinas
On Thu, Oct 08, 2020 at 12:05:25PM +0200, Nicolas Saenz Julienne wrote:
> On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote:
> > On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote:
> > > On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote:
> > > > On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote:
> > > > > On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne 
> > > > > wrote:
> > > > > > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> > > > > > index 4602e467ca8b..cd0d115ef329 100644
> > > > > > --- a/drivers/of/fdt.c
> > > > > > +++ b/drivers/of/fdt.c
> > > > > > @@ -25,6 +25,7 @@
> > > > > >  #include 
> > > > > >  #include 
> > > > > >  #include 
> > > > > > +#include   /* for zone_dma_bits */
> > > > > >  
> > > > > >  #include   /* for COMMAND_LINE_SIZE */
> > > > > >  #include 
> > > > > > @@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void)
> > > > > > of_scan_flat_dt(early_init_dt_scan_memory, NULL);
> > > > > >  }
> > > > > >  
> > > > > > +void __init early_init_dt_update_zone_dma_bits(void)
> > > > > > +{
> > > > > > +   unsigned long dt_root = of_get_flat_dt_root();
> > > > > > +
> > > > > > +   if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711"))
> > > > > > +   zone_dma_bits = 30;
> > > > > > +}
> > > > > 
> > > > > I think we could keep this entirely in the arm64 setup_machine_fdt() 
> > > > > and
> > > > > not pollute the core code with RPi4-specific code.
> > > > 
> > > > Actually, even better, could we not move the check to
> > > > arm64_memblock_init() when we initialise zone_dma_bits?
> > > 
> > > I did it this way as I vaguely remembered Rob saying he wanted to 
> > > centralise
> > > all early boot fdt code in one place. But I'll be happy to move it there.
> > 
> > I can see Rob replied and I'm fine if that's his preference. However,
> > what I don't particularly like is that in the arm64 code, if
> > zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by
> > the early_init_dt_update_zone_dma_bits(). What if at some point we'll
> > get a platform that actually needs 24 here (I truly hope not, but just
> > the principle of relying on magic values)?
> > 
> > So rather than guessing, I'd prefer if the arch code can override
> > ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no
> > need to explicitly touch the zone_dma_bits variable.
> 
> Yes, sonds like the way to go. TBH I wasn't happy with that solution either,
> but couldn't think of a nicer alternative.
> 
> Sadly I just realised that the series is incomplete, we have RPi4 users that
> want to boot unsing ACPI, and this series would break things for them. I'll
> have a word with them to see what we can do for their use-case.

Is there a way to get some SoC information from ACPI?

-- 
Catalin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

2020-10-08 Thread Thomas Gleixner
On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote:
> On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
>> Multiqueue devices want to have at max 1 queue per CPU or if the device
>> has less queues than CPUs they want the queues to have a fixed
>> association to a set of CPUs.
>> 
>> At setup time this is established considering possible CPUs to handle
>> 'physical' hotplug correctly.
>> 
>> If a queue has no online CPUs it cannot be started. If it's active and
>> the last CPU goes down then it's quiesced and stopped and the core code
>> shuts down the interrupt and does not move it to a still online CPU.
>> 
>> So with your hackery, we end up in a situation where we have a large
>> possible mask, but not all CPUs in that mask can be reached, which means
>> in a 1 queue per CPU scenario all unreachable CPUs would have
>> disfunctional queues.
>> 
>> So that spreading algorithm needs to know about this limitation.
>
> OK, thanks. So the queue exists, with an MSI assigned to point to an
> offline CPU(s), but it cannot actually be used until/unless at least
> one CPU in its mask comes online.

The MSI entry in that case is actually directed to an online CPU's
MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised
by the device after shutdown.

> So when I said I wanted to try treating "reachable" the same way as
> "online", that would mean the queue can't start until/unless at least
> one *reachable* CPU in its mask comes online.
>
> The underlying problem here is that until a CPU comes online, we don't
> actually *know* if it's reachable or not.

It's known before online, i.e. when the CPU is registered which is
either at boot time for present CPUs or at 'physical' hotplug.

> So if we want carefully create the affinity masks at setup time so that
> they don't include any unreachable CPUs... that basically means we
> don't include any non-present CPUs at all (unless they've been added
> once and then removed).

That breaks _all_ multi-queue assumptions in one go. :)

> But those really do seem like hacks which might only apply on x86,
> while the generic approach of treating "reachable" like "online" seems
> like it would work in other cases too.
>
> Fundamentally, there are three sets of CPUs. There are those known to
> be reachable, those known not to be, and those which are not yet
> known.

Unfortunately there are lots of assumptions all over the place that
possible CPUs are reachable. Multi-queue using managed interrupts is
just the tip of the iceberg.

> So another approach we could use is to work with a cpumask of those
> *known* not to be reachable, and to filter those *out* of the prebuilt
> affinities. That gives us basically the right behaviour without
> hotplug, but does include absent CPUs in a mask that *if* they are ever
> added, wouldn't be able to receive the IRQ. Which does mean we'd have
> to refrain from bringing up the corresponding queue. 

The multi-queue drivers rely on the interrupt setup to create their
queues and the fundamental assumption is that this setup works. The
managed interrupt mechanism guarantees that the queue has a vector
available on all CPUs which are in the queues assigned affinity mask. As
of today it also guarantees that these CPUs are reachable once they come
online.

So in order to make that work you'd need to teach the multi-queue stuff
about this new world order:

 1) On hotplug the queue needs to be able to figure out whether the
interrupt is functional. If not it has to redirect any requests to
some actually functional queue.

 2) On unplug it needs to be able to figure out whether the interrupt
will shutdown because the outgoing CPU is the last reachable in the
group and if there are still online but unreachable CPUs then use
the redirect mechanism.

I'm sure that the multi-queue people will be enthusiastic to add all of
this and deal with all the nasty corner cases coming out of it.

The overall conclusion for this is:

 1) X2APIC support on bare metal w/o irq remapping is not going to
happen unless you:

  - added support in multi-queue devices which utilize managed
interrupts

  - audited the whole tree for other assumptions related to the
reachability of possible CPUs.

I'm not expecting you to be done with that before I retire so for
me it's just not going to happen :)

 2) X2APIC support on VIRT is possible if the extended ID magic is
supported by the hypervisor because that does not make any CPU
unreachable for MSI and therefore the multi-queue muck and
everything else just works.

This requires to have either the domain affinity limitation for HPET
in place or just to force disable HPET or at least HPET-MSI which is
a reasonable tradeoff.

HPET is not required for guests which have kvmclock and
APIC/deadline timer and known (hypervisor provided) frequencies.

Anything else is just wishful thinking, really.

Thanks,

tglx

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-08 Thread Thierry Reding
On Mon, Oct 05, 2020 at 06:05:46PM -0700, Nicolin Chen wrote:
> On Mon, Oct 05, 2020 at 11:57:54AM +0200, Thierry Reding wrote:
> > On Fri, Oct 02, 2020 at 11:58:29AM -0700, Nicolin Chen wrote:
> > > On Fri, Oct 02, 2020 at 06:02:18PM +0300, Dmitry Osipenko wrote:
> > > > 02.10.2020 09:08, Nicolin Chen пишет:
> > > > >  static int tegra_smmu_of_xlate(struct device *dev,
> > > > >  struct of_phandle_args *args)
> > > > >  {
> > > > > + struct platform_device *iommu_pdev = 
> > > > > of_find_device_by_node(args->np);
> > > > > + struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> > > > >   u32 id = args->args[0];
> > > > >  
> > > > > + of_node_put(args->np);
> > > > 
> > > > of_find_device_by_node() takes device reference and not the np
> > > > reference. This is a bug, please remove of_node_put().
> > > 
> > > Looks like so. Replacing it with put_device(_pdev->dev);
> > 
> > Putting the put_device() here is wrong, though. You need to make sure
> > you keep a reference to it as long as you keep accessing the data that
> > is owned by it.
> 
> I am confused. You said in the other reply (to Dmitry) that we do
> need to put_device(mc->dev), where mc->dev should be the same as
> iommu_pdev->dev. But here your comments sounds that we should not
> put_device at all since ->probe_device/group_device/attach_dev()
> will use it later.

You need to call put_device() at some point to release the reference
that you acquired by calling of_find_device_by_node(). If you don't
release it, you're leaking the reference and the kernel isn't going to
know when it's safe to delete the device.

So what I'm saying is that we either release it here, which isn't quite
right because we do reference data relating to the device later on. And
because it isn't quite right there should be a reason to justify it,
which is that the SMMU parent device is the same as the MC, so the
reference count isn't strictly necessary. But that's not quite obvious,
so highlighting it in a comment makes sense.

The other alternative is to not call put_device() here and keep on to
the reference as long as you keep using "mc". This might be difficult to
implement because it may not be obvious where to release it. I think
this is the better alternative, but if it's too complicated to implement
it might not be worth it.

> > Like I said earlier, this is a bit weird in this case because we're
> > self-referencing, so iommu_pdev->dev is going to stay around as long as
> > the SMMU is. However, it might be worth to properly track the lifetime
> > anyway just so that the code can serve as a good example of how to do
> > things.
> 
> What's this "track-the-lifetime"?

This basically just means that SMMU needs to ensure that MC stays alive
(by holding a reference to it) as long as SMMU uses it. If the last
reference to MC is dropped, then the mc pointer and potentially anything
that it points to will become dangling. If you were to drop the last
reference at this point, then on the next line the mc pointer could
already be invalid.

That's how it generally works, anyway. What's special about this use-
case is that the SMMU and MC are the same device, so it should be safe
to omit this additional tracking because the IOMMU tracking should take
care of that already.

> > If you decide to go for the shortcut and not track this reference
> > properly, then at least you need to add a comment as to why it is safe
> > to do in this case. This ensures that readers are away of the
> > circumstances and don't copy this bad code into a context where the
> > circumstances are different.
> 
> I don't quite get this "shortcut" here either...mind elaborating?

The shortcut is taking advantage of the knowledge that the SMMU and the
MC are the same device and therefore not properly track the MC object.
Given that their code is located in different locations, this isn't
obvious to the casual reader of the code, so they may assume that this
is the normal way to do things. To avoid that, the code should have a
comment explaining why that is.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/2] iommu/tegra-smmu: Expand mutex protection range

2020-10-08 Thread Thierry Reding
On Mon, Sep 28, 2020 at 11:13:25PM -0700, Nicolin Chen wrote:
> This is used to protect potential race condition at use_count.
> since probes of client drivers, calling attach_dev(), may run
> concurrently.
> 
> Signed-off-by: Nicolin Chen 
> ---
> 
> Changelog
> v3->v4:
>  * Fixed typo "Expend" => "Expand"
> v2->v3:
>  * Renamed label "err_unlock" to "unlock"
> v1->v2:
>  * N/A
> 
>  drivers/iommu/tegra-smmu.c | 34 +-
>  1 file changed, 21 insertions(+), 13 deletions(-)

Acked-by: Thierry Reding 


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/2] iommu/tegra-smmu: Unwrap tegra_smmu_group_get

2020-10-08 Thread Thierry Reding
On Mon, Sep 28, 2020 at 11:13:24PM -0700, Nicolin Chen wrote:
> The tegra_smmu_group_get was added to group devices in different
> SWGROUPs and it'd return a NULL group pointer upon a mismatch at
> tegra_smmu_find_group(), so for most of clients/devices, it very
> likely would mismatch and need a fallback generic_device_group().
> 
> But now tegra_smmu_group_get handles devices in same SWGROUP too,
> which means that it would allocate a group for every new SWGROUP
> or would directly return an existing one upon matching a SWGROUP,
> i.e. any device will go through this function.
> 
> So possibility of having a NULL group pointer in device_group()
> is upon failure of either devm_kzalloc() or iommu_group_alloc().
> In either case, calling generic_device_group() no longer makes a
> sense. Especially for devm_kzalloc() failing case, it'd cause a
> problem if it fails at devm_kzalloc() yet succeeds at a fallback
> generic_device_group(), because it does not create a group->list
> for other devices to match.
> 
> This patch simply unwraps the function to clean it up.
> 
> Signed-off-by: Nicolin Chen 
> ---
> 
> Changelog
> v2->v4:
>  * N/A
> v1->v2:
>  * Changed type of swgroup to "unsigned int", following Thierry's
>commnets.
> 
>  drivers/iommu/tegra-smmu.c | 19 ---
>  1 file changed, 4 insertions(+), 15 deletions(-)

Acked-by: Thierry Reding 


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

2020-10-08 Thread David Woodhouse
On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote:
> The overall conclusion for this is:
> 
>  1) X2APIC support on bare metal w/o irq remapping is not going to
> happen unless you:
> 
>   - added support in multi-queue devices which utilize managed
> interrupts
> 
>   - audited the whole tree for other assumptions related to the
> reachability of possible CPUs.
> 
> I'm not expecting you to be done with that before I retire so for
> me it's just not going to happen :)

Makes sense. It probably does mean we should a BUG_ON for the case
where IRQ remapping *is* enabled but any device is found which isn't
behind it. But that's OK.

>  2) X2APIC support on VIRT is possible if the extended ID magic is
> supported by the hypervisor because that does not make any CPU
> unreachable for MSI and therefore the multi-queue muck and
> everything else just works.
> 
> This requires to have either the domain affinity limitation for HPET
> in place or just to force disable HPET or at least HPET-MSI which is
> a reasonable tradeoff.
> 
> HPET is not required for guests which have kvmclock and
> APIC/deadline timer and known (hypervisor provided) frequencies.

HPET-MSI should work fine. Like the IOAPIC, it's just a child of the
*actual* MSI domain. The address/data in the MSI message are completely
opaque to it, and if the parent domain happens to put meaningful
information into bits 11-5 of the MSI address, the HPET won't even
notice.

The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI
address; it's not doing bit-swizzling like the IOAPIC does, which might
potentially *not* have been able to set certain bits in the MSI.


smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu