Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711
(+ Lorenzo) On Thu, 8 Oct 2020 at 12:14, Catalin Marinas wrote: > > On Thu, Oct 08, 2020 at 12:05:25PM +0200, Nicolas Saenz Julienne wrote: > > On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote: > > > On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote: > > > > On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote: > > > > > On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote: > > > > > > On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne > > > > > > wrote: > > > > > > > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c > > > > > > > index 4602e467ca8b..cd0d115ef329 100644 > > > > > > > --- a/drivers/of/fdt.c > > > > > > > +++ b/drivers/of/fdt.c > > > > > > > @@ -25,6 +25,7 @@ > > > > > > > #include > > > > > > > #include > > > > > > > #include > > > > > > > +#include /* for zone_dma_bits */ > > > > > > > > > > > > > > #include /* for COMMAND_LINE_SIZE */ > > > > > > > #include > > > > > > > @@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void) > > > > > > > of_scan_flat_dt(early_init_dt_scan_memory, NULL); > > > > > > > } > > > > > > > > > > > > > > +void __init early_init_dt_update_zone_dma_bits(void) > > > > > > > +{ > > > > > > > + unsigned long dt_root = of_get_flat_dt_root(); > > > > > > > + > > > > > > > + if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711")) > > > > > > > + zone_dma_bits = 30; > > > > > > > +} > > > > > > > > > > > > I think we could keep this entirely in the arm64 > > > > > > setup_machine_fdt() and > > > > > > not pollute the core code with RPi4-specific code. > > > > > > > > > > Actually, even better, could we not move the check to > > > > > arm64_memblock_init() when we initialise zone_dma_bits? > > > > > > > > I did it this way as I vaguely remembered Rob saying he wanted to > > > > centralise > > > > all early boot fdt code in one place. But I'll be happy to move it > > > > there. > > > > > > I can see Rob replied and I'm fine if that's his preference. However, > > > what I don't particularly like is that in the arm64 code, if > > > zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by > > > the early_init_dt_update_zone_dma_bits(). What if at some point we'll > > > get a platform that actually needs 24 here (I truly hope not, but just > > > the principle of relying on magic values)? > > > > > > So rather than guessing, I'd prefer if the arch code can override > > > ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no > > > need to explicitly touch the zone_dma_bits variable. > > > > Yes, sonds like the way to go. TBH I wasn't happy with that solution either, > > but couldn't think of a nicer alternative. > > > > Sadly I just realised that the series is incomplete, we have RPi4 users that > > want to boot unsing ACPI, and this series would break things for them. I'll > > have a word with them to see what we can do for their use-case. > > Is there a way to get some SoC information from ACPI? > This is unfortunate. We used ACPI _DMA methods as they were designed to communicate the DMA limit of the XHCI controller to the OS. It shouldn't be too hard to match the OEM id field in the DSDT, and switch to the smaller mask. But it sucks to have to add a quirk like that. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
On Thu, Oct 08 2020 at 12:10, David Woodhouse wrote: > On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote: >> The overall conclusion for this is: >> >> 1) X2APIC support on bare metal w/o irq remapping is not going to >> happen unless you: >> >> - added support in multi-queue devices which utilize managed >> interrupts >> >> - audited the whole tree for other assumptions related to the >> reachability of possible CPUs. >> >> I'm not expecting you to be done with that before I retire so for >> me it's just not going to happen :) > > Makes sense. It probably does mean we should a BUG_ON for the case > where IRQ remapping *is* enabled but any device is found which isn't > behind it. But that's OK. We can kinda gracefully handle that. See the completely untested and incomplete patch below. >> 2) X2APIC support on VIRT is possible if the extended ID magic is >> supported by the hypervisor because that does not make any CPU >> unreachable for MSI and therefore the multi-queue muck and >> everything else just works. >> >> This requires to have either the domain affinity limitation for HPET >> in place or just to force disable HPET or at least HPET-MSI which is >> a reasonable tradeoff. >> >> HPET is not required for guests which have kvmclock and >> APIC/deadline timer and known (hypervisor provided) frequencies. > > HPET-MSI should work fine. Like the IOAPIC, it's just a child of the > *actual* MSI domain. The address/data in the MSI message are completely > opaque to it, and if the parent domain happens to put meaningful > information into bits 11-5 of the MSI address, the HPET won't even > notice. > > The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI > address; it's not doing bit-swizzling like the IOAPIC does, which might > potentially *not* have been able to set certain bits in the MSI. Indeed. I thought it was crippled in some way, but you're right it has all the bits. Thanks, tglx --- Subject: x86/iommu: Make interrupt remapping more robust From: Thomas Gleixner Date: Thu, 08 Oct 2020 14:09:44 +0200 Needs to be split into pieces and cover PCI proper. Right now PCI gets a NULL pointer assigned which makes it explode at the wrong place later. Also hyperv iommu wants some love. NOT-Signed-off-by: Thomas Gleixner --- arch/x86/kernel/apic/io_apic.c |4 +++- arch/x86/kernel/apic/msi.c | 24 ++-- drivers/iommu/amd/iommu.c |6 +++--- drivers/iommu/intel/irq_remapping.c |4 ++-- 4 files changed, 22 insertions(+), 16 deletions(-) --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT; info.devid = mpc_ioapic_id(ioapic); parent = irq_remapping_get_irq_domain(); - if (!parent) + if (IS_ERR(parent)) + return PTR_ERR(parent); + else if (!parent) parent = x86_vector_domain; else name = "IO-APIC-IR"; --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d struct irq_domain *hpet_create_irq_domain(int hpet_id) { struct msi_domain_info *domain_info; + struct fwnode_handle *fn = NULL; struct irq_domain *parent, *d; struct irq_alloc_info info; - struct fwnode_handle *fn; if (x86_vector_domain == NULL) return NULL; @@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai init_irq_alloc_info(, NULL); info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT; info.devid = hpet_id; + parent = irq_remapping_get_irq_domain(); - if (parent == NULL) + if (IS_ERR(parent)) + goto fail; + else if (!parent) parent = x86_vector_domain; else hpet_msi_controller.name = "IR-HPET-MSI"; fn = irq_domain_alloc_named_id_fwnode(hpet_msi_controller.name, hpet_id); - if (!fn) { - kfree(domain_info); - return NULL; - } + if (!fn) + goto fail; d = msi_create_irq_domain(fn, domain_info, parent); - if (!d) { - irq_domain_free_fwnode(fn); - kfree(domain_info); - } + if (!d) + goto fail; return d; + +fail: + irq_domain_free_fwnode(fn); + kfree(domain_info); + return NULL; } int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc, --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3557,7 +3557,7 @@ static struct irq_domain *get_irq_domain struct amd_iommu *iommu = amd_iommu_rlookup_table[devid]; if (!iommu) - return NULL; +
Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()
On Thu, Oct 08, 2020 at 11:53:43AM +0200, Thierry Reding wrote: > On Mon, Oct 05, 2020 at 06:05:46PM -0700, Nicolin Chen wrote: > > On Mon, Oct 05, 2020 at 11:57:54AM +0200, Thierry Reding wrote: > > > On Fri, Oct 02, 2020 at 11:58:29AM -0700, Nicolin Chen wrote: > > > > On Fri, Oct 02, 2020 at 06:02:18PM +0300, Dmitry Osipenko wrote: > > > > > 02.10.2020 09:08, Nicolin Chen пишет: > > > > > > static int tegra_smmu_of_xlate(struct device *dev, > > > > > >struct of_phandle_args *args) > > > > > > { > > > > > > + struct platform_device *iommu_pdev = > > > > > > of_find_device_by_node(args->np); > > > > > > + struct tegra_mc *mc = platform_get_drvdata(iommu_pdev); > > > > > > u32 id = args->args[0]; > > > > > > > > > > > > + of_node_put(args->np); > > > > > > > > > > of_find_device_by_node() takes device reference and not the np > > > > > reference. This is a bug, please remove of_node_put(). > > > > > > > > Looks like so. Replacing it with put_device(_pdev->dev); > > > > > > Putting the put_device() here is wrong, though. You need to make sure > > > you keep a reference to it as long as you keep accessing the data that > > > is owned by it. > > > > I am confused. You said in the other reply (to Dmitry) that we do > > need to put_device(mc->dev), where mc->dev should be the same as > > iommu_pdev->dev. But here your comments sounds that we should not > > put_device at all since ->probe_device/group_device/attach_dev() > > will use it later. > > You need to call put_device() at some point to release the reference > that you acquired by calling of_find_device_by_node(). If you don't > release it, you're leaking the reference and the kernel isn't going to > know when it's safe to delete the device. > > So what I'm saying is that we either release it here, which isn't quite > right because we do reference data relating to the device later on. And I see. A small question here by the way: By looking at other IOMMU drivers that are calling driver_find_device_by_fwnode() function, I found that most of them put_device right after the function call, and dev_get_drvdata() after putting the device.. Feels like they are doing it wrongly? > because it isn't quite right there should be a reason to justify it, > which is that the SMMU parent device is the same as the MC, so the > reference count isn't strictly necessary. But that's not quite obvious, > so highlighting it in a comment makes sense. > > The other alternative is to not call put_device() here and keep on to > the reference as long as you keep using "mc". This might be difficult to > implement because it may not be obvious where to release it. I think > this is the better alternative, but if it's too complicated to implement > it might not be worth it. I feel so too. The dev is got at of_xlate() that does not have an obvious counterpart function. So I'll just remove put_device() and put a line of comments, as you suggested. > > > Like I said earlier, this is a bit weird in this case because we're > > > self-referencing, so iommu_pdev->dev is going to stay around as long as > > > the SMMU is. However, it might be worth to properly track the lifetime > > > anyway just so that the code can serve as a good example of how to do > > > things. > > > > What's this "track-the-lifetime"? > > This basically just means that SMMU needs to ensure that MC stays alive > (by holding a reference to it) as long as SMMU uses it. If the last > reference to MC is dropped, then the mc pointer and potentially anything > that it points to will become dangling. If you were to drop the last > reference at this point, then on the next line the mc pointer could > already be invalid. > > That's how it generally works, anyway. What's special about this use- > case is that the SMMU and MC are the same device, so it should be safe > to omit this additional tracking because the IOMMU tracking should take > care of that already. Okay. > > > If you decide to go for the shortcut and not track this reference > > > properly, then at least you need to add a comment as to why it is safe > > > to do in this case. This ensures that readers are away of the > > > circumstances and don't copy this bad code into a context where the > > > circumstances are different. > > > > I don't quite get this "shortcut" here either...mind elaborating? > > The shortcut is taking advantage of the knowledge that the SMMU and the > MC are the same device and therefore not properly track the MC object. > Given that their code is located in different locations, this isn't > obvious to the casual reader of the code, so they may assume that this > is the normal way to do things. To avoid that, the code should have a > comment explaining why that is. Got it. Thanks! ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711
On 10/8/20 2:43 PM, Ard Biesheuvel wrote: (+ Lorenzo) On Thu, 8 Oct 2020 at 12:14, Catalin Marinas wrote: On Thu, Oct 08, 2020 at 12:05:25PM +0200, Nicolas Saenz Julienne wrote: On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote: On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote: On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote: On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote: On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne wrote: diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index 4602e467ca8b..cd0d115ef329 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -25,6 +25,7 @@ #include #include #include +#include /* for zone_dma_bits */ #include /* for COMMAND_LINE_SIZE */ #include @@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void) of_scan_flat_dt(early_init_dt_scan_memory, NULL); } +void __init early_init_dt_update_zone_dma_bits(void) +{ + unsigned long dt_root = of_get_flat_dt_root(); + + if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711")) + zone_dma_bits = 30; +} I think we could keep this entirely in the arm64 setup_machine_fdt() and not pollute the core code with RPi4-specific code. Actually, even better, could we not move the check to arm64_memblock_init() when we initialise zone_dma_bits? I did it this way as I vaguely remembered Rob saying he wanted to centralise all early boot fdt code in one place. But I'll be happy to move it there. I can see Rob replied and I'm fine if that's his preference. However, what I don't particularly like is that in the arm64 code, if zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by the early_init_dt_update_zone_dma_bits(). What if at some point we'll get a platform that actually needs 24 here (I truly hope not, but just the principle of relying on magic values)? So rather than guessing, I'd prefer if the arch code can override ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no need to explicitly touch the zone_dma_bits variable. Yes, sonds like the way to go. TBH I wasn't happy with that solution either, but couldn't think of a nicer alternative. Sadly I just realised that the series is incomplete, we have RPi4 users that want to boot unsing ACPI, and this series would break things for them. I'll have a word with them to see what we can do for their use-case. Is there a way to get some SoC information from ACPI? This is unfortunate. We used ACPI _DMA methods as they were designed to communicate the DMA limit of the XHCI controller to the OS. It shouldn't be too hard to match the OEM id field in the DSDT, and switch to the smaller mask. But it sucks to have to add a quirk like that. It also requires delaying setting the arm64_dma_phy_limit a bit, but that doesn't appear to be causing a problem. I've been boot/compiling with a patch like: diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c index cada0b816c8a..9dfe776c1c75 100644 --- a/arch/arm64/kernel/acpi.c +++ b/arch/arm64/kernel/acpi.c @@ -14,6 +14,7 @@ #include #include +#include #include #include #include @@ -168,6 +169,11 @@ static int __init acpi_fadt_sanity_check(void) ret = -EINVAL; } + if (!strncmp(table->oem_id, "RPIFDN", ACPI_OEM_ID_SIZE) && + !strncmp(table->oem_table_id, "RPI4", ACPI_OEM_TABLE_ID_SIZE) && + table->oem_revision <= 0x200) { + zone_dma_bits = 30; + } out: /* * acpi_get_table() creates FADT table mapping that diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index cd5caca8a929..6c8aaf1570ce 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -195,6 +195,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) unsigned long max_zone_pfns[MAX_NR_ZONES] = {0}; #ifdef CONFIG_ZONE_DMA + arm64_dma_phys_limit = max_zone_phys(zone_dma_bits); max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit); #endif #ifdef CONFIG_ZONE_DMA32 @@ -393,7 +394,6 @@ void __init arm64_memblock_init(void) */ if (zone_dma_bits == ZONE_DMA_BITS_DEFAULT) zone_dma_bits = 32; - arm64_dma_phys_limit = max_zone_phys(zone_dma_bits); } if (IS_ENABLED(CONFIG_ZONE_DMA32)) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote: > TLDR & HTF; > > Multiqueue devices want to have at max 1 queue per CPU or if the device > has less queues than CPUs they want the queues to have a fixed > association to a set of CPUs. > > At setup time this is established considering possible CPUs to handle > 'physical' hotplug correctly. > > If a queue has no online CPUs it cannot be started. If it's active and > the last CPU goes down then it's quiesced and stopped and the core code > shuts down the interrupt and does not move it to a still online CPU. > > So with your hackery, we end up in a situation where we have a large > possible mask, but not all CPUs in that mask can be reached, which means > in a 1 queue per CPU scenario all unreachable CPUs would have > disfunctional queues. > > So that spreading algorithm needs to know about this limitation. OK, thanks. So the queue exists, with an MSI assigned to point to an offline CPU(s), but it cannot actually be used until/unless at least one CPU in its mask comes online. So when I said I wanted to try treating "reachable" the same way as "online", that would mean the queue can't start until/unless at least one *reachable* CPU in its mask comes online. The underlying problem here is that until a CPU comes online, we don't actually *know* if it's reachable or not. So if we want carefully create the affinity masks at setup time so that they don't include any unreachable CPUs... that basically means we don't include any non-present CPUs at all (unless they've been added once and then removed). That's because — at least on x86 — we don't assign CPU numbers to CPUs which haven't yet been added. Theoretically we *could* but if there are more than NR_CPUS listed in (e.g.) the MADT then we could run out of CPU numbers. Then if the admin hotplugs one of the CPUs we *didn't* have space for, we'd not be able to handle it. I suppose there might be options like pre-assigning CPU numbers only to non-present APIC IDs below 256 (or 32768). Or *grouping* the CPU numbers, so some not-yet-assigned CPU#s are for low APICIDs and some are for high APICIDs but it doesn't actually have to be a 1:1 predetermined mapping. But those really do seem like hacks which might only apply on x86, while the generic approach of treating "reachable" like "online" seems like it would work in other cases too. Fundamentally, there are three sets of CPUs. There are those known to be reachable, those known not to be, and those which are not yet known. So another approach we could use is to work with a cpumask of those *known* not to be reachable, and to filter those *out* of the prebuilt affinities. That gives us basically the right behaviour without hotplug, but does include absent CPUs in a mask that *if* they are ever added, wouldn't be able to receive the IRQ. Which does mean we'd have to refrain from bringing up the corresponding queue. smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Bug 209321] DMAR: [DMA Read] Request device [03:00.0] PASID ffffffff fault addr fffd3000 [fault reason 06] PTE Read access is not set
Hi Bjorn, On Wed, Oct 07, 2020 at 10:43:14AM -0500, Bjorn Helgaas wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=209321 > > Not much detail in the bugzilla yet, but apparently this started in > v5.8.0-rc1: > > DMAR: [DMA Read] Request device [03:00.0] PASID fault addr > fffd3000 [fault reason 06] PTE Read access is not set > > Currently assigned to Driver/PCI, but not clear to me yet whether PCI > is the culprit or the victim. Thanks for the heads-up, looks like Intel is already on it. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/3] drm/msm: add DRM_MSM_GEM_SYNC_CACHE for non-coherent cache maintenance
On Tue, Oct 06, 2020 at 08:23:06AM +0100, Christoph Hellwig wrote: > If people want to use the "raw" IOMMU API with not cache coherent > devices we'll need a cache maintainance API that goes along with it. > It could either be formally part of the IOMMU API or be separate. The IOMMU-API does not care about the caching effects of DMA, is manages IO address spaces for devices. I also don't know how this would be going to be implemented, the IOMMU-API does not have the concept of handles for mapped ranges and does not care about CPU virtual addresses (which are needed for cache flushes) of the memory it maps into IO page-tables. So I think a cache management API should be separate from the IOMMU-API. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711
Hi Catalin, sorry for the late reply. On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote: > On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote: > > On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote: > > > On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote: > > > > On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne wrote: > > > > > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c > > > > > index 4602e467ca8b..cd0d115ef329 100644 > > > > > --- a/drivers/of/fdt.c > > > > > +++ b/drivers/of/fdt.c > > > > > @@ -25,6 +25,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include /* for zone_dma_bits */ > > > > > > > > > > #include /* for COMMAND_LINE_SIZE */ > > > > > #include > > > > > @@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void) > > > > > of_scan_flat_dt(early_init_dt_scan_memory, NULL); > > > > > } > > > > > > > > > > +void __init early_init_dt_update_zone_dma_bits(void) > > > > > +{ > > > > > + unsigned long dt_root = of_get_flat_dt_root(); > > > > > + > > > > > + if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711")) > > > > > + zone_dma_bits = 30; > > > > > +} > > > > > > > > I think we could keep this entirely in the arm64 setup_machine_fdt() and > > > > not pollute the core code with RPi4-specific code. > > > > > > Actually, even better, could we not move the check to > > > arm64_memblock_init() when we initialise zone_dma_bits? > > > > I did it this way as I vaguely remembered Rob saying he wanted to centralise > > all early boot fdt code in one place. But I'll be happy to move it there. > > I can see Rob replied and I'm fine if that's his preference. However, > what I don't particularly like is that in the arm64 code, if > zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by > the early_init_dt_update_zone_dma_bits(). What if at some point we'll > get a platform that actually needs 24 here (I truly hope not, but just > the principle of relying on magic values)? > > So rather than guessing, I'd prefer if the arch code can override > ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no > need to explicitly touch the zone_dma_bits variable. Yes, sonds like the way to go. TBH I wasn't happy with that solution either, but couldn't think of a nicer alternative. Sadly I just realised that the series is incomplete, we have RPi4 users that want to boot unsing ACPI, and this series would break things for them. I'll have a word with them to see what we can do for their use-case. Regards, Nicolas signature.asc Description: This is a digitally signed message part ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/4] of/fdt: Update zone_dma_bits when running in bcm2711
On Thu, Oct 08, 2020 at 12:05:25PM +0200, Nicolas Saenz Julienne wrote: > On Fri, 2020-10-02 at 12:55 +0100, Catalin Marinas wrote: > > On Thu, Oct 01, 2020 at 07:31:19PM +0200, Nicolas Saenz Julienne wrote: > > > On Thu, 2020-10-01 at 18:23 +0100, Catalin Marinas wrote: > > > > On Thu, Oct 01, 2020 at 06:15:01PM +0100, Catalin Marinas wrote: > > > > > On Thu, Oct 01, 2020 at 06:17:37PM +0200, Nicolas Saenz Julienne > > > > > wrote: > > > > > > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c > > > > > > index 4602e467ca8b..cd0d115ef329 100644 > > > > > > --- a/drivers/of/fdt.c > > > > > > +++ b/drivers/of/fdt.c > > > > > > @@ -25,6 +25,7 @@ > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > +#include /* for zone_dma_bits */ > > > > > > > > > > > > #include /* for COMMAND_LINE_SIZE */ > > > > > > #include > > > > > > @@ -1198,6 +1199,14 @@ void __init early_init_dt_scan_nodes(void) > > > > > > of_scan_flat_dt(early_init_dt_scan_memory, NULL); > > > > > > } > > > > > > > > > > > > +void __init early_init_dt_update_zone_dma_bits(void) > > > > > > +{ > > > > > > + unsigned long dt_root = of_get_flat_dt_root(); > > > > > > + > > > > > > + if (of_flat_dt_is_compatible(dt_root, "brcm,bcm2711")) > > > > > > + zone_dma_bits = 30; > > > > > > +} > > > > > > > > > > I think we could keep this entirely in the arm64 setup_machine_fdt() > > > > > and > > > > > not pollute the core code with RPi4-specific code. > > > > > > > > Actually, even better, could we not move the check to > > > > arm64_memblock_init() when we initialise zone_dma_bits? > > > > > > I did it this way as I vaguely remembered Rob saying he wanted to > > > centralise > > > all early boot fdt code in one place. But I'll be happy to move it there. > > > > I can see Rob replied and I'm fine if that's his preference. However, > > what I don't particularly like is that in the arm64 code, if > > zone_dma_bits == 24, we set it to 32 assuming that it wasn't touched by > > the early_init_dt_update_zone_dma_bits(). What if at some point we'll > > get a platform that actually needs 24 here (I truly hope not, but just > > the principle of relying on magic values)? > > > > So rather than guessing, I'd prefer if the arch code can override > > ZONE_DMA_BITS_DEFAULT. Then, in arm64, we'll just set it to 32 and no > > need to explicitly touch the zone_dma_bits variable. > > Yes, sonds like the way to go. TBH I wasn't happy with that solution either, > but couldn't think of a nicer alternative. > > Sadly I just realised that the series is incomplete, we have RPi4 users that > want to boot unsing ACPI, and this series would break things for them. I'll > have a word with them to see what we can do for their use-case. Is there a way to get some SoC information from ACPI? -- Catalin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote: > On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote: >> Multiqueue devices want to have at max 1 queue per CPU or if the device >> has less queues than CPUs they want the queues to have a fixed >> association to a set of CPUs. >> >> At setup time this is established considering possible CPUs to handle >> 'physical' hotplug correctly. >> >> If a queue has no online CPUs it cannot be started. If it's active and >> the last CPU goes down then it's quiesced and stopped and the core code >> shuts down the interrupt and does not move it to a still online CPU. >> >> So with your hackery, we end up in a situation where we have a large >> possible mask, but not all CPUs in that mask can be reached, which means >> in a 1 queue per CPU scenario all unreachable CPUs would have >> disfunctional queues. >> >> So that spreading algorithm needs to know about this limitation. > > OK, thanks. So the queue exists, with an MSI assigned to point to an > offline CPU(s), but it cannot actually be used until/unless at least > one CPU in its mask comes online. The MSI entry in that case is actually directed to an online CPU's MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised by the device after shutdown. > So when I said I wanted to try treating "reachable" the same way as > "online", that would mean the queue can't start until/unless at least > one *reachable* CPU in its mask comes online. > > The underlying problem here is that until a CPU comes online, we don't > actually *know* if it's reachable or not. It's known before online, i.e. when the CPU is registered which is either at boot time for present CPUs or at 'physical' hotplug. > So if we want carefully create the affinity masks at setup time so that > they don't include any unreachable CPUs... that basically means we > don't include any non-present CPUs at all (unless they've been added > once and then removed). That breaks _all_ multi-queue assumptions in one go. :) > But those really do seem like hacks which might only apply on x86, > while the generic approach of treating "reachable" like "online" seems > like it would work in other cases too. > > Fundamentally, there are three sets of CPUs. There are those known to > be reachable, those known not to be, and those which are not yet > known. Unfortunately there are lots of assumptions all over the place that possible CPUs are reachable. Multi-queue using managed interrupts is just the tip of the iceberg. > So another approach we could use is to work with a cpumask of those > *known* not to be reachable, and to filter those *out* of the prebuilt > affinities. That gives us basically the right behaviour without > hotplug, but does include absent CPUs in a mask that *if* they are ever > added, wouldn't be able to receive the IRQ. Which does mean we'd have > to refrain from bringing up the corresponding queue. The multi-queue drivers rely on the interrupt setup to create their queues and the fundamental assumption is that this setup works. The managed interrupt mechanism guarantees that the queue has a vector available on all CPUs which are in the queues assigned affinity mask. As of today it also guarantees that these CPUs are reachable once they come online. So in order to make that work you'd need to teach the multi-queue stuff about this new world order: 1) On hotplug the queue needs to be able to figure out whether the interrupt is functional. If not it has to redirect any requests to some actually functional queue. 2) On unplug it needs to be able to figure out whether the interrupt will shutdown because the outgoing CPU is the last reachable in the group and if there are still online but unreachable CPUs then use the redirect mechanism. I'm sure that the multi-queue people will be enthusiastic to add all of this and deal with all the nasty corner cases coming out of it. The overall conclusion for this is: 1) X2APIC support on bare metal w/o irq remapping is not going to happen unless you: - added support in multi-queue devices which utilize managed interrupts - audited the whole tree for other assumptions related to the reachability of possible CPUs. I'm not expecting you to be done with that before I retire so for me it's just not going to happen :) 2) X2APIC support on VIRT is possible if the extended ID magic is supported by the hypervisor because that does not make any CPU unreachable for MSI and therefore the multi-queue muck and everything else just works. This requires to have either the domain affinity limitation for HPET in place or just to force disable HPET or at least HPET-MSI which is a reasonable tradeoff. HPET is not required for guests which have kvmclock and APIC/deadline timer and known (hypervisor provided) frequencies. Anything else is just wishful thinking, really. Thanks, tglx
Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()
On Mon, Oct 05, 2020 at 06:05:46PM -0700, Nicolin Chen wrote: > On Mon, Oct 05, 2020 at 11:57:54AM +0200, Thierry Reding wrote: > > On Fri, Oct 02, 2020 at 11:58:29AM -0700, Nicolin Chen wrote: > > > On Fri, Oct 02, 2020 at 06:02:18PM +0300, Dmitry Osipenko wrote: > > > > 02.10.2020 09:08, Nicolin Chen пишет: > > > > > static int tegra_smmu_of_xlate(struct device *dev, > > > > > struct of_phandle_args *args) > > > > > { > > > > > + struct platform_device *iommu_pdev = > > > > > of_find_device_by_node(args->np); > > > > > + struct tegra_mc *mc = platform_get_drvdata(iommu_pdev); > > > > > u32 id = args->args[0]; > > > > > > > > > > + of_node_put(args->np); > > > > > > > > of_find_device_by_node() takes device reference and not the np > > > > reference. This is a bug, please remove of_node_put(). > > > > > > Looks like so. Replacing it with put_device(_pdev->dev); > > > > Putting the put_device() here is wrong, though. You need to make sure > > you keep a reference to it as long as you keep accessing the data that > > is owned by it. > > I am confused. You said in the other reply (to Dmitry) that we do > need to put_device(mc->dev), where mc->dev should be the same as > iommu_pdev->dev. But here your comments sounds that we should not > put_device at all since ->probe_device/group_device/attach_dev() > will use it later. You need to call put_device() at some point to release the reference that you acquired by calling of_find_device_by_node(). If you don't release it, you're leaking the reference and the kernel isn't going to know when it's safe to delete the device. So what I'm saying is that we either release it here, which isn't quite right because we do reference data relating to the device later on. And because it isn't quite right there should be a reason to justify it, which is that the SMMU parent device is the same as the MC, so the reference count isn't strictly necessary. But that's not quite obvious, so highlighting it in a comment makes sense. The other alternative is to not call put_device() here and keep on to the reference as long as you keep using "mc". This might be difficult to implement because it may not be obvious where to release it. I think this is the better alternative, but if it's too complicated to implement it might not be worth it. > > Like I said earlier, this is a bit weird in this case because we're > > self-referencing, so iommu_pdev->dev is going to stay around as long as > > the SMMU is. However, it might be worth to properly track the lifetime > > anyway just so that the code can serve as a good example of how to do > > things. > > What's this "track-the-lifetime"? This basically just means that SMMU needs to ensure that MC stays alive (by holding a reference to it) as long as SMMU uses it. If the last reference to MC is dropped, then the mc pointer and potentially anything that it points to will become dangling. If you were to drop the last reference at this point, then on the next line the mc pointer could already be invalid. That's how it generally works, anyway. What's special about this use- case is that the SMMU and MC are the same device, so it should be safe to omit this additional tracking because the IOMMU tracking should take care of that already. > > If you decide to go for the shortcut and not track this reference > > properly, then at least you need to add a comment as to why it is safe > > to do in this case. This ensures that readers are away of the > > circumstances and don't copy this bad code into a context where the > > circumstances are different. > > I don't quite get this "shortcut" here either...mind elaborating? The shortcut is taking advantage of the knowledge that the SMMU and the MC are the same device and therefore not properly track the MC object. Given that their code is located in different locations, this isn't obvious to the casual reader of the code, so they may assume that this is the normal way to do things. To avoid that, the code should have a comment explaining why that is. Thierry signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 2/2] iommu/tegra-smmu: Expand mutex protection range
On Mon, Sep 28, 2020 at 11:13:25PM -0700, Nicolin Chen wrote: > This is used to protect potential race condition at use_count. > since probes of client drivers, calling attach_dev(), may run > concurrently. > > Signed-off-by: Nicolin Chen > --- > > Changelog > v3->v4: > * Fixed typo "Expend" => "Expand" > v2->v3: > * Renamed label "err_unlock" to "unlock" > v1->v2: > * N/A > > drivers/iommu/tegra-smmu.c | 34 +- > 1 file changed, 21 insertions(+), 13 deletions(-) Acked-by: Thierry Reding signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/2] iommu/tegra-smmu: Unwrap tegra_smmu_group_get
On Mon, Sep 28, 2020 at 11:13:24PM -0700, Nicolin Chen wrote: > The tegra_smmu_group_get was added to group devices in different > SWGROUPs and it'd return a NULL group pointer upon a mismatch at > tegra_smmu_find_group(), so for most of clients/devices, it very > likely would mismatch and need a fallback generic_device_group(). > > But now tegra_smmu_group_get handles devices in same SWGROUP too, > which means that it would allocate a group for every new SWGROUP > or would directly return an existing one upon matching a SWGROUP, > i.e. any device will go through this function. > > So possibility of having a NULL group pointer in device_group() > is upon failure of either devm_kzalloc() or iommu_group_alloc(). > In either case, calling generic_device_group() no longer makes a > sense. Especially for devm_kzalloc() failing case, it'd cause a > problem if it fails at devm_kzalloc() yet succeeds at a fallback > generic_device_group(), because it does not create a group->list > for other devices to match. > > This patch simply unwraps the function to clean it up. > > Signed-off-by: Nicolin Chen > --- > > Changelog > v2->v4: > * N/A > v1->v2: > * Changed type of swgroup to "unsigned int", following Thierry's >commnets. > > drivers/iommu/tegra-smmu.c | 19 --- > 1 file changed, 4 insertions(+), 15 deletions(-) Acked-by: Thierry Reding signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote: > The overall conclusion for this is: > > 1) X2APIC support on bare metal w/o irq remapping is not going to > happen unless you: > > - added support in multi-queue devices which utilize managed > interrupts > > - audited the whole tree for other assumptions related to the > reachability of possible CPUs. > > I'm not expecting you to be done with that before I retire so for > me it's just not going to happen :) Makes sense. It probably does mean we should a BUG_ON for the case where IRQ remapping *is* enabled but any device is found which isn't behind it. But that's OK. > 2) X2APIC support on VIRT is possible if the extended ID magic is > supported by the hypervisor because that does not make any CPU > unreachable for MSI and therefore the multi-queue muck and > everything else just works. > > This requires to have either the domain affinity limitation for HPET > in place or just to force disable HPET or at least HPET-MSI which is > a reasonable tradeoff. > > HPET is not required for guests which have kvmclock and > APIC/deadline timer and known (hypervisor provided) frequencies. HPET-MSI should work fine. Like the IOAPIC, it's just a child of the *actual* MSI domain. The address/data in the MSI message are completely opaque to it, and if the parent domain happens to put meaningful information into bits 11-5 of the MSI address, the HPET won't even notice. The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI address; it's not doing bit-swizzling like the IOAPIC does, which might potentially *not* have been able to set certain bits in the MSI. smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu