Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry
On Thu, 7 Oct 2021 12:59:32 +0200 Karsten Graul wrote: [...] > > > >>> BTW, there is already a WARN in the add_dma_entry() path, related > >>> to cachlline overlap and -EEXIST: > >>> > >>> add_dma_entry() -> active_cacheline_insert() -> -EEXIST -> > >>> active_cacheline_inc_overlap() > >>> > >>> That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP". > >>> Not familiar with that code, but it seems that there are now two > >>> warnings for more or less the same, and the new warning is much more > >>> prone to false-positives. > >>> > >>> How do these 2 warnings relate, are they both really necessary? > >>> I think the new warning was only introduced because of some old > >>> TODO comment in add_dma_entry(), see commit 2b4bbc6231d78 > >>> ("dma-debug: report -EEXIST errors in add_dma_entry"). > > > > AFAICS they are different things. I believe the new warning is supposed to > > be for the fundementally incorrect API usage (as above) of mapping > > different regions overlapping within the same cacheline. The existing one > > is about dma-debug losing internal consistency when tracking the *same* > > region being mapped multiple times, which is a legal thing to do - e.g. > > buffer sharing between devices - but if anyone's doing it to excess that's > > almost certainly a bug (i.e. they probably intended to unmap it in between > > but missed that out). > > Thanks for the explanation Robin. > > In our case its really that a buffer is mapped twice for 2 different devices > which we use in SMC to provide failover capabilities. We see that -EEXIST is > returned when a buffer is mapped for the second device. Since there is a > maximum of 2 parallel mappings we never see the warning shown by > active_cacheline_inc_overlap() because we don't exceed > ACTIVE_CACHELINE_MAX_OVERLAP. > > So how to deal with this kind of "legal thing", looks like there is no way to > suppress the newly introduced EEXIST warning for that case? Thanks Karsten, very interesting. We assumed so far that we hit the same case as Ioana, i.e. having multiple sg elements in one cacheline. With debug output it now seems that we hit a completely different case, not at all related to any cacheline or coherency issues. So it really seems that the new warning is basically the same as the already present one, with the difference that it already triggers on the first occurrence. Looking at the code again, it also seems rather obvious now... IIUC, from what Robin described, this means that the "legal thing to do - e.g. buffer sharing between devices" will now immediately trigger the new warning? Not sure if I missed something (again), because then I would expect much more reports on this, and of course it would then obviously be false-positive. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] dma-debug: fix sg checks in debug_dma_map_sg()
The following warning occurred sporadically on s390: DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=48cc5e2f] [len=131072] WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138 It is a false-positive warning, due to broken logic in debug_dma_map_sg(). check_for_illegal_area() checks for overlay of sg elements with kernel text or rodata. It is called with sg_dma_len(s) instead of s->length as parameter. After the call to ->map_sg(), sg_dma_len() will contain the length of possibly combined sg elements in the DMA address space, and not the individual sg element length, which would be s->length. The check will then use the physical start address of an sg element, and add the DMA length for the overlap check, which could result in the false warning, because the DMA length can be larger than the actual single sg element length. In addition, the call to check_for_illegal_area() happens in the iteration over mapped_ents, which will not include all individual sg elements if any of them were combined in ->map_sg(). Fix this by using s->length instead of sg_dma_len(s). Also put the call to check_for_illegal_area() in a separate loop, iterating over all the individual sg elements ("nents" instead of "mapped_ents"). While at it, as suggested by Robin Murphy, also move check_for_stack() inside the new loop, as it is similarly concerned with validating the individual sg elements. Link: https://lore.kernel.org/lkml/20210705185252.4074653-1-gerald.schae...@linux.ibm.com Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor") Signed-off-by: Gerald Schaefer --- kernel/dma/debug.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 95445bd6eb72..d968a429f0d1 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -1289,6 +1289,13 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, if (unlikely(dma_debug_disabled())) return; + for_each_sg(sg, s, nents, i) { + check_for_stack(dev, sg_page(s), s->offset); + if (!PageHighMem(sg_page(s))) { + check_for_illegal_area(dev, sg_virt(s), s->length); + } + } + for_each_sg(sg, s, mapped_ents, i) { entry = dma_entry_alloc(); if (!entry) @@ -1304,12 +1311,6 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, entry->sg_call_ents = nents; entry->sg_mapped_ents = mapped_ents; - check_for_stack(dev, sg_page(s), s->offset); - - if (!PageHighMem(sg_page(s))) { - check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s)); - } - check_sg_segment(dev, s); add_dma_entry(entry); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry
On Wed, 6 Oct 2021 15:23:36 +0100 Robin Murphy wrote: > On 2021-10-06 14:10, Gerald Schaefer wrote: > > On Fri, 1 Oct 2021 14:52:56 +0200 > > Gerald Schaefer wrote: > > > >> On Thu, 30 Sep 2021 15:37:33 +0200 > >> Karsten Graul wrote: > >> > >>> On 14/09/2021 17:45, Ioana Ciornei wrote: > >>>> On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote: > >>>>> +DPAA2, netdev maintainers > >>>>> Hi, > >>>>> > >>>>> On 5/18/21 7:54 AM, Hamza Mahfooz wrote: > >>>>>> Since, overlapping mappings are not supported by the DMA API we should > >>>>>> report an error if active_cacheline_insert returns -EEXIST. > >>>>> > >>>>> It seems this patch found a victim. I was trying to run iperf3 on a > >>>>> honeycomb (5.14.0, fedora 35) and the console is blasting this error > >>>>> message > >>>>> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, > >>>>> which > >>>>> is attached below. > >>>>> > >>>> > >>>> These frags are allocated by the stack, transformed into a scatterlist > >>>> by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the > >>>> dpaa2-eth's decision to use two fragments from the same page (that will > >>>> also end un in the same cacheline) in two different in-flight skbs. > >>>> > >>>> Is this behavior normal? > >>>> > >>> > >>> We see the same problem here and it started with 5.15-rc2 in our nightly > >>> CI runs. > >>> The CI has panic_on_warn enabled so we see the panic every day now. > >> > >> Adding a WARN for a case that be detected false-positive seems not > >> acceptable, exactly for this reason (kernel panic on unaffected > >> systems). > >> > >> So I guess it boils down to the question if the behavior that Ioana > >> described is legit behavior, on a system that is dma coherent. We > >> are apparently hitting the same scenario, although it could not yet be > >> reproduced with debug printks for some reason. > >> > >> If the answer is yes, than please remove at lease the WARN, so that > >> it will not make systems crash that behave valid, and have > >> panic_on_warn set. Even a normal printk feels wrong to me in that > >> case, it really sounds rather like you want to fix / better refine > >> the overlap check, if you want to report anything here. > > > > Dan, Christoph, any opinion? > > > > So far it all looks a lot like a false positive, so could you please > > see that those patches get reverted? I do wonder a bit why this is > > not an issue for others, we surely cannot be the only ones running > > CI with panic_on_warn. > > What convinces you it's a false-positive? I'm hardly familiar with most > of that callstack, but it appears to be related to mlx5, and I know that > exists on expansion cards which could be plugged into a system with > non-coherent PCIe where partial cacheline overlap *would* be a real > issue. Of course it's dubious that there are many real use-cases for > plugging a NIC with a 4-figure price tag into a little i.MX8 or > whatever, but the point is that it *should* still work correctly. I would assume that a *proper* warning would check if we see the "non-coherent" case, e.g. by using dev_is_dma_coherent() and only report with potentially fatal WARN on systems where it is appropriate. However, I am certainly even less familiar with all that, and might just have gotten the wrong impression here. Also not sure about mlx5 relation here, it does not really show in the call trace, only in the err_printk() output, probably from dev_driver_string(dev) or dev_name(dev). But I do not see where mlx5 code would be involved here. [...] > According to the streaming DMA API documentation, it is *not* valid: > > ".. warning:: > >Memory coherency operates at a granularity called the cache >line width. In order for memory mapped by this API to operate >correctly, the mapped region must begin exactly on a cache line >boundary and end exactly on one (to prevent two separately mapped >regions from sharing a single cache line). Since the cache line size >may not be known at compile time, the API will not enforce this >requirement. Therefore, it is recommended that driver writers who >don't take special care to determine the cache line size at run time >
Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry
On Wed, 6 Oct 2021 15:10:43 +0200 Gerald Schaefer wrote: > On Fri, 1 Oct 2021 14:52:56 +0200 > Gerald Schaefer wrote: > > > On Thu, 30 Sep 2021 15:37:33 +0200 > > Karsten Graul wrote: > > > > > On 14/09/2021 17:45, Ioana Ciornei wrote: > > > > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote: > > > >> +DPAA2, netdev maintainers > > > >> Hi, > > > >> > > > >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote: > > > >>> Since, overlapping mappings are not supported by the DMA API we should > > > >>> report an error if active_cacheline_insert returns -EEXIST. > > > >> > > > >> It seems this patch found a victim. I was trying to run iperf3 on a > > > >> honeycomb (5.14.0, fedora 35) and the console is blasting this error > > > >> message > > > >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, > > > >> which > > > >> is attached below. > > > >> > > > > > > > > These frags are allocated by the stack, transformed into a scatterlist > > > > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the > > > > dpaa2-eth's decision to use two fragments from the same page (that will > > > > also end un in the same cacheline) in two different in-flight skbs. > > > > > > > > Is this behavior normal? > > > > > > > > > > We see the same problem here and it started with 5.15-rc2 in our nightly > > > CI runs. > > > The CI has panic_on_warn enabled so we see the panic every day now. > > > > Adding a WARN for a case that be detected false-positive seems not > > acceptable, exactly for this reason (kernel panic on unaffected > > systems). > > > > So I guess it boils down to the question if the behavior that Ioana > > described is legit behavior, on a system that is dma coherent. We > > are apparently hitting the same scenario, although it could not yet be > > reproduced with debug printks for some reason. > > > > If the answer is yes, than please remove at lease the WARN, so that > > it will not make systems crash that behave valid, and have > > panic_on_warn set. Even a normal printk feels wrong to me in that > > case, it really sounds rather like you want to fix / better refine > > the overlap check, if you want to report anything here. > > Dan, Christoph, any opinion? > > So far it all looks a lot like a false positive, so could you please > see that those patches get reverted? I do wonder a bit why this is > not an issue for others, we surely cannot be the only ones running > CI with panic_on_warn. For reference, we are talking about these commits: 2b4bbc6231d7 ("dma-debug: report -EEXIST errors in add_dma_entry") 510e1a724ab1 ("dma-debug: prevent an error message from causing runtime problems") The latter introduced the WARN (through err_printk usage), and should be reverted if it can be false-positive, but both seem wrong in that case. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry
On Fri, 1 Oct 2021 14:52:56 +0200 Gerald Schaefer wrote: > On Thu, 30 Sep 2021 15:37:33 +0200 > Karsten Graul wrote: > > > On 14/09/2021 17:45, Ioana Ciornei wrote: > > > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote: > > >> +DPAA2, netdev maintainers > > >> Hi, > > >> > > >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote: > > >>> Since, overlapping mappings are not supported by the DMA API we should > > >>> report an error if active_cacheline_insert returns -EEXIST. > > >> > > >> It seems this patch found a victim. I was trying to run iperf3 on a > > >> honeycomb (5.14.0, fedora 35) and the console is blasting this error > > >> message > > >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, > > >> which > > >> is attached below. > > >> > > > > > > These frags are allocated by the stack, transformed into a scatterlist > > > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the > > > dpaa2-eth's decision to use two fragments from the same page (that will > > > also end un in the same cacheline) in two different in-flight skbs. > > > > > > Is this behavior normal? > > > > > > > We see the same problem here and it started with 5.15-rc2 in our nightly CI > > runs. > > The CI has panic_on_warn enabled so we see the panic every day now. > > Adding a WARN for a case that be detected false-positive seems not > acceptable, exactly for this reason (kernel panic on unaffected > systems). > > So I guess it boils down to the question if the behavior that Ioana > described is legit behavior, on a system that is dma coherent. We > are apparently hitting the same scenario, although it could not yet be > reproduced with debug printks for some reason. > > If the answer is yes, than please remove at lease the WARN, so that > it will not make systems crash that behave valid, and have > panic_on_warn set. Even a normal printk feels wrong to me in that > case, it really sounds rather like you want to fix / better refine > the overlap check, if you want to report anything here. Dan, Christoph, any opinion? So far it all looks a lot like a false positive, so could you please see that those patches get reverted? I do wonder a bit why this is not an issue for others, we surely cannot be the only ones running CI with panic_on_warn. We would need to disable DEBUG_DMA if this WARN stays in, which would be a shame. Of course, in theory, this might also indicate some real bug, but there really is no sign of that so far. Having multiple sg elements in the same page (or cacheline) is valid, correct? And this is also not a decision of the driver IIUC, so if it was bug, it should be addressed in common code, correct? > > BTW, there is already a WARN in the add_dma_entry() path, related > to cachlline overlap and -EEXIST: > > add_dma_entry() -> active_cacheline_insert() -> -EEXIST -> > active_cacheline_inc_overlap() > > That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP". > Not familiar with that code, but it seems that there are now two > warnings for more or less the same, and the new warning is much more > prone to false-positives. > > How do these 2 warnings relate, are they both really necessary? > I think the new warning was only introduced because of some old > TODO comment in add_dma_entry(), see commit 2b4bbc6231d78 > ("dma-debug: report -EEXIST errors in add_dma_entry"). > > That comment was initially added by Dan long time ago, and he > added several fix-ups for overlap detection after that, including > the "overlap > ACTIVE_CACHELINE_MAX_OVERLAP" stuff in > active_cacheline_inc_overlap(). So could it be that the TODO > comment was simply not valid any more, and better be removed > instead of adding new / double warnings, that also generate > false-positives and kernel crashes? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry
On Thu, 30 Sep 2021 15:37:33 +0200 Karsten Graul wrote: > On 14/09/2021 17:45, Ioana Ciornei wrote: > > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote: > >> +DPAA2, netdev maintainers > >> Hi, > >> > >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote: > >>> Since, overlapping mappings are not supported by the DMA API we should > >>> report an error if active_cacheline_insert returns -EEXIST. > >> > >> It seems this patch found a victim. I was trying to run iperf3 on a > >> honeycomb (5.14.0, fedora 35) and the console is blasting this error > >> message > >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, which > >> is attached below. > >> > > > > These frags are allocated by the stack, transformed into a scatterlist > > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the > > dpaa2-eth's decision to use two fragments from the same page (that will > > also end un in the same cacheline) in two different in-flight skbs. > > > > Is this behavior normal? > > > > We see the same problem here and it started with 5.15-rc2 in our nightly CI > runs. > The CI has panic_on_warn enabled so we see the panic every day now. Adding a WARN for a case that be detected false-positive seems not acceptable, exactly for this reason (kernel panic on unaffected systems). So I guess it boils down to the question if the behavior that Ioana described is legit behavior, on a system that is dma coherent. We are apparently hitting the same scenario, although it could not yet be reproduced with debug printks for some reason. If the answer is yes, than please remove at lease the WARN, so that it will not make systems crash that behave valid, and have panic_on_warn set. Even a normal printk feels wrong to me in that case, it really sounds rather like you want to fix / better refine the overlap check, if you want to report anything here. BTW, there is already a WARN in the add_dma_entry() path, related to cachlline overlap and -EEXIST: add_dma_entry() -> active_cacheline_insert() -> -EEXIST -> active_cacheline_inc_overlap() That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP". Not familiar with that code, but it seems that there are now two warnings for more or less the same, and the new warning is much more prone to false-positives. How do these 2 warnings relate, are they both really necessary? I think the new warning was only introduced because of some old TODO comment in add_dma_entry(), see commit 2b4bbc6231d78 ("dma-debug: report -EEXIST errors in add_dma_entry"). That comment was initially added by Dan long time ago, and he added several fix-ups for overlap detection after that, including the "overlap > ACTIVE_CACHELINE_MAX_OVERLAP" stuff in active_cacheline_inc_overlap(). So could it be that the TODO comment was simply not valid any more, and better be removed instead of adding new / double warnings, that also generate false-positives and kernel crashes? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 1/1] dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()
On Tue, 6 Jul 2021 10:22:40 +0100 Robin Murphy wrote: > On 2021-07-05 19:52, Gerald Schaefer wrote: > > The following warning occurred sporadically on s390: > > DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or > > rodata [addr=48cc5e2f] [len=131072] > > WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 > > check_for_illegal_area+0xa8/0x138 > > > > It is a false-positive warning, due to a broken logic in debug_dma_map_sg(). > > check_for_illegal_area() should check for overlay of sg elements with kernel > > text or rodata. It is called with sg_dma_len(s) instead of s->length as > > parameter. After the call to ->map_sg(), sg_dma_len() contains the length > > of possibly combined sg elements in the DMA address space, and not the > > individual sg element length, which would be s->length. > > > > The check will then use the kernel start address of an sg element, and add > > the DMA length for overlap check, which can result in the false-positive > > warning because the DMA length can be larger than the actual single sg > > element length in kernel address space. > > > > In addition, the call to check_for_illegal_area() happens in the iteration > > over mapped_ents, which will not include all individual sg elements if > > any of them were combined in ->map_sg(). > > > > Fix this by using s->length instead of sg_dma_len(s). Also put the call to > > check_for_illegal_area() in a separate loop, iterating over all the > > individual sg elements ("nents" instead of "mapped_ents"). > > > > Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor") > > Tested-by: Niklas Schnelle > > Signed-off-by: Gerald Schaefer > > --- > > kernel/dma/debug.c | 10 ++ > > 1 file changed, 6 insertions(+), 4 deletions(-) > > > > diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c > > index 14de1271463f..d7d44b7fe7e2 100644 > > --- a/kernel/dma/debug.c > > +++ b/kernel/dma/debug.c > > @@ -1299,6 +1299,12 @@ void debug_dma_map_sg(struct device *dev, struct > > scatterlist *sg, > > if (unlikely(dma_debug_disabled())) > > return; > > > > + for_each_sg(sg, s, nents, i) { > > + if (!PageHighMem(sg_page(s))) { > > + check_for_illegal_area(dev, sg_virt(s), s->length); > > + } > > + } > > + > > for_each_sg(sg, s, mapped_ents, i) { > > entry = dma_entry_alloc(); > > if (!entry) > > @@ -1316,10 +1322,6 @@ void debug_dma_map_sg(struct device *dev, struct > > scatterlist *sg, > > > > check_for_stack(dev, sg_page(s), s->offset); > > Strictly this should probably be moved to the new loop as well, as it is > similarly concerned with validating the source segments rather than the > DMA mappings - I think with virtually-mapped stacks it might technically > be possible for a stack page to be physically adjacent to a "valid" page > such that it could get merged and overlooked if it were near the end of > the list, although in fairness that would probably be indicative of > something having gone far more fundamentally wrong. Otherwise, the > overall reasoning looks sound to me. I see, good point. I think I can add this to my patch, and a different subject like "dma-debug: fix sg checks in debug_dma_map_sg()". However, I do not quite understand why check_for_stack() does not also consider s->length. It seems to check only the first page of an sg element. So, shouldn't check_for_stack() behave similar to check_for_illegal_area(), i.e. check all source sg elements for overlap with the task stack area? If yes, then this probably should be a separate patch, but I can try to come up with something and send a new RFC with two patches. Maybe check_for_stack() can also be integrated into check_for_illegal_area(), they are both called at the same places. And mapping memory from the stack also sounds rather illegal. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 1/1] dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()
The following warning occurred sporadically on s390: DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=48cc5e2f] [len=131072] WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138 It is a false-positive warning, due to a broken logic in debug_dma_map_sg(). check_for_illegal_area() should check for overlay of sg elements with kernel text or rodata. It is called with sg_dma_len(s) instead of s->length as parameter. After the call to ->map_sg(), sg_dma_len() contains the length of possibly combined sg elements in the DMA address space, and not the individual sg element length, which would be s->length. The check will then use the kernel start address of an sg element, and add the DMA length for overlap check, which can result in the false-positive warning because the DMA length can be larger than the actual single sg element length in kernel address space. In addition, the call to check_for_illegal_area() happens in the iteration over mapped_ents, which will not include all individual sg elements if any of them were combined in ->map_sg(). Fix this by using s->length instead of sg_dma_len(s). Also put the call to check_for_illegal_area() in a separate loop, iterating over all the individual sg elements ("nents" instead of "mapped_ents"). Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor") Tested-by: Niklas Schnelle Signed-off-by: Gerald Schaefer --- kernel/dma/debug.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 14de1271463f..d7d44b7fe7e2 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -1299,6 +1299,12 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, if (unlikely(dma_debug_disabled())) return; + for_each_sg(sg, s, nents, i) { + if (!PageHighMem(sg_page(s))) { + check_for_illegal_area(dev, sg_virt(s), s->length); + } + } + for_each_sg(sg, s, mapped_ents, i) { entry = dma_entry_alloc(); if (!entry) @@ -1316,10 +1322,6 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, check_for_stack(dev, sg_page(s), s->offset); - if (!PageHighMem(sg_page(s))) { - check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s)); - } - check_sg_segment(dev, s); add_dma_entry(entry); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 0/1] dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()
The following warning occurred sporadically on s390: DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=48cc5e2f] [len=131072] WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138 It is a false-positive warning, due to a broken logic in debug_dma_map_sg(), see patch description. In short, the check is mixing up kernel start address for sg elements with the length of possibly combined sg elements in the DMA address space. I am a bit confused by the whole logic, and not sure what would be the best way to fix this. The false-postives should have been possible since commit 884d05970bfb ("dma-debug: use sg_dma_len accessor"), which is included since 2.6.31. Also, it seems to me that even before that commit, the check would have been wrong, or at least incomplete, because it is located in a loop that iterates over mapped_ents instead of nents. So it would not check all physical sg elements if any were combined in DMA address space. Gerald Schaefer (1): dma-debug: fix check_for_illegal_area() in debug_dma_map_sg() kernel/dma/debug.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/intel-iommu: fix memory leak in intel_iommu_put_resv_regions()
Commit 9d3a4de4cb8d ("iommu: Disambiguate MSI region types") changed the reserved region type in intel_iommu_get_resv_regions() from IOMMU_RESV_RESERVED to IOMMU_RESV_MSI, but it forgot to also change the type in intel_iommu_put_resv_regions(). This leads to a memory leak, because now the check in intel_iommu_put_resv_regions() for IOMMU_RESV_RESERVED will never be true, and no allocated regions will be freed. Fix this by changing the region type in intel_iommu_put_resv_regions() to IOMMU_RESV_MSI, matching the type of the allocated regions. Fixes: 9d3a4de4cb8d ("iommu: Disambiguate MSI region types") Cc: # v4.11+ Signed-off-by: Gerald Schaefer --- drivers/iommu/intel-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 048b5ab36a02..b83e0f2025bb 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -5295,7 +5295,7 @@ static void intel_iommu_put_resv_regions(struct device *dev, struct iommu_resv_region *entry, *next; list_for_each_entry_safe(entry, next, head, list) { - if (entry->type == IOMMU_RESV_RESERVED) + if (entry->type == IOMMU_RESV_MSI) kfree(entry); } } -- 2.16.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/3] iommu: s390: constify iommu_ops
On Mon, 28 Aug 2017 17:42:50 +0530 Arvind Yadav <arvind.yadav...@gmail.com> wrote: > iommu_ops are not supposed to change at runtime. > Functions 'bus_set_iommu' working with const iommu_ops provided > by . So mark the non-const structs as const. > > Signed-off-by: Arvind Yadav <arvind.yadav...@gmail.com> > --- > drivers/iommu/s390-iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Acked-by: Gerald Schaefer <gerald.schae...@de.ibm.com> > > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c > index 8788640..400d75f 100644 > --- a/drivers/iommu/s390-iommu.c > +++ b/drivers/iommu/s390-iommu.c > @@ -327,7 +327,7 @@ static size_t s390_iommu_unmap(struct iommu_domain > *domain, > return size; > } > > -static struct iommu_ops s390_iommu_ops = { > +static const struct iommu_ops s390_iommu_ops = { > .capable = s390_iommu_capable, > .domain_alloc = s390_domain_alloc, > .domain_free = s390_domain_free, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling
On Tue, 27 Jun 2017 17:28:06 +0200 Joerg Roedel <j...@8bytes.org> wrote: > On Mon, Jun 19, 2017 at 05:02:19PM +0200, Gerald Schaefer wrote: > > On Thu, 15 Jun 2017 15:11:52 +0200 > > Joerg Roedel <j...@8bytes.org> wrote: > > > + rc = zpci_init_iommu(zdev); > > > + if (rc) > > > + goto out_free; > > > + > > > > After this point, there are two options for failure (zpci_enable_device + > > zpci_scan_bus), but missing error handling with an appropriate call to > > zpci_destroy_iommu(). > > Right, I'll fix that in the next version. > > > I must admit that I do not understand what these sysfs attributes are > > used for, and by whom and when. Is it really necessary/correct to register > > them (including the s390_iommu_ops) _before_ the zpci_dev is registered > > to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)? > > > > What is the expected life cycle for the attributes, and are they already > > needed when a device is still under DMA API access, or even before it is > > added to the PCI bus? > > The sysfs attributes are used by user-space for topology detection. A > user-space program using VFIO needs to know which PCI-devices it needs > to assign to the VFIO driver in order to gain access. > > On s390 this probably doesn't matter much, as the real PCI topology is > not exposed, but it matters on other architectures. The purpose of this > patch is not so much the sysfs attributes. Its a step to convert all > IOMMU drivers to use the iommu_device_register() interface. Goal is that > the iommu core code knows about individual hardware iommus and the > devices behind it. > > When this is implemented in all iommu drivers, we can start moving more > code out of the drivers into the iommu core. This also probably doesn't > matter much for s390, but all iommu drivers need to use this interface > before we can move on. The sysfs-stuff is just a by-product of that. > > So to move on with that, it would be good to get an Acked-by on this > patch from you guys once you think I fixed everything :) Ok, thanks for the explanation. So it should not be a problem if the attributes are registered in zpci_create_device() before the device is registered to the bus, especially since they do not provide any manipulation function but only topology information. BTW, I get the following new attributes with this patch: /sys/devices/pci:00/:00:00.0/iommu /sys/devices/virtual/iommu /sys/devices/virtual/iommu/s390-iommu.0012 /sys/class/iommu/s390-iommu.0012 Regards, Gerald ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/3] iommu: Return ERR_PTR() values from device_group call-backs
On Wed, 28 Jun 2017 14:00:56 +0200 Joerg Roedel <j...@8bytes.org> wrote: > From: Joerg Roedel <jroe...@suse.de> > > The generic device_group call-backs in iommu.c return NULL > in case of error. Since they are getting ERR_PTR values from > iommu_group_alloc(), just pass them up instead. > > Reported-by: Gerald Schaefer <gerald.schae...@de.ibm.com> > Signed-off-by: Joerg Roedel <jroe...@suse.de> > --- Looks good, Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com> > drivers/iommu/iommu.c | 14 ++ > 1 file changed, 2 insertions(+), 12 deletions(-) > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index cf7ca7e..de09e1e 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -915,13 +915,7 @@ static int get_pci_alias_or_group(struct pci_dev *pdev, > u16 alias, void *opaque) > */ > struct iommu_group *generic_device_group(struct device *dev) > { > - struct iommu_group *group; > - > - group = iommu_group_alloc(); > - if (IS_ERR(group)) > - return NULL; > - > - return group; > + return iommu_group_alloc(); > } > > /* > @@ -988,11 +982,7 @@ struct iommu_group *pci_device_group(struct device *dev) > return group; > > /* No shared group found, allocate new */ > - group = iommu_group_alloc(); > - if (IS_ERR(group)) > - return NULL; > - > - return group; > + return iommu_group_alloc(); > } > > /** ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling
On Thu, 15 Jun 2017 15:11:52 +0200 Joerg Roedelwrote: > From: Joerg Roedel > > Add support for the iommu_device_register interface to make > the s390 hardware iommus visible to the iommu core and in > sysfs. > > Signed-off-by: Joerg Roedel > --- > arch/s390/include/asm/pci.h | 7 +++ > arch/s390/pci/pci.c | 5 + > drivers/iommu/s390-iommu.c | 35 +++ > 3 files changed, 47 insertions(+) > > diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h > index 4e31866..a3f697e 100644 > --- a/arch/s390/include/asm/pci.h > +++ b/arch/s390/include/asm/pci.h > @@ -8,6 +8,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -123,6 +124,8 @@ struct zpci_dev { > unsigned long iommu_pages; > unsigned intnext_bit; > > + struct iommu_device iommu_dev; /* IOMMU core handle */ > + > char res_name[16]; > struct zpci_bar_struct bars[PCI_BAR_COUNT]; > > @@ -173,6 +176,10 @@ int clp_add_pci_device(u32, u32, int); > int clp_enable_fh(struct zpci_dev *, u8); > int clp_disable_fh(struct zpci_dev *); > > +/* IOMMU Interface */ > +int zpci_init_iommu(struct zpci_dev *zdev); > +void zpci_destroy_iommu(struct zpci_dev *zdev); > + > #ifdef CONFIG_PCI > /* Error handling and recovery */ > void zpci_event_error(void *); > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c > index 8051df1..4c13da6 100644 > --- a/arch/s390/pci/pci.c > +++ b/arch/s390/pci/pci.c > @@ -749,6 +749,7 @@ void pcibios_remove_bus(struct pci_bus *bus) > > zpci_exit_slot(zdev); > zpci_cleanup_bus_resources(zdev); > + zpci_destroy_iommu(zdev); > zpci_free_domain(zdev); > > spin_lock(_list_lock); > @@ -820,6 +821,10 @@ int zpci_create_device(struct zpci_dev *zdev) > if (rc) > goto out; > > + rc = zpci_init_iommu(zdev); > + if (rc) > + goto out_free; > + After this point, there are two options for failure (zpci_enable_device + zpci_scan_bus), but missing error handling with an appropriate call to zpci_destroy_iommu(). I must admit that I do not understand what these sysfs attributes are used for, and by whom and when. Is it really necessary/correct to register them (including the s390_iommu_ops) _before_ the zpci_dev is registered to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)? What is the expected life cycle for the attributes, and are they already needed when a device is still under DMA API access, or even before it is added to the PCI bus? > mutex_init(>lock); > if (zdev->state == ZPCI_FN_STATE_CONFIGURED) { > rc = zpci_enable_device(zdev); > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c > index 8788640..85f3bc5 100644 > --- a/drivers/iommu/s390-iommu.c > +++ b/drivers/iommu/s390-iommu.c > @@ -18,6 +18,8 @@ > */ > #define S390_IOMMU_PGSIZES (~0xFFFUL) > > +static struct iommu_ops s390_iommu_ops; > + > struct s390_domain { > struct iommu_domain domain; > struct list_headdevices; > @@ -166,11 +168,13 @@ static void s390_iommu_detach_device(struct > iommu_domain *domain, > static int s390_iommu_add_device(struct device *dev) > { > struct iommu_group *group = iommu_group_get_for_dev(dev); > + struct zpci_dev *zdev = to_pci_dev(dev)->sysdata; > > if (IS_ERR(group)) > return PTR_ERR(group); > > iommu_group_put(group); > + iommu_device_link(>iommu_dev, dev); > > return 0; > } > @@ -197,6 +201,7 @@ static void s390_iommu_remove_device(struct device *dev) > s390_iommu_detach_device(domain, dev); > } > > + iommu_device_unlink(>iommu_dev, dev); > iommu_group_remove_device(dev); > } > > @@ -327,6 +332,36 @@ static size_t s390_iommu_unmap(struct iommu_domain > *domain, > return size; > } > > +int zpci_init_iommu(struct zpci_dev *zdev) > +{ > + int rc = 0; > + > + rc = iommu_device_sysfs_add(>iommu_dev, NULL, NULL, > + "s390-iommu.%08x", zdev->fid); > + if (rc) > + goto out_err; > + > + iommu_device_set_ops(>iommu_dev, _iommu_ops); > + > + rc = iommu_device_register(>iommu_dev); > + if (rc) > + goto out_sysfs; > + > + return 0; > + > +out_sysfs: > + iommu_device_sysfs_remove(>iommu_dev); > + > +out_err: > + return rc; > +} > + > +void zpci_destroy_iommu(struct zpci_dev *zdev) > +{ > + iommu_device_unregister(>iommu_dev); > + iommu_device_sysfs_remove(>iommu_dev); > +} > + > static struct iommu_ops s390_iommu_ops = { > .capable = s390_iommu_capable, > .domain_alloc = s390_domain_alloc, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/s390: Use iommu_group_get_for_dev() in s390_iommu_add_device()
On Thu, 15 Jun 2017 15:11:51 +0200 Joerg Roedel <j...@8bytes.org> wrote: > From: Joerg Roedel <jroe...@suse.de> > > The iommu_group_get_for_dev() function also attaches the > device to its group, so this code doesn't need to be in the > iommu driver. > > Further by using this function the driver can make use of > default domains in the future. > > Signed-off-by: Joerg Roedel <jroe...@suse.de> Seems pretty straightforward, so Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com> However, looking at iommu_group_get_for_dev(), I wonder if the generic_device_group() always returns the right thing, but that would be independent from this patch. With generic_device_group() returning NULL in case the allocation failed, this part of iommu_group_get_for_dev() would then happily dereference the NULL pointer, because IS_ERR(group) would be false: if (ops && ops->device_group) group = ops->device_group(dev); if (IS_ERR(group)) return group; /* * Try to allocate a default domain - needs support from the * IOMMU driver. */ if (!group->default_domain) { The same is true for pci_device_group(), which also returns NULL in case of allocation failure. I guess both functions should just return the group pointer from iommu_group_alloc() directly, which already would contain an appropriate ERR_PTR, but never NULL. What do you think? > --- > drivers/iommu/s390-iommu.c | 15 +-- > 1 file changed, 5 insertions(+), 10 deletions(-) > > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c > index 179e636..8788640 100644 > --- a/drivers/iommu/s390-iommu.c > +++ b/drivers/iommu/s390-iommu.c > @@ -165,20 +165,14 @@ static void s390_iommu_detach_device(struct > iommu_domain *domain, > > static int s390_iommu_add_device(struct device *dev) > { > - struct iommu_group *group; > - int rc; > + struct iommu_group *group = iommu_group_get_for_dev(dev); > > - group = iommu_group_get(dev); > - if (!group) { > - group = iommu_group_alloc(); > - if (IS_ERR(group)) > - return PTR_ERR(group); > - } > + if (IS_ERR(group)) > + return PTR_ERR(group); > > - rc = iommu_group_add_device(group, dev); > iommu_group_put(group); > > - return rc; > + return 0; > } > > static void s390_iommu_remove_device(struct device *dev) > @@ -344,6 +338,7 @@ static struct iommu_ops s390_iommu_ops = { > .iova_to_phys = s390_iommu_iova_to_phys, > .add_device = s390_iommu_add_device, > .remove_device = s390_iommu_remove_device, > + .device_group = generic_device_group, > .pgsize_bitmap = S390_IOMMU_PGSIZES, > }; > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 19/44] s390: implement ->mapping_error
On Thu, 8 Jun 2017 15:25:44 +0200 Christoph Hellwig <h...@lst.de> wrote: > s390 can also use noop_dma_ops, and while that currently does not return > errors it will so in the future. Implementing the mapping_error method > is the proper way to have per-ops error conditions. > > Signed-off-by: Christoph Hellwig <h...@lst.de> Acked-by: Gerald Schaefer <gerald.schae...@de.ibm.com> ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/2] iommu/s390: Fix iommu-groups and add sysfs support
On Fri, 28 Apr 2017 16:55:13 +0200 Joerg Roedelwrote: > > I am however a bit confused now, about how we would have allowed group > > sharing with the current s390 IOMMU code, or IOW in which scenario would > > iommu_group_get() in the add_device callback find a shareable iommu-group? > > The usual way to do this is to use the iommu_group_get_for_dev() > function, which invokes the iommu_ops->device_group call-back of the > driver to find a matching group or allocating a new one. > > There are ready-to-use functions for this call-back already: > > 1) generic_device_group() - which just allocates a new group for > the device. This is usually used outside of PCI > > 2) pci_device_group() - Which walks the PCI hierarchy to find > devices that are not isolated and uses the matching group for > its isolation domain. > > A few drivers have their own versions of this call-back, but those are > IOMMU drivers supporting multiple bus-types and need to find the right > way to determine the group first. > > > So, I guess we may have an issue with not sharing iommu-groups when > > it could make sense to do so. But your patch would not fix this, as > > we still would allocate separate iommu-groups for all functions. > > Yes, but the above approach won't help when each function ends up on a > seperate bus because the code looks for different functions that are > enumerated as such. Anyway, some more insight into how this enumeration > works on s390 would be great :) Since Sebastian confirmed this, it looks like we do not really have any enumeration when there is a separate bus for each function. Also, IIRC, add_device will get called before attach_dev. Currently we allow to attach more than one device (apparently from different buses) to one domain (one shared DMA table) in attach_dev. But then it would be too late to also add all devices to the same iommu-group. That would have had to be done earlier in add_device, but there we don't know yet if a shared DMA table would be set up later in attach_dev. So, it looks to me that we cannot provide correct iommu-group sharing on s390, even though we allow iommu-domain sharing, which sounds odd. Since this "shared domain / DMA table" option in attach_dev was only added because at that time I thought that was a hard requirement for any arch- specific IOMMU API implementation, maybe there was some misunderstanding. It would make the code easier (and more consistent with the s390 hardware) if I would just remove that option from attach_dev, and allow only one device/function per iommu-domain. What do you think, could this be removed for s390, or is there any common code requirement for providing that option (and is it OK that we have separate iommu-groups in this case)? Regards, Gerald ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/s390: Fix IOMMU groups
On Thu, 27 Apr 2017 23:12:32 +0200 Joerg Roedel <j...@8bytes.org> wrote: > On Thu, Apr 27, 2017 at 08:11:42PM +0200, Gerald Schaefer wrote: > > > +void zpci_destroy_iommu(struct zpci_dev *zdev) > > > +{ > > > + iommu_group_put(zdev->group); > > > + zdev->group = NULL; > > > +} > > > > While the rest of this patch doesn't seem to make much of a difference to > > the current behavior, I'm wondering where this extra iommu_group_put() > > comes from. It either was erroneously missing before this patch, or it > > is erroneously introduced by this patch. > > This is the way to free an iommu-group. It was missing before probably > because it was unclear whether the add_device function allocated a group > or not. So there was no way to know if it needs to be put again in the > remove_device function. Hmm, for the reference count it should not matter whether a new group was allocated or an existing group found with iommu_group_get(). Our add_device callback always gets one reference either from iommu_group_get or _alloc, and then another one from iommu_group_add_device(), after which the first reference is put again. So there should always be one reference more after a successful add_device. Now I'm wondering where this one reference is put again, and I thought that happened in the remove_device callback, where we call iommu_group_remove_device(). Is this not correct? Just want to make sure that we don't have a refcount issue in the current code. Regards, Gerald ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/2] iommu/s390: Fix iommu-groups and add sysfs support
Hi Joerg, I guess we are a bit special on s390 (again), see below. Sebastian is more familiar with the base s390 PCI code, he may correct me if I'm wrong. On Thu, 27 Apr 2017 23:03:25 +0200 Joerg Roedelwrote: > > Well, there is a separate zpci_dev for each pci_dev on s390, > > and each of those has its own separate dma-table (thus not shared). > > Is that true for all functions of a PCIe card, so does every function of > a device has its own zpci_dev structure and thus its own DMA-table? Yes, clp_add_pci_device() is called for every function, which in turn calls zpci_create_device() with a freshly allocated zdev. zpci_enable_device() then sets up a new DMA address space for each function. > My assumption came from the fact that the zpci_dev is read from > pci_dev->sysdata, which is propagated there from the pci_bridge > through the pci_root_bus structures. The zdev gets there via zpci_create_device() -> zpci_scan_bus() -> pci_scan_root_bus(), which is done for every single function. Not sure if I understand this right, but it looks like we set up a new PCI bus for each function. > > Given this "separate zpci_dev for each pci_dev" situation, I don't > > see what this update actually changes, compared to the previous code, > > see also my comments to that patch. > > The add_device call-back is invoked for every function of a pci-device, > because each function gets its own pci_dev structure. Also we usually > group all functions of a PCI-device together into one iommu-group, > because we don't trust that the device isolates its functions from each > other. OK, but similar to the add_device callback, zpci_create_device() is also invoked for every function. So, allocating a new iommu-group in zpci_create_device() will never lead to any group sharing. I am however a bit confused now, about how we would have allowed group sharing with the current s390 IOMMU code, or IOW in which scenario would iommu_group_get() in the add_device callback find a shareable iommu-group? In the attach_dev callback, we provide the option to "force" multiple functions using the same iommu-domain / DMA address space, by de-registering the per-function DMA address space and registering a common space. But such functions would only be in the same iommu "domain" and not "group", if I get this right. So, I guess we may have an issue with not sharing iommu-groups when it could make sense to do so. But your patch would not fix this, as we still would allocate separate iommu-groups for all functions. Regards, Gerald ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/s390: Fix IOMMU groups
On Thu, 27 Apr 2017 17:28:24 +0200 Joerg Roedelwrote: > From: Joerg Roedel > > Currently the s390 iommu driver allocates an iommu-group for > every device that is added. But that is wrong, as there is > only one dma-table per pci-root-bus. Make all devices behind > one dma-table share one iommu-group. On s390, each PCI device has its own zpci_dev structure, and also its own DMA address space. Even with this patch, you'll still allocate an iommu_group for every device that is added, see below, so I do not really see the benefit of this (but my knowledge got a little rusty, so I may be missing something). The only reason why we allow the theoretical option to attach more than one device to the same IOMMU domain (and thus address space), is because it was a requirement by you at that time (IIRC). We have no use-case for this, and even in this theoretical scenario you would still have separate zpci_dev structures for the affected devices that share the same IOMMU domain (DMA address space), and thus also separate IOMMU groups, at least after this patch. Before this patch, you could in theory have different PCI devices in the same IOMMU group, by having iommu_group_get() in s390_iommu_add_device() return a group != NULL. With this patch, you will enforce that a new iommu_group is allocated for every device, which would be the contrary of what the description says. > > Signed-off-by: Joerg Roedel > --- > arch/s390/include/asm/pci.h | 7 +++ > arch/s390/pci/pci.c | 10 +- > drivers/iommu/s390-iommu.c | 43 --- > 3 files changed, 48 insertions(+), 12 deletions(-) > > diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h > index 4e31866..045665d 100644 > --- a/arch/s390/include/asm/pci.h > +++ b/arch/s390/include/asm/pci.h > @@ -8,6 +8,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -123,6 +124,8 @@ struct zpci_dev { > unsigned long iommu_pages; > unsigned intnext_bit; > > + struct iommu_group *group; /* IOMMU group for all devices behind > this zdev */ There is always only one device behind a zpci_dev, so this comment makes no sense. > + > char res_name[16]; > struct zpci_bar_struct bars[PCI_BAR_COUNT]; > > @@ -173,6 +176,10 @@ static inline bool zdev_enabled(struct zpci_dev *zdev) > int clp_enable_fh(struct zpci_dev *, u8); > int clp_disable_fh(struct zpci_dev *); > > +/* IOMMU Interface */ > +int zpci_init_iommu(struct zpci_dev *zdev); > +void zpci_destroy_iommu(struct zpci_dev *zdev); > + > #ifdef CONFIG_PCI > /* Error handling and recovery */ > void zpci_event_error(void *); > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c > index 364b9d8..3178e4d 100644 > --- a/arch/s390/pci/pci.c > +++ b/arch/s390/pci/pci.c > @@ -754,6 +754,7 @@ void pcibios_remove_bus(struct pci_bus *bus) > > zpci_exit_slot(zdev); > zpci_cleanup_bus_resources(zdev); > + zpci_destroy_iommu(zdev); > zpci_free_domain(zdev); > > spin_lock(_list_lock); > @@ -825,11 +826,15 @@ int zpci_create_device(struct zpci_dev *zdev) > if (rc) > goto out; > > + rc = zpci_init_iommu(zdev); > + if (rc) > + goto out_free; This will get called for every device that is added, and every device will get its own zpci_dev, so this will still result in allocating an IOMMU group for every device. > + > mutex_init(>lock); > if (zdev->state == ZPCI_FN_STATE_CONFIGURED) { > rc = zpci_enable_device(zdev); > if (rc) > - goto out_free; > + goto out_iommu; > } > rc = zpci_scan_bus(zdev); > if (rc) > @@ -846,6 +851,9 @@ int zpci_create_device(struct zpci_dev *zdev) > out_disable: > if (zdev->state == ZPCI_FN_STATE_ONLINE) > zpci_disable_device(zdev); > +out_iommu: > + zpci_destroy_iommu(zdev); > + > out_free: > zpci_free_domain(zdev); > out: > diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c > index 179e636..cad3ad0 100644 > --- a/drivers/iommu/s390-iommu.c > +++ b/drivers/iommu/s390-iommu.c > @@ -163,22 +163,22 @@ static void s390_iommu_detach_device(struct > iommu_domain *domain, > } > } > > -static int s390_iommu_add_device(struct device *dev) > +static struct iommu_group *s390_iommu_device_group(struct device *dev) > { > - struct iommu_group *group; > - int rc; > + struct zpci_dev *zdev = to_pci_dev(dev)->sysdata; > > - group = iommu_group_get(dev); > - if (!group) { > - group = iommu_group_alloc(); > - if (IS_ERR(group)) > - return PTR_ERR(group); > - } > + return zdev->group; > +} > + > +static int s390_iommu_add_device(struct device *dev) > +{ > + struct iommu_group *group = iommu_group_get_for_dev(dev); > + if
Re: [RFC PATCH 0/2] iommu/s390: Fix iommu-groups and add sysfs support
On Thu, 27 Apr 2017 17:28:23 +0200 Joerg Roedelwrote: > Hey, > > here are two patches for the s390 PCI and IOMMU code. It is > based on the assumption that every pci_dev that points to > the same zpci_dev shares a single dma-table (and thus a > single address space). Well, there is a separate zpci_dev for each pci_dev on s390, and each of those has its own separate dma-table (thus not shared). > > If this assupmtion is true (as it looks to me from reading > the code) then the iommu-group setup code in the s390 iommu > driver needs to be updated. Given this "separate zpci_dev for each pci_dev" situation, I don't see what this update actually changes, compared to the previous code, see also my comments to that patch. > > These patches do this and also add support for the > iommu_device_register interface to the s390 iommu driver. > > Any comments and testing appreciated. > > Thanks, > > Joerg > > Joerg Roedel (2): > iommu/s390: Fix IOMMU groups > iommu/s390: Add support for iommu_device handling > > arch/s390/include/asm/pci.h | 8 + > arch/s390/pci/pci.c | 10 ++- > drivers/iommu/s390-iommu.c | 71 > ++--- > 3 files changed, 78 insertions(+), 11 deletions(-) > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/s390: add iommu api for s390 pci devices
On Tue, 29 Sep 2015 14:40:30 +0200 Joerg Roedel <j...@8bytes.org> wrote: > Hi Gerald, > > thanks for your patch. It looks pretty good and addresses my previous > review comments. I have a few questions, first one is how this > operates with DMA-API on s390. Is there a seperate DMA-API > implementation besides the IOMMU-API one for PCI devices? Yes, the DMA API is already implemented in arch/s390/pci/pci_dma.c. I thought about moving it over to the new location in drivers/iommu/, but I don't see any benefit from it. Also, the two APIs are quite different on s390 and must not be mixed-up. For example, we have optimizations in the DMA API to reduce TLB flushes based on iommu bitmap wrap-around, which is not possible for the map/unmap logic in the IOMMU API. There is also the requirement that each device has its own DMA page table (not shared), which is important for DMA API device recovery and map/unmap on s390. > > My other question is inline: > > On Thu, Aug 27, 2015 at 03:33:03PM +0200, Gerald Schaefer wrote: > > +struct s390_domain_device { > > + struct list_headlist; > > + struct zpci_dev *zdev; > > +}; > > Instead of using your own struct here, have you considered using the > struct iommu_group instead? The struct devices contains a pointer to an > iommu_group and the struct itself contains pointers to the domain it is > currently bound to. Hmm, not sure how this can replace my own struct. I need the struct to maintain a list of all devices that share a dma page table. And the devices need to be added and removed to/from that list in attach/detach_dev. I also need that list during map/unmap, in order to do a TLB flush for all affected devices, and this happens under a spin lock. So I guess I cannot use the iommu_group->devices list, which is managed in add/remove_device and under a mutex, if that was on your mind. Regards, Gerald ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/s390: add iommu api for s390 pci devices
This adds an IOMMU API implementation for s390 PCI devices. Reviewed-by: Sebastian Ott seb...@linux.vnet.ibm.com Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com --- MAINTAINERS | 7 + arch/s390/Kconfig | 1 + arch/s390/include/asm/pci.h | 4 + arch/s390/include/asm/pci_dma.h | 5 +- arch/s390/pci/pci_dma.c | 37 +++-- drivers/iommu/Kconfig | 7 + drivers/iommu/Makefile | 1 + drivers/iommu/s390-iommu.c | 337 8 files changed, 386 insertions(+), 13 deletions(-) create mode 100644 drivers/iommu/s390-iommu.c diff --git a/MAINTAINERS b/MAINTAINERS index 23da5da..8845410 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8937,6 +8937,13 @@ F: drivers/s390/net/*iucv* F: include/net/iucv/ F: net/iucv/ +S390 IOMMU (PCI) +M: Gerald Schaefer gerald.schae...@de.ibm.com +L: linux-s...@vger.kernel.org +W: http://www.ibm.com/developerworks/linux/linux390/ +S: Supported +F: drivers/iommu/s390-iommu.c + S3C24XX SD/MMC Driver M: Ben Dooks ben-li...@fluff.org L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 1d57000..74db332 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -582,6 +582,7 @@ menuconfig PCI bool PCI support select HAVE_DMA_ATTRS select PCI_MSI + select IOMMU_SUPPORT help Enable PCI support. diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h index 34d9603..c873e68 100644 --- a/arch/s390/include/asm/pci.h +++ b/arch/s390/include/asm/pci.h @@ -62,6 +62,8 @@ struct zpci_bar_struct { u8 size; /* order 2 exponent */ }; +struct s390_domain; + /* Private data per function */ struct zpci_dev { struct pci_dev *pdev; @@ -118,6 +120,8 @@ struct zpci_dev { struct dentry *debugfs_dev; struct dentry *debugfs_perf; + + struct s390_domain *s390_domain; /* s390 IOMMU domain data */ }; static inline bool zdev_enabled(struct zpci_dev *zdev) diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h index 30b4c17..7a7abf1 100644 --- a/arch/s390/include/asm/pci_dma.h +++ b/arch/s390/include/asm/pci_dma.h @@ -192,5 +192,8 @@ static inline unsigned long *get_st_pto(unsigned long entry) /* Prototypes */ int zpci_dma_init_device(struct zpci_dev *); void zpci_dma_exit_device(struct zpci_dev *); - +void dma_free_seg_table(unsigned long); +unsigned long *dma_alloc_cpu_table(void); +void dma_cleanup_tables(unsigned long *); +void dma_update_cpu_trans(unsigned long *, void *, dma_addr_t, int); #endif diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c index 37505b8..37d10f7 100644 --- a/arch/s390/pci/pci_dma.c +++ b/arch/s390/pci/pci_dma.c @@ -24,7 +24,7 @@ static int zpci_refresh_global(struct zpci_dev *zdev) zdev-iommu_pages * PAGE_SIZE); } -static unsigned long *dma_alloc_cpu_table(void) +unsigned long *dma_alloc_cpu_table(void) { unsigned long *table, *entry; @@ -114,12 +114,12 @@ static unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr return pto[px]; } -static void dma_update_cpu_trans(struct zpci_dev *zdev, void *page_addr, -dma_addr_t dma_addr, int flags) +void dma_update_cpu_trans(unsigned long *dma_table, void *page_addr, + dma_addr_t dma_addr, int flags) { unsigned long *entry; - entry = dma_walk_cpu_trans(zdev-dma_table, dma_addr); + entry = dma_walk_cpu_trans(dma_table, dma_addr); if (!entry) { WARN_ON_ONCE(1); return; @@ -156,7 +156,8 @@ static int dma_update_trans(struct zpci_dev *zdev, unsigned long pa, goto no_refresh; for (i = 0; i nr_pages; i++) { - dma_update_cpu_trans(zdev, page_addr, dma_addr, flags); + dma_update_cpu_trans(zdev-dma_table, page_addr, dma_addr, +flags); page_addr += PAGE_SIZE; dma_addr += PAGE_SIZE; } @@ -181,7 +182,7 @@ no_refresh: return rc; } -static void dma_free_seg_table(unsigned long entry) +void dma_free_seg_table(unsigned long entry) { unsigned long *sto = get_rt_sto(entry); int sx; @@ -193,21 +194,18 @@ static void dma_free_seg_table(unsigned long entry) dma_free_cpu_table(sto); } -static void dma_cleanup_tables(struct zpci_dev *zdev) +void dma_cleanup_tables(unsigned long *table) { - unsigned long *table; int rtx; - if (!zdev || !zdev-dma_table) + if (!table) return; - table = zdev-dma_table; for (rtx = 0; rtx ZPCI_TABLE_ENTRIES; rtx++) if (reg_entry_isvalid(table[rtx
Re: [RFC PATCH 0/1] iommu: Detach device from domain when removed from group
On Mon, 3 Aug 2015 17:48:55 +0200 Joerg Roedel j...@8bytes.org wrote: On Tue, Jul 28, 2015 at 07:55:55PM +0200, Gerald Schaefer wrote: On s390, this eventually leads to a kernel panic when binding the device again to its non-vfio PCI driver, because of the missing arch-specific cleanup in detach_dev. On x86, the detach_dev callback will also not be called directly, but there is a notifier that will catch BUS_NOTIFY_REMOVED_DEVICE and eventually do the cleanup. Other architectures w/o the notifier probably have at least some kind of memory leak in this scenario, so a general fix would be nice. This notifier is not arch-specific, but registered against the bus the iommu-ops are set for. Why does it not run on s390? Adding the notifier would of course also work on s390 (and all other affected architectures). However, it seems that the missing detach_dev issue in this scenario is not fundamentally fixed by using this notifier, it just seems to hide the symptom by chance. Adding the otherwise unneeded notifier just to work around this issue somehow doesn't seem right, also given that x86 is so far the only user of it. At least I thought it would be cleaner to fix it in common code and for all architectures. Not sure what's wrong with fixing the asymmetry as suggested in my patch, but I guess there are good reasons for having this asymmetry. For now, I'll just add the notifier to my s390 implementation and post it soon. Joerg -- To unsubscribe from this list: send the line unsubscribe linux-pci in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 0/1] iommu: Detach device from domain when removed from group
On Tue, 28 Jul 2015 19:55:55 +0200 Gerald Schaefer gerald.schae...@de.ibm.com wrote: Hi, during IOMMU API function testing on s390 I hit the following scenario: After binding a device to vfio-pci, the user completes the VFIO_SET_IOMMU ioctl and stops, see the sample C program below. Now the device is manually removed via echo 1 /sys/bus/pci/devices/.../remove. Although the SET_IOMMU ioctl triggered the attach_dev callback in the underlying IOMMU API, removing the device in this way won't trigger the detach_dev callback, neither during remove nor when the user program continues with closing group/container. On s390, this eventually leads to a kernel panic when binding the device again to its non-vfio PCI driver, because of the missing arch-specific cleanup in detach_dev. On x86, the detach_dev callback will also not be called directly, but there is a notifier that will catch BUS_NOTIFY_REMOVED_DEVICE and eventually do the cleanup. Other architectures w/o the notifier probably have at least some kind of memory leak in this scenario, so a general fix would be nice. My first approach was to try and fix this in VFIO code, but Alex Williamson pointed me to some asymmetry in the IOMMU code: iommu_group_add_device() will invoke the attach_dev callback, but iommu_group_remove_device() won't trigger detach_dev. Fixing this asymmetry would fix the issue for me, but is this the correct fix? Any thoughts? Ping. The suggested fix may be completely wrong, but not having detach_dev called seems like like a serious issue, any feedback would be greatly appreciated. Regards, Gerald Here is the sample C program to trigger the ioctl: #include stdio.h #include fcntl.h #include linux/vfio.h int main(void) { int container, group, rc; container = open(/dev/vfio/vfio, O_RDWR); if (container 0) { perror(open /dev/vfio/vfio\n); return -1; } group = open(/dev/vfio/0, O_RDWR); if (group 0) { perror(open /dev/vfio/0\n); return -1; } rc = ioctl(group, VFIO_GROUP_SET_CONTAINER, container); if (rc) { perror(ioctl VFIO_GROUP_SET_CONTAINER\n); return -1; } rc = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); if (rc) { perror(ioctl VFIO_SET_IOMMU\n); return -1; } printf(Try device remove...\n); getchar(); close(group); close(container); return 0; } Gerald Schaefer (1): iommu: Detach device from domain when removed from group drivers/iommu/iommu.c | 5 + 1 file changed, 5 insertions(+) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 1/1] iommu: Detach device from domain when removed from group
This patch adds a call to __iommu_detach_device() to the iommu_group_remove_device() function, which will trigger a missing detach_dev callback in (at least) the following scenario: When a user completes the VFIO_SET_IOMMU ioctl for a vfio-pci device, and the corresponding device is removed thereafter (before any other ioctl like VFIO_GROUP_GET_DEVICE_FD), then the detach_dev callback of the underlying IOMMU API is never called. This also fixes an asymmetry with iommu_group_add_device() and iommu_group_remove_device(), where the former did an attach_dev but the latter did no detach_dev. Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com --- drivers/iommu/iommu.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index f286090..82ac8b3 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -447,6 +447,9 @@ rename: } EXPORT_SYMBOL_GPL(iommu_group_add_device); +static void __iommu_detach_device(struct iommu_domain *domain, + struct device *dev); + /** * iommu_group_remove_device - remove a device from it's current group * @dev: device to be removed @@ -466,6 +469,8 @@ void iommu_group_remove_device(struct device *dev) IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev); mutex_lock(group-mutex); + if (group-domain) + __iommu_detach_device(group-domain, dev); list_for_each_entry(tmp_device, group-devices, list) { if (tmp_device-dev == dev) { device = tmp_device; -- 2.3.8 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 0/1] iommu: Detach device from domain when removed from group
Hi, during IOMMU API function testing on s390 I hit the following scenario: After binding a device to vfio-pci, the user completes the VFIO_SET_IOMMU ioctl and stops, see the sample C program below. Now the device is manually removed via echo 1 /sys/bus/pci/devices/.../remove. Although the SET_IOMMU ioctl triggered the attach_dev callback in the underlying IOMMU API, removing the device in this way won't trigger the detach_dev callback, neither during remove nor when the user program continues with closing group/container. On s390, this eventually leads to a kernel panic when binding the device again to its non-vfio PCI driver, because of the missing arch-specific cleanup in detach_dev. On x86, the detach_dev callback will also not be called directly, but there is a notifier that will catch BUS_NOTIFY_REMOVED_DEVICE and eventually do the cleanup. Other architectures w/o the notifier probably have at least some kind of memory leak in this scenario, so a general fix would be nice. My first approach was to try and fix this in VFIO code, but Alex Williamson pointed me to some asymmetry in the IOMMU code: iommu_group_add_device() will invoke the attach_dev callback, but iommu_group_remove_device() won't trigger detach_dev. Fixing this asymmetry would fix the issue for me, but is this the correct fix? Any thoughts? Regards, Gerald Here is the sample C program to trigger the ioctl: #include stdio.h #include fcntl.h #include linux/vfio.h int main(void) { int container, group, rc; container = open(/dev/vfio/vfio, O_RDWR); if (container 0) { perror(open /dev/vfio/vfio\n); return -1; } group = open(/dev/vfio/0, O_RDWR); if (group 0) { perror(open /dev/vfio/0\n); return -1; } rc = ioctl(group, VFIO_GROUP_SET_CONTAINER, container); if (rc) { perror(ioctl VFIO_GROUP_SET_CONTAINER\n); return -1; } rc = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); if (rc) { perror(ioctl VFIO_SET_IOMMU\n); return -1; } printf(Try device remove...\n); getchar(); close(group); close(container); return 0; } Gerald Schaefer (1): iommu: Detach device from domain when removed from group drivers/iommu/iommu.c | 5 + 1 file changed, 5 insertions(+) -- 2.3.8 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 1/1] vfio-pci/iommu: Detach iommu group on remove path
On Wed, 22 Jul 2015 10:54:35 -0600 Alex Williamson alex.william...@redhat.com wrote: On Tue, 2015-07-21 at 19:44 +0200, Gerald Schaefer wrote: When a user completes the VFIO_SET_IOMMU ioctl and the vfio-pci device is removed thereafter (before any other ioctl like VFIO_GROUP_GET_DEVICE_FD), then the detach_dev callback of the underlying IOMMU API is never called. This patch adds a call to vfio_group_try_dissolve_container() to the remove path, which will trigger the missing detach_dev callback in this scenario. Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com --- drivers/vfio/vfio.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 2fb29df..9c5c784 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -711,6 +711,8 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev) return true; } +static void vfio_group_try_dissolve_container(struct vfio_group *group); + /* * Decrement the device reference count and wait for the device to be * removed. Open file descriptors for the device... */ @@ -785,6 +787,7 @@ void *vfio_del_group_dev(struct device *dev) } } while (ret = 0); + vfio_group_try_dissolve_container(group); vfio_group_put(group); return device_data; This won't work, vfio_group_try_dissolve_container() decrements container_users, which an unused device is not. Imagine if we had more than one device in the iommu group, one device is removed and the container is dissolved despite the user holding a reference and other viable devices remaining. Additionally, from an isolation perspective, an unbind from vfio-pci should not pull the device out of the iommu domain, it's part of the domain because it's not isolated and that continues even after unbind. I think what you want to do is detach a device from the iommu domain only when it's being removed from iommu group, such as through iommu_group_remove_device(). We already have a bit of an asymmetry there as iommu_group_add_device() will add devices to the currently active iommu domain for the group, but iommu_group_remove_device() does not appear to do the reverse. Thanks, Interesting, I haven't noticed this asymmetry so far, do you mean something like this: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index f286090..82ac8b3 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -447,6 +447,9 @@ rename: } EXPORT_SYMBOL_GPL(iommu_group_add_device); +static void __iommu_detach_device(struct iommu_domain *domain, + struct device *dev); + /** * iommu_group_remove_device - remove a device from it's current group * @dev: device to be removed @@ -466,6 +469,8 @@ void iommu_group_remove_device(struct device *dev) IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev); mutex_lock(group-mutex); + if (group-domain) + __iommu_detach_device(group-domain, dev); list_for_each_entry(tmp_device, group-devices, list) { if (tmp_device-dev == dev) { device = tmp_device; This would also fix the issue in my scenario, but like before that doesn't need to mean it is the correct fix. Adding the iommu list and maintainer to cc. Joerg, what do you think? (see https://lkml.org/lkml/2015/7/21/635 for the problem description) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 1/1] vfio-pci/iommu: Detach iommu group on remove path
On Wed, 22 Jul 2015 11:10:57 -0600 Alex Williamson alex.william...@redhat.com wrote: On Wed, 2015-07-22 at 10:54 -0600, Alex Williamson wrote: On Tue, 2015-07-21 at 19:44 +0200, Gerald Schaefer wrote: When a user completes the VFIO_SET_IOMMU ioctl and the vfio-pci device is removed thereafter (before any other ioctl like VFIO_GROUP_GET_DEVICE_FD), then the detach_dev callback of the underlying IOMMU API is never called. This patch adds a call to vfio_group_try_dissolve_container() to the remove path, which will trigger the missing detach_dev callback in this scenario. Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com --- drivers/vfio/vfio.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 2fb29df..9c5c784 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -711,6 +711,8 @@ static bool vfio_dev_present(struct vfio_group *group, struct device *dev) return true; } +static void vfio_group_try_dissolve_container(struct vfio_group *group); + /* * Decrement the device reference count and wait for the device to be * removed. Open file descriptors for the device... */ @@ -785,6 +787,7 @@ void *vfio_del_group_dev(struct device *dev) } } while (ret = 0); + vfio_group_try_dissolve_container(group); vfio_group_put(group); return device_data; This won't work, vfio_group_try_dissolve_container() decrements container_users, which an unused device is not. Imagine if we had more than one device in the iommu group, one device is removed and the container is dissolved despite the user holding a reference and other viable devices remaining. Additionally, from an isolation perspective, an unbind from vfio-pci should not pull the device out of the iommu domain, it's part of the domain because it's not isolated and that continues even after unbind. I think what you want to do is detach a device from the iommu domain only when it's being removed from iommu group, such as through iommu_group_remove_device(). We already have a bit of an asymmetry there as iommu_group_add_device() will add devices to the currently active iommu domain for the group, but iommu_group_remove_device() does not appear to do the reverse. Thanks, BTW, VT-d on x86 avoids a leak using its own notifier_block, drivers/iommu/intel-iommu.c:device_notifier() catches BUS_NOTIFY_REMOVED_DEVICE and removes the device from the domain (the domain_exit() there is only used for non-IOMMU-API domains). It's possible that's the only IOMMU driver that avoids a leak due to the scenario you describe. Thanks, Thanks, that's good to know, so as a last resort I could also use the notifier to work around the issue. But x86 seems to be the only arch using this notifier so far, so a general fix would be nice. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH linux-next] iommu: add iommu for s390 platform
On Thu, 23 Oct 2014 16:04:37 +0200 Frank Blaschka blasc...@linux.vnet.ibm.com wrote: On Thu, Oct 23, 2014 at 02:41:15PM +0200, Joerg Roedel wrote: On Wed, Oct 22, 2014 at 05:43:20PM +0200, Frank Blaschka wrote: Basically there are no limitations. Depending on the s390 maschine generation a device starts its IOVA at a specific address (announced by the HW). But as I already told each device starts at the same address. I think this prevents having multiple devices on the same IOMMU domain. Why, each device has its own IOVA address space, so IOVA A could map to physical address X for one device and to Y for another, no? And if you point multiple devices to the same dma_table they share the mappings (and thus the address space). Or am I getting something wrong? yes, you are absolutely right. There is a per-device dma_table. There is no general IOMMU device but each pci device has its own IOMMU translation capability. I see, in this way it is similar to ARM where there is often also one IOMMU per master device. Is there a possibility the IOMMU domain can support e.g. something like VIOA 0x1 - pci device 1 VIOA 0x1 - pci device 2 A domain is basically an abstraction for a DMA page table (or a dma_table, as you call it on s390). So you can easily create similar mappings for more than one device with it. ok, maybe I was too close to the existing s390 dma implementation or simply wrong, maybe Sebastian or Gerald can give more background Not sure if I understood the concept of IOMMU domains right. But if this is about having multiple devices in the same domain, so that iommu_ops-map will establish the _same_ DMA mapping on _all_ registered devices, then this should be possible. We cannot have shared DMA tables because each device gets its own DMA table allocated during device initialization. But we could just keep all devices from one domain in a list and then call dma_update_trans() for all devices during iommu_ops-map/unmap. Gerald ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH linux-next] iommu: add iommu for s390 platform
On Mon, 27 Oct 2014 17:25:02 +0100 Joerg Roedel j...@8bytes.org wrote: On Mon, Oct 27, 2014 at 03:32:01PM +0100, Gerald Schaefer wrote: Not sure if I understood the concept of IOMMU domains right. But if this is about having multiple devices in the same domain, so that iommu_ops-map will establish the _same_ DMA mapping on _all_ registered devices, then this should be possible. Yes, this is what domains are about. A domain describes a set of DMA mappings which can be assigned to multiple devices in parallel. We cannot have shared DMA tables because each device gets its own DMA table allocated during device initialization. Is there some hardware reason for this or is that just an implementation detail that can be changed. In other words, does the hardware allow to use the same DMA table for multiple devices? Yes, the HW would allow shared DMA tables, but the implementation would need some non-trivial changes. For example, we have a per-device spin_lock for DMA table manipulations and the code in arch/s390/pci/pci_dma.c knows nothing about IOMMU domains or shared DMA tables, it just implements a set of dma_map_ops. Of course this would also go horribly wrong if a device was already in use (via the current dma_map_ops), but I guess using devices through the IOMMU_API prevents using them otherwise? But we could just keep all devices from one domain in a list and then call dma_update_trans() for all devices during iommu_ops-map/unmap. This sounds complicated. Note that a device can be assigned to a domain that already has existing mappings. In this case you need to make sure that the new device inherits these mappings (and destroy all old mappings for the device that possibly exist). I think it is much easier to use the same DMA table for all devices in a domain, if the hardware allows that. Yes, in this case, having one DMA table per domain and sharing it between all devices in that domain sounds like a good idea. However, I can't think of any use case for this, and Frank probably had a very special use case in mind where this scenario doesn't appear, hence the one device per domain restriction. So, if having multiple devices per domain is a must, then we probably need a thorough rewrite of the arch/s390/pci/pci_dma.c code. Gerald ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH linux-next] iommu: add iommu for s390 platform
On Mon, 27 Oct 2014 18:58:35 +0100 Joerg Roedel j...@8bytes.org wrote: On Mon, Oct 27, 2014 at 06:02:19PM +0100, Gerald Schaefer wrote: On Mon, 27 Oct 2014 17:25:02 +0100 Joerg Roedel j...@8bytes.org wrote: Is there some hardware reason for this or is that just an implementation detail that can be changed. In other words, does the hardware allow to use the same DMA table for multiple devices? Yes, the HW would allow shared DMA tables, but the implementation would need some non-trivial changes. For example, we have a per-device spin_lock for DMA table manipulations and the code in arch/s390/pci/pci_dma.c knows nothing about IOMMU domains or shared DMA tables, it just implements a set of dma_map_ops. I think it would make sense to move the DMA table handling code and the dma_map_ops implementation to the IOMMU driver too. This is also how some other IOMMU drivers implement it. Yes, I feared that this would come up, but I agree that it looks like the best solution, at least if we really want/need the IOMMU API for s390 now. I'll need to discuss this with Frank, he seems to be on vacation this week. Thanks for your feedback and explanations! The plan is to consolidate the dma_ops implementations someday and have a common implementation that works with all IOMMU drivers across architectures. This would benefit s390 as well and obsoletes the driver specific dma_ops implementation. Of course this would also go horribly wrong if a device was already in use (via the current dma_map_ops), but I guess using devices through the IOMMU_API prevents using them otherwise? This is taken care of by the device drivers. A driver for a device either uses the DMA-API or does its own management of DMA mappings using the IOMMU-API. VFIO is an example for the later case. I think it is much easier to use the same DMA table for all devices in a domain, if the hardware allows that. Yes, in this case, having one DMA table per domain and sharing it between all devices in that domain sounds like a good idea. However, I can't think of any use case for this, and Frank probably had a very special use case in mind where this scenario doesn't appear, hence the one device per domain restriction. One usecase is device access from user-space via VFIO. A userspace process might want to access multiple devices at the same time and VFIO would implement this by assigning all of these devices to the same IOMMU domain. This requirement also comes also from the IOMMU-API itself. The intention of the API is to make different IOMMUs look the same through the API, and this is violated when drivers implement a 1-1 domain-device mapping. So, if having multiple devices per domain is a must, then we probably need a thorough rewrite of the arch/s390/pci/pci_dma.c code. Yes, this is a requirement for new IOMMU drivers. We already have drivers implementing the same 1-1 relation and we are about to fix them. But I don't want to add new drivers doing the same. Joerg -- To unsubscribe from this list: send the line unsubscribe linux-s390 in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu