Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-10-07 Thread Gerald Schaefer
On Thu, 7 Oct 2021 12:59:32 +0200
Karsten Graul  wrote:

[...]
> > 
> >>> BTW, there is already a WARN in the add_dma_entry() path, related
> >>> to cachlline overlap and -EEXIST:
> >>>
> >>> add_dma_entry() -> active_cacheline_insert() -> -EEXIST ->
> >>> active_cacheline_inc_overlap()
> >>>
> >>> That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP".
> >>> Not familiar with that code, but it seems that there are now two
> >>> warnings for more or less the same, and the new warning is much more
> >>> prone to false-positives.
> >>>
> >>> How do these 2 warnings relate, are they both really necessary?
> >>> I think the new warning was only introduced because of some old
> >>> TODO comment in add_dma_entry(), see commit 2b4bbc6231d78
> >>> ("dma-debug: report -EEXIST errors in add_dma_entry").
> > 
> > AFAICS they are different things. I believe the new warning is supposed to 
> > be for the fundementally incorrect API usage (as above) of mapping 
> > different regions overlapping within the same cacheline. The existing one 
> > is about dma-debug losing internal consistency when tracking the *same* 
> > region being mapped multiple times, which is a legal thing to do - e.g. 
> > buffer sharing between devices - but if anyone's doing it to excess that's 
> > almost certainly a bug (i.e. they probably intended to unmap it in between 
> > but missed that out).
> 
> Thanks for the explanation Robin. 
> 
> In our case its really that a buffer is mapped twice for 2 different devices 
> which we use in SMC to provide failover capabilities. We see that -EEXIST is 
> returned when a buffer is mapped for the second device. Since there is a 
> maximum of 2 parallel mappings we never see the warning shown by 
> active_cacheline_inc_overlap() because we don't exceed 
> ACTIVE_CACHELINE_MAX_OVERLAP.
> 
> So how to deal with this kind of "legal thing", looks like there is no way to 
> suppress the newly introduced EEXIST warning for that case?

Thanks Karsten, very interesting. We assumed so far that we hit the
same case as Ioana, i.e. having multiple sg elements in one cacheline.
With debug output it now seems that we hit a completely different
case, not at all related to any cacheline or coherency issues.

So it really seems that the new warning is basically the same
as the already present one, with the difference that it already
triggers on the first occurrence. Looking at the code again, it
also seems rather obvious now...

IIUC, from what Robin described, this means that the "legal thing
to do - e.g. buffer sharing between devices" will now immediately
trigger the new warning? Not sure if I missed something (again),
because then I would expect much more reports on this, and of
course it would then obviously be false-positive.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-debug: fix sg checks in debug_dma_map_sg()

2021-10-06 Thread Gerald Schaefer
The following warning occurred sporadically on s390:
DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or 
rodata [addr=48cc5e2f] [len=131072]
WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 
check_for_illegal_area+0xa8/0x138

It is a false-positive warning, due to broken logic in debug_dma_map_sg().
check_for_illegal_area() checks for overlay of sg elements with kernel text
or rodata. It is called with sg_dma_len(s) instead of s->length as
parameter. After the call to ->map_sg(), sg_dma_len() will contain the
length of possibly combined sg elements in the DMA address space, and not
the individual sg element length, which would be s->length.

The check will then use the physical start address of an sg element, and
add the DMA length for the overlap check, which could result in the false
warning, because the DMA length can be larger than the actual single sg
element length.

In addition, the call to check_for_illegal_area() happens in the iteration
over mapped_ents, which will not include all individual sg elements if
any of them were combined in ->map_sg().

Fix this by using s->length instead of sg_dma_len(s). Also put the call to
check_for_illegal_area() in a separate loop, iterating over all the
individual sg elements ("nents" instead of "mapped_ents").

While at it, as suggested by Robin Murphy, also move check_for_stack()
inside the new loop, as it is similarly concerned with validating the
individual sg elements.

Link: 
https://lore.kernel.org/lkml/20210705185252.4074653-1-gerald.schae...@linux.ibm.com
Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor")
Signed-off-by: Gerald Schaefer 
---
 kernel/dma/debug.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 95445bd6eb72..d968a429f0d1 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -1289,6 +1289,13 @@ void debug_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
if (unlikely(dma_debug_disabled()))
return;
 
+   for_each_sg(sg, s, nents, i) {
+   check_for_stack(dev, sg_page(s), s->offset);
+   if (!PageHighMem(sg_page(s))) {
+   check_for_illegal_area(dev, sg_virt(s), s->length);
+   }
+   }
+
for_each_sg(sg, s, mapped_ents, i) {
entry = dma_entry_alloc();
if (!entry)
@@ -1304,12 +1311,6 @@ void debug_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
entry->sg_call_ents   = nents;
entry->sg_mapped_ents = mapped_ents;
 
-   check_for_stack(dev, sg_page(s), s->offset);
-
-   if (!PageHighMem(sg_page(s))) {
-   check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s));
-   }
-
check_sg_segment(dev, s);
 
add_dma_entry(entry);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-10-06 Thread Gerald Schaefer
On Wed, 6 Oct 2021 15:23:36 +0100
Robin Murphy  wrote:

> On 2021-10-06 14:10, Gerald Schaefer wrote:
> > On Fri, 1 Oct 2021 14:52:56 +0200
> > Gerald Schaefer  wrote:
> > 
> >> On Thu, 30 Sep 2021 15:37:33 +0200
> >> Karsten Graul  wrote:
> >>
> >>> On 14/09/2021 17:45, Ioana Ciornei wrote:
> >>>> On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote:
> >>>>> +DPAA2, netdev maintainers
> >>>>> Hi,
> >>>>>
> >>>>> On 5/18/21 7:54 AM, Hamza Mahfooz wrote:
> >>>>>> Since, overlapping mappings are not supported by the DMA API we should
> >>>>>> report an error if active_cacheline_insert returns -EEXIST.
> >>>>>
> >>>>> It seems this patch found a victim. I was trying to run iperf3 on a
> >>>>> honeycomb (5.14.0, fedora 35) and the console is blasting this error 
> >>>>> message
> >>>>> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, 
> >>>>> which
> >>>>> is attached below.
> >>>>>
> >>>>
> >>>> These frags are allocated by the stack, transformed into a scatterlist
> >>>> by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the
> >>>> dpaa2-eth's decision to use two fragments from the same page (that will
> >>>> also end un in the same cacheline) in two different in-flight skbs.
> >>>>
> >>>> Is this behavior normal?
> >>>>
> >>>
> >>> We see the same problem here and it started with 5.15-rc2 in our nightly 
> >>> CI runs.
> >>> The CI has panic_on_warn enabled so we see the panic every day now.
> >>
> >> Adding a WARN for a case that be detected false-positive seems not
> >> acceptable, exactly for this reason (kernel panic on unaffected
> >> systems).
> >>
> >> So I guess it boils down to the question if the behavior that Ioana
> >> described is legit behavior, on a system that is dma coherent. We
> >> are apparently hitting the same scenario, although it could not yet be
> >> reproduced with debug printks for some reason.
> >>
> >> If the answer is yes, than please remove at lease the WARN, so that
> >> it will not make systems crash that behave valid, and have
> >> panic_on_warn set. Even a normal printk feels wrong to me in that
> >> case, it really sounds rather like you want to fix / better refine
> >> the overlap check, if you want to report anything here.
> > 
> > Dan, Christoph, any opinion?
> > 
> > So far it all looks a lot like a false positive, so could you please
> > see that those patches get reverted? I do wonder a bit why this is
> > not an issue for others, we surely cannot be the only ones running
> > CI with panic_on_warn.
> 
> What convinces you it's a false-positive? I'm hardly familiar with most 
> of that callstack, but it appears to be related to mlx5, and I know that 
> exists on expansion cards which could be plugged into a system with 
> non-coherent PCIe where partial cacheline overlap *would* be a real 
> issue. Of course it's dubious that there are many real use-cases for 
> plugging a NIC with a 4-figure price tag into a little i.MX8 or 
> whatever, but the point is that it *should* still work correctly.

I would assume that a *proper* warning would check if we see the
"non-coherent" case, e.g. by using dev_is_dma_coherent() and only
report with potentially fatal WARN on systems where it is appropriate.

However, I am certainly even less familiar with all that, and might
just have gotten the wrong impression here.

Also not sure about mlx5 relation here, it does not really show
in the call trace, only in the err_printk() output, probably
from dev_driver_string(dev) or dev_name(dev). But I do not see
where mlx5 code would be involved here.

[...]
> According to the streaming DMA API documentation, it is *not* valid:
> 
> ".. warning::
> 
>Memory coherency operates at a granularity called the cache
>line width.  In order for memory mapped by this API to operate
>correctly, the mapped region must begin exactly on a cache line
>boundary and end exactly on one (to prevent two separately mapped
>regions from sharing a single cache line).  Since the cache line size
>may not be known at compile time, the API will not enforce this
>requirement.  Therefore, it is recommended that driver writers who
>don't take special care to determine the cache line size at run time
>   

Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-10-06 Thread Gerald Schaefer
On Wed, 6 Oct 2021 15:10:43 +0200
Gerald Schaefer  wrote:

> On Fri, 1 Oct 2021 14:52:56 +0200
> Gerald Schaefer  wrote:
> 
> > On Thu, 30 Sep 2021 15:37:33 +0200
> > Karsten Graul  wrote:
> > 
> > > On 14/09/2021 17:45, Ioana Ciornei wrote:
> > > > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote:
> > > >> +DPAA2, netdev maintainers
> > > >> Hi,
> > > >>
> > > >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote:
> > > >>> Since, overlapping mappings are not supported by the DMA API we should
> > > >>> report an error if active_cacheline_insert returns -EEXIST.
> > > >>
> > > >> It seems this patch found a victim. I was trying to run iperf3 on a
> > > >> honeycomb (5.14.0, fedora 35) and the console is blasting this error 
> > > >> message
> > > >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, 
> > > >> which
> > > >> is attached below.
> > > >>
> > > > 
> > > > These frags are allocated by the stack, transformed into a scatterlist
> > > > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the
> > > > dpaa2-eth's decision to use two fragments from the same page (that will
> > > > also end un in the same cacheline) in two different in-flight skbs.
> > > > 
> > > > Is this behavior normal?
> > > > 
> > > 
> > > We see the same problem here and it started with 5.15-rc2 in our nightly 
> > > CI runs.
> > > The CI has panic_on_warn enabled so we see the panic every day now.
> > 
> > Adding a WARN for a case that be detected false-positive seems not
> > acceptable, exactly for this reason (kernel panic on unaffected
> > systems).
> > 
> > So I guess it boils down to the question if the behavior that Ioana
> > described is legit behavior, on a system that is dma coherent. We
> > are apparently hitting the same scenario, although it could not yet be
> > reproduced with debug printks for some reason.
> > 
> > If the answer is yes, than please remove at lease the WARN, so that
> > it will not make systems crash that behave valid, and have
> > panic_on_warn set. Even a normal printk feels wrong to me in that
> > case, it really sounds rather like you want to fix / better refine
> > the overlap check, if you want to report anything here.
> 
> Dan, Christoph, any opinion?
> 
> So far it all looks a lot like a false positive, so could you please
> see that those patches get reverted? I do wonder a bit why this is
> not an issue for others, we surely cannot be the only ones running
> CI with panic_on_warn.

For reference, we are talking about these commits:

2b4bbc6231d7 ("dma-debug: report -EEXIST errors in add_dma_entry")
510e1a724ab1 ("dma-debug: prevent an error message from causing runtime 
problems")

The latter introduced the WARN (through err_printk usage), and should
be reverted if it can be false-positive, but both seem wrong in that
case.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-10-06 Thread Gerald Schaefer
On Fri, 1 Oct 2021 14:52:56 +0200
Gerald Schaefer  wrote:

> On Thu, 30 Sep 2021 15:37:33 +0200
> Karsten Graul  wrote:
> 
> > On 14/09/2021 17:45, Ioana Ciornei wrote:
> > > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote:
> > >> +DPAA2, netdev maintainers
> > >> Hi,
> > >>
> > >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote:
> > >>> Since, overlapping mappings are not supported by the DMA API we should
> > >>> report an error if active_cacheline_insert returns -EEXIST.
> > >>
> > >> It seems this patch found a victim. I was trying to run iperf3 on a
> > >> honeycomb (5.14.0, fedora 35) and the console is blasting this error 
> > >> message
> > >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, 
> > >> which
> > >> is attached below.
> > >>
> > > 
> > > These frags are allocated by the stack, transformed into a scatterlist
> > > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the
> > > dpaa2-eth's decision to use two fragments from the same page (that will
> > > also end un in the same cacheline) in two different in-flight skbs.
> > > 
> > > Is this behavior normal?
> > > 
> > 
> > We see the same problem here and it started with 5.15-rc2 in our nightly CI 
> > runs.
> > The CI has panic_on_warn enabled so we see the panic every day now.
> 
> Adding a WARN for a case that be detected false-positive seems not
> acceptable, exactly for this reason (kernel panic on unaffected
> systems).
> 
> So I guess it boils down to the question if the behavior that Ioana
> described is legit behavior, on a system that is dma coherent. We
> are apparently hitting the same scenario, although it could not yet be
> reproduced with debug printks for some reason.
> 
> If the answer is yes, than please remove at lease the WARN, so that
> it will not make systems crash that behave valid, and have
> panic_on_warn set. Even a normal printk feels wrong to me in that
> case, it really sounds rather like you want to fix / better refine
> the overlap check, if you want to report anything here.

Dan, Christoph, any opinion?

So far it all looks a lot like a false positive, so could you please
see that those patches get reverted? I do wonder a bit why this is
not an issue for others, we surely cannot be the only ones running
CI with panic_on_warn.

We would need to disable DEBUG_DMA if this WARN stays in, which
would be a shame. Of course, in theory, this might also indicate
some real bug, but there really is no sign of that so far.

Having multiple sg elements in the same page (or cacheline) is
valid, correct? And this is also not a decision of the driver
IIUC, so if it was bug, it should be addressed in common code,
correct?

> 
> BTW, there is already a WARN in the add_dma_entry() path, related
> to cachlline overlap and -EEXIST:
> 
> add_dma_entry() -> active_cacheline_insert() -> -EEXIST ->
> active_cacheline_inc_overlap()
> 
> That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP".
> Not familiar with that code, but it seems that there are now two
> warnings for more or less the same, and the new warning is much more
> prone to false-positives.
> 
> How do these 2 warnings relate, are they both really necessary?
> I think the new warning was only introduced because of some old
> TODO comment in add_dma_entry(), see commit 2b4bbc6231d78
> ("dma-debug: report -EEXIST errors in add_dma_entry").
> 
> That comment was initially added by Dan long time ago, and he
> added several fix-ups for overlap detection after that, including
> the "overlap > ACTIVE_CACHELINE_MAX_OVERLAP" stuff in
> active_cacheline_inc_overlap(). So could it be that the TODO
> comment was simply not valid any more, and better be removed
> instead of adding new / double warnings, that also generate
> false-positives and kernel crashes?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry

2021-10-01 Thread Gerald Schaefer
On Thu, 30 Sep 2021 15:37:33 +0200
Karsten Graul  wrote:

> On 14/09/2021 17:45, Ioana Ciornei wrote:
> > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote:
> >> +DPAA2, netdev maintainers
> >> Hi,
> >>
> >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote:
> >>> Since, overlapping mappings are not supported by the DMA API we should
> >>> report an error if active_cacheline_insert returns -EEXIST.
> >>
> >> It seems this patch found a victim. I was trying to run iperf3 on a
> >> honeycomb (5.14.0, fedora 35) and the console is blasting this error 
> >> message
> >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, which
> >> is attached below.
> >>
> > 
> > These frags are allocated by the stack, transformed into a scatterlist
> > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the
> > dpaa2-eth's decision to use two fragments from the same page (that will
> > also end un in the same cacheline) in two different in-flight skbs.
> > 
> > Is this behavior normal?
> > 
> 
> We see the same problem here and it started with 5.15-rc2 in our nightly CI 
> runs.
> The CI has panic_on_warn enabled so we see the panic every day now.

Adding a WARN for a case that be detected false-positive seems not
acceptable, exactly for this reason (kernel panic on unaffected
systems).

So I guess it boils down to the question if the behavior that Ioana
described is legit behavior, on a system that is dma coherent. We
are apparently hitting the same scenario, although it could not yet be
reproduced with debug printks for some reason.

If the answer is yes, than please remove at lease the WARN, so that
it will not make systems crash that behave valid, and have
panic_on_warn set. Even a normal printk feels wrong to me in that
case, it really sounds rather like you want to fix / better refine
the overlap check, if you want to report anything here.

BTW, there is already a WARN in the add_dma_entry() path, related
to cachlline overlap and -EEXIST:

add_dma_entry() -> active_cacheline_insert() -> -EEXIST ->
active_cacheline_inc_overlap()

That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP".
Not familiar with that code, but it seems that there are now two
warnings for more or less the same, and the new warning is much more
prone to false-positives.

How do these 2 warnings relate, are they both really necessary?
I think the new warning was only introduced because of some old
TODO comment in add_dma_entry(), see commit 2b4bbc6231d78
("dma-debug: report -EEXIST errors in add_dma_entry").

That comment was initially added by Dan long time ago, and he
added several fix-ups for overlap detection after that, including
the "overlap > ACTIVE_CACHELINE_MAX_OVERLAP" stuff in
active_cacheline_inc_overlap(). So could it be that the TODO
comment was simply not valid any more, and better be removed
instead of adding new / double warnings, that also generate
false-positives and kernel crashes?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/1] dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()

2021-07-06 Thread Gerald Schaefer
On Tue, 6 Jul 2021 10:22:40 +0100
Robin Murphy  wrote:

> On 2021-07-05 19:52, Gerald Schaefer wrote:
> > The following warning occurred sporadically on s390:
> > DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or 
> > rodata [addr=48cc5e2f] [len=131072]
> > WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 
> > check_for_illegal_area+0xa8/0x138
> > 
> > It is a false-positive warning, due to a broken logic in debug_dma_map_sg().
> > check_for_illegal_area() should check for overlay of sg elements with kernel
> > text or rodata. It is called with sg_dma_len(s) instead of s->length as
> > parameter. After the call to ->map_sg(), sg_dma_len() contains the length
> > of possibly combined sg elements in the DMA address space, and not the
> > individual sg element length, which would be s->length.
> > 
> > The check will then use the kernel start address of an sg element, and add
> > the DMA length for overlap check, which can result in the false-positive
> > warning because the DMA length can be larger than the actual single sg
> > element length in kernel address space.
> > 
> > In addition, the call to check_for_illegal_area() happens in the iteration
> > over mapped_ents, which will not include all individual sg elements if
> > any of them were combined in ->map_sg().
> > 
> > Fix this by using s->length instead of sg_dma_len(s). Also put the call to
> > check_for_illegal_area() in a separate loop, iterating over all the
> > individual sg elements ("nents" instead of "mapped_ents").
> > 
> > Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor")
> > Tested-by: Niklas Schnelle 
> > Signed-off-by: Gerald Schaefer 
> > ---
> >   kernel/dma/debug.c | 10 ++
> >   1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
> > index 14de1271463f..d7d44b7fe7e2 100644
> > --- a/kernel/dma/debug.c
> > +++ b/kernel/dma/debug.c
> > @@ -1299,6 +1299,12 @@ void debug_dma_map_sg(struct device *dev, struct 
> > scatterlist *sg,
> > if (unlikely(dma_debug_disabled()))
> > return;
> >   
> > +   for_each_sg(sg, s, nents, i) {
> > +   if (!PageHighMem(sg_page(s))) {
> > +   check_for_illegal_area(dev, sg_virt(s), s->length);
> > +   }
> > +   }
> > +
> > for_each_sg(sg, s, mapped_ents, i) {
> > entry = dma_entry_alloc();
> > if (!entry)
> > @@ -1316,10 +1322,6 @@ void debug_dma_map_sg(struct device *dev, struct 
> > scatterlist *sg,
> >   
> > check_for_stack(dev, sg_page(s), s->offset);
> 
> Strictly this should probably be moved to the new loop as well, as it is 
> similarly concerned with validating the source segments rather than the 
> DMA mappings - I think with virtually-mapped stacks it might technically 
> be possible for a stack page to be physically adjacent to a "valid" page 
> such that it could get merged and overlooked if it were near the end of 
> the list, although in fairness that would probably be indicative of 
> something having gone far more fundamentally wrong. Otherwise, the 
> overall reasoning looks sound to me.

I see, good point. I think I can add this to my patch, and a different
subject like "dma-debug: fix sg checks in debug_dma_map_sg()".

However, I do not quite understand why check_for_stack() does not also
consider s->length. It seems to check only the first page of an sg
element.

So, shouldn't check_for_stack() behave similar to check_for_illegal_area(),
i.e. check all source sg elements for overlap with the task stack area?

If yes, then this probably should be a separate patch, but I can try
to come up with something and send a new RFC with two patches. Maybe
check_for_stack() can also be integrated into check_for_illegal_area(),
they are both called at the same places. And mapping memory from the
stack also sounds rather illegal.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 1/1] dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()

2021-07-05 Thread Gerald Schaefer
The following warning occurred sporadically on s390:
DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or 
rodata [addr=48cc5e2f] [len=131072]
WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 
check_for_illegal_area+0xa8/0x138

It is a false-positive warning, due to a broken logic in debug_dma_map_sg().
check_for_illegal_area() should check for overlay of sg elements with kernel
text or rodata. It is called with sg_dma_len(s) instead of s->length as
parameter. After the call to ->map_sg(), sg_dma_len() contains the length
of possibly combined sg elements in the DMA address space, and not the
individual sg element length, which would be s->length.

The check will then use the kernel start address of an sg element, and add
the DMA length for overlap check, which can result in the false-positive
warning because the DMA length can be larger than the actual single sg
element length in kernel address space.

In addition, the call to check_for_illegal_area() happens in the iteration
over mapped_ents, which will not include all individual sg elements if
any of them were combined in ->map_sg().

Fix this by using s->length instead of sg_dma_len(s). Also put the call to
check_for_illegal_area() in a separate loop, iterating over all the
individual sg elements ("nents" instead of "mapped_ents").

Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor")
Tested-by: Niklas Schnelle 
Signed-off-by: Gerald Schaefer 
---
 kernel/dma/debug.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 14de1271463f..d7d44b7fe7e2 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -1299,6 +1299,12 @@ void debug_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
if (unlikely(dma_debug_disabled()))
return;
 
+   for_each_sg(sg, s, nents, i) {
+   if (!PageHighMem(sg_page(s))) {
+   check_for_illegal_area(dev, sg_virt(s), s->length);
+   }
+   }
+
for_each_sg(sg, s, mapped_ents, i) {
entry = dma_entry_alloc();
if (!entry)
@@ -1316,10 +1322,6 @@ void debug_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 
check_for_stack(dev, sg_page(s), s->offset);
 
-   if (!PageHighMem(sg_page(s))) {
-   check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s));
-   }
-
check_sg_segment(dev, s);
 
add_dma_entry(entry);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 0/1] dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()

2021-07-05 Thread Gerald Schaefer
The following warning occurred sporadically on s390:
DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or 
rodata [addr=48cc5e2f] [len=131072]
WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 
check_for_illegal_area+0xa8/0x138

It is a false-positive warning, due to a broken logic in debug_dma_map_sg(),
see patch description. In short, the check is mixing up kernel start address
for sg elements with the length of possibly combined sg elements in the DMA
address space.

I am a bit confused by the whole logic, and not sure what would be the best
way to fix this. The false-postives should have been possible since commit
884d05970bfb ("dma-debug: use sg_dma_len accessor"), which is included since
2.6.31. Also, it seems to me that even before that commit, the check would
have been wrong, or at least incomplete, because it is located in a loop
that iterates over mapped_ents instead of nents. So it would not check all
physical sg elements if any were combined in DMA address space.

Gerald Schaefer (1):
  dma-debug: fix check_for_illegal_area() in debug_dma_map_sg()

 kernel/dma/debug.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/intel-iommu: fix memory leak in intel_iommu_put_resv_regions()

2019-01-16 Thread Gerald Schaefer
Commit 9d3a4de4cb8d ("iommu: Disambiguate MSI region types") changed
the reserved region type in intel_iommu_get_resv_regions() from
IOMMU_RESV_RESERVED to IOMMU_RESV_MSI, but it forgot to also change
the type in intel_iommu_put_resv_regions().

This leads to a memory leak, because now the check in
intel_iommu_put_resv_regions() for IOMMU_RESV_RESERVED will never
be true, and no allocated regions will be freed.

Fix this by changing the region type in intel_iommu_put_resv_regions()
to IOMMU_RESV_MSI, matching the type of the allocated regions.

Fixes: 9d3a4de4cb8d ("iommu: Disambiguate MSI region types")
Cc:  # v4.11+
Signed-off-by: Gerald Schaefer 
---
 drivers/iommu/intel-iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 048b5ab36a02..b83e0f2025bb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5295,7 +5295,7 @@ static void intel_iommu_put_resv_regions(struct device 
*dev,
struct iommu_resv_region *entry, *next;
 
list_for_each_entry_safe(entry, next, head, list) {
-   if (entry->type == IOMMU_RESV_RESERVED)
+   if (entry->type == IOMMU_RESV_MSI)
kfree(entry);
}
 }
-- 
2.16.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/3] iommu: s390: constify iommu_ops

2017-08-28 Thread Gerald Schaefer
On Mon, 28 Aug 2017 17:42:50 +0530
Arvind Yadav <arvind.yadav...@gmail.com> wrote:

> iommu_ops are not supposed to change at runtime.
> Functions 'bus_set_iommu' working with const iommu_ops provided
> by . So mark the non-const structs as const.
> 
> Signed-off-by: Arvind Yadav <arvind.yadav...@gmail.com>
> ---
>  drivers/iommu/s390-iommu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Gerald Schaefer <gerald.schae...@de.ibm.com>

> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 8788640..400d75f 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -327,7 +327,7 @@ static size_t s390_iommu_unmap(struct iommu_domain 
> *domain,
>   return size;
>  }
> 
> -static struct iommu_ops s390_iommu_ops = {
> +static const struct iommu_ops s390_iommu_ops = {
>   .capable = s390_iommu_capable,
>   .domain_alloc = s390_domain_alloc,
>   .domain_free = s390_domain_free,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling

2017-06-29 Thread Gerald Schaefer
On Tue, 27 Jun 2017 17:28:06 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> On Mon, Jun 19, 2017 at 05:02:19PM +0200, Gerald Schaefer wrote:
> > On Thu, 15 Jun 2017 15:11:52 +0200
> > Joerg Roedel <j...@8bytes.org> wrote:
> > > + rc = zpci_init_iommu(zdev);
> > > + if (rc)
> > > + goto out_free;
> > > +
> > 
> > After this point, there are two options for failure (zpci_enable_device +
> > zpci_scan_bus), but missing error handling with an appropriate call to
> > zpci_destroy_iommu().
> 
> Right, I'll fix that in the next version.
> 
> > I must admit that I do not understand what these sysfs attributes are
> > used for, and by whom and when. Is it really necessary/correct to register
> > them (including the s390_iommu_ops) _before_ the zpci_dev is registered
> > to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)?
> > 
> > What is the expected life cycle for the attributes, and are they already
> > needed when a device is still under DMA API access, or even before it is
> > added to the PCI bus?
> 
> The sysfs attributes are used by user-space for topology detection. A
> user-space program using VFIO needs to know which PCI-devices it needs
> to assign to the VFIO driver in order to gain access.
> 
> On s390 this probably doesn't matter much, as the real PCI topology is
> not exposed, but it matters on other architectures. The purpose of this
> patch is not so much the sysfs attributes. Its a step to convert all
> IOMMU drivers to use the iommu_device_register() interface. Goal is that
> the iommu core code knows about individual hardware iommus and the
> devices behind it.
> 
> When this is implemented in all iommu drivers, we can start moving more
> code out of the drivers into the iommu core. This also probably doesn't
> matter much for s390, but all iommu drivers need to use this interface
> before we can move on. The sysfs-stuff is just a by-product of that.
> 
> So to move on with that, it would be good to get an Acked-by on this
> patch from you guys once you think I fixed everything :)

Ok, thanks for the explanation. So it should not be a problem if the
attributes are registered in zpci_create_device() before the device is
registered to the bus, especially since they do not provide any manipulation
function but only topology information.

BTW, I get the following new attributes with this patch:
/sys/devices/pci:00/:00:00.0/iommu
/sys/devices/virtual/iommu
/sys/devices/virtual/iommu/s390-iommu.0012
/sys/class/iommu/s390-iommu.0012

Regards,
Gerald

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] iommu: Return ERR_PTR() values from device_group call-backs

2017-06-28 Thread Gerald Schaefer
On Wed, 28 Jun 2017 14:00:56 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> From: Joerg Roedel <jroe...@suse.de>
> 
> The generic device_group call-backs in iommu.c return NULL
> in case of error. Since they are getting ERR_PTR values from
> iommu_group_alloc(), just pass them up instead.
> 
> Reported-by: Gerald Schaefer <gerald.schae...@de.ibm.com>
> Signed-off-by: Joerg Roedel <jroe...@suse.de>
> ---

Looks good,
Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com>


>  drivers/iommu/iommu.c | 14 ++
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index cf7ca7e..de09e1e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -915,13 +915,7 @@ static int get_pci_alias_or_group(struct pci_dev *pdev, 
> u16 alias, void *opaque)
>   */
>  struct iommu_group *generic_device_group(struct device *dev)
>  {
> - struct iommu_group *group;
> -
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return NULL;
> -
> - return group;
> + return iommu_group_alloc();
>  }
> 
>  /*
> @@ -988,11 +982,7 @@ struct iommu_group *pci_device_group(struct device *dev)
>   return group;
> 
>   /* No shared group found, allocate new */
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return NULL;
> -
> - return group;
> + return iommu_group_alloc();
>  }
> 
>  /**

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu/s390: Add support for iommu_device handling

2017-06-19 Thread Gerald Schaefer
On Thu, 15 Jun 2017 15:11:52 +0200
Joerg Roedel  wrote:

> From: Joerg Roedel 
> 
> Add support for the iommu_device_register interface to make
> the s390 hardware iommus visible to the iommu core and in
> sysfs.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/s390/include/asm/pci.h |  7 +++
>  arch/s390/pci/pci.c |  5 +
>  drivers/iommu/s390-iommu.c  | 35 +++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 4e31866..a3f697e 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -8,6 +8,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -123,6 +124,8 @@ struct zpci_dev {
>   unsigned long   iommu_pages;
>   unsigned intnext_bit;
> 
> + struct iommu_device iommu_dev;  /* IOMMU core handle */
> +
>   char res_name[16];
>   struct zpci_bar_struct bars[PCI_BAR_COUNT];
> 
> @@ -173,6 +176,10 @@ int clp_add_pci_device(u32, u32, int);
>  int clp_enable_fh(struct zpci_dev *, u8);
>  int clp_disable_fh(struct zpci_dev *);
> 
> +/* IOMMU Interface */
> +int zpci_init_iommu(struct zpci_dev *zdev);
> +void zpci_destroy_iommu(struct zpci_dev *zdev);
> +
>  #ifdef CONFIG_PCI
>  /* Error handling and recovery */
>  void zpci_event_error(void *);
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 8051df1..4c13da6 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -749,6 +749,7 @@ void pcibios_remove_bus(struct pci_bus *bus)
> 
>   zpci_exit_slot(zdev);
>   zpci_cleanup_bus_resources(zdev);
> + zpci_destroy_iommu(zdev);
>   zpci_free_domain(zdev);
> 
>   spin_lock(_list_lock);
> @@ -820,6 +821,10 @@ int zpci_create_device(struct zpci_dev *zdev)
>   if (rc)
>   goto out;
> 
> + rc = zpci_init_iommu(zdev);
> + if (rc)
> + goto out_free;
> +

After this point, there are two options for failure (zpci_enable_device +
zpci_scan_bus), but missing error handling with an appropriate call to
zpci_destroy_iommu().

I must admit that I do not understand what these sysfs attributes are
used for, and by whom and when. Is it really necessary/correct to register
them (including the s390_iommu_ops) _before_ the zpci_dev is registered
to the bus (zpci_scan_bus -> pci_scan_root_bus/pci_bus_add_devices)?

What is the expected life cycle for the attributes, and are they already
needed when a device is still under DMA API access, or even before it is
added to the PCI bus?

>   mutex_init(>lock);
>   if (zdev->state == ZPCI_FN_STATE_CONFIGURED) {
>   rc = zpci_enable_device(zdev);
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 8788640..85f3bc5 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -18,6 +18,8 @@
>   */
>  #define S390_IOMMU_PGSIZES   (~0xFFFUL)
> 
> +static struct iommu_ops s390_iommu_ops;
> +
>  struct s390_domain {
>   struct iommu_domain domain;
>   struct list_headdevices;
> @@ -166,11 +168,13 @@ static void s390_iommu_detach_device(struct 
> iommu_domain *domain,
>  static int s390_iommu_add_device(struct device *dev)
>  {
>   struct iommu_group *group = iommu_group_get_for_dev(dev);
> + struct zpci_dev *zdev = to_pci_dev(dev)->sysdata;
> 
>   if (IS_ERR(group))
>   return PTR_ERR(group);
> 
>   iommu_group_put(group);
> + iommu_device_link(>iommu_dev, dev);
> 
>   return 0;
>  }
> @@ -197,6 +201,7 @@ static void s390_iommu_remove_device(struct device *dev)
>   s390_iommu_detach_device(domain, dev);
>   }
> 
> + iommu_device_unlink(>iommu_dev, dev);
>   iommu_group_remove_device(dev);
>  }
> 
> @@ -327,6 +332,36 @@ static size_t s390_iommu_unmap(struct iommu_domain 
> *domain,
>   return size;
>  }
> 
> +int zpci_init_iommu(struct zpci_dev *zdev)
> +{
> + int rc = 0;
> +
> + rc = iommu_device_sysfs_add(>iommu_dev, NULL, NULL,
> + "s390-iommu.%08x", zdev->fid);
> + if (rc)
> + goto out_err;
> +
> + iommu_device_set_ops(>iommu_dev, _iommu_ops);
> +
> + rc = iommu_device_register(>iommu_dev);
> + if (rc)
> + goto out_sysfs;
> +
> + return 0;
> +
> +out_sysfs:
> + iommu_device_sysfs_remove(>iommu_dev);
> +
> +out_err:
> + return rc;
> +}
> +
> +void zpci_destroy_iommu(struct zpci_dev *zdev)
> +{
> + iommu_device_unregister(>iommu_dev);
> + iommu_device_sysfs_remove(>iommu_dev);
> +}
> +
>  static struct iommu_ops s390_iommu_ops = {
>   .capable = s390_iommu_capable,
>   .domain_alloc = s390_domain_alloc,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu/s390: Use iommu_group_get_for_dev() in s390_iommu_add_device()

2017-06-16 Thread Gerald Schaefer
On Thu, 15 Jun 2017 15:11:51 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> From: Joerg Roedel <jroe...@suse.de>
> 
> The iommu_group_get_for_dev() function also attaches the
> device to its group, so this code doesn't need to be in the
> iommu driver.
> 
> Further by using this function the driver can make use of
> default domains in the future.
> 
> Signed-off-by: Joerg Roedel <jroe...@suse.de>

Seems pretty straightforward, so
Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com>

However, looking at iommu_group_get_for_dev(), I wonder if the
generic_device_group() always returns the right thing, but that
would be independent from this patch.

With generic_device_group() returning NULL in case the allocation failed,
this part of iommu_group_get_for_dev() would then happily dereference the
NULL pointer, because IS_ERR(group) would be false:

if (ops && ops->device_group)
group = ops->device_group(dev);

if (IS_ERR(group))
return group;

/*
 * Try to allocate a default domain - needs support from the
 * IOMMU driver.
 */
if (!group->default_domain) {

The same is true for pci_device_group(), which also returns NULL in case
of allocation failure. I guess both functions should just return the
group pointer from iommu_group_alloc() directly, which already would
contain an appropriate ERR_PTR, but never NULL.

What do you think?

> ---
>  drivers/iommu/s390-iommu.c | 15 +--
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 179e636..8788640 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -165,20 +165,14 @@ static void s390_iommu_detach_device(struct 
> iommu_domain *domain,
> 
>  static int s390_iommu_add_device(struct device *dev)
>  {
> - struct iommu_group *group;
> - int rc;
> + struct iommu_group *group = iommu_group_get_for_dev(dev);
> 
> - group = iommu_group_get(dev);
> - if (!group) {
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return PTR_ERR(group);
> - }
> + if (IS_ERR(group))
> + return PTR_ERR(group);
> 
> - rc = iommu_group_add_device(group, dev);
>   iommu_group_put(group);
> 
> - return rc;
> + return 0;
>  }
> 
>  static void s390_iommu_remove_device(struct device *dev)
> @@ -344,6 +338,7 @@ static struct iommu_ops s390_iommu_ops = {
>   .iova_to_phys = s390_iommu_iova_to_phys,
>   .add_device = s390_iommu_add_device,
>   .remove_device = s390_iommu_remove_device,
> + .device_group = generic_device_group,
>   .pgsize_bitmap = S390_IOMMU_PGSIZES,
>  };
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 19/44] s390: implement ->mapping_error

2017-06-08 Thread Gerald Schaefer
On Thu,  8 Jun 2017 15:25:44 +0200
Christoph Hellwig <h...@lst.de> wrote:

> s390 can also use noop_dma_ops, and while that currently does not return
> errors it will so in the future.  Implementing the mapping_error method
> is the proper way to have per-ops error conditions.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Acked-by: Gerald Schaefer <gerald.schae...@de.ibm.com>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/2] iommu/s390: Fix iommu-groups and add sysfs support

2017-04-28 Thread Gerald Schaefer
On Fri, 28 Apr 2017 16:55:13 +0200
Joerg Roedel  wrote:

> > I am however a bit confused now, about how we would have allowed group
> > sharing with the current s390 IOMMU code, or IOW in which scenario would
> > iommu_group_get() in the add_device callback find a shareable iommu-group?  
> 
> The usual way to do this is to use the iommu_group_get_for_dev()
> function, which invokes the iommu_ops->device_group call-back of the
> driver to find a matching group or allocating a new one.
> 
> There are ready-to-use functions for this call-back already:
> 
>   1) generic_device_group() - which just allocates a new group for
>  the device. This is usually used outside of PCI
> 
>   2) pci_device_group() - Which walks the PCI hierarchy to find
>  devices that are not isolated and uses the matching group for
>  its isolation domain.
> 
> A few drivers have their own versions of this call-back, but those are
> IOMMU drivers supporting multiple bus-types and need to find the right
> way to determine the group first.
> 
> > So, I guess we may have an issue with not sharing iommu-groups when
> > it could make sense to do so. But your patch would not fix this, as
> > we still would allocate separate iommu-groups for all functions.  
> 
> Yes, but the above approach won't help when each function ends up on a
> seperate bus because the code looks for different functions that are
> enumerated as such. Anyway, some more insight into how this enumeration
> works on s390 would be great :)

Since Sebastian confirmed this, it looks like we do not really have any
enumeration when there is a separate bus for each function.

Also, IIRC, add_device will get called before attach_dev. Currently we
allow to attach more than one device (apparently from different buses) to
one domain (one shared DMA table) in attach_dev. But then it would be too
late to also add all devices to the same iommu-group. That would have had
to be done earlier in add_device, but there we don't know yet if a shared
DMA table would be set up later in attach_dev.

So, it looks to me that we cannot provide correct iommu-group sharing
on s390, even though we allow iommu-domain sharing, which sounds odd. Since
this "shared domain / DMA table" option in attach_dev was only added
because at that time I thought that was a hard requirement for any arch-
specific IOMMU API implementation, maybe there was some misunderstanding.

It would make the code easier (and more consistent with the s390 hardware)
if I would just remove that option from attach_dev, and allow only one
device/function per iommu-domain. What do you think, could this be removed
for s390, or is there any common code requirement for providing that option
(and is it OK that we have separate iommu-groups in this case)?

Regards,
Gerald

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu/s390: Fix IOMMU groups

2017-04-28 Thread Gerald Schaefer
On Thu, 27 Apr 2017 23:12:32 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> On Thu, Apr 27, 2017 at 08:11:42PM +0200, Gerald Schaefer wrote:
> > > +void zpci_destroy_iommu(struct zpci_dev *zdev)
> > > +{
> > > + iommu_group_put(zdev->group);
> > > + zdev->group = NULL;
> > > +}  
> > 
> > While the rest of this patch doesn't seem to make much of a difference to
> > the current behavior, I'm wondering where this extra iommu_group_put()
> > comes from. It either was erroneously missing before this patch, or it
> > is erroneously introduced by this patch.  
> 
> This is the way to free an iommu-group. It was missing before probably
> because it was unclear whether the add_device function allocated a group
> or not. So there was no way to know if it needs to be put again in the
> remove_device function.

Hmm, for the reference count it should not matter whether a new group was
allocated or an existing group found with iommu_group_get(). Our add_device
callback always gets one reference either from iommu_group_get or _alloc,
and then another one from iommu_group_add_device(), after which the first
reference is put again. So there should always be one reference more after
a successful add_device.

Now I'm wondering where this one reference is put again, and I thought
that happened in the remove_device callback, where we call
iommu_group_remove_device(). Is this not correct? Just want to make sure
that we don't have a refcount issue in the current code.

Regards,
Gerald

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/2] iommu/s390: Fix iommu-groups and add sysfs support

2017-04-28 Thread Gerald Schaefer
Hi Joerg,

I guess we are a bit special on s390 (again), see below. Sebastian is more
familiar with the base s390 PCI code, he may correct me if I'm wrong.

On Thu, 27 Apr 2017 23:03:25 +0200
Joerg Roedel  wrote:

> > Well, there is a separate zpci_dev for each pci_dev on s390,
> > and each of those has its own separate dma-table (thus not shared).  
> 
> Is that true for all functions of a PCIe card, so does every function of
> a device has its own zpci_dev structure and thus its own DMA-table?

Yes, clp_add_pci_device() is called for every function, which in turn calls
zpci_create_device() with a freshly allocated zdev. zpci_enable_device()
then sets up a new DMA address space for each function.

> My assumption came from the fact that the zpci_dev is read from
> pci_dev->sysdata, which is propagated there from the pci_bridge
> through the pci_root_bus structures.

The zdev gets there via zpci_create_device() -> zpci_scan_bus() ->
pci_scan_root_bus(), which is done for every single function.

Not sure if I understand this right, but it looks like we set up a new PCI
bus for each function.

> > Given this "separate zpci_dev for each pci_dev" situation, I don't
> > see what this update actually changes, compared to the previous code,
> > see also my comments to that patch.  
> 
> The add_device call-back is invoked for every function of a pci-device,
> because each function gets its own pci_dev structure. Also we usually
> group all functions of a PCI-device together into one iommu-group,
> because we don't trust that the device isolates its functions from each
> other.

OK, but similar to the add_device callback, zpci_create_device() is
also invoked for every function. So, allocating a new iommu-group in
zpci_create_device() will never lead to any group sharing.

I am however a bit confused now, about how we would have allowed group
sharing with the current s390 IOMMU code, or IOW in which scenario would
iommu_group_get() in the add_device callback find a shareable iommu-group?

In the attach_dev callback, we provide the option to "force" multiple
functions using the same iommu-domain / DMA address space, by de-registering
the per-function DMA address space and registering a common space. But
such functions would only be in the same iommu "domain" and not "group",
if I get this right.

So, I guess we may have an issue with not sharing iommu-groups when
it could make sense to do so. But your patch would not fix this, as
we still would allocate separate iommu-groups for all functions.

Regards,
Gerald

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu/s390: Fix IOMMU groups

2017-04-27 Thread Gerald Schaefer
On Thu, 27 Apr 2017 17:28:24 +0200
Joerg Roedel  wrote:

> From: Joerg Roedel 
> 
> Currently the s390 iommu driver allocates an iommu-group for
> every device that is added. But that is wrong, as there is
> only one dma-table per pci-root-bus. Make all devices behind
> one dma-table share one iommu-group.

On s390, each PCI device has its own zpci_dev structure, and also its own
DMA address space. Even with this patch, you'll still allocate an
iommu_group for every device that is added, see below, so I do not really
see the benefit of this (but my knowledge got a little rusty, so I may be
missing something).

The only reason why we allow the theoretical option to attach more than
one device to the same IOMMU domain (and thus address space), is because
it was a requirement by you at that time (IIRC). We have no use-case for
this, and even in this theoretical scenario you would still have separate
zpci_dev structures for the affected devices that share the same IOMMU
domain (DMA address space), and thus also separate IOMMU groups, at least
after this patch.

Before this patch, you could in theory have different PCI devices in the
same IOMMU group, by having iommu_group_get() in s390_iommu_add_device()
return a group != NULL. With this patch, you will enforce that a new
iommu_group is allocated for every device, which would be the contrary
of what the description says.

> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/s390/include/asm/pci.h |  7 +++
>  arch/s390/pci/pci.c | 10 +-
>  drivers/iommu/s390-iommu.c  | 43 ---
>  3 files changed, 48 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 4e31866..045665d 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -8,6 +8,7 @@
> 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -123,6 +124,8 @@ struct zpci_dev {
>   unsigned long   iommu_pages;
>   unsigned intnext_bit;
> 
> + struct iommu_group *group;  /* IOMMU group for all devices behind 
> this zdev */

There is always only one device behind a zpci_dev, so this comment makes
no sense.

> +
>   char res_name[16];
>   struct zpci_bar_struct bars[PCI_BAR_COUNT];
> 
> @@ -173,6 +176,10 @@ static inline bool zdev_enabled(struct zpci_dev *zdev)
>  int clp_enable_fh(struct zpci_dev *, u8);
>  int clp_disable_fh(struct zpci_dev *);
> 
> +/* IOMMU Interface */
> +int zpci_init_iommu(struct zpci_dev *zdev);
> +void zpci_destroy_iommu(struct zpci_dev *zdev);
> +
>  #ifdef CONFIG_PCI
>  /* Error handling and recovery */
>  void zpci_event_error(void *);
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 364b9d8..3178e4d 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -754,6 +754,7 @@ void pcibios_remove_bus(struct pci_bus *bus)
> 
>   zpci_exit_slot(zdev);
>   zpci_cleanup_bus_resources(zdev);
> + zpci_destroy_iommu(zdev);
>   zpci_free_domain(zdev);
> 
>   spin_lock(_list_lock);
> @@ -825,11 +826,15 @@ int zpci_create_device(struct zpci_dev *zdev)
>   if (rc)
>   goto out;
> 
> + rc = zpci_init_iommu(zdev);
> + if (rc)
> + goto out_free;

This will get called for every device that is added, and every device
will get its own zpci_dev, so this will still result in allocating
an IOMMU group for every device.

> +
>   mutex_init(>lock);
>   if (zdev->state == ZPCI_FN_STATE_CONFIGURED) {
>   rc = zpci_enable_device(zdev);
>   if (rc)
> - goto out_free;
> + goto out_iommu;
>   }
>   rc = zpci_scan_bus(zdev);
>   if (rc)
> @@ -846,6 +851,9 @@ int zpci_create_device(struct zpci_dev *zdev)
>  out_disable:
>   if (zdev->state == ZPCI_FN_STATE_ONLINE)
>   zpci_disable_device(zdev);
> +out_iommu:
> + zpci_destroy_iommu(zdev);
> +
>  out_free:
>   zpci_free_domain(zdev);
>  out:
> diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
> index 179e636..cad3ad0 100644
> --- a/drivers/iommu/s390-iommu.c
> +++ b/drivers/iommu/s390-iommu.c
> @@ -163,22 +163,22 @@ static void s390_iommu_detach_device(struct 
> iommu_domain *domain,
>   }
>  }
> 
> -static int s390_iommu_add_device(struct device *dev)
> +static struct iommu_group *s390_iommu_device_group(struct device *dev)
>  {
> - struct iommu_group *group;
> - int rc;
> + struct zpci_dev *zdev = to_pci_dev(dev)->sysdata;
> 
> - group = iommu_group_get(dev);
> - if (!group) {
> - group = iommu_group_alloc();
> - if (IS_ERR(group))
> - return PTR_ERR(group);
> - }
> + return zdev->group;
> +}
> +
> +static int s390_iommu_add_device(struct device *dev)
> +{
> + struct iommu_group *group = iommu_group_get_for_dev(dev);
> + if 

Re: [RFC PATCH 0/2] iommu/s390: Fix iommu-groups and add sysfs support

2017-04-27 Thread Gerald Schaefer
On Thu, 27 Apr 2017 17:28:23 +0200
Joerg Roedel  wrote:

> Hey,
> 
> here are two patches for the s390 PCI and IOMMU code. It is
> based on the assumption that every pci_dev that points to
> the same zpci_dev shares a single dma-table (and thus a
> single address space).

Well, there is a separate zpci_dev for each pci_dev on s390,
and each of those has its own separate dma-table (thus not shared).

> 
> If this assupmtion is true (as it looks to me from reading
> the code) then the iommu-group setup code in the s390 iommu
> driver needs to be updated.

Given this "separate zpci_dev for each pci_dev" situation, I don't
see what this update actually changes, compared to the previous code,
see also my comments to that patch.

> 
> These patches do this and also add support for the
> iommu_device_register interface to the s390 iommu driver.
> 
> Any comments and testing appreciated.
> 
> Thanks,
> 
>   Joerg
> 
> Joerg Roedel (2):
>   iommu/s390: Fix IOMMU groups
>   iommu/s390: Add support for iommu_device handling
> 
>  arch/s390/include/asm/pci.h |  8 +
>  arch/s390/pci/pci.c | 10 ++-
>  drivers/iommu/s390-iommu.c  | 71 
> ++---
>  3 files changed, 78 insertions(+), 11 deletions(-)
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/s390: add iommu api for s390 pci devices

2015-10-01 Thread Gerald Schaefer
On Tue, 29 Sep 2015 14:40:30 +0200
Joerg Roedel <j...@8bytes.org> wrote:

> Hi Gerald,
> 
> thanks for your patch. It looks pretty good and addresses my previous
> review comments. I have a few questions, first one is how this
> operates with DMA-API on s390. Is there a seperate DMA-API
> implementation besides the IOMMU-API one for PCI devices?

Yes, the DMA API is already implemented in arch/s390/pci/pci_dma.c.
I thought about moving it over to the new location in drivers/iommu/,
but I don't see any benefit from it.

Also, the two APIs are quite different on s390 and must not be mixed-up.
For example, we have optimizations in the DMA API to reduce TLB flushes
based on iommu bitmap wrap-around, which is not possible for the map/unmap
logic in the IOMMU API. There is also the requirement that each device has
its own DMA page table (not shared), which is important for DMA API device
recovery and map/unmap on s390.

> 
> My other question is inline:
> 
> On Thu, Aug 27, 2015 at 03:33:03PM +0200, Gerald Schaefer wrote:
> > +struct s390_domain_device {
> > +   struct list_headlist;
> > +   struct zpci_dev *zdev;
> > +};
> 
> Instead of using your own struct here, have you considered using the
> struct iommu_group instead? The struct devices contains a pointer to an
> iommu_group and the struct itself contains pointers to the domain it is
> currently bound to.

Hmm, not sure how this can replace my own struct. I need the struct to
maintain a list of all devices that share a dma page table. And the
devices need to be added and removed to/from that list in attach/detach_dev.

I also need that list during map/unmap, in order to do a TLB flush for
all affected devices, and this happens under a spin lock.

So I guess I cannot use the iommu_group->devices list, which is managed
in add/remove_device and under a mutex, if that was on your mind.

Regards,
Gerald

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/s390: add iommu api for s390 pci devices

2015-08-27 Thread Gerald Schaefer
This adds an IOMMU API implementation for s390 PCI devices.

Reviewed-by: Sebastian Ott seb...@linux.vnet.ibm.com
Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com
---
 MAINTAINERS |   7 +
 arch/s390/Kconfig   |   1 +
 arch/s390/include/asm/pci.h |   4 +
 arch/s390/include/asm/pci_dma.h |   5 +-
 arch/s390/pci/pci_dma.c |  37 +++--
 drivers/iommu/Kconfig   |   7 +
 drivers/iommu/Makefile  |   1 +
 drivers/iommu/s390-iommu.c  | 337 
 8 files changed, 386 insertions(+), 13 deletions(-)
 create mode 100644 drivers/iommu/s390-iommu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 23da5da..8845410 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8937,6 +8937,13 @@ F:   drivers/s390/net/*iucv*
 F: include/net/iucv/
 F: net/iucv/
 
+S390 IOMMU (PCI)
+M: Gerald Schaefer gerald.schae...@de.ibm.com
+L: linux-s...@vger.kernel.org
+W: http://www.ibm.com/developerworks/linux/linux390/
+S: Supported
+F: drivers/iommu/s390-iommu.c
+
 S3C24XX SD/MMC Driver
 M: Ben Dooks ben-li...@fluff.org
 L: linux-arm-ker...@lists.infradead.org (moderated for non-subscribers)
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 1d57000..74db332 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -582,6 +582,7 @@ menuconfig PCI
bool PCI support
select HAVE_DMA_ATTRS
select PCI_MSI
+   select IOMMU_SUPPORT
help
  Enable PCI support.
 
diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 34d9603..c873e68 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -62,6 +62,8 @@ struct zpci_bar_struct {
u8  size;   /* order 2 exponent */
 };
 
+struct s390_domain;
+
 /* Private data per function */
 struct zpci_dev {
struct pci_dev  *pdev;
@@ -118,6 +120,8 @@ struct zpci_dev {
 
struct dentry   *debugfs_dev;
struct dentry   *debugfs_perf;
+
+   struct s390_domain *s390_domain; /* s390 IOMMU domain data */
 };
 
 static inline bool zdev_enabled(struct zpci_dev *zdev)
diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h
index 30b4c17..7a7abf1 100644
--- a/arch/s390/include/asm/pci_dma.h
+++ b/arch/s390/include/asm/pci_dma.h
@@ -192,5 +192,8 @@ static inline unsigned long *get_st_pto(unsigned long entry)
 /* Prototypes */
 int zpci_dma_init_device(struct zpci_dev *);
 void zpci_dma_exit_device(struct zpci_dev *);
-
+void dma_free_seg_table(unsigned long);
+unsigned long *dma_alloc_cpu_table(void);
+void dma_cleanup_tables(unsigned long *);
+void dma_update_cpu_trans(unsigned long *, void *, dma_addr_t, int);
 #endif
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index 37505b8..37d10f7 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -24,7 +24,7 @@ static int zpci_refresh_global(struct zpci_dev *zdev)
  zdev-iommu_pages * PAGE_SIZE);
 }
 
-static unsigned long *dma_alloc_cpu_table(void)
+unsigned long *dma_alloc_cpu_table(void)
 {
unsigned long *table, *entry;
 
@@ -114,12 +114,12 @@ static unsigned long *dma_walk_cpu_trans(unsigned long 
*rto, dma_addr_t dma_addr
return pto[px];
 }
 
-static void dma_update_cpu_trans(struct zpci_dev *zdev, void *page_addr,
-dma_addr_t dma_addr, int flags)
+void dma_update_cpu_trans(unsigned long *dma_table, void *page_addr,
+ dma_addr_t dma_addr, int flags)
 {
unsigned long *entry;
 
-   entry = dma_walk_cpu_trans(zdev-dma_table, dma_addr);
+   entry = dma_walk_cpu_trans(dma_table, dma_addr);
if (!entry) {
WARN_ON_ONCE(1);
return;
@@ -156,7 +156,8 @@ static int dma_update_trans(struct zpci_dev *zdev, unsigned 
long pa,
goto no_refresh;
 
for (i = 0; i  nr_pages; i++) {
-   dma_update_cpu_trans(zdev, page_addr, dma_addr, flags);
+   dma_update_cpu_trans(zdev-dma_table, page_addr, dma_addr,
+flags);
page_addr += PAGE_SIZE;
dma_addr += PAGE_SIZE;
}
@@ -181,7 +182,7 @@ no_refresh:
return rc;
 }
 
-static void dma_free_seg_table(unsigned long entry)
+void dma_free_seg_table(unsigned long entry)
 {
unsigned long *sto = get_rt_sto(entry);
int sx;
@@ -193,21 +194,18 @@ static void dma_free_seg_table(unsigned long entry)
dma_free_cpu_table(sto);
 }
 
-static void dma_cleanup_tables(struct zpci_dev *zdev)
+void dma_cleanup_tables(unsigned long *table)
 {
-   unsigned long *table;
int rtx;
 
-   if (!zdev || !zdev-dma_table)
+   if (!table)
return;
 
-   table = zdev-dma_table;
for (rtx = 0; rtx  ZPCI_TABLE_ENTRIES; rtx++)
if (reg_entry_isvalid(table[rtx

Re: [RFC PATCH 0/1] iommu: Detach device from domain when removed from group

2015-08-03 Thread Gerald Schaefer
On Mon, 3 Aug 2015 17:48:55 +0200
Joerg Roedel j...@8bytes.org wrote:

 On Tue, Jul 28, 2015 at 07:55:55PM +0200, Gerald Schaefer wrote:
  On s390, this eventually leads to a kernel panic when binding the device
  again to its non-vfio PCI driver, because of the missing arch-specific
  cleanup in detach_dev. On x86, the detach_dev callback will also not be
  called directly, but there is a notifier that will catch
  BUS_NOTIFY_REMOVED_DEVICE and eventually do the cleanup. Other
  architectures w/o the notifier probably have at least some kind of memory
  leak in this scenario, so a general fix would be nice.
 
 This notifier is not arch-specific, but registered against the bus the
 iommu-ops are set for. Why does it not run on s390?

Adding the notifier would of course also work on s390 (and all other affected
architectures). However, it seems that the missing detach_dev issue in this
scenario is not fundamentally fixed by using this notifier, it just seems to
hide the symptom by chance.

Adding the otherwise unneeded notifier just to work around this issue somehow
doesn't seem right, also given that x86 is so far the only user of it. At
least I thought it would be cleaner to fix it in common code and for all
architectures. Not sure what's wrong with fixing the asymmetry as suggested
in my patch, but I guess there are good reasons for having this asymmetry.

For now, I'll just add the notifier to my s390 implementation and post it
soon.

 
 
   Joerg
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-pci in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/1] iommu: Detach device from domain when removed from group

2015-08-03 Thread Gerald Schaefer
On Tue, 28 Jul 2015 19:55:55 +0200
Gerald Schaefer gerald.schae...@de.ibm.com wrote:

 Hi,
 
 during IOMMU API function testing on s390 I hit the following scenario:
 
 After binding a device to vfio-pci, the user completes the VFIO_SET_IOMMU
 ioctl and stops, see the sample C program below. Now the device is manually
 removed via echo 1  /sys/bus/pci/devices/.../remove.
 
 Although the SET_IOMMU ioctl triggered the attach_dev callback in the
 underlying IOMMU API, removing the device in this way won't trigger the
 detach_dev callback, neither during remove nor when the user program
 continues with closing group/container.
 
 On s390, this eventually leads to a kernel panic when binding the device
 again to its non-vfio PCI driver, because of the missing arch-specific
 cleanup in detach_dev. On x86, the detach_dev callback will also not be
 called directly, but there is a notifier that will catch
 BUS_NOTIFY_REMOVED_DEVICE and eventually do the cleanup. Other
 architectures w/o the notifier probably have at least some kind of memory
 leak in this scenario, so a general fix would be nice.
 
 My first approach was to try and fix this in VFIO code, but Alex Williamson
 pointed me to some asymmetry in the IOMMU code: iommu_group_add_device()
 will invoke the attach_dev callback, but iommu_group_remove_device() won't
 trigger detach_dev. Fixing this asymmetry would fix the issue for me, but
 is this the correct fix? Any thoughts?

Ping.

The suggested fix may be completely wrong, but not having detach_dev called
seems like like a serious issue, any feedback would be greatly appreciated.


 
 Regards,
 Gerald
 
 
 Here is the sample C program to trigger the ioctl:
 
 #include stdio.h
 #include fcntl.h
 #include linux/vfio.h
 
 int main(void)
 {
 int container, group, rc;
 
 container = open(/dev/vfio/vfio, O_RDWR);
 if (container  0) {
 perror(open /dev/vfio/vfio\n);
 return -1;
 }
 
 group = open(/dev/vfio/0, O_RDWR);
 if (group  0) {
 perror(open /dev/vfio/0\n);
 return -1;
 }
 
 rc = ioctl(group, VFIO_GROUP_SET_CONTAINER, container);
 if (rc) {
 perror(ioctl VFIO_GROUP_SET_CONTAINER\n);
 return -1;
 }
 
 rc = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
 if (rc) {
 perror(ioctl VFIO_SET_IOMMU\n);
 return -1;
 }
 
 printf(Try device remove...\n);
 getchar();
 
 close(group);
 close(container);
 return 0;
 }
 
 Gerald Schaefer (1):
   iommu: Detach device from domain when removed from group
 
  drivers/iommu/iommu.c | 5 +
  1 file changed, 5 insertions(+)
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 1/1] iommu: Detach device from domain when removed from group

2015-07-28 Thread Gerald Schaefer
This patch adds a call to __iommu_detach_device() to the
iommu_group_remove_device() function, which will trigger a missing
detach_dev callback in (at least) the following scenario:

When a user completes the VFIO_SET_IOMMU ioctl for a vfio-pci device,
and the corresponding device is removed thereafter (before any other
ioctl like VFIO_GROUP_GET_DEVICE_FD), then the detach_dev callback of
the underlying IOMMU API is never called.

This also fixes an asymmetry with iommu_group_add_device() and
iommu_group_remove_device(), where the former did an attach_dev but
the latter did no detach_dev.

Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com
---
 drivers/iommu/iommu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f286090..82ac8b3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -447,6 +447,9 @@ rename:
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
+static void __iommu_detach_device(struct iommu_domain *domain,
+ struct device *dev);
+
 /**
  * iommu_group_remove_device - remove a device from it's current group
  * @dev: device to be removed
@@ -466,6 +469,8 @@ void iommu_group_remove_device(struct device *dev)
 IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
 
mutex_lock(group-mutex);
+   if (group-domain)
+   __iommu_detach_device(group-domain, dev);
list_for_each_entry(tmp_device, group-devices, list) {
if (tmp_device-dev == dev) {
device = tmp_device;
-- 
2.3.8

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH 0/1] iommu: Detach device from domain when removed from group

2015-07-28 Thread Gerald Schaefer
Hi,

during IOMMU API function testing on s390 I hit the following scenario:

After binding a device to vfio-pci, the user completes the VFIO_SET_IOMMU
ioctl and stops, see the sample C program below. Now the device is manually
removed via echo 1  /sys/bus/pci/devices/.../remove.

Although the SET_IOMMU ioctl triggered the attach_dev callback in the
underlying IOMMU API, removing the device in this way won't trigger the
detach_dev callback, neither during remove nor when the user program
continues with closing group/container.

On s390, this eventually leads to a kernel panic when binding the device
again to its non-vfio PCI driver, because of the missing arch-specific
cleanup in detach_dev. On x86, the detach_dev callback will also not be
called directly, but there is a notifier that will catch
BUS_NOTIFY_REMOVED_DEVICE and eventually do the cleanup. Other
architectures w/o the notifier probably have at least some kind of memory
leak in this scenario, so a general fix would be nice.

My first approach was to try and fix this in VFIO code, but Alex Williamson
pointed me to some asymmetry in the IOMMU code: iommu_group_add_device()
will invoke the attach_dev callback, but iommu_group_remove_device() won't
trigger detach_dev. Fixing this asymmetry would fix the issue for me, but
is this the correct fix? Any thoughts?

Regards,
Gerald


Here is the sample C program to trigger the ioctl:

#include stdio.h
#include fcntl.h
#include linux/vfio.h

int main(void)
{
int container, group, rc;

container = open(/dev/vfio/vfio, O_RDWR);
if (container  0) {
perror(open /dev/vfio/vfio\n);
return -1;
}

group = open(/dev/vfio/0, O_RDWR);
if (group  0) {
perror(open /dev/vfio/0\n);
return -1;
}

rc = ioctl(group, VFIO_GROUP_SET_CONTAINER, container);
if (rc) {
perror(ioctl VFIO_GROUP_SET_CONTAINER\n);
return -1;
}

rc = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
if (rc) {
perror(ioctl VFIO_SET_IOMMU\n);
return -1;
}

printf(Try device remove...\n);
getchar();

close(group);
close(container);
return 0;
}

Gerald Schaefer (1):
  iommu: Detach device from domain when removed from group

 drivers/iommu/iommu.c | 5 +
 1 file changed, 5 insertions(+)

-- 
2.3.8

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/1] vfio-pci/iommu: Detach iommu group on remove path

2015-07-23 Thread Gerald Schaefer
On Wed, 22 Jul 2015 10:54:35 -0600
Alex Williamson alex.william...@redhat.com wrote:

 On Tue, 2015-07-21 at 19:44 +0200, Gerald Schaefer wrote:
  When a user completes the VFIO_SET_IOMMU ioctl and the vfio-pci
  device is removed thereafter (before any other ioctl like
  VFIO_GROUP_GET_DEVICE_FD), then the detach_dev callback of the
  underlying IOMMU API is never called.
  
  This patch adds a call to vfio_group_try_dissolve_container() to
  the remove path, which will trigger the missing detach_dev callback
  in this scenario.
  
  Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com
  ---
   drivers/vfio/vfio.c | 3 +++
   1 file changed, 3 insertions(+)
  
  diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
  index 2fb29df..9c5c784 100644
  --- a/drivers/vfio/vfio.c
  +++ b/drivers/vfio/vfio.c
  @@ -711,6 +711,8 @@ static bool vfio_dev_present(struct vfio_group
  *group, struct device *dev) return true;
   }
   
  +static void vfio_group_try_dissolve_container(struct vfio_group
  *group); +
   /*
* Decrement the device reference count and wait for the device to
  be
* removed.  Open file descriptors for the device... */
  @@ -785,6 +787,7 @@ void *vfio_del_group_dev(struct device *dev)
  }
  } while (ret = 0);
   
  +   vfio_group_try_dissolve_container(group);
  vfio_group_put(group);
   
  return device_data;
 
 
 This won't work, vfio_group_try_dissolve_container() decrements
 container_users, which an unused device is not.  Imagine if we had
 more than one device in the iommu group, one device is removed and the
 container is dissolved despite the user holding a reference and other
 viable devices remaining.  Additionally, from an isolation
 perspective, an unbind from vfio-pci should not pull the device out
 of the iommu domain, it's part of the domain because it's not
 isolated and that continues even after unbind.
 
 I think what you want to do is detach a device from the iommu domain
 only when it's being removed from iommu group, such as through
 iommu_group_remove_device().  We already have a bit of an asymmetry
 there as iommu_group_add_device() will add devices to the currently
 active iommu domain for the group, but iommu_group_remove_device()
 does not appear to do the reverse.  Thanks,

Interesting, I haven't noticed this asymmetry so far, do you mean
something like this:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f286090..82ac8b3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -447,6 +447,9 @@ rename:
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
+static void __iommu_detach_device(struct iommu_domain *domain,
+ struct device *dev);
+
 /**
  * iommu_group_remove_device - remove a device from it's current group
  * @dev: device to be removed
@@ -466,6 +469,8 @@ void iommu_group_remove_device(struct device *dev)
 IOMMU_GROUP_NOTIFY_DEL_DEVICE,
dev); 
mutex_lock(group-mutex);
+   if (group-domain)
+   __iommu_detach_device(group-domain, dev);
list_for_each_entry(tmp_device, group-devices, list) {
if (tmp_device-dev == dev) {
device = tmp_device;

This would also fix the issue in my scenario, but like before that
doesn't need to mean it is the correct fix. Adding the iommu list and
maintainer to cc.

Joerg, what do you think? (see https://lkml.org/lkml/2015/7/21/635 for
the problem description)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 1/1] vfio-pci/iommu: Detach iommu group on remove path

2015-07-23 Thread Gerald Schaefer
On Wed, 22 Jul 2015 11:10:57 -0600
Alex Williamson alex.william...@redhat.com wrote:

 On Wed, 2015-07-22 at 10:54 -0600, Alex Williamson wrote:
  On Tue, 2015-07-21 at 19:44 +0200, Gerald Schaefer wrote:
   When a user completes the VFIO_SET_IOMMU ioctl and the vfio-pci
   device is removed thereafter (before any other ioctl like
   VFIO_GROUP_GET_DEVICE_FD), then the detach_dev callback of the
   underlying IOMMU API is never called.
   
   This patch adds a call to vfio_group_try_dissolve_container() to
   the remove path, which will trigger the missing detach_dev
   callback in this scenario.
   
   Signed-off-by: Gerald Schaefer gerald.schae...@de.ibm.com
   ---
drivers/vfio/vfio.c | 3 +++
1 file changed, 3 insertions(+)
   
   diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
   index 2fb29df..9c5c784 100644
   --- a/drivers/vfio/vfio.c
   +++ b/drivers/vfio/vfio.c
   @@ -711,6 +711,8 @@ static bool vfio_dev_present(struct
   vfio_group *group, struct device *dev) return true;
}

   +static void vfio_group_try_dissolve_container(struct vfio_group
   *group); +
/*
 * Decrement the device reference count and wait for the device
   to be
 * removed.  Open file descriptors for the device... */
   @@ -785,6 +787,7 @@ void *vfio_del_group_dev(struct device *dev)
 }
 } while (ret = 0);

   + vfio_group_try_dissolve_container(group);
 vfio_group_put(group);

 return device_data;
  
  
  This won't work, vfio_group_try_dissolve_container() decrements
  container_users, which an unused device is not.  Imagine if we had
  more than one device in the iommu group, one device is removed and
  the container is dissolved despite the user holding a reference and
  other viable devices remaining.  Additionally, from an isolation
  perspective, an unbind from vfio-pci should not pull the device out
  of the iommu domain, it's part of the domain because it's not
  isolated and that continues even after unbind.
  
  I think what you want to do is detach a device from the iommu domain
  only when it's being removed from iommu group, such as through
  iommu_group_remove_device().  We already have a bit of an asymmetry
  there as iommu_group_add_device() will add devices to the currently
  active iommu domain for the group, but iommu_group_remove_device()
  does not appear to do the reverse.  Thanks,
 
 BTW, VT-d on x86 avoids a leak using its own notifier_block,
 drivers/iommu/intel-iommu.c:device_notifier() catches
 BUS_NOTIFY_REMOVED_DEVICE and removes the device from the domain (the
 domain_exit() there is only used for non-IOMMU-API domains).  It's
 possible that's the only IOMMU driver that avoids a leak due to the
 scenario you describe.  Thanks,

Thanks, that's good to know, so as a last resort I could also use the
notifier to work around the issue. But x86 seems to be the only arch
using this notifier so far, so a general fix would be nice.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH linux-next] iommu: add iommu for s390 platform

2014-10-27 Thread Gerald Schaefer
On Thu, 23 Oct 2014 16:04:37 +0200
Frank Blaschka blasc...@linux.vnet.ibm.com wrote:

 On Thu, Oct 23, 2014 at 02:41:15PM +0200, Joerg Roedel wrote:
  On Wed, Oct 22, 2014 at 05:43:20PM +0200, Frank Blaschka wrote:
   Basically there are no limitations. Depending on the s390 maschine
   generation a device starts its IOVA at a specific address
   (announced by the HW). But as I already told each device starts
   at the same address. I think this prevents having multiple
   devices on the same IOMMU domain.
  
  Why, each device has its own IOVA address space, so IOVA A could
  map to physical address X for one device and to Y for another, no?
  And if you point multiple devices to the same dma_table they share
  the mappings (and thus the address space). Or am I getting
  something wrong?
  
   yes, you are absolutely right. There is a per-device dma_table.
   There is no general IOMMU device but each pci device has its own
   IOMMU translation capability.
  
  I see, in this way it is similar to ARM where there is often also
  one IOMMU per master device.
  
   Is there a possibility the IOMMU domain can support e.g.
   something like
   
   VIOA 0x1 - pci device 1
   VIOA 0x1 - pci device 2
  
  A domain is basically an abstraction for a DMA page table (or a
  dma_table, as you call it on s390). So you can easily create similar
  mappings for more than one device with it.
 
 ok, maybe I was too close to the existing s390 dma implementation or
 simply wrong, maybe Sebastian or Gerald can give more background

Not sure if I understood the concept of IOMMU domains right. But if this
is about having multiple devices in the same domain, so that iommu_ops-map
will establish the _same_ DMA mapping on _all_ registered devices, then
this should be possible.

We cannot have shared DMA tables because each device gets its own DMA table
allocated during device initialization. But we could just keep all devices
from one domain in a list and then call dma_update_trans() for all devices
during iommu_ops-map/unmap.

Gerald

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH linux-next] iommu: add iommu for s390 platform

2014-10-27 Thread Gerald Schaefer
On Mon, 27 Oct 2014 17:25:02 +0100
Joerg Roedel j...@8bytes.org wrote:

 On Mon, Oct 27, 2014 at 03:32:01PM +0100, Gerald Schaefer wrote:
  Not sure if I understood the concept of IOMMU domains right. But if
  this is about having multiple devices in the same domain, so that
  iommu_ops-map will establish the _same_ DMA mapping on _all_
  registered devices, then this should be possible.
 
 Yes, this is what domains are about. A domain describes a set of DMA
 mappings which can be assigned to multiple devices in parallel.
 
  We cannot have shared DMA tables because each device gets its own
  DMA table allocated during device initialization.
 
 Is there some hardware reason for this or is that just an
 implementation detail that can be changed. In other words, does the
 hardware allow to use the same DMA table for multiple devices?

Yes, the HW would allow shared DMA tables, but the implementation would
need some non-trivial changes. For example, we have a per-device spin_lock
for DMA table manipulations and the code in arch/s390/pci/pci_dma.c knows
nothing about IOMMU domains or shared DMA tables, it just implements a set
of dma_map_ops.

Of course this would also go horribly wrong if a device was already
in use (via the current dma_map_ops), but I guess using devices through
the IOMMU_API prevents using them otherwise?

 
  But we could just keep all devices from one domain in a list and
  then call dma_update_trans() for all devices during
  iommu_ops-map/unmap.
 
 This sounds complicated. Note that a device can be assigned to a
 domain that already has existing mappings. In this case you need to
 make sure that the new device inherits these mappings (and destroy
 all old mappings for the device that possibly exist).
 
 I think it is much easier to use the same DMA table for all devices
 in a domain, if the hardware allows that.

Yes, in this case, having one DMA table per domain and sharing it
between all devices in that domain sounds like a good idea. However,
I can't think of any use case for this, and Frank probably had a very
special use case in mind where this scenario doesn't appear, hence the
one device per domain restriction.

So, if having multiple devices per domain is a must, then we probably
need a thorough rewrite of the arch/s390/pci/pci_dma.c code.

Gerald

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH linux-next] iommu: add iommu for s390 platform

2014-10-27 Thread Gerald Schaefer
On Mon, 27 Oct 2014 18:58:35 +0100
Joerg Roedel j...@8bytes.org wrote:

 On Mon, Oct 27, 2014 at 06:02:19PM +0100, Gerald Schaefer wrote:
  On Mon, 27 Oct 2014 17:25:02 +0100
  Joerg Roedel j...@8bytes.org wrote:
   Is there some hardware reason for this or is that just an
   implementation detail that can be changed. In other words, does
   the hardware allow to use the same DMA table for multiple devices?
  
  Yes, the HW would allow shared DMA tables, but the implementation
  would need some non-trivial changes. For example, we have a
  per-device spin_lock for DMA table manipulations and the code in
  arch/s390/pci/pci_dma.c knows nothing about IOMMU domains or shared
  DMA tables, it just implements a set of dma_map_ops.
 
 I think it would make sense to move the DMA table handling code and
 the dma_map_ops implementation to the IOMMU driver too. This is also
 how some other IOMMU drivers implement it.

Yes, I feared that this would come up, but I agree that it looks like the
best solution, at least if we really want/need the IOMMU API for s390 now.
I'll need to discuss this with Frank, he seems to be on vacation this week.

Thanks for your feedback and explanations!

 The plan is to consolidate the dma_ops implementations someday and
 have a common implementation that works with all IOMMU drivers across
 architectures. This would benefit s390 as well and obsoletes the
 driver specific dma_ops implementation.
 
  Of course this would also go horribly wrong if a device was already
  in use (via the current dma_map_ops), but I guess using devices
  through the IOMMU_API prevents using them otherwise?
 
 This is taken care of by the device drivers. A driver for a device
 either uses the DMA-API or does its own management of DMA mappings
 using the IOMMU-API. VFIO is an example for the later case.
 
   I think it is much easier to use the same DMA table for all
   devices in a domain, if the hardware allows that.
  
  Yes, in this case, having one DMA table per domain and sharing it
  between all devices in that domain sounds like a good idea. However,
  I can't think of any use case for this, and Frank probably had a
  very special use case in mind where this scenario doesn't appear,
  hence the one device per domain restriction.
 
 One usecase is device access from user-space via VFIO. A userspace
 process might want to access multiple devices at the same time and
 VFIO would implement this by assigning all of these devices to the
 same IOMMU domain.
 
 This requirement also comes also from the IOMMU-API itself. The
 intention of the API is to make different IOMMUs look the same through
 the API, and this is violated when drivers implement a 1-1
 domain-device mapping.
 
  So, if having multiple devices per domain is a must, then we
  probably need a thorough rewrite of the arch/s390/pci/pci_dma.c
  code.
 
 Yes, this is a requirement for new IOMMU drivers. We already have
 drivers implementing the same 1-1 relation and we are about to fix
 them. But I don't want to add new drivers doing the same.
 
 
   Joerg
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-s390
 in the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu