Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-11-02 Thread Christian König

Am 02.11.23 um 12:19 schrieb Lucas Stach:

Am Donnerstag, dem 02.11.2023 um 11:48 +0100 schrieb Christian König:
[...]

I was considering to change the dma_fence semantics so that
dma_fence_signal() could only be called from the interrupt contexts of
devices and then put a big fat WARN_ON(!in_interrupt()) in there.

It's a sledgehammer, but as far as I can see the only thing which might
help. Opinions?

That's not going to fly. As soon as you are dealing with device drivers
that use IRQ threads, either voluntarily or even involuntarily on RT
kernels, the dma_fence_signal will be from process context.


Ah shit, yeah of course. We use IRQ threads in amdgpu for the second 
interrupt ring as well.


Ok, nail that coffin. Any other ideas how we could enforce this?

Thanks,
Christian.



Regards,
Lucas




Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-11-02 Thread Lucas Stach
Am Donnerstag, dem 02.11.2023 um 11:48 +0100 schrieb Christian König:
[...]
> I was considering to change the dma_fence semantics so that 
> dma_fence_signal() could only be called from the interrupt contexts of 
> devices and then put a big fat WARN_ON(!in_interrupt()) in there.
> 
> It's a sledgehammer, but as far as I can see the only thing which might 
> help. Opinions?

That's not going to fly. As soon as you are dealing with device drivers
that use IRQ threads, either voluntarily or even involuntarily on RT
kernels, the dma_fence_signal will be from process context.

Regards,
Lucas


Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-11-02 Thread Christian König

Am 01.11.23 um 09:13 schrieb Daniel Vetter:

On Wed, 1 Nov 2023 at 07:59, Dave Airlie  wrote:

Well, to make it clear once more: Signaling a dma_fence from the
destructor of a reference counted object is very problematic! This will
be rejected no matter if you do that in C or in Rust.

What we can do is to make it safe in the sense that you don't access
freed up memory by using the scheduler fences even more as wrapper
around the hardware fence as we do now. But this quite a change and
requires a bit more than just hacking around
drm_sched_fence_get_timeline_name().

I really think this needs to be documented if nothing else out of this thread.

Clearly nobody is going to get it right and hidden here in this
thread, this info isn't useful.

Can we have some sort of design document for the dma-fence/scheduler
interactions written and we can try and refine it with solutions on
the list, because I'm tired of people proposing things and NAK's
getting thrown around without anything to point people at.

The next NAK I see on the list will mean I block all patches from the
sender until they write a documentation patch, because seriously this
stuff is too hard for someone to just keep it in their head and expect
everyone else to understand from reading the code.

I very much like the idea that NAK replies are counted as "you've just
volunteered yourself for some documentation patches so that next time
around you can reply with a link to the docs instead of just a NAK".


Yeah, that sounds like a great idea to me as well :)

Especially when I can use it to convince managers that we need to have 
more work force on writing documentation.



I don't think we'll get out of these discussions otherwise, since
currently we have undocumented, but very tricky semantics of the
drm/sched codebase for ringbuffer scheduling which is extended to fw
scheduling in also very tricky ways, with not entirely clear impacts
on semantics of all the drm/sched things. And as a result we just pile
up enormous amounts of threads where I think the only thing assured is
that people talk past each another.


The scheduler is certainly the ugliest part, but it's unfortunately 
still only the tip of the iceberg.


I have seen at least halve a dozen approach in the last two years where 
people tried to signal a dma_fence from userspace or similar.


Fortunately it was mostly prototyping and I could jump in early enough 
to stop that, but basically this is a fight against windmills.


I was considering to change the dma_fence semantics so that 
dma_fence_signal() could only be called from the interrupt contexts of 
devices and then put a big fat WARN_ON(!in_interrupt()) in there.


It's a sledgehammer, but as far as I can see the only thing which might 
help. Opinions?


Thanks,
Christian.



Converting NAKs into doc patches should at least eventually get rid of
the worst confusions we're dealing with here.

Cheers, Sima




Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-11-01 Thread Daniel Vetter
On Wed, 1 Nov 2023 at 07:59, Dave Airlie  wrote:
>
> >
> > Well, to make it clear once more: Signaling a dma_fence from the
> > destructor of a reference counted object is very problematic! This will
> > be rejected no matter if you do that in C or in Rust.
> >
> > What we can do is to make it safe in the sense that you don't access
> > freed up memory by using the scheduler fences even more as wrapper
> > around the hardware fence as we do now. But this quite a change and
> > requires a bit more than just hacking around
> > drm_sched_fence_get_timeline_name().
>
> I really think this needs to be documented if nothing else out of this thread.
>
> Clearly nobody is going to get it right and hidden here in this
> thread, this info isn't useful.
>
> Can we have some sort of design document for the dma-fence/scheduler
> interactions written and we can try and refine it with solutions on
> the list, because I'm tired of people proposing things and NAK's
> getting thrown around without anything to point people at.
>
> The next NAK I see on the list will mean I block all patches from the
> sender until they write a documentation patch, because seriously this
> stuff is too hard for someone to just keep it in their head and expect
> everyone else to understand from reading the code.

I very much like the idea that NAK replies are counted as "you've just
volunteered yourself for some documentation patches so that next time
around you can reply with a link to the docs instead of just a NAK".

I don't think we'll get out of these discussions otherwise, since
currently we have undocumented, but very tricky semantics of the
drm/sched codebase for ringbuffer scheduling which is extended to fw
scheduling in also very tricky ways, with not entirely clear impacts
on semantics of all the drm/sched things. And as a result we just pile
up enormous amounts of threads where I think the only thing assured is
that people talk past each another.

Converting NAKs into doc patches should at least eventually get rid of
the worst confusions we're dealing with here.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-11-01 Thread Dave Airlie
>
> Well, to make it clear once more: Signaling a dma_fence from the
> destructor of a reference counted object is very problematic! This will
> be rejected no matter if you do that in C or in Rust.
>
> What we can do is to make it safe in the sense that you don't access
> freed up memory by using the scheduler fences even more as wrapper
> around the hardware fence as we do now. But this quite a change and
> requires a bit more than just hacking around
> drm_sched_fence_get_timeline_name().

I really think this needs to be documented if nothing else out of this thread.

Clearly nobody is going to get it right and hidden here in this
thread, this info isn't useful.

Can we have some sort of design document for the dma-fence/scheduler
interactions written and we can try and refine it with solutions on
the list, because I'm tired of people proposing things and NAK's
getting thrown around without anything to point people at.

The next NAK I see on the list will mean I block all patches from the
sender until they write a documentation patch, because seriously this
stuff is too hard for someone to just keep it in their head and expect
everyone else to understand from reading the code.

Dave.


Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-31 Thread Christian König

Am 21.07.23 um 12:33 schrieb Asahi Lina:

[SNIP]


I've already tried to explain the issue. The DRM scheduler design, as 
it stands today, makes it impractical to write a safe Rust abstraction 
for it. This is a fact. Christian has repeatedly NAKed my attempts at 
changing it to make such a safe abstraction possible, and is clearly 
opposed to the fundamental lifetime requirements change I am trying to 
make here. Therefore, we are at an impasse.


It's unfortunate, but given this situation, the DRM scheduler will not 
be available to Rust DRM drivers. I thought the goal of the DRM 
subsystem common code was to cater to multiple drivers and usage 
approaches, with an emphasis on doing things "right" and avoiding 
design issues that are common mistakes in driver design. Clearly, this 
is not the case for all of DRM, at least not the DRM scheduler.


In software engineering, complex lifetime requirements are an 
anti-pattern, which is one reason why Rust draws a line between safe 
and unsafe APIs. For a safe API, it is up to the API developer to 
design it such that it cannot be misused in a way that causes memory 
safety issues, and the only lifetime requirements it can impose are 
those that can be expressed in the Rust type system and statically 
checked at compile time. The DRM scheduler's complex chain of lifetime 
requirements cannot, and wrapping it in enough glue to remove those 
lifetime requirements would require about as much code as just 
rewriting it, as well as add unacceptable duplication and overhead.


In kernel Rust, we strive to only have safe APIs for components which 
have no fundamental reason for being unsafe (things like direct memory 
mapping and raw hardware access). The DRM scheduler does not fall into 
one of those "inherently unsafe" categories, so it doesn't make sense 
to provide a raw unsafe API for it.


This is not completely correct. The DRM scheduler provides a dma_fence 
interface as wrapper around hardware dma_fence interfaces.


This interface is made to allow core Linux memory management to query 
the progress of hardware operations.


So you are working with something very low level here and have to follow 
restrictions which Rust can't enforce from the language because it 
simply can't know about that at compile time.


Doing so would just expose Rust code to the kind of subtle safety 
implications that cause memory problems every day in C. If I were to 
use drm_sched's unsafe API "as is" in my driver, it would *by far* be 
the least auditable, most complex usage of unsafe code in the entire 
driver, and I have no confidence that I would be able to get it right 
and keep it right as the driver changes.


I don't see a reason why this cannot be simply fixed in drm_sched, but 
Christian disagrees, and has repeatedly (and strongly) NAKed my 
attempts at doing so, and indeed NAKed the entire premise of the 
change in lifetime requirements itself. So here we are. There is no 
solution that will work for everyone that involves drm_sched.


So I don't have a choice other than to just not use drm_sched and roll 
my own. I will happily move that Rust implementation to common code if 
another Rust driver comes along and wants to use it. I'm not sure if 
that kind of thing can be easily made available to C drivers in 
reverse, but I guess we'll cross that bridge when a C driver expresses 
interest in using it.


Well, to make it clear once more: Signaling a dma_fence from the 
destructor of a reference counted object is very problematic! This will 
be rejected no matter if you do that in C or in Rust.


What we can do is to make it safe in the sense that you don't access 
freed up memory by using the scheduler fences even more as wrapper 
around the hardware fence as we do now. But this quite a change and 
requires a bit more than just hacking around 
drm_sched_fence_get_timeline_name().




So far it seems existing C drivers are happy with drm_sched's design 
and complex lifetime requirements, even though they aren't even 
documented. Perhaps they managed to reverse engineer them from the 
source, or someone told the authors about it (certainly nobody told me 
when I started using drm_sched). Or maybe a bunch of the drm_scheduler 
users are actually broken and have memory safety issues in corner 
cases. I don't know, though if I had to bet, I'd bet on the second 
option.


Actually, let's do a quick search and find out!

panfrost_remove() -> panfrost_device_fini() -> panfrost_job_fini() -> 
drm_sched_fini()


There is a direct, synchronous path between device removal and 
destroying the DRM scheduler. At no point does it wait for any jobs to 
complete. That's what patch #3 in this series tries to fix.


In fact, it doesn't even keep the entities alive! It calls 
drm_dev_unregister() before everything else, but as the docs for the 
DRM driver lifetimes clearly say (see, docs!), objects visible to 
userspace must survive that and only be released from the release 
callback. 

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-28 Thread Christian König

Am 18.07.23 um 04:35 schrieb Asahi Lina:

On 18/07/2023 00.55, Christian König wrote:

[SNIP]


I give up. You are ignoring everything we say, and rejecting 
everything we suggest. We've already explained why drm_sched doesn't 
work for us. I'm tired of repeating the same explanation over and over 
again only to be ignored and told I'm wrong.


I'm not telling you that you are wrong in any way. What I'm pointing out 
is that your solution won't work upstream and you need to take a step 
back and think about why this won't work.




I'll start working on a new, much simpler Rust-native scheduler based 
on the workqueue Rust abstractions which are in review.


Please note that when you are implementing a dma_fence interface you 
also need my acknowledgement to get this upstream.


Additional to that Dave and Daniel might have objections to this as well.

Regards,
Christian.



~~ Lina





Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-21 Thread Asahi Lina

On 18/07/2023 14.45, Luben Tuikov wrote:

On 2023-07-17 22:35, Asahi Lina wrote:

On 18/07/2023 00.55, Christian König wrote:

Am 15.07.23 um 16:14 schrieb aly...@rosenzweig.io:

15 July 2023 at 00:03, "Luben Tuikov"  wrote:

On 2023-07-14 05:57, Christian König wrote:


Am 14.07.23 um 11:49 schrieb Asahi Lina:


On 14/07/2023 17.43, Christian König wrote:


Am 14.07.23 um 10:21 schrieb Asahi Lina:
A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
   include/drm/gpu_scheduler.h  | 5 +
   3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
      /*
    * Fence is from the same scheduler, only need to wait for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and reallocated
+ * as another scheduler. This false positive case is okay,
as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.

This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That
means *this* entity certainly exists for *this* scheduler, but the
*dependency* fence might have come from a past scheduler that was
already destroyed along with all of its entities, and its address reused.


Well this is function is not about fences, this function is a callback

for the entity.



Christian, I'm really getting tired of your tone. I don't appreciate
being told my comments are "outright nonsense" when you don't even
take the time to understand what the issue is and what I'm trying to
do/document. If you aren't interested in working with me, I'm just
going to give up on drm_sched, wait until Rust gets workqueue support,
and reimplement it in Rust. You can keep your broken fence lifetime
semantics and I'll do my own thing.


I'm certainly trying to help here, but you seem to have unrealistic

expectations.

I perfectly understand what you are trying to do, but you don't seem to

understand that this functionality here isn't made for your use case.

We can adjust the functionality to better match your requirements, but

you can't say it is broken because it doesn't work when you use it not
in the way it is intended to be used.


I believe "adjusting" functionality to fit some external requirements,
may have unintended consequences, requiring yet more and more "adjustments".
(Or may allow (new) drivers to do wild things which may lead to wild results. 
:-) )

We need to be extra careful and wary of this.

Either drm/scheduler is common code that we should use for our driver, in which case we 
need to "adjust" it to fit the requirements of a safe Rust abstraction usable 
for AGX.


Well this is the fundamental disagreement we have. As far as I can see
you don't need to adjust anything in the common drm/scheduler code.

That code works with quite a bunch of different drivers, including the
Intel XE which has similar requirements to your work here.

We can talk about gradually improving the common code, but as Luben
already wrote as well this needs to be done very carefully.


Or, drm/scheduler is not common code intended for drivers with our 
requirements, and then we need to be able to write our own scheduler.

AMD has NAK'd both options, effectively NAK'ing the driver.

I will ask a simple yes/no question: Should we use drm/sched?


Well, yes.



If yes, it will need patches like these,


No, you don't.

First of all you need to try to adjust your driver to match the
requirements of drm/scheduler and *not* the other way around.


and AMD needs to be ok with that and stop NAK'ing them on sight becuase 
they don't match the existing requirements.

If no, we will write our own scheduler in Rust, and AMD needs to be ok with 
that and not NAK it on sight because it's not drm/sched.

Which is it?

Note if we write a Rust scheduler, drm/sched and 

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-18 Thread Pekka Paalanen
On Mon, 17 Jul 2023 17:55:04 +0200
Christian König  wrote:

> Am 15.07.23 um 16:14 schrieb aly...@rosenzweig.io:

...

> > Lina has been polite and accommodating while AMD calls her code
> > "outright nonsense" and gets "outright NAK"s, and puts her into an
> > impossible catch-22 where no matter what she does it's NAK'd.  
> 
> Well as far as I can see I'm totally polite as well.

Christian,

politeness is in the eye of the beholder. You do not get to decide how
other people feel.

I consider myself a very blunt and difficult reviewer on my own area
(which I consider mostly as a negative trait), and while I have
alienated some people over the years, I try hard to not intentionally
hurt anyone. Sometimes it means that writing one email takes an hour or
two. It can take a tremendous amount of energy. Like this email here.

If people have the courage to repeatedly tell someone that the someone
comes out as off-putting, it cannot be dismissed. It really means
coming out as off-putting. There does not need to be anything malicious
related to it from either side, it could as well be a cultural
difference that one cannot know in advance, or it could be a personal
hurt inside the offending person lashing out.

When told, it is time to reflect.

> Pointing out that approaches doesn't seem to make sense and NAKing 
> patches is a perfectly normal part of the review process.

Yes. You don't have to change your message.

One only needs to make an effort to try to change their tone. Otherwise
they lose and alienate developers by choosing to hurt them. It was an
accident before one knew about it, but now it is known, so how one
communicates is a decision. It's no longer an accident.

> What you need to to is to take a step back and ask yourself why this 
> here is facing so much rejection from our side. I have to admit that I 
> don't seem to be good at explaining that, cause we are obviously talking 
> past each other, but you don't seem to try hard to understand what I'm 
> pointing out either.

Maybe try using a softer tone for a start? Lina has reiterated the
restrictions imposed by the hardware, the firmware they cannot change,
and Rust design principles. How do *you* fit those with unchanged
drm/sched?

> > That's not ok.  
> 
> As far as I can see it is.

Hurting people is not ok.

Not even if the kernel community culture traditionally does so.

> As maintainer of a commonly used component my first duty is to preserve 
> the status quo and prevent modifications which are not well thought 
> through.

Of course.

Accidentally hurting someone is eventually unavoidable. Defending the
communication style that hurt someone in order to keep on doing that
just makes one look like a d...


Thanks,
pq


pgpbC4JEfFc0o.pgp
Description: OpenPGP digital signature


Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-17 Thread Luben Tuikov
On 2023-07-17 22:35, Asahi Lina wrote:
> On 18/07/2023 00.55, Christian König wrote:
>> Am 15.07.23 um 16:14 schrieb aly...@rosenzweig.io:
>>> 15 July 2023 at 00:03, "Luben Tuikov"  wrote:
 On 2023-07-14 05:57, Christian König wrote:

> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>
>> On 14/07/2023 17.43, Christian König wrote:
>>
>Am 14.07.23 um 10:21 schrieb Asahi Lina:
>A signaled scheduler fence can outlive its scheduler, since fences are
>independencly reference counted. Therefore, we can't reference the
>scheduler in the get_timeline_name() implementation.
>
>Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>dma-bufs reference fences from GPU schedulers that no longer exist.
>
>Signed-off-by: Asahi Lina 
>---
>   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
>   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>   include/drm/gpu_scheduler.h  | 5 +
>   3 files changed, 14 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>b/drivers/gpu/drm/scheduler/sched_entity.c
>index b2bbc8a68b30..17f35b0b005a 100644
>--- a/drivers/gpu/drm/scheduler/sched_entity.c
>+++ b/drivers/gpu/drm/scheduler/sched_entity.c
>@@ -389,7 +389,12 @@ static bool
>drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>      /*
>    * Fence is from the same scheduler, only need to wait for
>- * it to be scheduled
>+ * it to be scheduled.
>+ *
>+ * Note: s_fence->sched could have been freed and reallocated
>+ * as another scheduler. This false positive case is okay,
>as if
>+ * the old scheduler was freed all of its jobs must have
>+ * signaled their completion fences.
>
>This is outright nonsense. As long as an entity for a scheduler exists
>it is not allowed to free up this scheduler.
>
>So this function can't be called like this.
>
>> As I already explained, the fences can outlive their scheduler. That
>>means *this* entity certainly exists for *this* scheduler, but the
>>*dependency* fence might have come from a past scheduler that was
>>already destroyed along with all of its entities, and its address 
>> reused.
>>
>
>Well this is function is not about fences, this function is a callback
>for the entity.
>
>
>> Christian, I'm really getting tired of your tone. I don't appreciate
>>being told my comments are "outright nonsense" when you don't even
>>take the time to understand what the issue is and what I'm trying to
>>do/document. If you aren't interested in working with me, I'm just
>>going to give up on drm_sched, wait until Rust gets workqueue support,
>>and reimplement it in Rust. You can keep your broken fence lifetime
>>semantics and I'll do my own thing.
>>
>
>I'm certainly trying to help here, but you seem to have unrealistic
>expectations.
>
>I perfectly understand what you are trying to do, but you don't seem to
>understand that this functionality here isn't made for your use case.
>
>We can adjust the functionality to better match your requirements, but
>you can't say it is broken because it doesn't work when you use it not
>in the way it is intended to be used.
>
 I believe "adjusting" functionality to fit some external requirements,
 may have unintended consequences, requiring yet more and more 
 "adjustments".
 (Or may allow (new) drivers to do wild things which may lead to wild 
 results. :-) )

 We need to be extra careful and wary of this.
>>> Either drm/scheduler is common code that we should use for our driver, in 
>>> which case we need to "adjust" it to fit the requirements of a safe Rust 
>>> abstraction usable for AGX.
>>
>> Well this is the fundamental disagreement we have. As far as I can see
>> you don't need to adjust anything in the common drm/scheduler code.
>>
>> That code works with quite a bunch of different drivers, including the
>> Intel XE which has similar requirements to your work here.
>>
>> We can talk about gradually improving the common code, but as Luben
>> already wrote as well this needs to be done very carefully.
>>
>>>Or, drm/scheduler is not common code intended for drivers with our 
>>> requirements, and then we need to be able to write our own scheduler.
>>>
>>> AMD has NAK'd both options, effectively NAK'ing the driver.
>>>
>>> I will ask a simple yes/no question: Should we use drm/sched?
>>
>> Well, yes.
>>
>>>
>>> If yes, it will need patches like these,
>>
>> No, you don't.
>>

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-17 Thread Asahi Lina

On 18/07/2023 00.55, Christian König wrote:

Am 15.07.23 um 16:14 schrieb aly...@rosenzweig.io:

15 July 2023 at 00:03, "Luben Tuikov"  wrote:

On 2023-07-14 05:57, Christian König wrote:


Am 14.07.23 um 11:49 schrieb Asahi Lina:


On 14/07/2023 17.43, Christian König wrote:


   Am 14.07.23 um 10:21 schrieb Asahi Lina:
   A signaled scheduler fence can outlive its scheduler, since fences are
   independencly reference counted. Therefore, we can't reference the
   scheduler in the get_timeline_name() implementation.

   Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
   dma-bufs reference fences from GPU schedulers that no longer exist.

   Signed-off-by: Asahi Lina 
   ---
      drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
      drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
      include/drm/gpu_scheduler.h  | 5 +
      3 files changed, 14 insertions(+), 2 deletions(-)

   diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
   b/drivers/gpu/drm/scheduler/sched_entity.c
   index b2bbc8a68b30..17f35b0b005a 100644
   --- a/drivers/gpu/drm/scheduler/sched_entity.c
   +++ b/drivers/gpu/drm/scheduler/sched_entity.c
   @@ -389,7 +389,12 @@ static bool
   drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
         /*
       * Fence is from the same scheduler, only need to wait for
   - * it to be scheduled
   + * it to be scheduled.
   + *
   + * Note: s_fence->sched could have been freed and reallocated
   + * as another scheduler. This false positive case is okay,
   as if
   + * the old scheduler was freed all of its jobs must have
   + * signaled their completion fences.

   This is outright nonsense. As long as an entity for a scheduler exists
   it is not allowed to free up this scheduler.

   So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That
   means *this* entity certainly exists for *this* scheduler, but the
   *dependency* fence might have come from a past scheduler that was
   already destroyed along with all of its entities, and its address reused.

   
   Well this is function is not about fences, this function is a callback

   for the entity.
   


Christian, I'm really getting tired of your tone. I don't appreciate
   being told my comments are "outright nonsense" when you don't even
   take the time to understand what the issue is and what I'm trying to
   do/document. If you aren't interested in working with me, I'm just
   going to give up on drm_sched, wait until Rust gets workqueue support,
   and reimplement it in Rust. You can keep your broken fence lifetime
   semantics and I'll do my own thing.

   
   I'm certainly trying to help here, but you seem to have unrealistic

   expectations.
   
   I perfectly understand what you are trying to do, but you don't seem to

   understand that this functionality here isn't made for your use case.
   
   We can adjust the functionality to better match your requirements, but

   you can't say it is broken because it doesn't work when you use it not
   in the way it is intended to be used.


I believe "adjusting" functionality to fit some external requirements,
may have unintended consequences, requiring yet more and more "adjustments".
(Or may allow (new) drivers to do wild things which may lead to wild results. 
:-) )

We need to be extra careful and wary of this.

Either drm/scheduler is common code that we should use for our driver, in which case we 
need to "adjust" it to fit the requirements of a safe Rust abstraction usable 
for AGX.


Well this is the fundamental disagreement we have. As far as I can see
you don't need to adjust anything in the common drm/scheduler code.

That code works with quite a bunch of different drivers, including the
Intel XE which has similar requirements to your work here.

We can talk about gradually improving the common code, but as Luben
already wrote as well this needs to be done very carefully.


   Or, drm/scheduler is not common code intended for drivers with our 
requirements, and then we need to be able to write our own scheduler.

AMD has NAK'd both options, effectively NAK'ing the driver.

I will ask a simple yes/no question: Should we use drm/sched?


Well, yes.



If yes, it will need patches like these,


No, you don't.

First of all you need to try to adjust your driver to match the
requirements of drm/scheduler and *not* the other way around.


   and AMD needs to be ok with that and stop NAK'ing them on sight becuase they 
don't match the existing requirements.

If no, we will write our own scheduler in Rust, and AMD needs to be ok with 
that and not NAK it on sight because it's not drm/sched.

Which is it?

Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that 
and AMD comes back and NAKs it -- as said in this thread would "probably" 
happen -- then it is 

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-17 Thread Christian König

Am 15.07.23 um 16:14 schrieb aly...@rosenzweig.io:

15 July 2023 at 00:03, "Luben Tuikov"  wrote:

On 2023-07-14 05:57, Christian König wrote:


Am 14.07.23 um 11:49 schrieb Asahi Lina:


On 14/07/2023 17.43, Christian König wrote:


  Am 14.07.23 um 10:21 schrieb Asahi Lina:
  A signaled scheduler fence can outlive its scheduler, since fences are
  independencly reference counted. Therefore, we can't reference the
  scheduler in the get_timeline_name() implementation.

  Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
  dma-bufs reference fences from GPU schedulers that no longer exist.

  Signed-off-by: Asahi Lina 
  ---
     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
     include/drm/gpu_scheduler.h  | 5 +
     3 files changed, 14 insertions(+), 2 deletions(-)

  diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
  b/drivers/gpu/drm/scheduler/sched_entity.c
  index b2bbc8a68b30..17f35b0b005a 100644
  --- a/drivers/gpu/drm/scheduler/sched_entity.c
  +++ b/drivers/gpu/drm/scheduler/sched_entity.c
  @@ -389,7 +389,12 @@ static bool
  drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
        /*
      * Fence is from the same scheduler, only need to wait for
  - * it to be scheduled
  + * it to be scheduled.
  + *
  + * Note: s_fence->sched could have been freed and reallocated
  + * as another scheduler. This false positive case is okay,
  as if
  + * the old scheduler was freed all of its jobs must have
  + * signaled their completion fences.

  This is outright nonsense. As long as an entity for a scheduler exists
  it is not allowed to free up this scheduler.

  So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That
  means *this* entity certainly exists for *this* scheduler, but the
  *dependency* fence might have come from a past scheduler that was
  already destroyed along with all of its entities, and its address reused.

  
  Well this is function is not about fences, this function is a callback

  for the entity.
  


Christian, I'm really getting tired of your tone. I don't appreciate
  being told my comments are "outright nonsense" when you don't even
  take the time to understand what the issue is and what I'm trying to
  do/document. If you aren't interested in working with me, I'm just
  going to give up on drm_sched, wait until Rust gets workqueue support,
  and reimplement it in Rust. You can keep your broken fence lifetime
  semantics and I'll do my own thing.

  
  I'm certainly trying to help here, but you seem to have unrealistic

  expectations.
  
  I perfectly understand what you are trying to do, but you don't seem to

  understand that this functionality here isn't made for your use case.
  
  We can adjust the functionality to better match your requirements, but

  you can't say it is broken because it doesn't work when you use it not
  in the way it is intended to be used.


I believe "adjusting" functionality to fit some external requirements,
may have unintended consequences, requiring yet more and more "adjustments".
(Or may allow (new) drivers to do wild things which may lead to wild results. 
:-) )

We need to be extra careful and wary of this.

Either drm/scheduler is common code that we should use for our driver, in which case we 
need to "adjust" it to fit the requirements of a safe Rust abstraction usable 
for AGX.


Well this is the fundamental disagreement we have. As far as I can see 
you don't need to adjust anything in the common drm/scheduler code.


That code works with quite a bunch of different drivers, including the 
Intel XE which has similar requirements to your work here.


We can talk about gradually improving the common code, but as Luben 
already wrote as well this needs to be done very carefully.



  Or, drm/scheduler is not common code intended for drivers with our 
requirements, and then we need to be able to write our own scheduler.

AMD has NAK'd both options, effectively NAK'ing the driver.

I will ask a simple yes/no question: Should we use drm/sched?


Well, yes.



If yes, it will need patches like these,


No, you don't.

First of all you need to try to adjust your driver to match the 
requirements of drm/scheduler and *not* the other way around.



  and AMD needs to be ok with that and stop NAK'ing them on sight becuase they 
don't match the existing requirements.

If no, we will write our own scheduler in Rust, and AMD needs to be ok with 
that and not NAK it on sight because it's not drm/sched.

Which is it?

Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If we do that 
and AMD comes back and NAKs it -- as said in this thread would "probably" 
happen -- then it is impossible for us to upstream a driver regardless of whether we use 
drm/sched.

Lina has been 

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-15 Thread alyssa
15 July 2023 at 00:03, "Luben Tuikov"  wrote:


> 
> On 2023-07-14 05:57, Christian König wrote:
> 
> > 
> > Am 14.07.23 um 11:49 schrieb Asahi Lina:
> > 
> > > 
> > > On 14/07/2023 17.43, Christian König wrote:
> > > 
> > 
> >  Am 14.07.23 um 10:21 schrieb Asahi Lina:
> >  A signaled scheduler fence can outlive its scheduler, since fences are
> >  independencly reference counted. Therefore, we can't reference the
> >  scheduler in the get_timeline_name() implementation.
> > 
> >  Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
> >  dma-bufs reference fences from GPU schedulers that no longer exist.
> > 
> >  Signed-off-by: Asahi Lina 
> >  ---
> >     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
> >     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
> >     include/drm/gpu_scheduler.h  | 5 +
> >     3 files changed, 14 insertions(+), 2 deletions(-)
> > 
> >  diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> >  b/drivers/gpu/drm/scheduler/sched_entity.c
> >  index b2bbc8a68b30..17f35b0b005a 100644
> >  --- a/drivers/gpu/drm/scheduler/sched_entity.c
> >  +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> >  @@ -389,7 +389,12 @@ static bool 
> >  drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> >        /*
> >      * Fence is from the same scheduler, only need to wait for
> >  - * it to be scheduled
> >  + * it to be scheduled.
> >  + *
> >  + * Note: s_fence->sched could have been freed and reallocated
> >  + * as another scheduler. This false positive case is okay, 
> >  as if
> >  + * the old scheduler was freed all of its jobs must have
> >  + * signaled their completion fences.
> > 
> >  This is outright nonsense. As long as an entity for a scheduler exists
> >  it is not allowed to free up this scheduler.
> > 
> >  So this function can't be called like this.
> > 
> > > 
> > > As I already explained, the fences can outlive their scheduler. That 
> > >  means *this* entity certainly exists for *this* scheduler, but the 
> > >  *dependency* fence might have come from a past scheduler that was 
> > >  already destroyed along with all of its entities, and its address reused.
> > > 
> > 
> >  
> >  Well this is function is not about fences, this function is a callback 
> >  for the entity.
> >  
> > 
> > > 
> > > Christian, I'm really getting tired of your tone. I don't appreciate 
> > >  being told my comments are "outright nonsense" when you don't even 
> > >  take the time to understand what the issue is and what I'm trying to 
> > >  do/document. If you aren't interested in working with me, I'm just 
> > >  going to give up on drm_sched, wait until Rust gets workqueue support, 
> > >  and reimplement it in Rust. You can keep your broken fence lifetime 
> > >  semantics and I'll do my own thing.
> > > 
> > 
> >  
> >  I'm certainly trying to help here, but you seem to have unrealistic 
> >  expectations.
> >  
> >  I perfectly understand what you are trying to do, but you don't seem to 
> >  understand that this functionality here isn't made for your use case.
> >  
> >  We can adjust the functionality to better match your requirements, but 
> >  you can't say it is broken because it doesn't work when you use it not 
> >  in the way it is intended to be used.
> > 
> 
> I believe "adjusting" functionality to fit some external requirements,
> may have unintended consequences, requiring yet more and more "adjustments".
> (Or may allow (new) drivers to do wild things which may lead to wild results. 
> :-) )
> 
> We need to be extra careful and wary of this.

Either drm/scheduler is common code that we should use for our driver, in which 
case we need to "adjust" it to fit the requirements of a safe Rust abstraction 
usable for AGX. Or, drm/scheduler is not common code intended for drivers with 
our requirements, and then we need to be able to write our own scheduler.

AMD has NAK'd both options, effectively NAK'ing the driver.

I will ask a simple yes/no question: Should we use drm/sched?

If yes, it will need patches like these, and AMD needs to be ok with that and 
stop NAK'ing them on sight becuase they don't match the existing requirements.

If no, we will write our own scheduler in Rust, and AMD needs to be ok with 
that and not NAK it on sight because it's not drm/sched.

Which is it?

Note if we write a Rust scheduler, drm/sched and amdgpu will be unaffected. If 
we do that and AMD comes back and NAKs it -- as said in this thread would 
"probably" happen -- then it is impossible for us to upstream a driver 
regardless of whether we use drm/sched.

Lina has been polite and accommodating while AMD calls her code "outright 
nonsense" and gets "outright NAK"s, and puts her into an impossible catch-22 
where no matter what she does it's NAK'd.

That's not ok.


Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Luben Tuikov
On 2023-07-14 05:57, Christian König wrote:
> Am 14.07.23 um 11:49 schrieb Asahi Lina:
>> On 14/07/2023 17.43, Christian König wrote:
>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
 A signaled scheduler fence can outlive its scheduler, since fences are
 independencly reference counted. Therefore, we can't reference the
 scheduler in the get_timeline_name() implementation.

 Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
 dma-bufs reference fences from GPU schedulers that no longer exist.

 Signed-off-by: Asahi Lina 
 ---
    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
    include/drm/gpu_scheduler.h  | 5 +
    3 files changed, 14 insertions(+), 2 deletions(-)

 diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
 b/drivers/gpu/drm/scheduler/sched_entity.c
 index b2bbc8a68b30..17f35b0b005a 100644
 --- a/drivers/gpu/drm/scheduler/sched_entity.c
 +++ b/drivers/gpu/drm/scheduler/sched_entity.c
 @@ -389,7 +389,12 @@ static bool 
 drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
       /*
     * Fence is from the same scheduler, only need to wait for
 - * it to be scheduled
 + * it to be scheduled.
 + *
 + * Note: s_fence->sched could have been freed and reallocated
 + * as another scheduler. This false positive case is okay, 
 as if
 + * the old scheduler was freed all of its jobs must have
 + * signaled their completion fences.
>>>
>>> This is outright nonsense. As long as an entity for a scheduler exists
>>> it is not allowed to free up this scheduler.
>>>
>>> So this function can't be called like this.
>>
>> As I already explained, the fences can outlive their scheduler. That 
>> means *this* entity certainly exists for *this* scheduler, but the 
>> *dependency* fence might have come from a past scheduler that was 
>> already destroyed along with all of its entities, and its address reused.
> 
> Well this is function is not about fences, this function is a callback 
> for the entity.
> 
>>
>> Christian, I'm really getting tired of your tone. I don't appreciate 
>> being told my comments are "outright nonsense" when you don't even 
>> take the time to understand what the issue is and what I'm trying to 
>> do/document. If you aren't interested in working with me, I'm just 
>> going to give up on drm_sched, wait until Rust gets workqueue support, 
>> and reimplement it in Rust. You can keep your broken fence lifetime 
>> semantics and I'll do my own thing.
> 
> I'm certainly trying to help here, but you seem to have unrealistic 
> expectations.
> 
> I perfectly understand what you are trying to do, but you don't seem to 
> understand that this functionality here isn't made for your use case.
> 
> We can adjust the functionality to better match your requirements, but 
> you can't say it is broken because it doesn't work when you use it not 
> in the way it is intended to be used.

I believe "adjusting" functionality to fit some external requirements,
may have unintended consequences, requiring yet more and more "adjustments".
(Or may allow (new) drivers to do wild things which may lead to wild results. 
:-) )

We need to be extra careful and wary of this.
-- 
Regards,
Luben



Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Asahi Lina

On 14/07/2023 19.18, Christian König wrote:

Am 14.07.23 um 12:06 schrieb Asahi Lina:

On 14/07/2023 18.57, Christian König wrote:

Am 14.07.23 um 11:49 schrieb Asahi Lina:

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences
are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
     include/drm/gpu_scheduler.h  | 5 +
     3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
        /*
  * Fence is from the same scheduler, only need to wait
for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and
reallocated
+ * as another scheduler. This false positive case is okay,
as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That
means *this* entity certainly exists for *this* scheduler, but the
*dependency* fence might have come from a past scheduler that was
already destroyed along with all of its entities, and its address
reused.


Well this is function is not about fences, this function is a callback
for the entity.


That deals with dependency fences, which could have come from any
arbitrary source, including another entity and another scheduler.


No, they can't. Signaling is certainly mandatory to happen before things
are released even if we allow to decouple the dma_fence from it's issuer.


That's exactly what I'm saying in my comment. That the fence must be 
signaled if its creator no longer exists, therefore it's okay to 
inadvertently wait on its scheduled fence instead of its finished fence 
(if that one was intended) since everything needs to be signaled at that 
point anyway.






Christian, I'm really getting tired of your tone. I don't appreciate
being told my comments are "outright nonsense" when you don't even
take the time to understand what the issue is and what I'm trying to
do/document. If you aren't interested in working with me, I'm just
going to give up on drm_sched, wait until Rust gets workqueue support,
and reimplement it in Rust. You can keep your broken fence lifetime
semantics and I'll do my own thing.


I'm certainly trying to help here, but you seem to have unrealistic
expectations.


I don't think expecting not to be told my changes are "outright
nonsense" is an unrealistic expectation. If you think it is, maybe
that's yet another indicator of the culture problems the kernel
community has...


Well I'm just pointing out that you don't seem to understand the
background of the things and just think this is a bug instead of
intentional behavior.


I made a change, I explained why that change works with a portion of the 
existing code by updating a comment, and you called that nonsense. It's 
not even a bug, I'm trying to explain why this part isn't a bug even 
with the expectation that fences don't outlive the scheduler. This is 
because I went through the code trying to find problems this approach 
would cause, ran into this tricky case, thought about it for a while, 
realized it wasn't a problem, and figured it needed a comment.



I perfectly understand what you are trying to do, but you don't seem to
understand that this functionality here isn't made for your use case.


I do, that's why I'm trying to change things. Right now, this
functionality isn't even properly documented, which is why I thought
it could be used for my use case, and slowly discovered otherwise.
Daniel suggested documenting it, then fixing the mismatches between
documentation and reality, which is what I'm doing here.


Well I know Daniel for something like 10-15 years or so, I'm pretty sure
that he meant that you document the existing state because otherwise
this goes against usual patch submission approaches.




We can adjust the functionality to better match your requirements, but
you can't say it is broken because it doesn't work when you use it not
in the way it is intended to be used.


I'm saying the idea that a 

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Christian König

Am 14.07.23 um 12:07 schrieb Asahi Lina:

On 14/07/2023 18.51, Christian König wrote:

Am 14.07.23 um 11:44 schrieb Asahi Lina:

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:
A signaled scheduler fence can outlive its scheduler, since fences 
are

independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
    include/drm/gpu_scheduler.h  | 5 +
    3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
       /*
 * Fence is from the same scheduler, only need to wait 
for

- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and 
reallocated

+ * as another scheduler. This false positive case is okay,
as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


 */
    fence = dma_fence_get(_fence->scheduled);
    dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char
*drm_sched_fence_get_driver_name(struct dma_fence *fence)
    static const char *drm_sched_fence_get_timeline_name(struct
dma_fence *f)
    {
    struct drm_sched_fence *fence = to_drm_sched_fence(f);
-    return (const char *)fence->sched->name;
+    return (const char *)fence->sched_name;
    }
       static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
*fence,
    unsigned seq;
       fence->sched = entity->rq->sched;
+    strlcpy(fence->sched_name, entity->rq->sched->name,
+    sizeof(fence->sched_name));
    seq = atomic_inc_return(>fence_seq);
    dma_fence_init(>scheduled,
_sched_fence_ops_scheduled,
   >lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h 
b/include/drm/gpu_scheduler.h

index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
 * @lock: the lock used by the scheduled and the finished
fences.
 */
    spinlock_t    lock;
+    /**
+ * @sched_name: the name of the scheduler that owns this
fence. We
+ * keep a copy here since fences can outlive their scheduler.
+ */
+    char sched_name[16];


This just mitigates the problem, but doesn't fix it.


Could you point out any remaining issues so we can fix them? Right now
this absolutely *is* broken and this fixes the breakage I observed. If
there are other bugs remaining, I'd like to know what they are so I
can fix them.


The real issue is that the hw fence is kept around much longer than
that.


As far as I know, the whole point of scheduler fences is to decouple
the hw fences from the consumers.


Well yes and no. The decoupling is for the signaling, it's not
decoupling the lifetime.


When I spoke with Daniel I understood the intent was also to decouple 
the lifetime.


Yes, we discussed that a long long time ago as well.

We *wanted* to decouple the dma_fence lifetime, it's just not done that 
way because of problems with that approach.





I already talked with Daniel about this. The current behavior is
broken. These fences can live forever. It is broken to require that
they outlive the driver that produced them.

Additional to that I'm not willing to increase the scheduler fence 
size

like that just to decouple them from the scheduler.


Did you read my explanation on the cover letter as to how this is just
broken right now? We need to fix this. If you have a better suggestion
I'll take it. Doing nothing is not an option.


Well this isn't broken at all. This works exactly like intended, you
just want to use it for something it wasn't made for.

That scheduler fences could be changed to outlive the scheduler which
issued them is possible, but this is certainly a new requirement.


Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Christian König

Am 14.07.23 um 12:06 schrieb Asahi Lina:

On 14/07/2023 18.57, Christian König wrote:

Am 14.07.23 um 11:49 schrieb Asahi Lina:

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:
A signaled scheduler fence can outlive its scheduler, since fences 
are

independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
    include/drm/gpu_scheduler.h  | 5 +
    3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
       /*
 * Fence is from the same scheduler, only need to wait 
for

- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and 
reallocated

+ * as another scheduler. This false positive case is okay,
as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That
means *this* entity certainly exists for *this* scheduler, but the
*dependency* fence might have come from a past scheduler that was
already destroyed along with all of its entities, and its address 
reused.


Well this is function is not about fences, this function is a callback
for the entity.


That deals with dependency fences, which could have come from any 
arbitrary source, including another entity and another scheduler.


No, they can't. Signaling is certainly mandatory to happen before things 
are released even if we allow to decouple the dma_fence from it's issuer.






Christian, I'm really getting tired of your tone. I don't appreciate
being told my comments are "outright nonsense" when you don't even
take the time to understand what the issue is and what I'm trying to
do/document. If you aren't interested in working with me, I'm just
going to give up on drm_sched, wait until Rust gets workqueue support,
and reimplement it in Rust. You can keep your broken fence lifetime
semantics and I'll do my own thing.


I'm certainly trying to help here, but you seem to have unrealistic
expectations.


I don't think expecting not to be told my changes are "outright 
nonsense" is an unrealistic expectation. If you think it is, maybe 
that's yet another indicator of the culture problems the kernel 
community has...


Well I'm just pointing out that you don't seem to understand the 
background of the things and just think this is a bug instead of 
intentional behavior.





I perfectly understand what you are trying to do, but you don't seem to
understand that this functionality here isn't made for your use case.


I do, that's why I'm trying to change things. Right now, this 
functionality isn't even properly documented, which is why I thought 
it could be used for my use case, and slowly discovered otherwise. 
Daniel suggested documenting it, then fixing the mismatches between 
documentation and reality, which is what I'm doing here.


Well I know Daniel for something like 10-15 years or so, I'm pretty sure 
that he meant that you document the existing state because otherwise 
this goes against usual patch submission approaches.





We can adjust the functionality to better match your requirements, but
you can't say it is broken because it doesn't work when you use it not
in the way it is intended to be used.


I'm saying the idea that a random dma-buf holds onto a chain of 
references that prevents unloading a driver module that wrote into it 
(and keeps a bunch of random unrelated objects alive) is a broken 
state of affairs.


Well no, this is intentional design. Otherwise the module and with it 
the operations pointer the fences rely on go away. We already discussed 
that over 10 years ago when Marten came up with the initial dma_fence 
design.


The resulting problems are very well known and I completely agree that 
they are undesirable, but this is how the framework works and not just 
the scheduler but the rest of the DMA-buf framework as well.


It may or may not trickle down to actual problems for users (I would 
bet it does in some cases but I don't know for sure), but it's a 
situation that doesn't make any sense.


I know I'm triggering 

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Asahi Lina

On 14/07/2023 18.51, Christian König wrote:

Am 14.07.23 um 11:44 schrieb Asahi Lina:

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
    include/drm/gpu_scheduler.h  | 5 +
    3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
       /*
     * Fence is from the same scheduler, only need to wait for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and reallocated
+ * as another scheduler. This false positive case is okay,
as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


     */
    fence = dma_fence_get(_fence->scheduled);
    dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char
*drm_sched_fence_get_driver_name(struct dma_fence *fence)
    static const char *drm_sched_fence_get_timeline_name(struct
dma_fence *f)
    {
    struct drm_sched_fence *fence = to_drm_sched_fence(f);
-    return (const char *)fence->sched->name;
+    return (const char *)fence->sched_name;
    }
       static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
*fence,
    unsigned seq;
       fence->sched = entity->rq->sched;
+    strlcpy(fence->sched_name, entity->rq->sched->name,
+    sizeof(fence->sched_name));
    seq = atomic_inc_return(>fence_seq);
    dma_fence_init(>scheduled,
_sched_fence_ops_scheduled,
   >lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
     * @lock: the lock used by the scheduled and the finished
fences.
     */
    spinlock_t    lock;
+    /**
+ * @sched_name: the name of the scheduler that owns this
fence. We
+ * keep a copy here since fences can outlive their scheduler.
+ */
+    char sched_name[16];


This just mitigates the problem, but doesn't fix it.


Could you point out any remaining issues so we can fix them? Right now
this absolutely *is* broken and this fixes the breakage I observed. If
there are other bugs remaining, I'd like to know what they are so I
can fix them.


The real issue is that the hw fence is kept around much longer than
that.


As far as I know, the whole point of scheduler fences is to decouple
the hw fences from the consumers.


Well yes and no. The decoupling is for the signaling, it's not
decoupling the lifetime.


When I spoke with Daniel I understood the intent was also to decouple 
the lifetime.



I already talked with Daniel about this. The current behavior is
broken. These fences can live forever. It is broken to require that
they outlive the driver that produced them.


Additional to that I'm not willing to increase the scheduler fence size
like that just to decouple them from the scheduler.


Did you read my explanation on the cover letter as to how this is just
broken right now? We need to fix this. If you have a better suggestion
I'll take it. Doing nothing is not an option.


Well this isn't broken at all. This works exactly like intended, you
just want to use it for something it wasn't made for.

That scheduler fences could be changed to outlive the scheduler which
issued them is possible, but this is certainly a new requirement.

Especially since we need to grab additional references to make sure that
the module isn't unloaded in such a case.


Yes, that's a remaining issue. The fences need to grab a module 
reference to make sure drm_sched doesn't get 

Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Asahi Lina

On 14/07/2023 18.57, Christian König wrote:

Am 14.07.23 um 11:49 schrieb Asahi Lina:

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
    drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
    drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
    include/drm/gpu_scheduler.h  | 5 +
    3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
       /*
     * Fence is from the same scheduler, only need to wait for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and reallocated
+ * as another scheduler. This false positive case is okay,
as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That
means *this* entity certainly exists for *this* scheduler, but the
*dependency* fence might have come from a past scheduler that was
already destroyed along with all of its entities, and its address reused.


Well this is function is not about fences, this function is a callback
for the entity.


That deals with dependency fences, which could have come from any 
arbitrary source, including another entity and another scheduler.




Christian, I'm really getting tired of your tone. I don't appreciate
being told my comments are "outright nonsense" when you don't even
take the time to understand what the issue is and what I'm trying to
do/document. If you aren't interested in working with me, I'm just
going to give up on drm_sched, wait until Rust gets workqueue support,
and reimplement it in Rust. You can keep your broken fence lifetime
semantics and I'll do my own thing.


I'm certainly trying to help here, but you seem to have unrealistic
expectations.


I don't think expecting not to be told my changes are "outright 
nonsense" is an unrealistic expectation. If you think it is, maybe 
that's yet another indicator of the culture problems the kernel 
community has...



I perfectly understand what you are trying to do, but you don't seem to
understand that this functionality here isn't made for your use case.


I do, that's why I'm trying to change things. Right now, this 
functionality isn't even properly documented, which is why I thought it 
could be used for my use case, and slowly discovered otherwise. Daniel 
suggested documenting it, then fixing the mismatches between 
documentation and reality, which is what I'm doing here.



We can adjust the functionality to better match your requirements, but
you can't say it is broken because it doesn't work when you use it not
in the way it is intended to be used.


I'm saying the idea that a random dma-buf holds onto a chain of 
references that prevents unloading a driver module that wrote into it 
(and keeps a bunch of random unrelated objects alive) is a broken state 
of affairs. It may or may not trickle down to actual problems for users 
(I would bet it does in some cases but I don't know for sure), but it's 
a situation that doesn't make any sense.


I know I'm triggering actual breakage with my new use case due to this, 
which is why I'm trying to improve things. But the current state of 
affairs just doesn't make any sense even if it isn't causing kernel 
oopses today with other drivers.



You can go ahead and try to re-implement the functionality in Rust, but
then I would reject that pointing out that this should probably be an
extension to the existing code.


You keep rejecting my attempts at extending the existing code...

~~ Lina



Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Christian König

Am 14.07.23 um 11:49 schrieb Asahi Lina:

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
   include/drm/gpu_scheduler.h  | 5 +
   3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c

index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool 
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)

      /*
    * Fence is from the same scheduler, only need to wait for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and reallocated
+ * as another scheduler. This false positive case is okay, 
as if

+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That 
means *this* entity certainly exists for *this* scheduler, but the 
*dependency* fence might have come from a past scheduler that was 
already destroyed along with all of its entities, and its address reused.


Well this is function is not about fences, this function is a callback 
for the entity.




Christian, I'm really getting tired of your tone. I don't appreciate 
being told my comments are "outright nonsense" when you don't even 
take the time to understand what the issue is and what I'm trying to 
do/document. If you aren't interested in working with me, I'm just 
going to give up on drm_sched, wait until Rust gets workqueue support, 
and reimplement it in Rust. You can keep your broken fence lifetime 
semantics and I'll do my own thing.


I'm certainly trying to help here, but you seem to have unrealistic 
expectations.


I perfectly understand what you are trying to do, but you don't seem to 
understand that this functionality here isn't made for your use case.


We can adjust the functionality to better match your requirements, but 
you can't say it is broken because it doesn't work when you use it not 
in the way it is intended to be used.


You can go ahead and try to re-implement the functionality in Rust, but 
then I would reject that pointing out that this should probably be an 
extension to the existing code.


Christian.



~~ Lina





Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Christian König

Am 14.07.23 um 11:44 schrieb Asahi Lina:

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
   include/drm/gpu_scheduler.h  | 5 +
   3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c

index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool 
drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)

      /*
    * Fence is from the same scheduler, only need to wait for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and reallocated
+ * as another scheduler. This false positive case is okay, 
as if

+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


    */
   fence = dma_fence_get(_fence->scheduled);
   dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c

index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char 
*drm_sched_fence_get_driver_name(struct dma_fence *fence)
   static const char *drm_sched_fence_get_timeline_name(struct 
dma_fence *f)

   {
   struct drm_sched_fence *fence = to_drm_sched_fence(f);
-    return (const char *)fence->sched->name;
+    return (const char *)fence->sched_name;
   }
      static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence 
*fence,

   unsigned seq;
      fence->sched = entity->rq->sched;
+    strlcpy(fence->sched_name, entity->rq->sched->name,
+    sizeof(fence->sched_name));
   seq = atomic_inc_return(>fence_seq);
   dma_fence_init(>scheduled, 
_sched_fence_ops_scheduled,

  >lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
    * @lock: the lock used by the scheduled and the finished 
fences.

    */
   spinlock_t    lock;
+    /**
+ * @sched_name: the name of the scheduler that owns this 
fence. We

+ * keep a copy here since fences can outlive their scheduler.
+ */
+    char sched_name[16];


This just mitigates the problem, but doesn't fix it.


Could you point out any remaining issues so we can fix them? Right now 
this absolutely *is* broken and this fixes the breakage I observed. If 
there are other bugs remaining, I'd like to know what they are so I 
can fix them.


The real issue is that the hw fence is kept around much longer than 
that.


As far as I know, the whole point of scheduler fences is to decouple 
the hw fences from the consumers.


Well yes and no. The decoupling is for the signaling, it's not 
decoupling the lifetime.


I already talked with Daniel about this. The current behavior is 
broken. These fences can live forever. It is broken to require that 
they outlive the driver that produced them.



Additional to that I'm not willing to increase the scheduler fence size
like that just to decouple them from the scheduler.


Did you read my explanation on the cover letter as to how this is just 
broken right now? We need to fix this. If you have a better suggestion 
I'll take it. Doing nothing is not an option.


Well this isn't broken at all. This works exactly like intended, you 
just want to use it for something it wasn't made for.


That scheduler fences could be changed to outlive the scheduler which 
issued them is possible, but this is certainly a new requirement.


Especially since we need to grab additional references to make sure that 
the module isn't unloaded in such a case.


Christian.



~~ Lina





Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Asahi Lina

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
   include/drm/gpu_scheduler.h  | 5 +
   3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct 
drm_sched_entity *entity)
   
   		/*

 * Fence is from the same scheduler, only need to wait for
-* it to be scheduled
+* it to be scheduled.
+*
+* Note: s_fence->sched could have been freed and reallocated
+* as another scheduler. This false positive case is okay, as if
+* the old scheduler was freed all of its jobs must have
+* signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


As I already explained, the fences can outlive their scheduler. That 
means *this* entity certainly exists for *this* scheduler, but the 
*dependency* fence might have come from a past scheduler that was 
already destroyed along with all of its entities, and its address reused.


Christian, I'm really getting tired of your tone. I don't appreciate 
being told my comments are "outright nonsense" when you don't even take 
the time to understand what the issue is and what I'm trying to 
do/document. If you aren't interested in working with me, I'm just going 
to give up on drm_sched, wait until Rust gets workqueue support, and 
reimplement it in Rust. You can keep your broken fence lifetime 
semantics and I'll do my own thing.


~~ Lina



Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Asahi Lina

On 14/07/2023 17.43, Christian König wrote:

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
   drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
   drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
   include/drm/gpu_scheduler.h  | 5 +
   3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct 
drm_sched_entity *entity)
   
   		/*

 * Fence is from the same scheduler, only need to wait for
-* it to be scheduled
+* it to be scheduled.
+*
+* Note: s_fence->sched could have been freed and reallocated
+* as another scheduler. This false positive case is okay, as if
+* the old scheduler was freed all of its jobs must have
+* signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.

So this function can't be called like this.


 */
fence = dma_fence_get(_fence->scheduled);
dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct 
dma_fence *fence)
   static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
   {
struct drm_sched_fence *fence = to_drm_sched_fence(f);
-   return (const char *)fence->sched->name;
+   return (const char *)fence->sched_name;
   }
   
   static void drm_sched_fence_free_rcu(struct rcu_head *rcu)

@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
unsigned seq;
   
   	fence->sched = entity->rq->sched;

+   strlcpy(fence->sched_name, entity->rq->sched->name,
+   sizeof(fence->sched_name));
seq = atomic_inc_return(>fence_seq);
dma_fence_init(>scheduled, _sched_fence_ops_scheduled,
   >lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
* @lock: the lock used by the scheduled and the finished fences.
*/
spinlock_t  lock;
+/**
+ * @sched_name: the name of the scheduler that owns this fence. We
+* keep a copy here since fences can outlive their scheduler.
+ */
+   char sched_name[16];


This just mitigates the problem, but doesn't fix it.


Could you point out any remaining issues so we can fix them? Right now 
this absolutely *is* broken and this fixes the breakage I observed. If 
there are other bugs remaining, I'd like to know what they are so I can 
fix them.



The real issue is that the hw fence is kept around much longer than that.


As far as I know, the whole point of scheduler fences is to decouple the 
hw fences from the consumers. I already talked with Daniel about this. 
The current behavior is broken. These fences can live forever. It is 
broken to require that they outlive the driver that produced them.



Additional to that I'm not willing to increase the scheduler fence size
like that just to decouple them from the scheduler.


Did you read my explanation on the cover letter as to how this is just 
broken right now? We need to fix this. If you have a better suggestion 
I'll take it. Doing nothing is not an option.


~~ Lina



Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Christian König

Am 14.07.23 um 10:21 schrieb Asahi Lina:

A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
  drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
  drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
  include/drm/gpu_scheduler.h  | 5 +
  3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct 
drm_sched_entity *entity)
  
  		/*

 * Fence is from the same scheduler, only need to wait for
-* it to be scheduled
+* it to be scheduled.
+*
+* Note: s_fence->sched could have been freed and reallocated
+* as another scheduler. This false positive case is okay, as if
+* the old scheduler was freed all of its jobs must have
+* signaled their completion fences.


This is outright nonsense. As long as an entity for a scheduler exists 
it is not allowed to free up this scheduler.


So this function can't be called like this.


 */
fence = dma_fence_get(_fence->scheduled);
dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct 
dma_fence *fence)
  static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
  {
struct drm_sched_fence *fence = to_drm_sched_fence(f);
-   return (const char *)fence->sched->name;
+   return (const char *)fence->sched_name;
  }
  
  static void drm_sched_fence_free_rcu(struct rcu_head *rcu)

@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
unsigned seq;
  
  	fence->sched = entity->rq->sched;

+   strlcpy(fence->sched_name, entity->rq->sched->name,
+   sizeof(fence->sched_name));
seq = atomic_inc_return(>fence_seq);
dma_fence_init(>scheduled, _sched_fence_ops_scheduled,
   >lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
   * @lock: the lock used by the scheduled and the finished fences.
   */
spinlock_t  lock;
+/**
+ * @sched_name: the name of the scheduler that owns this fence. We
+* keep a copy here since fences can outlive their scheduler.
+ */
+   char sched_name[16];


This just mitigates the problem, but doesn't fix it.

The real issue is that the hw fence is kept around much longer than that.

Additional to that I'm not willing to increase the scheduler fence size 
like that just to decouple them from the scheduler.


Regards,
Christian.


  /**
   * @owner: job owner for debugging
   */





[PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

2023-07-14 Thread Asahi Lina
A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.

Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.

Signed-off-by: Asahi Lina 
---
 drivers/gpu/drm/scheduler/sched_entity.c | 7 ++-
 drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
 include/drm/gpu_scheduler.h  | 5 +
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct 
drm_sched_entity *entity)
 
/*
 * Fence is from the same scheduler, only need to wait for
-* it to be scheduled
+* it to be scheduled.
+*
+* Note: s_fence->sched could have been freed and reallocated
+* as another scheduler. This false positive case is okay, as if
+* the old scheduler was freed all of its jobs must have
+* signaled their completion fences.
 */
fence = dma_fence_get(_fence->scheduled);
dma_fence_put(entity->dependency);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06a0eebcca10 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -68,7 +68,7 @@ static const char *drm_sched_fence_get_driver_name(struct 
dma_fence *fence)
 static const char *drm_sched_fence_get_timeline_name(struct dma_fence *f)
 {
struct drm_sched_fence *fence = to_drm_sched_fence(f);
-   return (const char *)fence->sched->name;
+   return (const char *)fence->sched_name;
 }
 
 static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
@@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
unsigned seq;
 
fence->sched = entity->rq->sched;
+   strlcpy(fence->sched_name, entity->rq->sched->name,
+   sizeof(fence->sched_name));
seq = atomic_inc_return(>fence_seq);
dma_fence_init(>scheduled, _sched_fence_ops_scheduled,
   >lock, entity->fence_context, seq);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..4fa9523bd47d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -305,6 +305,11 @@ struct drm_sched_fence {
  * @lock: the lock used by the scheduled and the finished fences.
  */
spinlock_t  lock;
+/**
+ * @sched_name: the name of the scheduler that owns this fence. We
+* keep a copy here since fences can outlive their scheduler.
+ */
+   char sched_name[16];
 /**
  * @owner: job owner for debugging
  */

-- 
2.40.1