Re: [git pull] drm for 6.1-rc1

2022-10-07 Thread Daniel Vetter
Forgot to add Andrey as scheduler maintainer.
-Daniel

On Fri, 7 Oct 2022 at 10:16, Daniel Vetter  wrote:
>
> On Fri, 7 Oct 2022 at 01:45, Linus Torvalds
>  wrote:
> >
> > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:
> > >
> > >
> > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
> > > 0088
> > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> >
> > As far as I can tell, that's the line
> >
> > struct drm_gpu_scheduler *sched = s_fence->sched;
> >
> > where 's_fence' is NULL. The code is
> >
> >0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
> >5: 41 54push   %r12
> >7: 55push   %rbp
> >8: 53push   %rbx
> >9: 48 89 fb  mov%rdi,%rbx
> >c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction
> >   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> >   1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax
> >
> > and that next 'lock decl' instruction would have been the
> >
> > atomic_dec(>hw_rq_count);
> >
> > at the top of drm_sched_job_done().
> >
> > Now, as to *why* you'd have a NULL s_fence, it would seem that
> > drm_sched_job_cleanup() was called with an active job. Looking at that
> > code, it does
> >
> > if (kref_read(>s_fence->finished.refcount)) {
> > /* drm_sched_job_arm() has been called */
> > dma_fence_put(>s_fence->finished);
> > ...
> >
> > but then it does
> >
> > job->s_fence = NULL;
> >
> > anyway, despite the job still being active. The logic of that kind of
> > "fake refcount" escapes me. The above looks fundamentally racy, not to
> > say pointless and wrong (a refcount is a _count_, not a flag, so there
> > could be multiple references to it, what says that you can just
> > decrement one of them and say "I'm done").
>
> Just figured I'll clarify this, because it's indeed a bit wtf and the
> comment doesn't explain much. drm_sched_job_cleanup can be called both
> when a real job is being cleaned up (which holds a full reference on
> job->s_fence and needs to drop it) and to simplify error path in job
> constructions (and the "is this refcount initialized already" signals
> what exactly needs to be cleaned up or not). So no race, because the
> only times this check goes different is when job construction has
> failed before the job struct is visible by any other thread.
>
> But yeah the comment could actually explain what's going on here :-)
>
> And yeah the patch Dave reverted screws up the cascade of references
> that ensures this all stays alive until drm_sched_job_cleanup is
> called on active jobs, so looks all reasonable to me. Some Kunit tests
> maybe to exercise these corners? Not the first time pure scheduler
> code blew up, so proably worth the effort.
> -Daniel
>
> >
> > Now, _why_ any of that happens, I have no idea. I'm just looking at
> > the immediate "that pointer is NULL" thing, and reacting to what looks
> > like a completely bogus refcount pattern.
> >
> > But that odd refcount pattern isn't new, so it's presumably some user
> > on the amd gpu side that changed.
> >
> > The problem hasn't happened again for me, but that's not saying a lot,
> > since it was very random to begin with.
> >
> >  Linus
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [git pull] drm for 6.1-rc1

2022-10-07 Thread Daniel Vetter
On Fri, 7 Oct 2022 at 01:45, Linus Torvalds
 wrote:
>
> On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:
> >
> >
> > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
> > 0088
> > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
>
> As far as I can tell, that's the line
>
> struct drm_gpu_scheduler *sched = s_fence->sched;
>
> where 's_fence' is NULL. The code is
>
>0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
>5: 41 54push   %r12
>7: 55push   %rbp
>8: 53push   %rbx
>9: 48 89 fb  mov%rdi,%rbx
>c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction
>   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
>   1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax
>
> and that next 'lock decl' instruction would have been the
>
> atomic_dec(>hw_rq_count);
>
> at the top of drm_sched_job_done().
>
> Now, as to *why* you'd have a NULL s_fence, it would seem that
> drm_sched_job_cleanup() was called with an active job. Looking at that
> code, it does
>
> if (kref_read(>s_fence->finished.refcount)) {
> /* drm_sched_job_arm() has been called */
> dma_fence_put(>s_fence->finished);
> ...
>
> but then it does
>
> job->s_fence = NULL;
>
> anyway, despite the job still being active. The logic of that kind of
> "fake refcount" escapes me. The above looks fundamentally racy, not to
> say pointless and wrong (a refcount is a _count_, not a flag, so there
> could be multiple references to it, what says that you can just
> decrement one of them and say "I'm done").

Just figured I'll clarify this, because it's indeed a bit wtf and the
comment doesn't explain much. drm_sched_job_cleanup can be called both
when a real job is being cleaned up (which holds a full reference on
job->s_fence and needs to drop it) and to simplify error path in job
constructions (and the "is this refcount initialized already" signals
what exactly needs to be cleaned up or not). So no race, because the
only times this check goes different is when job construction has
failed before the job struct is visible by any other thread.

But yeah the comment could actually explain what's going on here :-)

And yeah the patch Dave reverted screws up the cascade of references
that ensures this all stays alive until drm_sched_job_cleanup is
called on active jobs, so looks all reasonable to me. Some Kunit tests
maybe to exercise these corners? Not the first time pure scheduler
code blew up, so proably worth the effort.
-Daniel

>
> Now, _why_ any of that happens, I have no idea. I'm just looking at
> the immediate "that pointer is NULL" thing, and reacting to what looks
> like a completely bogus refcount pattern.
>
> But that odd refcount pattern isn't new, so it's presumably some user
> on the amd gpu side that changed.
>
> The problem hasn't happened again for me, but that's not saying a lot,
> since it was very random to begin with.
>
>  Linus



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [git pull] drm for 6.1-rc1

2022-10-07 Thread Christian König

Am 07.10.22 um 04:45 schrieb Dave Airlie:

On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
 wrote:

On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:


[ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088
[ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]

As far as I can tell, that's the line

 struct drm_gpu_scheduler *sched = s_fence->sched;

where 's_fence' is NULL. The code is

0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
5: 41 54push   %r12
7: 55push   %rbp
8: 53push   %rbx
9: 48 89 fb  mov%rdi,%rbx
c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction
   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
   1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax

and that next 'lock decl' instruction would have been the

 atomic_dec(>hw_rq_count);

at the top of drm_sched_job_done().

Now, as to *why* you'd have a NULL s_fence, it would seem that
drm_sched_job_cleanup() was called with an active job. Looking at that
code, it does

 if (kref_read(>s_fence->finished.refcount)) {
 /* drm_sched_job_arm() has been called */
 dma_fence_put(>s_fence->finished);
 ...

but then it does

 job->s_fence = NULL;

anyway, despite the job still being active. The logic of that kind of
"fake refcount" escapes me. The above looks fundamentally racy, not to
say pointless and wrong (a refcount is a _count_, not a flag, so there
could be multiple references to it, what says that you can just
decrement one of them and say "I'm done").

Now, _why_ any of that happens, I have no idea. I'm just looking at
the immediate "that pointer is NULL" thing, and reacting to what looks
like a completely bogus refcount pattern.

But that odd refcount pattern isn't new, so it's presumably some user
on the amd gpu side that changed.

The problem hasn't happened again for me, but that's not saying a lot,
since it was very random to begin with.

I chased down the culprit to a drm sched patch, I'll send you a pull
with a revert in it.

commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
Author: Arvind Yadav 
Date:   Wed Sep 14 22:13:20 2022 +0530

 drm/sched: Use parent fence instead of finished

 Using the parent fence instead of the finished fence
 to get the job status. This change is to avoid GPU
 scheduler timeout error which can cause GPU reset.

 Signed-off-by: Arvind Yadav 
 Reviewed-by: Andrey Grodzovsky 
 Link: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2Fmsgid%2F20220914164321.2156-6-Arvind.Yadav%40amd.comdata=05%7C01%7Cchristian.koenig%40amd.com%7C516db37183e84489e1aa08daa80e087e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638007075495101336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=JWT8R205jIPQu87K7a1T0UJ0iKNO8smHhosijAA0%2BNk%3Dreserved=0
 Signed-off-by: Christian König 

I'll let Arvind and Christian maybe work out what is going wrong there.


That's a known issue Arvind is already investigating for a while.

Any idea how you triggered it on boot? We have only be able to trigger 
it very sporadic.


Reverting the patch for now sounds like a good idea to me, it's only a 
cleanup anyway.


Thanks,
Christian.



Dave.


  Linus




Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Dave Airlie
On Fri, 7 Oct 2022 at 12:54, Dave Airlie  wrote:
>
> On Fri, 7 Oct 2022 at 12:45, Dave Airlie  wrote:
> >
> > On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
> >  wrote:
> > >
> > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:
> > > >
> > > >
> > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
> > > > 0088
> > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> > >
> > > As far as I can tell, that's the line
> > >
> > > struct drm_gpu_scheduler *sched = s_fence->sched;
> > >
> > > where 's_fence' is NULL. The code is
> > >
> > >0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
> > >5: 41 54push   %r12
> > >7: 55push   %rbp
> > >8: 53push   %rbx
> > >9: 48 89 fb  mov%rdi,%rbx
> > >c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping 
> > > instruction
> > >   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> > >   1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax
> > >
> > > and that next 'lock decl' instruction would have been the
> > >
> > > atomic_dec(>hw_rq_count);
> > >
> > > at the top of drm_sched_job_done().
> > >
> > > Now, as to *why* you'd have a NULL s_fence, it would seem that
> > > drm_sched_job_cleanup() was called with an active job. Looking at that
> > > code, it does
> > >
> > > if (kref_read(>s_fence->finished.refcount)) {
> > > /* drm_sched_job_arm() has been called */
> > > dma_fence_put(>s_fence->finished);
> > > ...
> > >
> > > but then it does
> > >
> > > job->s_fence = NULL;
> > >
> > > anyway, despite the job still being active. The logic of that kind of
> > > "fake refcount" escapes me. The above looks fundamentally racy, not to
> > > say pointless and wrong (a refcount is a _count_, not a flag, so there
> > > could be multiple references to it, what says that you can just
> > > decrement one of them and say "I'm done").
> > >
> > > Now, _why_ any of that happens, I have no idea. I'm just looking at
> > > the immediate "that pointer is NULL" thing, and reacting to what looks
> > > like a completely bogus refcount pattern.
> > >
> > > But that odd refcount pattern isn't new, so it's presumably some user
> > > on the amd gpu side that changed.
> > >
> > > The problem hasn't happened again for me, but that's not saying a lot,
> > > since it was very random to begin with.
> >
> > I chased down the culprit to a drm sched patch, I'll send you a pull
> > with a revert in it.
> >
> > commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
> > Author: Arvind Yadav 
> > Date:   Wed Sep 14 22:13:20 2022 +0530
> >
> > drm/sched: Use parent fence instead of finished
> >
> > Using the parent fence instead of the finished fence
> > to get the job status. This change is to avoid GPU
> > scheduler timeout error which can cause GPU reset.
> >
> > Signed-off-by: Arvind Yadav 
> > Reviewed-by: Andrey Grodzovsky 
> > Link: 
> > https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-arvind.ya...@amd.com
> > Signed-off-by: Christian König 
> >
> > I'll let Arvind and Christian maybe work out what is going wrong there.
>
> I do spy two changes queued for -next that might be relevant, so I
> might try just pulling those instead.
>
> I'll send a PR in next hour once I test it.

Okay sent, let me know if you see any further problems.

Dave.


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Dave Airlie
On Fri, 7 Oct 2022 at 12:45, Dave Airlie  wrote:
>
> On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
>  wrote:
> >
> > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:
> > >
> > >
> > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
> > > 0088
> > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> >
> > As far as I can tell, that's the line
> >
> > struct drm_gpu_scheduler *sched = s_fence->sched;
> >
> > where 's_fence' is NULL. The code is
> >
> >0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
> >5: 41 54push   %r12
> >7: 55push   %rbp
> >8: 53push   %rbx
> >9: 48 89 fb  mov%rdi,%rbx
> >c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction
> >   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> >   1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax
> >
> > and that next 'lock decl' instruction would have been the
> >
> > atomic_dec(>hw_rq_count);
> >
> > at the top of drm_sched_job_done().
> >
> > Now, as to *why* you'd have a NULL s_fence, it would seem that
> > drm_sched_job_cleanup() was called with an active job. Looking at that
> > code, it does
> >
> > if (kref_read(>s_fence->finished.refcount)) {
> > /* drm_sched_job_arm() has been called */
> > dma_fence_put(>s_fence->finished);
> > ...
> >
> > but then it does
> >
> > job->s_fence = NULL;
> >
> > anyway, despite the job still being active. The logic of that kind of
> > "fake refcount" escapes me. The above looks fundamentally racy, not to
> > say pointless and wrong (a refcount is a _count_, not a flag, so there
> > could be multiple references to it, what says that you can just
> > decrement one of them and say "I'm done").
> >
> > Now, _why_ any of that happens, I have no idea. I'm just looking at
> > the immediate "that pointer is NULL" thing, and reacting to what looks
> > like a completely bogus refcount pattern.
> >
> > But that odd refcount pattern isn't new, so it's presumably some user
> > on the amd gpu side that changed.
> >
> > The problem hasn't happened again for me, but that's not saying a lot,
> > since it was very random to begin with.
>
> I chased down the culprit to a drm sched patch, I'll send you a pull
> with a revert in it.
>
> commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
> Author: Arvind Yadav 
> Date:   Wed Sep 14 22:13:20 2022 +0530
>
> drm/sched: Use parent fence instead of finished
>
> Using the parent fence instead of the finished fence
> to get the job status. This change is to avoid GPU
> scheduler timeout error which can cause GPU reset.
>
> Signed-off-by: Arvind Yadav 
> Reviewed-by: Andrey Grodzovsky 
> Link: 
> https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-arvind.ya...@amd.com
> Signed-off-by: Christian König 
>
> I'll let Arvind and Christian maybe work out what is going wrong there.

I do spy two changes queued for -next that might be relevant, so I
might try just pulling those instead.

I'll send a PR in next hour once I test it.

Dave.


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Dave Airlie
On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
 wrote:
>
> On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:
> >
> >
> > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
> > 0088
> > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
>
> As far as I can tell, that's the line
>
> struct drm_gpu_scheduler *sched = s_fence->sched;
>
> where 's_fence' is NULL. The code is
>
>0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
>5: 41 54push   %r12
>7: 55push   %rbp
>8: 53push   %rbx
>9: 48 89 fb  mov%rdi,%rbx
>c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction
>   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
>   1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax
>
> and that next 'lock decl' instruction would have been the
>
> atomic_dec(>hw_rq_count);
>
> at the top of drm_sched_job_done().
>
> Now, as to *why* you'd have a NULL s_fence, it would seem that
> drm_sched_job_cleanup() was called with an active job. Looking at that
> code, it does
>
> if (kref_read(>s_fence->finished.refcount)) {
> /* drm_sched_job_arm() has been called */
> dma_fence_put(>s_fence->finished);
> ...
>
> but then it does
>
> job->s_fence = NULL;
>
> anyway, despite the job still being active. The logic of that kind of
> "fake refcount" escapes me. The above looks fundamentally racy, not to
> say pointless and wrong (a refcount is a _count_, not a flag, so there
> could be multiple references to it, what says that you can just
> decrement one of them and say "I'm done").
>
> Now, _why_ any of that happens, I have no idea. I'm just looking at
> the immediate "that pointer is NULL" thing, and reacting to what looks
> like a completely bogus refcount pattern.
>
> But that odd refcount pattern isn't new, so it's presumably some user
> on the amd gpu side that changed.
>
> The problem hasn't happened again for me, but that's not saying a lot,
> since it was very random to begin with.

I chased down the culprit to a drm sched patch, I'll send you a pull
with a revert in it.

commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
Author: Arvind Yadav 
Date:   Wed Sep 14 22:13:20 2022 +0530

drm/sched: Use parent fence instead of finished

Using the parent fence instead of the finished fence
to get the job status. This change is to avoid GPU
scheduler timeout error which can cause GPU reset.

Signed-off-by: Arvind Yadav 
Reviewed-by: Andrey Grodzovsky 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-arvind.ya...@amd.com
Signed-off-by: Christian König 

I'll let Arvind and Christian maybe work out what is going wrong there.

Dave.

>
>  Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie  wrote:
>
>
> [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088
> [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]

As far as I can tell, that's the line

struct drm_gpu_scheduler *sched = s_fence->sched;

where 's_fence' is NULL. The code is

   0: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
   5: 41 54push   %r12
   7: 55push   %rbp
   8: 53push   %rbx
   9: 48 89 fb  mov%rdi,%rbx
   c:* 48 8b af 88 00 00 00 mov0x88(%rdi),%rbp <-- trapping instruction
  13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
  1a: 48 8b 85 80 01 00 00 mov0x180(%rbp),%rax

and that next 'lock decl' instruction would have been the

atomic_dec(>hw_rq_count);

at the top of drm_sched_job_done().

Now, as to *why* you'd have a NULL s_fence, it would seem that
drm_sched_job_cleanup() was called with an active job. Looking at that
code, it does

if (kref_read(>s_fence->finished.refcount)) {
/* drm_sched_job_arm() has been called */
dma_fence_put(>s_fence->finished);
...

but then it does

job->s_fence = NULL;

anyway, despite the job still being active. The logic of that kind of
"fake refcount" escapes me. The above looks fundamentally racy, not to
say pointless and wrong (a refcount is a _count_, not a flag, so there
could be multiple references to it, what says that you can just
decrement one of them and say "I'm done").

Now, _why_ any of that happens, I have no idea. I'm just looking at
the immediate "that pointer is NULL" thing, and reacting to what looks
like a completely bogus refcount pattern.

But that odd refcount pattern isn't new, so it's presumably some user
on the amd gpu side that changed.

The problem hasn't happened again for me, but that's not saying a lot,
since it was very random to begin with.

 Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Dave Airlie
On Fri, 7 Oct 2022 at 07:41, Dave Airlie  wrote:
>
> On Fri, 7 Oct 2022 at 06:24, Dave Airlie  wrote:
> >
> > On Fri, 7 Oct 2022 at 06:14, Alex Deucher  wrote:
> > >
> > > On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds
> > >  wrote:
> > > >
> > > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher  
> > > > wrote:
> > > > >
> > > > > Maybe you are seeing this which is an issue with GPU TLB flushes which
> > > > > is kind of sporadic:
> > > > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113
> > > >
> > > > Well, that seems to be 5.19, and while timing changes (or whatever
> > > > other software updates) could have made it start trigger, this machine
> > > > has been pretty solid otgerwise.
> > > >
> > > > > Are you seeing any GPU page faults in your kernel log?
> > > >
> > > > Nothing even remotely like that "no-retry page fault" in that issue
> > > > report. Of course, if it happens just before the whole thing locks
> > > > up...
> > >
> > > Your chip is too old to support retry faults so it's likely you could
> > > be just seeing a GPU page fault followed by a hang.  Your chip also
> > > lacks a paging queue, so you would be affected by the TLB issue.
> >
> >
> > Okay I got my FIJI running Linus tree and netconsole to blow up like
> > this, running fedora 36 desktop, steam, firefox, and then I ran
> > poweroff over ssh.
> >
> > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 
> > 0088
> > [ 1234.778782] #PF: supervisor read access in kernel mode
> > [ 1234.778787] #PF: error_code(0x) - not-present page
> > [ 1234.778791] PGD 0 P4D 0
> > [ 1234.778798] Oops:  [#1] PREEMPT SMP NOPTI
> > [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
> > [ 1234.778809] Hardware name: System manufacturer System Product
> > Name/PRIME X370-PRO, BIOS 5603 07/28/2020
> > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> > [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
> > ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
> > 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
> > 00 f0
> > [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087
> > [ 1234.778839] RAX: c04e9230 RBX:  RCX: 
> > 0018
> > [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: 
> > 
> > [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 
> > 953fbd98b808
> > [ 1234.778905] R10:  R11: abe680380ff8 R12: 
> > abe680380e00
> > [ 1234.778908] R13: 0001 R14:  R15: 
> > 953fbd9ec458
> > [ 1234.778912] FS:  7f35e7008580() GS:95428ebc()
> > knlGS:
> > [ 1234.778916] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 
> > 003506e0
> > [ 1234.778924] Call Trace:
> > [ 1234.778981]  
> > [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
> > [ 1234.778999]  dma_fence_signal+0x2c/0x50
> > [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
> > [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
> > [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
> > [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
> > [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
> > [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
> > [ 1234.779946]  handle_irq_event+0x34/0x70
> > [ 1234.779949]  handle_edge_irq+0x9f/0x240
> > [ 1234.779954]  __common_interrupt+0x66/0x100
> > [ 1234.779960]  common_interrupt+0xa0/0xc0
> > [ 1234.779965]  
> > [ 1234.779968]  
> > [ 1234.779971]  asm_common_interrupt+0x22/0x40
> > [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
> > [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
> > 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
> > 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
> > 83 ea
> > [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202
> >
> > I'll see if I can dig any.
>
> I'm kicking the tires on the drm-next tree just prior to submission,
> and in an attempt to make myself look foolish and to tempt fate, it
> seems stable.

Yay it worked, crashed drm-next. will start reverting down the rabbit hole.

Dave.


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Dave Airlie
On Fri, 7 Oct 2022 at 06:24, Dave Airlie  wrote:
>
> On Fri, 7 Oct 2022 at 06:14, Alex Deucher  wrote:
> >
> > On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds
> >  wrote:
> > >
> > > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher  
> > > wrote:
> > > >
> > > > Maybe you are seeing this which is an issue with GPU TLB flushes which
> > > > is kind of sporadic:
> > > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113
> > >
> > > Well, that seems to be 5.19, and while timing changes (or whatever
> > > other software updates) could have made it start trigger, this machine
> > > has been pretty solid otgerwise.
> > >
> > > > Are you seeing any GPU page faults in your kernel log?
> > >
> > > Nothing even remotely like that "no-retry page fault" in that issue
> > > report. Of course, if it happens just before the whole thing locks
> > > up...
> >
> > Your chip is too old to support retry faults so it's likely you could
> > be just seeing a GPU page fault followed by a hang.  Your chip also
> > lacks a paging queue, so you would be affected by the TLB issue.
>
>
> Okay I got my FIJI running Linus tree and netconsole to blow up like
> this, running fedora 36 desktop, steam, firefox, and then I ran
> poweroff over ssh.
>
> [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088
> [ 1234.778782] #PF: supervisor read access in kernel mode
> [ 1234.778787] #PF: error_code(0x) - not-present page
> [ 1234.778791] PGD 0 P4D 0
> [ 1234.778798] Oops:  [#1] PREEMPT SMP NOPTI
> [ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
> [ 1234.778809] Hardware name: System manufacturer System Product
> Name/PRIME X370-PRO, BIOS 5603 07/28/2020
> [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
> [ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
> ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
> 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
> 00 f0
> [ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087
> [ 1234.778839] RAX: c04e9230 RBX:  RCX: 
> 0018
> [ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: 
> 
> [ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 
> 953fbd98b808
> [ 1234.778905] R10:  R11: abe680380ff8 R12: 
> abe680380e00
> [ 1234.778908] R13: 0001 R14:  R15: 
> 953fbd9ec458
> [ 1234.778912] FS:  7f35e7008580() GS:95428ebc()
> knlGS:
> [ 1234.778916] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 
> 003506e0
> [ 1234.778924] Call Trace:
> [ 1234.778981]  
> [ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
> [ 1234.778999]  dma_fence_signal+0x2c/0x50
> [ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
> [ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
> [ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
> [ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
> [ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
> [ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
> [ 1234.779946]  handle_irq_event+0x34/0x70
> [ 1234.779949]  handle_edge_irq+0x9f/0x240
> [ 1234.779954]  __common_interrupt+0x66/0x100
> [ 1234.779960]  common_interrupt+0xa0/0xc0
> [ 1234.779965]  
> [ 1234.779968]  
> [ 1234.779971]  asm_common_interrupt+0x22/0x40
> [ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
> [ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
> 54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
> 48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
> 83 ea
> [ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202
>
> I'll see if I can dig any.

I'm kicking the tires on the drm-next tree just prior to submission,
and in an attempt to make myself look foolish and to tempt fate, it
seems stable.

This might mean a silent merge conflict/regression, I'll bash on the
drm-next tree a lot more and see if I can play spot the difference.

Dave.


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Dave Airlie
On Fri, 7 Oct 2022 at 06:14, Alex Deucher  wrote:
>
> On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds
>  wrote:
> >
> > On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher  wrote:
> > >
> > > Maybe you are seeing this which is an issue with GPU TLB flushes which
> > > is kind of sporadic:
> > > https://gitlab.freedesktop.org/drm/amd/-/issues/2113
> >
> > Well, that seems to be 5.19, and while timing changes (or whatever
> > other software updates) could have made it start trigger, this machine
> > has been pretty solid otgerwise.
> >
> > > Are you seeing any GPU page faults in your kernel log?
> >
> > Nothing even remotely like that "no-retry page fault" in that issue
> > report. Of course, if it happens just before the whole thing locks
> > up...
>
> Your chip is too old to support retry faults so it's likely you could
> be just seeing a GPU page fault followed by a hang.  Your chip also
> lacks a paging queue, so you would be affected by the TLB issue.


Okay I got my FIJI running Linus tree and netconsole to blow up like
this, running fedora 36 desktop, steam, firefox, and then I ran
poweroff over ssh.

[ 1234.778760] BUG: kernel NULL pointer dereference, address: 0088
[ 1234.778782] #PF: supervisor read access in kernel mode
[ 1234.778787] #PF: error_code(0x) - not-present page
[ 1234.778791] PGD 0 P4D 0
[ 1234.778798] Oops:  [#1] PREEMPT SMP NOPTI
[ 1234.778803] CPU: 7 PID: 805 Comm: systemd-journal Not tainted 6.0.0+ #2
[ 1234.778809] Hardware name: System manufacturer System Product
Name/PRIME X370-PRO, BIOS 5603 07/28/2020
[ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
[ 1234.778828] Code: aa 0f 1d ce e9 57 ff ff ff 48 89 d7 e8 9d 8f 3f
ce e9 4a ff ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 55 53
48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d f0 00 00 00 48 8b 85 80 01 00
00 f0
[ 1234.778834] RSP: :abe680380de0 EFLAGS: 00010087
[ 1234.778839] RAX: c04e9230 RBX:  RCX: 0018
[ 1234.778897] RDX: 0ba278e8977a RSI: 953fb288b460 RDI: 
[ 1234.778901] RBP: 953fb288b598 R08: 00e0 R09: 953fbd98b808
[ 1234.778905] R10:  R11: abe680380ff8 R12: abe680380e00
[ 1234.778908] R13: 0001 R14:  R15: 953fbd9ec458
[ 1234.778912] FS:  7f35e7008580() GS:95428ebc()
knlGS:
[ 1234.778916] CS:  0010 DS:  ES:  CR0: 80050033
[ 1234.778919] CR2: 0088 CR3: 00010147c000 CR4: 003506e0
[ 1234.778924] Call Trace:
[ 1234.778981]  
[ 1234.778989]  dma_fence_signal_timestamp_locked+0x6a/0xe0
[ 1234.778999]  dma_fence_signal+0x2c/0x50
[ 1234.779005]  amdgpu_fence_process+0xc8/0x140 [amdgpu]
[ 1234.779234]  sdma_v3_0_process_trap_irq+0x70/0x80 [amdgpu]
[ 1234.779395]  amdgpu_irq_dispatch+0xa9/0x1d0 [amdgpu]
[ 1234.779609]  amdgpu_ih_process+0x80/0x100 [amdgpu]
[ 1234.779783]  amdgpu_irq_handler+0x1f/0x60 [amdgpu]
[ 1234.779940]  __handle_irq_event_percpu+0x46/0x190
[ 1234.779946]  handle_irq_event+0x34/0x70
[ 1234.779949]  handle_edge_irq+0x9f/0x240
[ 1234.779954]  __common_interrupt+0x66/0x100
[ 1234.779960]  common_interrupt+0xa0/0xc0
[ 1234.779965]  
[ 1234.779968]  
[ 1234.779971]  asm_common_interrupt+0x22/0x40
[ 1234.779976] RIP: 0010:finish_mkwrite_fault+0x22/0x110
[ 1234.779981] Code: 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41 55 41
54 55 48 89 fd 53 48 8b 07 f6 40 50 08 0f 84 eb 00 00 00 48 8b 45 30
48 8b 18 <48> 89 df e8 66 bd ff ff 48 85 c0 74 0d 48 89 c2 83 e2 01 48
83 ea
[ 1234.779985] RSP: :abe680bcfd78 EFLAGS: 0202

I'll see if I can dig any.
Dave.


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Alex Deucher
On Thu, Oct 6, 2022 at 3:48 PM Linus Torvalds
 wrote:
>
> On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher  wrote:
> >
> > Maybe you are seeing this which is an issue with GPU TLB flushes which
> > is kind of sporadic:
> > https://gitlab.freedesktop.org/drm/amd/-/issues/2113
>
> Well, that seems to be 5.19, and while timing changes (or whatever
> other software updates) could have made it start trigger, this machine
> has been pretty solid otgerwise.
>
> > Are you seeing any GPU page faults in your kernel log?
>
> Nothing even remotely like that "no-retry page fault" in that issue
> report. Of course, if it happens just before the whole thing locks
> up...

Your chip is too old to support retry faults so it's likely you could
be just seeing a GPU page fault followed by a hang.  Your chip also
lacks a paging queue, so you would be affected by the TLB issue.

Alex


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 12:28 PM Alex Deucher  wrote:
>
> Maybe you are seeing this which is an issue with GPU TLB flushes which
> is kind of sporadic:
> https://gitlab.freedesktop.org/drm/amd/-/issues/2113

Well, that seems to be 5.19, and while timing changes (or whatever
other software updates) could have made it start trigger, this machine
has been pretty solid otgerwise.

> Are you seeing any GPU page faults in your kernel log?

Nothing even remotely like that "no-retry page fault" in that issue
report. Of course, if it happens just before the whole thing locks
up...

   Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Thu, Oct 6, 2022 at 12:30 PM Dave Airlie  wrote:
>
> netconsole?

I've never been really successful with that in the past, and haven't
used it for decades. I guess I could try if nothing else works.

   Linus


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Dave Airlie
On Fri, 7 Oct 2022 at 04:48, Linus Torvalds
 wrote:
>
> On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie  wrote:
> >
> > Lots of stuff all over, some new AMD IP support and gang
> > submit support [..]
>
> Hmm.
>
> I have now had my main desktop lock up twice after pulling this.
> Nothing in the dmesg after a reboot, and nothing in particular that
> seems to trigger it, so I have a hard time even guessing what's up,
> but the drm changes are the primary suspect.
>
> I will try to see if I can get any information out of the machine, but
> with the symptom being just a dead machine ...
>
> This is the same (old) Radeon device:
>
>49:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
>
> with dual 4k monitors, running on my good old Threadripper setup.
>
> Again, there's no explicit reason to blame the drm pull, except that
> it started after that merge (that machine ran the kernel with the
> networking pull for a day with no problems, and while there were other
> pull requests in between them, they seem to be fairly unrelated to the
> hardware I have).
>
> But the lockup is so sporadic (twice in the last day) that I really
> can't bisect it, so I'm afraid I have very very little info.
>
> Any suggestions?

netconsole?

I'll plug in my 480 and see if I can make it die.

Dave.


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Alex Deucher
On Thu, Oct 6, 2022 at 2:48 PM Linus Torvalds
 wrote:
>
> On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie  wrote:
> >
> > Lots of stuff all over, some new AMD IP support and gang
> > submit support [..]
>
> Hmm.
>
> I have now had my main desktop lock up twice after pulling this.
> Nothing in the dmesg after a reboot, and nothing in particular that
> seems to trigger it, so I have a hard time even guessing what's up,
> but the drm changes are the primary suspect.
>
> I will try to see if I can get any information out of the machine, but
> with the symptom being just a dead machine ...
>
> This is the same (old) Radeon device:
>
>49:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
>
> with dual 4k monitors, running on my good old Threadripper setup.
>
> Again, there's no explicit reason to blame the drm pull, except that
> it started after that merge (that machine ran the kernel with the
> networking pull for a day with no problems, and while there were other
> pull requests in between them, they seem to be fairly unrelated to the
> hardware I have).
>
> But the lockup is so sporadic (twice in the last day) that I really
> can't bisect it, so I'm afraid I have very very little info.
>
> Any suggestions?

Maybe you are seeing this which is an issue with GPU TLB flushes which
is kind of sporadic:
https://gitlab.freedesktop.org/drm/amd/-/issues/2113
Are you seeing any GPU page faults in your kernel log?

Alex


Re: [git pull] drm for 6.1-rc1

2022-10-06 Thread Linus Torvalds
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie  wrote:
>
> Lots of stuff all over, some new AMD IP support and gang
> submit support [..]

Hmm.

I have now had my main desktop lock up twice after pulling this.
Nothing in the dmesg after a reboot, and nothing in particular that
seems to trigger it, so I have a hard time even guessing what's up,
but the drm changes are the primary suspect.

I will try to see if I can get any information out of the machine, but
with the symptom being just a dead machine ...

This is the same (old) Radeon device:

   49:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)

with dual 4k monitors, running on my good old Threadripper setup.

Again, there's no explicit reason to blame the drm pull, except that
it started after that merge (that machine ran the kernel with the
networking pull for a day with no problems, and while there were other
pull requests in between them, they seem to be fairly unrelated to the
hardware I have).

But the lockup is so sporadic (twice in the last day) that I really
can't bisect it, so I'm afraid I have very very little info.

Any suggestions?

  Linus


Re: [git pull] drm for 6.1-rc1

2022-10-05 Thread Dave Airlie
On Thu, 6 Oct 2022 at 04:38, Linus Torvalds
 wrote:
>
> On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie  wrote:
> >
> > This is very conflict heavy, mostly the correct answer is picking
> > the version from drm-next.
>
> Ugh, yes, that was a bit annoying.
>
> I get the same end result as you did, but I do wonder if the drm
> people should try to keep some kind of separate "fixes" branches for
> things that go both into the development tree and then get sent to me
> for fixes pulls?
>
> Hopefully this "lots of pointless noise" was a one-off, but it might
> be due to how you guys work..

In this case I think it was a late set of fixes backported for new AMD
hardware, that had to be redone to fit into the current kernel that
caused most of it. I haven't seen it this bad in a long while. We also
maintain a rerere tree ourselves to avoid continuously seeing it.

The problem is a lot of developers don't have the insight that the
maintainers do into the current state of the tree/pipeline.

Stuff goes into next because that is where the patch it fixes
originally went, and it goes through CI there. Then at some point
someone else realises the change needs to be in fixes and it gets
backported.

The volume of patches and company signoff processes doesn't make it
trivial to upfront decide what needs to go in -next or -fixes
unfortunately.

Dave.


Re: [git pull] drm for 6.1-rc1

2022-10-05 Thread pr-tracker-bot
The pull request you sent on Wed, 5 Oct 2022 13:41:47 +1000:

> git://anongit.freedesktop.org/drm/drm tags/drm-next-2022-10-05

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


Re: [git pull] drm for 6.1-rc1

2022-10-05 Thread Linus Torvalds
On Tue, Oct 4, 2022 at 8:42 PM Dave Airlie  wrote:
>
> This is very conflict heavy, mostly the correct answer is picking
> the version from drm-next.

Ugh, yes, that was a bit annoying.

I get the same end result as you did, but I do wonder if the drm
people should try to keep some kind of separate "fixes" branches for
things that go both into the development tree and then get sent to me
for fixes pulls?

Hopefully this "lots of pointless noise" was a one-off, but it might
be due to how you guys work..

  Linus