Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-17 Thread Daniel Vetter
On Thu, Jun 17, 2021 at 09:41:35AM +0200, Christian König wrote:
> 
> 
> Am 16.06.21 um 21:19 schrieb Dan Carpenter:
> > On Wed, Jun 16, 2021 at 01:00:38PM +0200, Christian König wrote:
> > > 
> > > Am 16.06.21 um 11:36 schrieb Dan Carpenter:
> > > > On Wed, Jun 16, 2021 at 10:47:14AM +0200, Christian König wrote:
> > > > > Am 16.06.21 um 10:37 schrieb Dan Carpenter:
> > > > > > On Wed, Jun 16, 2021 at 08:46:33AM +0200, Christian König wrote:
> > > > > > > Sending the first message didn't worked, so let's try again.
> > > > > > > 
> > > > > > > Am 16.06.21 um 08:30 schrieb Dan Carpenter:
> > > > > > > > There are three bugs here:
> > > > > > > > 1) We need to call unpopulate() if ttm_tt_populate() succeeds.
> > > > > > > > 2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" 
> > > > > > > > assignment
> > > > > > > >was wrong and it was really assigning "new_mem = 
> > > > > > > > old_mem;".  There
> > > > > > > >is no need for this assignment anyway as we already have 
> > > > > > > > the value
> > > > > > > >for "new_mem".
> > > > > > > > 3) The (!new_man->use_tt) condition is reversed.
> > > > > > > > 
> > > > > > > > Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager 
> > > > > > > > subsystem.")
> > > > > > > > Signed-off-by: Dan Carpenter 
> > > > > > > > ---
> > > > > > > > This is from reading the code and I can't swear that I have 
> > > > > > > > understood
> > > > > > > > it correctly.  My nouveau driver is currently unusable and this 
> > > > > > > > patch
> > > > > > > > has not helped.  But hopefully if I fix enough bugs eventually 
> > > > > > > > it will
> > > > > > > > start to work.
> > > > > > > Well NAK, the code previously looked quite well and you are 
> > > > > > > breaking it now.
> > > > > > > 
> > > > > > > What's the problem with nouveau?
> > > > > > > 
> > > > > > The new Firefox seems to excersize nouveau more than the old one so
> > > > > > when I start 10 firefox windows it just hangs the graphics.
> > > > > > 
> > > > > > I've added debug code and it seems like the problem is that
> > > > > > nv50_mem_new() is failing.
> > > > > Sounds like it is running out of memory to me.
> > > > > 
> > > > > Do you have a dmesg?
> > > > > 
> > > > At first there was a very straight forward use after free bug which I
> > > > fixed.
> > > > https://lore.kernel.org/nouveau/YMinJwpIei9n1Pn1@mwanda/T/#u
> > > > 
> > > > But now the use after free is gone the only thing in dmesg is:
> > > > "[TTM] Buffer eviction failed".  And I have some firmware missing.
> > > > 
> > > > [  205.489763] rfkill: input handler disabled
> > > > [  205.678292] nouveau :01:00.0: Direct firmware load for 
> > > > nouveau/nva8_fuc084 failed with error -2
> > > > [  205.678300] nouveau :01:00.0: Direct firmware load for 
> > > > nouveau/nva8_fuc084d failed with error -2
> > > > [  205.678302] nouveau :01:00.0: msvld: unable to load firmware data
> > > > [  205.678304] nouveau :01:00.0: msvld: init failed, -19
> > > > [  296.150632] [TTM] Buffer eviction failed
> > > > [  417.084265] [TTM] Buffer eviction failed
> > > > [  447.295961] [TTM] Buffer eviction failed
> > > > [  510.800231] [TTM] Buffer eviction failed
> > > > [  556.101384] [TTM] Buffer eviction failed
> > > > [  616.495790] [TTM] Buffer eviction failed
> > > > [  692.014007] [TTM] Buffer eviction failed
> > > > 
> > > > The eviction failed message only shows up a minute after the hang so it
> > > > seems more like a symptom than a root cause.
> > > Yeah, look at the timing. What happens is that the buffer eviction timed 
> > > out
> > > because the hardware is locked up.
> > > 
> > > No idea what that could be. It might not even be kernel related at all.
> > I don't think it's hardware related...  Using an old version of firefox
> > "fixes" the problem.  I downloaded the firmware so that's not the issue.
> > Here's the dmesg load info with the new firmware.
> 
> Oh, I was not suggesting a hardware problem.
> 
> The most likely cause is a software issue in userspace, e.g. wrong order of
> doing thing, doing things to fast without waiting etc...
> 
> There are tons of things how userspace can crash GPU hardware you can't
> prevent in the kernel. Especially sending an endless loop is well known as
> Turing's halting problems and not even theoretically solvable.
> 
> I suggest to start digging in userspace instead.

I guess nouveau doesn't have reset when the fences time out? That would at
least paper over this, plus it makes debugging the bug in mesa3 easier.

Also as Christian points out, because halting problem lack of tdr (timeoud
and device reset) is actually a security bug itself.
-Daniel

> 
> Christian.
> 
> > 
> > [1.412458] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel 
> > [1.412527] AMD-Vi: AMD IOMMUv2 functionality not available on this 
> > system
> > [1.412710] nouveau :01:00.0: vgaarb: deactivate vga console
> > [1.417213] Console: switching to colour dummy device 80x25

Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-17 Thread Christian König




Am 16.06.21 um 21:19 schrieb Dan Carpenter:

On Wed, Jun 16, 2021 at 01:00:38PM +0200, Christian König wrote:


Am 16.06.21 um 11:36 schrieb Dan Carpenter:

On Wed, Jun 16, 2021 at 10:47:14AM +0200, Christian König wrote:

Am 16.06.21 um 10:37 schrieb Dan Carpenter:

On Wed, Jun 16, 2021 at 08:46:33AM +0200, Christian König wrote:

Sending the first message didn't worked, so let's try again.

Am 16.06.21 um 08:30 schrieb Dan Carpenter:

There are three bugs here:
1) We need to call unpopulate() if ttm_tt_populate() succeeds.
2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" assignment
   was wrong and it was really assigning "new_mem = old_mem;".  There
   is no need for this assignment anyway as we already have the value
   for "new_mem".
3) The (!new_man->use_tt) condition is reversed.

Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager subsystem.")
Signed-off-by: Dan Carpenter 
---
This is from reading the code and I can't swear that I have understood
it correctly.  My nouveau driver is currently unusable and this patch
has not helped.  But hopefully if I fix enough bugs eventually it will
start to work.

Well NAK, the code previously looked quite well and you are breaking it now.

What's the problem with nouveau?


The new Firefox seems to excersize nouveau more than the old one so
when I start 10 firefox windows it just hangs the graphics.

I've added debug code and it seems like the problem is that
nv50_mem_new() is failing.

Sounds like it is running out of memory to me.

Do you have a dmesg?


At first there was a very straight forward use after free bug which I
fixed.
https://lore.kernel.org/nouveau/YMinJwpIei9n1Pn1@mwanda/T/#u

But now the use after free is gone the only thing in dmesg is:
"[TTM] Buffer eviction failed".  And I have some firmware missing.

[  205.489763] rfkill: input handler disabled
[  205.678292] nouveau :01:00.0: Direct firmware load for 
nouveau/nva8_fuc084 failed with error -2
[  205.678300] nouveau :01:00.0: Direct firmware load for 
nouveau/nva8_fuc084d failed with error -2
[  205.678302] nouveau :01:00.0: msvld: unable to load firmware data
[  205.678304] nouveau :01:00.0: msvld: init failed, -19
[  296.150632] [TTM] Buffer eviction failed
[  417.084265] [TTM] Buffer eviction failed
[  447.295961] [TTM] Buffer eviction failed
[  510.800231] [TTM] Buffer eviction failed
[  556.101384] [TTM] Buffer eviction failed
[  616.495790] [TTM] Buffer eviction failed
[  692.014007] [TTM] Buffer eviction failed

The eviction failed message only shows up a minute after the hang so it
seems more like a symptom than a root cause.

Yeah, look at the timing. What happens is that the buffer eviction timed out
because the hardware is locked up.

No idea what that could be. It might not even be kernel related at all.

I don't think it's hardware related...  Using an old version of firefox
"fixes" the problem.  I downloaded the firmware so that's not the issue.
Here's the dmesg load info with the new firmware.


Oh, I was not suggesting a hardware problem.

The most likely cause is a software issue in userspace, e.g. wrong order 
of doing thing, doing things to fast without waiting etc...


There are tons of things how userspace can crash GPU hardware you can't 
prevent in the kernel. Especially sending an endless loop is well known 
as Turing's halting problems and not even theoretically solvable.


I suggest to start digging in userspace instead.

Christian.



[1.412458] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel 
[1.412527] AMD-Vi: AMD IOMMUv2 functionality not available on this system
[1.412710] nouveau :01:00.0: vgaarb: deactivate vga console
[1.417213] Console: switching to colour dummy device 80x25
[1.417272] nouveau :01:00.0: NVIDIA GT218 (0a8280b1)
[1.531565] nouveau :01:00.0: bios: nvkm_bios_new: version 70.18.6f.00.05
[1.531916] nouveau :01:00.0: fb: nvkm_ram_ctor: 1024 MiB DDR3
[2.248212] tsc: Refined TSC clocksource calibration: 3392.144 MHz
[2.248218] clocksource: tsc: mask: 0x max_cycles: 
0x30e5517d4e4, max_idle_ns: 440795261668 ns
[2.252203] clocksource: Switched to clocksource tsc
[2.848138] nouveau :01:00.0: DRM: VRAM: 1024 MiB
[2.848142] nouveau :01:00.0: DRM: GART: 1048576 MiB
[2.848145] nouveau :01:00.0: DRM: TMDS table version 2.0
[2.848147] nouveau :01:00.0: DRM: DCB version 4.0
[2.848149] nouveau :01:00.0: DRM: DCB outp 00: 01000302 00020030
[2.848151] nouveau :01:00.0: DRM: DCB outp 01: 02000300 
[2.848154] nouveau :01:00.0: DRM: DCB outp 02: 02011362 00020010
[2.848155] nouveau :01:00.0: DRM: DCB outp 03: 01022310 
[2.848157] nouveau :01:00.0: DRM: DCB conn 00: 1030
[2.848159] nouveau :01:00.0: DRM: DCB conn 01: 2161
[2.848161] nouveau :01:00.0: DRM: DCB conn 02: 0200
[2.850214] nouveau :01:00.0: DRM: MM: using COPY for 

Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-16 Thread Dan Carpenter
On Wed, Jun 16, 2021 at 01:00:38PM +0200, Christian König wrote:
> 
> 
> Am 16.06.21 um 11:36 schrieb Dan Carpenter:
> > On Wed, Jun 16, 2021 at 10:47:14AM +0200, Christian König wrote:
> > > 
> > > Am 16.06.21 um 10:37 schrieb Dan Carpenter:
> > > > On Wed, Jun 16, 2021 at 08:46:33AM +0200, Christian König wrote:
> > > > > Sending the first message didn't worked, so let's try again.
> > > > > 
> > > > > Am 16.06.21 um 08:30 schrieb Dan Carpenter:
> > > > > > There are three bugs here:
> > > > > > 1) We need to call unpopulate() if ttm_tt_populate() succeeds.
> > > > > > 2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" 
> > > > > > assignment
> > > > > >   was wrong and it was really assigning "new_mem = old_mem;".  
> > > > > > There
> > > > > >   is no need for this assignment anyway as we already have the 
> > > > > > value
> > > > > >   for "new_mem".
> > > > > > 3) The (!new_man->use_tt) condition is reversed.
> > > > > > 
> > > > > > Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager 
> > > > > > subsystem.")
> > > > > > Signed-off-by: Dan Carpenter 
> > > > > > ---
> > > > > > This is from reading the code and I can't swear that I have 
> > > > > > understood
> > > > > > it correctly.  My nouveau driver is currently unusable and this 
> > > > > > patch
> > > > > > has not helped.  But hopefully if I fix enough bugs eventually it 
> > > > > > will
> > > > > > start to work.
> > > > > Well NAK, the code previously looked quite well and you are breaking 
> > > > > it now.
> > > > > 
> > > > > What's the problem with nouveau?
> > > > > 
> > > > The new Firefox seems to excersize nouveau more than the old one so
> > > > when I start 10 firefox windows it just hangs the graphics.
> > > > 
> > > > I've added debug code and it seems like the problem is that
> > > > nv50_mem_new() is failing.
> > > Sounds like it is running out of memory to me.
> > > 
> > > Do you have a dmesg?
> > > 
> > At first there was a very straight forward use after free bug which I
> > fixed.
> > https://lore.kernel.org/nouveau/YMinJwpIei9n1Pn1@mwanda/T/#u
> > 
> > But now the use after free is gone the only thing in dmesg is:
> > "[TTM] Buffer eviction failed".  And I have some firmware missing.
> > 
> > [  205.489763] rfkill: input handler disabled
> > [  205.678292] nouveau :01:00.0: Direct firmware load for 
> > nouveau/nva8_fuc084 failed with error -2
> > [  205.678300] nouveau :01:00.0: Direct firmware load for 
> > nouveau/nva8_fuc084d failed with error -2
> > [  205.678302] nouveau :01:00.0: msvld: unable to load firmware data
> > [  205.678304] nouveau :01:00.0: msvld: init failed, -19
> > [  296.150632] [TTM] Buffer eviction failed
> > [  417.084265] [TTM] Buffer eviction failed
> > [  447.295961] [TTM] Buffer eviction failed
> > [  510.800231] [TTM] Buffer eviction failed
> > [  556.101384] [TTM] Buffer eviction failed
> > [  616.495790] [TTM] Buffer eviction failed
> > [  692.014007] [TTM] Buffer eviction failed
> > 
> > The eviction failed message only shows up a minute after the hang so it
> > seems more like a symptom than a root cause.
> 
> Yeah, look at the timing. What happens is that the buffer eviction timed out
> because the hardware is locked up.
> 
> No idea what that could be. It might not even be kernel related at all.

I don't think it's hardware related...  Using an old version of firefox
"fixes" the problem.  I downloaded the firmware so that's not the issue.
Here's the dmesg load info with the new firmware.

[1.412458] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel 
[1.412527] AMD-Vi: AMD IOMMUv2 functionality not available on this system
[1.412710] nouveau :01:00.0: vgaarb: deactivate vga console
[1.417213] Console: switching to colour dummy device 80x25
[1.417272] nouveau :01:00.0: NVIDIA GT218 (0a8280b1)
[1.531565] nouveau :01:00.0: bios: nvkm_bios_new: version 70.18.6f.00.05
[1.531916] nouveau :01:00.0: fb: nvkm_ram_ctor: 1024 MiB DDR3
[2.248212] tsc: Refined TSC clocksource calibration: 3392.144 MHz
[2.248218] clocksource: tsc: mask: 0x max_cycles: 
0x30e5517d4e4, max_idle_ns: 440795261668 ns
[2.252203] clocksource: Switched to clocksource tsc
[2.848138] nouveau :01:00.0: DRM: VRAM: 1024 MiB
[2.848142] nouveau :01:00.0: DRM: GART: 1048576 MiB
[2.848145] nouveau :01:00.0: DRM: TMDS table version 2.0
[2.848147] nouveau :01:00.0: DRM: DCB version 4.0
[2.848149] nouveau :01:00.0: DRM: DCB outp 00: 01000302 00020030
[2.848151] nouveau :01:00.0: DRM: DCB outp 01: 02000300 
[2.848154] nouveau :01:00.0: DRM: DCB outp 02: 02011362 00020010
[2.848155] nouveau :01:00.0: DRM: DCB outp 03: 01022310 
[2.848157] nouveau :01:00.0: DRM: DCB conn 00: 1030
[2.848159] nouveau :01:00.0: DRM: DCB conn 01: 2161
[2.848161] nouveau :01:00.0: DRM: DCB conn 02: 0200
[2.850214] nouveau 

Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-16 Thread Christian König




Am 16.06.21 um 11:36 schrieb Dan Carpenter:

On Wed, Jun 16, 2021 at 10:47:14AM +0200, Christian König wrote:


Am 16.06.21 um 10:37 schrieb Dan Carpenter:

On Wed, Jun 16, 2021 at 08:46:33AM +0200, Christian König wrote:

Sending the first message didn't worked, so let's try again.

Am 16.06.21 um 08:30 schrieb Dan Carpenter:

There are three bugs here:
1) We need to call unpopulate() if ttm_tt_populate() succeeds.
2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" assignment
  was wrong and it was really assigning "new_mem = old_mem;".  There
  is no need for this assignment anyway as we already have the value
  for "new_mem".
3) The (!new_man->use_tt) condition is reversed.

Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager subsystem.")
Signed-off-by: Dan Carpenter 
---
This is from reading the code and I can't swear that I have understood
it correctly.  My nouveau driver is currently unusable and this patch
has not helped.  But hopefully if I fix enough bugs eventually it will
start to work.

Well NAK, the code previously looked quite well and you are breaking it now.

What's the problem with nouveau?


The new Firefox seems to excersize nouveau more than the old one so
when I start 10 firefox windows it just hangs the graphics.

I've added debug code and it seems like the problem is that
nv50_mem_new() is failing.

Sounds like it is running out of memory to me.

Do you have a dmesg?


At first there was a very straight forward use after free bug which I
fixed.
https://lore.kernel.org/nouveau/YMinJwpIei9n1Pn1@mwanda/T/#u

But now the use after free is gone the only thing in dmesg is:
"[TTM] Buffer eviction failed".  And I have some firmware missing.

[  205.489763] rfkill: input handler disabled
[  205.678292] nouveau :01:00.0: Direct firmware load for 
nouveau/nva8_fuc084 failed with error -2
[  205.678300] nouveau :01:00.0: Direct firmware load for 
nouveau/nva8_fuc084d failed with error -2
[  205.678302] nouveau :01:00.0: msvld: unable to load firmware data
[  205.678304] nouveau :01:00.0: msvld: init failed, -19
[  296.150632] [TTM] Buffer eviction failed
[  417.084265] [TTM] Buffer eviction failed
[  447.295961] [TTM] Buffer eviction failed
[  510.800231] [TTM] Buffer eviction failed
[  556.101384] [TTM] Buffer eviction failed
[  616.495790] [TTM] Buffer eviction failed
[  692.014007] [TTM] Buffer eviction failed

The eviction failed message only shows up a minute after the hang so it
seems more like a symptom than a root cause.


Yeah, look at the timing. What happens is that the buffer eviction timed 
out because the hardware is locked up.


No idea what that could be. It might not even be kernel related at all.

Regards,
Christian.



regards,
dan carpenter





Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-16 Thread Dan Carpenter
On Wed, Jun 16, 2021 at 10:47:14AM +0200, Christian König wrote:
> 
> 
> Am 16.06.21 um 10:37 schrieb Dan Carpenter:
> > On Wed, Jun 16, 2021 at 08:46:33AM +0200, Christian König wrote:
> > > Sending the first message didn't worked, so let's try again.
> > > 
> > > Am 16.06.21 um 08:30 schrieb Dan Carpenter:
> > > > There are three bugs here:
> > > > 1) We need to call unpopulate() if ttm_tt_populate() succeeds.
> > > > 2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" assignment
> > > >  was wrong and it was really assigning "new_mem = old_mem;".  There
> > > >  is no need for this assignment anyway as we already have the value
> > > >  for "new_mem".
> > > > 3) The (!new_man->use_tt) condition is reversed.
> > > > 
> > > > Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager subsystem.")
> > > > Signed-off-by: Dan Carpenter 
> > > > ---
> > > > This is from reading the code and I can't swear that I have understood
> > > > it correctly.  My nouveau driver is currently unusable and this patch
> > > > has not helped.  But hopefully if I fix enough bugs eventually it will
> > > > start to work.
> > > Well NAK, the code previously looked quite well and you are breaking it 
> > > now.
> > > 
> > > What's the problem with nouveau?
> > > 
> > The new Firefox seems to excersize nouveau more than the old one so
> > when I start 10 firefox windows it just hangs the graphics.
> > 
> > I've added debug code and it seems like the problem is that
> > nv50_mem_new() is failing.
> 
> Sounds like it is running out of memory to me.
> 
> Do you have a dmesg?
> 

At first there was a very straight forward use after free bug which I
fixed.
https://lore.kernel.org/nouveau/YMinJwpIei9n1Pn1@mwanda/T/#u

But now the use after free is gone the only thing in dmesg is:
"[TTM] Buffer eviction failed".  And I have some firmware missing.

[  205.489763] rfkill: input handler disabled
[  205.678292] nouveau :01:00.0: Direct firmware load for 
nouveau/nva8_fuc084 failed with error -2
[  205.678300] nouveau :01:00.0: Direct firmware load for 
nouveau/nva8_fuc084d failed with error -2
[  205.678302] nouveau :01:00.0: msvld: unable to load firmware data
[  205.678304] nouveau :01:00.0: msvld: init failed, -19
[  296.150632] [TTM] Buffer eviction failed
[  417.084265] [TTM] Buffer eviction failed
[  447.295961] [TTM] Buffer eviction failed
[  510.800231] [TTM] Buffer eviction failed
[  556.101384] [TTM] Buffer eviction failed
[  616.495790] [TTM] Buffer eviction failed
[  692.014007] [TTM] Buffer eviction failed

The eviction failed message only shows up a minute after the hang so it
seems more like a symptom than a root cause.

regards,
dan carpenter



Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-16 Thread Christian König




Am 16.06.21 um 10:37 schrieb Dan Carpenter:

On Wed, Jun 16, 2021 at 08:46:33AM +0200, Christian König wrote:

Sending the first message didn't worked, so let's try again.

Am 16.06.21 um 08:30 schrieb Dan Carpenter:

There are three bugs here:
1) We need to call unpopulate() if ttm_tt_populate() succeeds.
2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" assignment
 was wrong and it was really assigning "new_mem = old_mem;".  There
 is no need for this assignment anyway as we already have the value
 for "new_mem".
3) The (!new_man->use_tt) condition is reversed.

Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager subsystem.")
Signed-off-by: Dan Carpenter 
---
This is from reading the code and I can't swear that I have understood
it correctly.  My nouveau driver is currently unusable and this patch
has not helped.  But hopefully if I fix enough bugs eventually it will
start to work.

Well NAK, the code previously looked quite well and you are breaking it now.

What's the problem with nouveau?


The new Firefox seems to excersize nouveau more than the old one so
when I start 10 firefox windows it just hangs the graphics.

I've added debug code and it seems like the problem is that
nv50_mem_new() is failing.


Sounds like it is running out of memory to me.

Do you have a dmesg?





   drivers/gpu/drm/ttm/ttm_bo.c | 14 --
   1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index ebcffe794adb..72dde093f754 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -180,12 +180,12 @@ static int ttm_bo_handle_move_mem(struct 
ttm_buffer_object *bo,
 */
ret = ttm_tt_create(bo, old_man->use_tt);
if (ret)
-   goto out_err;
+   return ret;
if (mem->mem_type != TTM_PL_SYSTEM) {
ret = ttm_tt_populate(bo->bdev, bo->ttm, ctx);
if (ret)
-   goto out_err;
+   goto err_destroy;
}
}
@@ -193,15 +193,17 @@ static int ttm_bo_handle_move_mem(struct 
ttm_buffer_object *bo,
if (ret) {
if (ret == -EMULTIHOP)
return ret;
-   goto out_err;
+   goto err_unpopulate;
}
ctx->bytes_moved += bo->base.size;
return 0;
-out_err:
-   new_man = ttm_manager_type(bdev, bo->mem.mem_type);

This here switches new and old manager. E.g. the new_man is now pointing to
the existing resource manager.

Why not just use "old_man" instead of basically the equivalent to
"new_man = old_man"?  Can the old_man change part way through the
function?


Good question :)

I don't think that old_man could change and yes that would be much more 
easier to understand.


Regards,
Christian.



regards,
dan carpenter





Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-16 Thread Dan Carpenter
On Wed, Jun 16, 2021 at 08:46:33AM +0200, Christian König wrote:
> Sending the first message didn't worked, so let's try again.
> 
> Am 16.06.21 um 08:30 schrieb Dan Carpenter:
> > There are three bugs here:
> > 1) We need to call unpopulate() if ttm_tt_populate() succeeds.
> > 2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" assignment
> > was wrong and it was really assigning "new_mem = old_mem;".  There
> > is no need for this assignment anyway as we already have the value
> > for "new_mem".
> > 3) The (!new_man->use_tt) condition is reversed.
> > 
> > Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager subsystem.")
> > Signed-off-by: Dan Carpenter 
> > ---
> > This is from reading the code and I can't swear that I have understood
> > it correctly.  My nouveau driver is currently unusable and this patch
> > has not helped.  But hopefully if I fix enough bugs eventually it will
> > start to work.
> 
> Well NAK, the code previously looked quite well and you are breaking it now.
> 
> What's the problem with nouveau?
> 

The new Firefox seems to excersize nouveau more than the old one so
when I start 10 firefox windows it just hangs the graphics.

I've added debug code and it seems like the problem is that
nv50_mem_new() is failing.


> >   drivers/gpu/drm/ttm/ttm_bo.c | 14 --
> >   1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> > index ebcffe794adb..72dde093f754 100644
> > --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > @@ -180,12 +180,12 @@ static int ttm_bo_handle_move_mem(struct 
> > ttm_buffer_object *bo,
> >  */
> > ret = ttm_tt_create(bo, old_man->use_tt);
> > if (ret)
> > -   goto out_err;
> > +   return ret;
> > if (mem->mem_type != TTM_PL_SYSTEM) {
> > ret = ttm_tt_populate(bo->bdev, bo->ttm, ctx);
> > if (ret)
> > -   goto out_err;
> > +   goto err_destroy;
> > }
> > }
> > @@ -193,15 +193,17 @@ static int ttm_bo_handle_move_mem(struct 
> > ttm_buffer_object *bo,
> > if (ret) {
> > if (ret == -EMULTIHOP)
> > return ret;
> > -   goto out_err;
> > +   goto err_unpopulate;
> > }
> > ctx->bytes_moved += bo->base.size;
> > return 0;
> > -out_err:
> > -   new_man = ttm_manager_type(bdev, bo->mem.mem_type);
> 
> This here switches new and old manager. E.g. the new_man is now pointing to
> the existing resource manager.

Why not just use "old_man" instead of basically the equivalent to
"new_man = old_man"?  Can the old_man change part way through the
function?

regards,
dan carpenter



Re: [PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-16 Thread Christian König

Sending the first message didn't worked, so let's try again.

Am 16.06.21 um 08:30 schrieb Dan Carpenter:

There are three bugs here:
1) We need to call unpopulate() if ttm_tt_populate() succeeds.
2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" assignment
was wrong and it was really assigning "new_mem = old_mem;".  There
is no need for this assignment anyway as we already have the value
for "new_mem".
3) The (!new_man->use_tt) condition is reversed.

Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager subsystem.")
Signed-off-by: Dan Carpenter 
---
This is from reading the code and I can't swear that I have understood
it correctly.  My nouveau driver is currently unusable and this patch
has not helped.  But hopefully if I fix enough bugs eventually it will
start to work.


Well NAK, the code previously looked quite well and you are breaking it now.

What's the problem with nouveau?


  drivers/gpu/drm/ttm/ttm_bo.c | 14 --
  1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index ebcffe794adb..72dde093f754 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -180,12 +180,12 @@ static int ttm_bo_handle_move_mem(struct 
ttm_buffer_object *bo,
 */
ret = ttm_tt_create(bo, old_man->use_tt);
if (ret)
-   goto out_err;
+   return ret;
  
  		if (mem->mem_type != TTM_PL_SYSTEM) {

ret = ttm_tt_populate(bo->bdev, bo->ttm, ctx);
if (ret)
-   goto out_err;
+   goto err_destroy;
}
}
  
@@ -193,15 +193,17 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,

if (ret) {
if (ret == -EMULTIHOP)
return ret;
-   goto out_err;
+   goto err_unpopulate;
}
  
  	ctx->bytes_moved += bo->base.size;

return 0;
  
-out_err:

-   new_man = ttm_manager_type(bdev, bo->mem.mem_type);


This here switches new and old manager. E.g. the new_man is now pointing 
to the existing resource manager.



-   if (!new_man->use_tt)


So we should destroy the TT object only if the old manager is not using one.


+err_unpopulate:
+   if (new_man->use_tt)
+   ttm_tt_unpopulate(bo->bdev, bo->ttm);


Unpopulate is not necessary, destroying is sufficient.

Christian.


+err_destroy:
+   if (new_man->use_tt)
ttm_bo_tt_destroy(bo);
  
  	return ret;




[PATCH] drm/ttm: fix error handling in ttm_bo_handle_move_mem()

2021-06-16 Thread Dan Carpenter
There are three bugs here:
1) We need to call unpopulate() if ttm_tt_populate() succeeds.
2) The "new_man = ttm_manager_type(bdev, bo->mem.mem_type);" assignment
   was wrong and it was really assigning "new_mem = old_mem;".  There
   is no need for this assignment anyway as we already have the value
   for "new_mem".
3) The (!new_man->use_tt) condition is reversed.

Fixes: ba4e7d973dd0 ("drm: Add the TTM GPU memory manager subsystem.")
Signed-off-by: Dan Carpenter 
---
This is from reading the code and I can't swear that I have understood
it correctly.  My nouveau driver is currently unusable and this patch
has not helped.  But hopefully if I fix enough bugs eventually it will
start to work.

 drivers/gpu/drm/ttm/ttm_bo.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index ebcffe794adb..72dde093f754 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -180,12 +180,12 @@ static int ttm_bo_handle_move_mem(struct 
ttm_buffer_object *bo,
 */
ret = ttm_tt_create(bo, old_man->use_tt);
if (ret)
-   goto out_err;
+   return ret;
 
if (mem->mem_type != TTM_PL_SYSTEM) {
ret = ttm_tt_populate(bo->bdev, bo->ttm, ctx);
if (ret)
-   goto out_err;
+   goto err_destroy;
}
}
 
@@ -193,15 +193,17 @@ static int ttm_bo_handle_move_mem(struct 
ttm_buffer_object *bo,
if (ret) {
if (ret == -EMULTIHOP)
return ret;
-   goto out_err;
+   goto err_unpopulate;
}
 
ctx->bytes_moved += bo->base.size;
return 0;
 
-out_err:
-   new_man = ttm_manager_type(bdev, bo->mem.mem_type);
-   if (!new_man->use_tt)
+err_unpopulate:
+   if (new_man->use_tt)
+   ttm_tt_unpopulate(bo->bdev, bo->ttm);
+err_destroy:
+   if (new_man->use_tt)
ttm_bo_tt_destroy(bo);
 
return ret;
-- 
2.30.2