Re: ttm crash on init

2018-09-17 Thread Tom St Denis

I've sent a patch to the list that fixes the bug on my end.

Cheers,

Tom

On 2018-09-17 2:01 p.m., Tom St Denis wrote:

On 2018-09-17 1:55 p.m., Christian König wrote:

Am 17.09.2018 um 19:50 schrieb Tom St Denis:

On 2018-09-17 1:45 p.m., Christian König wrote:

Mhm, not the slightest idea.

That nearly looks like adev->stolen_vga_memory already contains 
something.


Nope,

[   51.564605] >>>adev->stolen_vga_memory == (null)
[   51.564619] kasan: CONFIG_KASAN_INLINE enabled
[   51.564877] kasan: GPF could be caused by NULL-ptr deref or user 
memory access
[   51.565071] general protection fault:  [#1] SMP 
DEBUG_PAGEALLOC KASAN NOPTI
[   51.565254] CPU: 6 PID: 3863 Comm: modprobe Not tainted 
4.19.0-rc1+ #30
[   51.565425] Hardware name: System manufacturer System Product 
Name/TUF B350M-PLUS GAMING, BIOS 4011 04/19/2018

[   51.565714] RIP: 0010:amdgpu_bo_create_kernel+0x59/0x1a0 [amdgpu]

That's me printing out the value of the value for stolen_vga_memory 
before the call to allocate it.


What does amdgpu_bo_create_kernel+0x59 points to?


I've never really got line numbers to work with the kernel but if I had 
to guess I'd say right here


int amdgpu_bo_create_kernel(struct amdgpu_device *adev,
     unsigned long size, int align,
     u32 domain, struct amdgpu_bo **bo_ptr,
     u64 *gpu_addr, void **cpu_addr)
{
 int r;

 r = amdgpu_bo_create_reserved(adev, size, align, domain, bo_ptr,
   gpu_addr, cpu_addr);

 if (r)
     return r;

*bo_ptr is NULL ===>    amdgpu_bo_unreserve(*bo_ptr);

 return 0;
}

Which then results in

static inline void amdgpu_bo_unreserve(struct amdgpu_bo *bo)
{
 ttm_bo_unreserve(>tbo);
}

Which then passes the address NULL + offsetof(tbo) to ttm_bo_unreserve:

static inline void ttm_bo_unreserve(struct ttm_buffer_object *bo)
{
     if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
     spin_lock(>bdev->glob->lru_lock);
     ttm_bo_add_to_lru(bo);
     spin_unlock(>bdev->glob->lru_lock);
     }
     reservation_object_unlock(bo->resv);
}


Which likely faults on reading bo->mem.placement since the address is 
bogus.


The report is from amdgpu_bo_create_kernel because everything is a macro 
or inlined... :-)


Tom



Christian.



Tom




Christian.

Am 17.09.2018 um 18:47 schrieb Tom St Denis:

On 2018-09-17 12:21 p.m., Tom St Denis wrote:
(attached).  I'll try to bisect in a second.  Is anyone aware of 
this?


Tom


Bisection led to:

a327772a5655ff4fb104c8aae6515faa461df466 is the first bad commit
commit a327772a5655ff4fb104c8aae6515faa461df466
Author: Christian König 
Date:   Fri Sep 14 21:06:50 2018 +0200

    drm/amdgpu: drop size check

    We no don't allocate zero sized kernel BOs any longer.

    Signed-off-by: Christian König 
    Reviewed-by: Alex Deucher 

:04 04 265e4fa231d367d354e4c66600b8f98a4d2f04c4 
3702baaeb2423361dcd7eac8c533edace760ae3e M  drivers



As the culprit.

Cheers,
Tom




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx






___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: ttm crash on init

2018-09-17 Thread Christian König

Am 17.09.2018 um 20:01 schrieb Tom St Denis:

On 2018-09-17 1:55 p.m., Christian König wrote:

Am 17.09.2018 um 19:50 schrieb Tom St Denis:

On 2018-09-17 1:45 p.m., Christian König wrote:

Mhm, not the slightest idea.

That nearly looks like adev->stolen_vga_memory already contains 
something.


Nope,

[   51.564605] >>>adev->stolen_vga_memory == (null)
[   51.564619] kasan: CONFIG_KASAN_INLINE enabled
[   51.564877] kasan: GPF could be caused by NULL-ptr deref or user 
memory access
[   51.565071] general protection fault:  [#1] SMP 
DEBUG_PAGEALLOC KASAN NOPTI
[   51.565254] CPU: 6 PID: 3863 Comm: modprobe Not tainted 
4.19.0-rc1+ #30
[   51.565425] Hardware name: System manufacturer System Product 
Name/TUF B350M-PLUS GAMING, BIOS 4011 04/19/2018

[   51.565714] RIP: 0010:amdgpu_bo_create_kernel+0x59/0x1a0 [amdgpu]

That's me printing out the value of the value for stolen_vga_memory 
before the call to allocate it.


What does amdgpu_bo_create_kernel+0x59 points to?


I've never really got line numbers to work with the kernel but if I 
had to guess I'd say right here


int amdgpu_bo_create_kernel(struct amdgpu_device *adev,
    unsigned long size, int align,
    u32 domain, struct amdgpu_bo **bo_ptr,
    u64 *gpu_addr, void **cpu_addr)
{
int r;

r = amdgpu_bo_create_reserved(adev, size, align, domain, bo_ptr,
  gpu_addr, cpu_addr);

if (r)
    return r;

*bo_ptr is NULL ===>    amdgpu_bo_unreserve(*bo_ptr);


Ah, of course! Thanks for pointing out the obvious, totally forgot that 
there is still another function in the call chain.


Patch to fix is on the list,
Christian.



return 0;
}

Which then results in

static inline void amdgpu_bo_unreserve(struct amdgpu_bo *bo)
{
ttm_bo_unreserve(>tbo);
}

Which then passes the address NULL + offsetof(tbo) to ttm_bo_unreserve:

static inline void ttm_bo_unreserve(struct ttm_buffer_object *bo)
{
    if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
    spin_lock(>bdev->glob->lru_lock);
    ttm_bo_add_to_lru(bo);
spin_unlock(>bdev->glob->lru_lock);
    }
    reservation_object_unlock(bo->resv);
}


Which likely faults on reading bo->mem.placement since the address is 
bogus.


The report is from amdgpu_bo_create_kernel because everything is a 
macro or inlined... :-)


Tom



Christian.



Tom




Christian.

Am 17.09.2018 um 18:47 schrieb Tom St Denis:

On 2018-09-17 12:21 p.m., Tom St Denis wrote:
(attached).  I'll try to bisect in a second.  Is anyone aware of 
this?


Tom


Bisection led to:

a327772a5655ff4fb104c8aae6515faa461df466 is the first bad commit
commit a327772a5655ff4fb104c8aae6515faa461df466
Author: Christian König 
Date:   Fri Sep 14 21:06:50 2018 +0200

    drm/amdgpu: drop size check

    We no don't allocate zero sized kernel BOs any longer.

    Signed-off-by: Christian König 
    Reviewed-by: Alex Deucher 

:04 04 265e4fa231d367d354e4c66600b8f98a4d2f04c4 
3702baaeb2423361dcd7eac8c533edace760ae3e M  drivers



As the culprit.

Cheers,
Tom




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: ttm crash on init

2018-09-17 Thread Tom St Denis

On 2018-09-17 1:55 p.m., Christian König wrote:

Am 17.09.2018 um 19:50 schrieb Tom St Denis:

On 2018-09-17 1:45 p.m., Christian König wrote:

Mhm, not the slightest idea.

That nearly looks like adev->stolen_vga_memory already contains 
something.


Nope,

[   51.564605] >>>adev->stolen_vga_memory == (null)
[   51.564619] kasan: CONFIG_KASAN_INLINE enabled
[   51.564877] kasan: GPF could be caused by NULL-ptr deref or user 
memory access
[   51.565071] general protection fault:  [#1] SMP DEBUG_PAGEALLOC 
KASAN NOPTI
[   51.565254] CPU: 6 PID: 3863 Comm: modprobe Not tainted 4.19.0-rc1+ 
#30
[   51.565425] Hardware name: System manufacturer System Product 
Name/TUF B350M-PLUS GAMING, BIOS 4011 04/19/2018

[   51.565714] RIP: 0010:amdgpu_bo_create_kernel+0x59/0x1a0 [amdgpu]

That's me printing out the value of the value for stolen_vga_memory 
before the call to allocate it.


What does amdgpu_bo_create_kernel+0x59 points to?


I've never really got line numbers to work with the kernel but if I had 
to guess I'd say right here


int amdgpu_bo_create_kernel(struct amdgpu_device *adev,
unsigned long size, int align,
u32 domain, struct amdgpu_bo **bo_ptr,
u64 *gpu_addr, void **cpu_addr)
{
int r;

r = amdgpu_bo_create_reserved(adev, size, align, domain, bo_ptr,
  gpu_addr, cpu_addr);

if (r)
return r;

*bo_ptr is NULL ===> amdgpu_bo_unreserve(*bo_ptr);

return 0;
}

Which then results in

static inline void amdgpu_bo_unreserve(struct amdgpu_bo *bo)
{
ttm_bo_unreserve(>tbo);
}

Which then passes the address NULL + offsetof(tbo) to ttm_bo_unreserve:

static inline void ttm_bo_unreserve(struct ttm_buffer_object *bo)
{
if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
spin_lock(>bdev->glob->lru_lock);
ttm_bo_add_to_lru(bo);
spin_unlock(>bdev->glob->lru_lock);
}
reservation_object_unlock(bo->resv);
}


Which likely faults on reading bo->mem.placement since the address is bogus.

The report is from amdgpu_bo_create_kernel because everything is a macro 
or inlined... :-)


Tom



Christian.



Tom




Christian.

Am 17.09.2018 um 18:47 schrieb Tom St Denis:

On 2018-09-17 12:21 p.m., Tom St Denis wrote:

(attached).  I'll try to bisect in a second.  Is anyone aware of this?

Tom


Bisection led to:

a327772a5655ff4fb104c8aae6515faa461df466 is the first bad commit
commit a327772a5655ff4fb104c8aae6515faa461df466
Author: Christian König 
Date:   Fri Sep 14 21:06:50 2018 +0200

    drm/amdgpu: drop size check

    We no don't allocate zero sized kernel BOs any longer.

    Signed-off-by: Christian König 
    Reviewed-by: Alex Deucher 

:04 04 265e4fa231d367d354e4c66600b8f98a4d2f04c4 
3702baaeb2423361dcd7eac8c533edace760ae3e M  drivers



As the culprit.

Cheers,
Tom




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: ttm crash on init

2018-09-17 Thread Christian König

Am 17.09.2018 um 19:50 schrieb Tom St Denis:

On 2018-09-17 1:45 p.m., Christian König wrote:

Mhm, not the slightest idea.

That nearly looks like adev->stolen_vga_memory already contains 
something.


Nope,

[   51.564605] >>>adev->stolen_vga_memory == (null)
[   51.564619] kasan: CONFIG_KASAN_INLINE enabled
[   51.564877] kasan: GPF could be caused by NULL-ptr deref or user 
memory access
[   51.565071] general protection fault:  [#1] SMP DEBUG_PAGEALLOC 
KASAN NOPTI
[   51.565254] CPU: 6 PID: 3863 Comm: modprobe Not tainted 4.19.0-rc1+ 
#30
[   51.565425] Hardware name: System manufacturer System Product 
Name/TUF B350M-PLUS GAMING, BIOS 4011 04/19/2018

[   51.565714] RIP: 0010:amdgpu_bo_create_kernel+0x59/0x1a0 [amdgpu]

That's me printing out the value of the value for stolen_vga_memory 
before the call to allocate it.


What does amdgpu_bo_create_kernel+0x59 points to?

Christian.



Tom




Christian.

Am 17.09.2018 um 18:47 schrieb Tom St Denis:

On 2018-09-17 12:21 p.m., Tom St Denis wrote:

(attached).  I'll try to bisect in a second.  Is anyone aware of this?

Tom


Bisection led to:

a327772a5655ff4fb104c8aae6515faa461df466 is the first bad commit
commit a327772a5655ff4fb104c8aae6515faa461df466
Author: Christian König 
Date:   Fri Sep 14 21:06:50 2018 +0200

    drm/amdgpu: drop size check

    We no don't allocate zero sized kernel BOs any longer.

    Signed-off-by: Christian König 
    Reviewed-by: Alex Deucher 

:04 04 265e4fa231d367d354e4c66600b8f98a4d2f04c4 
3702baaeb2423361dcd7eac8c533edace760ae3e M  drivers



As the culprit.

Cheers,
Tom




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: ttm crash on init

2018-09-17 Thread Tom St Denis

On 2018-09-17 1:45 p.m., Christian König wrote:

Mhm, not the slightest idea.

That nearly looks like adev->stolen_vga_memory already contains something.


Nope,

[   51.564605] >>>adev->stolen_vga_memory ==   (null)
[   51.564619] kasan: CONFIG_KASAN_INLINE enabled
[   51.564877] kasan: GPF could be caused by NULL-ptr deref or user 
memory access
[   51.565071] general protection fault:  [#1] SMP DEBUG_PAGEALLOC 
KASAN NOPTI

[   51.565254] CPU: 6 PID: 3863 Comm: modprobe Not tainted 4.19.0-rc1+ #30
[   51.565425] Hardware name: System manufacturer System Product 
Name/TUF B350M-PLUS GAMING, BIOS 4011 04/19/2018

[   51.565714] RIP: 0010:amdgpu_bo_create_kernel+0x59/0x1a0 [amdgpu]

That's me printing out the value of the value for stolen_vga_memory 
before the call to allocate it.


Tom




Christian.

Am 17.09.2018 um 18:47 schrieb Tom St Denis:

On 2018-09-17 12:21 p.m., Tom St Denis wrote:

(attached).  I'll try to bisect in a second.  Is anyone aware of this?

Tom


Bisection led to:

a327772a5655ff4fb104c8aae6515faa461df466 is the first bad commit
commit a327772a5655ff4fb104c8aae6515faa461df466
Author: Christian König 
Date:   Fri Sep 14 21:06:50 2018 +0200

    drm/amdgpu: drop size check

    We no don't allocate zero sized kernel BOs any longer.

    Signed-off-by: Christian König 
    Reviewed-by: Alex Deucher 

:04 04 265e4fa231d367d354e4c66600b8f98a4d2f04c4 
3702baaeb2423361dcd7eac8c533edace760ae3e M  drivers



As the culprit.

Cheers,
Tom




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: ttm crash on init

2018-09-17 Thread Christian König

Mhm, not the slightest idea.

That nearly looks like adev->stolen_vga_memory already contains something.

Christian.

Am 17.09.2018 um 18:47 schrieb Tom St Denis:

On 2018-09-17 12:21 p.m., Tom St Denis wrote:

(attached).  I'll try to bisect in a second.  Is anyone aware of this?

Tom


Bisection led to:

a327772a5655ff4fb104c8aae6515faa461df466 is the first bad commit
commit a327772a5655ff4fb104c8aae6515faa461df466
Author: Christian König 
Date:   Fri Sep 14 21:06:50 2018 +0200

    drm/amdgpu: drop size check

    We no don't allocate zero sized kernel BOs any longer.

    Signed-off-by: Christian König 
    Reviewed-by: Alex Deucher 

:04 04 265e4fa231d367d354e4c66600b8f98a4d2f04c4 
3702baaeb2423361dcd7eac8c533edace760ae3e M  drivers



As the culprit.

Cheers,
Tom


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: ttm crash on init

2018-09-17 Thread Tom St Denis

On 2018-09-17 12:21 p.m., Tom St Denis wrote:

(attached).  I'll try to bisect in a second.  Is anyone aware of this?

Tom


Bisection led to:

a327772a5655ff4fb104c8aae6515faa461df466 is the first bad commit
commit a327772a5655ff4fb104c8aae6515faa461df466
Author: Christian König 
Date:   Fri Sep 14 21:06:50 2018 +0200

drm/amdgpu: drop size check

We no don't allocate zero sized kernel BOs any longer.

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 

:04 04 265e4fa231d367d354e4c66600b8f98a4d2f04c4 
3702baaeb2423361dcd7eac8c533edace760ae3e M  drivers



As the culprit.

Cheers,
Tom
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


ttm crash on init

2018-09-17 Thread Tom St Denis

(attached).  I'll try to bisect in a second.  Is anyone aware of this?

Tom
[0.00] Linux version 4.19.0-rc1+ (root@raven) (gcc version 8.1.1 20180712 (Red Hat 8.1.1-5) (GCC)) #29 SMP Fri Sep 14 07:30:30 EDT 2018
[0.00] Command line: BOOT_IMAGE=/vmlinuz-4.19.0-rc1+ root=UUID=66163c80-0ca1-4beb-aeba-5cc130b813e6 ro rhgb quiet modprobe.blacklist=amdgpu,radeon LANG=en_CA.UTF-8
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009d3ff] usable
[0.00] BIOS-e820: [mem 0x0009d400-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x03ff] usable
[0.00] BIOS-e820: [mem 0x0400-0x04009fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x0400a000-0x09bf] usable
[0.00] BIOS-e820: [mem 0x09c0-0x09ff] reserved
[0.00] BIOS-e820: [mem 0x0a00-0x0aff] usable
[0.00] BIOS-e820: [mem 0x0b00-0x0b01] reserved
[0.00] BIOS-e820: [mem 0x0b02-0x73963fff] usable
[0.00] BIOS-e820: [mem 0x73964000-0x7397cfff] ACPI data
[0.00] BIOS-e820: [mem 0x7397d000-0x7a5aafff] usable
[0.00] BIOS-e820: [mem 0x7a5ab000-0x7a6c2fff] reserved
[0.00] BIOS-e820: [mem 0x7a6c3000-0x7a6cefff] ACPI data
[0.00] BIOS-e820: [mem 0x7a6cf000-0x7a7d1fff] usable
[0.00] BIOS-e820: [mem 0x7a7d2000-0x7ab89fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x7ab8a000-0x7b942fff] reserved
[0.00] BIOS-e820: [mem 0x7b943000-0x7dff] usable
[0.00] BIOS-e820: [mem 0x7e00-0xbfff] reserved
[0.00] BIOS-e820: [mem 0xf800-0xfbff] reserved
[0.00] BIOS-e820: [mem 0xfd80-0xfdff] reserved
[0.00] BIOS-e820: [mem 0xfea0-0xfea0] reserved
[0.00] BIOS-e820: [mem 0xfeb8-0xfec01fff] reserved
[0.00] BIOS-e820: [mem 0xfec1-0xfec10fff] reserved
[0.00] BIOS-e820: [mem 0xfec3-0xfec30fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed00fff] reserved
[0.00] BIOS-e820: [mem 0xfed4-0xfed44fff] reserved
[0.00] BIOS-e820: [mem 0xfed8-0xfed8] reserved
[0.00] BIOS-e820: [mem 0xfedc2000-0xfedc] reserved
[0.00] BIOS-e820: [mem 0xfedd4000-0xfedd5fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfeef] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00023f33] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 3.1.1 present.
[0.00] DMI: System manufacturer System Product Name/TUF B350M-PLUS GAMING, BIOS 4011 04/19/2018
[0.00] tsc: Fast TSC calibration failed
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] last_pfn = 0x23f340 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B write-through
[0.00]   C-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base  mask 8000 write-back
[0.00]   1 base 8000 mask C000 write-back
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] TOM2: 00024000 aka 9216M
[0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[0.00] e820: update [mem 0xc000-0x] usable ==> reserved
[0.00] last_pfn = 0x7e000 max_arch_pfn = 0x4
[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [(ptrval)] 97000 size 24576
[0.00] BRK [0x1f83bb000, 0x1f83bbfff] PGTABLE
[0.00] BRK [0x1f83bc000, 0x1f83bcfff] PGTABLE
[0.00] BRK [0x1f83bd000, 0x1f83bdfff]