[Desktop-packages] [Bug 1863390] [NEW] GPU lockup ring 0 stalled for more than X msec

Jamie Bainbridge Fri, 14 Feb 2020 16:11:48 -0800

Public bug reported:

Since the update:


 xserver-xorg-video-ati-hwe-18.04 (1:19.0.1-1ubuntu1~18.04.1) bionic;

which resulted from:

 https://bugs.launchpad.net/fedora/+source/xserver-xorg-video-
ati/+bug/1841718

I've experienced GPU freezes where all video becomes unresponsive, both
Xorg and Ctrl+Alt terminal switching, and the GPU fan goes to full. I am
still able to access the system via SSH.

Sometimes dmesg ends up full of this message repeating over and over:

 radeon 0000:01:00.0: ring 0 stalled for more than 24040msec
 radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000009e44 last 
fence id 0x0000000000009e49 on ring 0)

I sometimes get a few GPU soft reset which seem to fail in drm(?):

 radeon 0000:01:00.0: Saved 110839 dwords of commands on ring 0.
 radeon 0000:01:00.0: GPU softreset: 0x00000008
 ...
 radeon 0000:01:00.0: Wait for MC idle timedout !
 radeon 0000:01:00.0: Wait for MC idle timedout !
 [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
 radeon 0000:01:00.0: WB enabled 
 radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 
and cpu addr 0x00000000725651ad
 radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c 
and cpu addr 0x00000000c3678ed8
 radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 
and cpu addr 0x00000000dbd9e01b
 [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed 
(scratch(0x8504)=0xCAFEDEAD)
 [drm:evergreen_resume [radeon]] *ERROR* evergreen startup failed on resume

Even if the above reset doesn't happen, this freeze always results in a
unable to handle page fault" BUG in radeon_ring_backup, entered from
various call paths, eg:

 BUG: unable to handle page fault for address: ffffbc2d80574ffc
 ...
 Oops: 0000 [#1] SMP PTI 
 CPU: 2 PID: 11243 Comm: kworker/2:1H Not tainted 5.5.0-050500-generic 
#202001262030
 Workqueue: radeon-crtc radeon_flip_work_func [radeon]
 RIP: 0010:radeon_ring_backup+0xc9/0x140 [radeon]
 Call Trace:
  radeon_gpu_reset+0xc3/0x2f0 [radeon]
  radeon_flip_work_func+0x1f3/0x250 [radeon]
  ? __schedule+0x2e0/0x760
  process_one_work+0x1b5/0x370
  worker_thread+0x50/0x3d0
  kthread+0x104/0x140
  ? process_one_work+0x370/0x370
  ? kthread_park+0x90/0x90
  ret_from_fork+0x35/0x40

or:

 BUG: unable to handle page fault for address: ffffc03901000ffc
 ...
 Oops: 0000 [#1] SMP PTI

 CPU: 3 PID: 2227 Comm: compton Not tainted 5.3.0-28-generic #30~18.04.1-Ubuntu
 RIP: 0010:radeon_ring_backup+0xd3/0x140 [radeon]
 Call Trace:
  radeon_gpu_reset+0xb9/0x340 [radeon]
  ? dma_fence_wait_timeout+0x48/0x110
  ? reservation_object_wait_timeout_rcu+0x19d/0x340
  radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon]
  radeon_gem_wait_idle_ioctl+0xa6/0x110 [radeon]
  ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
  drm_ioctl_kernel+0xb0/0x100 [drm]
  drm_ioctl+0x389/0x450 [drm]
  ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
  ? __switch_to_asm+0x40/0x70
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70
  ? __switch_to_asm+0x40/0x70
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70
  radeon_drm_ioctl+0x4f/0x80 [radeon]
  do_vfs_ioctl+0xa9/0x640
  ? __schedule+0x2b0/0x670
  ksys_ioctl+0x75/0x80
  __x64_sys_ioctl+0x1a/0x20
  do_syscall_64+0x5a/0x130
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I've tried both 5.3.0-28-generic and 5.5.0-050500-generic from kernel-
ppa but that made no difference. It appears to be a bug in radeon.

Nothing specific makes this happen, just regular usage with a
compositing window manager. I'm not playing games or particularly
exercising the GPU. The last two times I was just reading in web
browser. It's also happened in the middle of the night while I was
asleep. Sometimes I have a few days uptime, sometimes it happens in less
than 24 hours from boot.

This never happened before the radeon update mentioned on the first
line.

I'll attach two files of dmesg output. As per
https://wiki.ubuntu.com/X/Troubleshooting/Freeze I've installed and
started apport for next time it happens.

** Affects: xserver-xorg-video-ati (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to xserver-xorg-video-ati in Ubuntu.
https://bugs.launchpad.net/bugs/1863390

Title:
  GPU lockup ring 0 stalled for more than X msec

Status in xserver-xorg-video-ati package in Ubuntu:
  New

Bug description:
  Since the update:

   xserver-xorg-video-ati-hwe-18.04 (1:19.0.1-1ubuntu1~18.04.1) bionic;

  which resulted from:

   https://bugs.launchpad.net/fedora/+source/xserver-xorg-video-
  ati/+bug/1841718

  I've experienced GPU freezes where all video becomes unresponsive,
  both Xorg and Ctrl+Alt terminal switching, and the GPU fan goes to
  full. I am still able to access the system via SSH.

  Sometimes dmesg ends up full of this message repeating over and over:

   radeon 0000:01:00.0: ring 0 stalled for more than 24040msec
   radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000009e44 last 
fence id 0x0000000000009e49 on ring 0)

  I sometimes get a few GPU soft reset which seem to fail in drm(?):

   radeon 0000:01:00.0: Saved 110839 dwords of commands on ring 0.
   radeon 0000:01:00.0: GPU softreset: 0x00000008
   ...
   radeon 0000:01:00.0: Wait for MC idle timedout !
   radeon 0000:01:00.0: Wait for MC idle timedout !
   [drm] PCIE GART of 1024M enabled (table at 0x0000000000162000).
   radeon 0000:01:00.0: WB enabled 
   radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 
and cpu addr 0x00000000725651ad
   radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c 
and cpu addr 0x00000000c3678ed8
   radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000072118 
and cpu addr 0x00000000dbd9e01b
   [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed 
(scratch(0x8504)=0xCAFEDEAD)
   [drm:evergreen_resume [radeon]] *ERROR* evergreen startup failed on resume

  Even if the above reset doesn't happen, this freeze always results in
  a unable to handle page fault" BUG in radeon_ring_backup, entered from
  various call paths, eg:

   BUG: unable to handle page fault for address: ffffbc2d80574ffc
   ...
   Oops: 0000 [#1] SMP PTI 
   CPU: 2 PID: 11243 Comm: kworker/2:1H Not tainted 5.5.0-050500-generic 
#202001262030
   Workqueue: radeon-crtc radeon_flip_work_func [radeon]
   RIP: 0010:radeon_ring_backup+0xc9/0x140 [radeon]
   Call Trace:
    radeon_gpu_reset+0xc3/0x2f0 [radeon]
    radeon_flip_work_func+0x1f3/0x250 [radeon]
    ? __schedule+0x2e0/0x760
    process_one_work+0x1b5/0x370
    worker_thread+0x50/0x3d0
    kthread+0x104/0x140
    ? process_one_work+0x370/0x370
    ? kthread_park+0x90/0x90
    ret_from_fork+0x35/0x40

  or:

   BUG: unable to handle page fault for address: ffffc03901000ffc
   ...
   Oops: 0000 [#1] SMP PTI

   CPU: 3 PID: 2227 Comm: compton Not tainted 5.3.0-28-generic 
#30~18.04.1-Ubuntu
   RIP: 0010:radeon_ring_backup+0xd3/0x140 [radeon]
   Call Trace:
    radeon_gpu_reset+0xb9/0x340 [radeon]
    ? dma_fence_wait_timeout+0x48/0x110
    ? reservation_object_wait_timeout_rcu+0x19d/0x340
    radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon]
    radeon_gem_wait_idle_ioctl+0xa6/0x110 [radeon]
    ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
    drm_ioctl_kernel+0xb0/0x100 [drm]
    drm_ioctl+0x389/0x450 [drm]
    ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
    ? __switch_to_asm+0x40/0x70
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    ? __switch_to_asm+0x40/0x70
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    radeon_drm_ioctl+0x4f/0x80 [radeon]
    do_vfs_ioctl+0xa9/0x640
    ? __schedule+0x2b0/0x670
    ksys_ioctl+0x75/0x80
    __x64_sys_ioctl+0x1a/0x20
    do_syscall_64+0x5a/0x130
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

  I've tried both 5.3.0-28-generic and 5.5.0-050500-generic from kernel-
  ppa but that made no difference. It appears to be a bug in radeon.

  Nothing specific makes this happen, just regular usage with a
  compositing window manager. I'm not playing games or particularly
  exercising the GPU. The last two times I was just reading in web
  browser. It's also happened in the middle of the night while I was
  asleep. Sometimes I have a few days uptime, sometimes it happens in
  less than 24 hours from boot.

  This never happened before the radeon update mentioned on the first
  line.

  I'll attach two files of dmesg output. As per
  https://wiki.ubuntu.com/X/Troubleshooting/Freeze I've installed and
  started apport for next time it happens.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-ati/+bug/1863390/+subscriptions

-- 
Mailing list: https://launchpad.net/~desktop-packages
Post to     : desktop-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~desktop-packages
More help   : https://help.launchpad.net/ListHelp

[Desktop-packages] [Bug 1863390] [NEW] GPU lockup ring 0 stalled for more than X msec

Reply via email to