display: Use mdelay to avoid crashes

Alex Hung Thu, 15 Dec 2022 13:02:27 -0800



On 2022-12-15 08:17, Harry Wentland wrote:



On 12/15/22 05:29, Michel Dänzer wrote:

On 12/15/22 09:09, Christian König wrote:

Am 15.12.22 um 00:33 schrieb Alex Hung:

On 2022-12-14 16:06, Alex Deucher wrote:

On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.h...@amd.com> wrote:

On 2022-12-14 15:35, Alex Deucher wrote:

On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.h...@amd.com> wrote:

On 2022-12-14 14:54, Alex Deucher wrote:

On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.h...@amd.com> wrote:

On 2022-12-14 13:48, Alex Deucher wrote:

On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai
<aurabindo.pil...@amd.com> wrote:


From: Alex Hung <alex.h...@amd.com>

[Why]
When running IGT kms_bw test with DP monitor, some systems crash from
msleep no matter how long or short the time is.

[How]
To replace msleep with mdelay.


Can you provide a bit more info about the crash?  A lot of platforms
don't support delay larger than 2-4ms so this change will generate
errors on ARM and possibly other platforms.

Alex


The msleep was introduced in eec3303de3378 for non-compliant display
port monitors but IGT's kms_bw test can cause a recent h/w to hang at
msleep(60) when calling "igt_remove_fb" in IGT
(https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/kms_bw.c#L197>>>>>>>>>>
It is possible to workaround this by reversing order of
igt_remove_fb(&buffer[i]), as the following example:

       igt_create_color_fb with the order buffer[0], buffer[1], buffer[2]

Hangs:
       igt_remove_fb with the order buffer[0], buffer[1], buffer[2]

No hangs:
       igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0]

However, IGT simply exposes the problem and it makes more sense to stop
the hang from occurring.

I also tried to remove the msleep completely and it also work, but I
didn't want to break the fix for the original problematic hardware
configuration.


Why does sleep vs delay make a difference?  Is there some race that we
are not locking against?

Alex


That was my original thought but I did not find any previously. I will
investigate it again.

If mdelay(>4) isn't usable on other platforms, is it an option to use
mdelay on x86_64 only and keep msleep on other platforms or just remove
the msleep for other platforms, something like

-                       msleep(60);
+#ifdef CONFIG_X86_64
+                       mdelay(60);
+#endif


That's pretty ugly.  I'd rather try and resolve the root cause.  How
important is the IGT test?  What does it do?  Is the test itself
correct?


Agreed, and I didn't want to add conditions around the mdelay for the
same reason. I will assume this is not an option now.

As in the previous comment, IGT can be modified to avoid the crash by
reversing the order fb is removed - though I suspect I will receive
questions why this is not fixed in kernel.

I wanted to fix this in kernel because nothing stops other user-space
applications to use the same way to crash kernel, so fixing IGT is the
second option.

Apparently causing problems on other platforms isn't an option at all so
I will try to figure out an non-mdelay solution, and then maybe an IGT
solution instead.


What hangs?  The test or the kernel or the hardware?


The system becomes completely unresponsive - no keyboard, mouse nor remote 
accesses.


I agree with Alex that changing this is extremely questionable and not 
justified at all.

My educated guess is that by using mdelay() instead of msleep() we keep the CPU 
core busy and preventing something from happening at the same time as something 
else.

This clearly points to missing locking or similar to protect concurrent 
execution of things.

Might another possibility be that this code gets called from an atomic context 
which can't sleep?


It can come through handle_hpd_rx_irq but we're using a workqueue
to queue interrupt handling so this shouldn't come from an atomic
context. I currently don't see where else it might be used in an
atomic context. Alex Hung, can you do a dump_stack() in this function
to see where the problematic call is coming from?



IGT's kms_bw executes as below (when passing)

IGT-Version: 1.26-gf4067678 (x86_64) (Linux: 5.19.0-99-custom x86_64)
Starting subtest: linear-tiling-1-displays-1920x1080p
Subtest linear-tiling-1-displays-1920x1080p: SUCCESS (0.225s)
Starting subtest: linear-tiling-1-displays-2560x1440p
Subtest linear-tiling-1-displays-2560x1440p: SUCCESS (0.111s)
Starting subtest: linear-tiling-1-displays-3840x2160p
Subtest linear-tiling-1-displays-3840x2160p: SUCCESS (0.118s)
Starting subtest: linear-tiling-2-displays-1920x1080p
Subtest linear-tiling-2-displays-1920x1080p: SUCCESS (0.409s)
Starting subtest: linear-tiling-2-displays-2560x1440p
Subtest linear-tiling-2-displays-2560x1440p: SUCCESS (0.417s)
Starting subtest: linear-tiling-2-displays-3840x2160p
Subtest linear-tiling-2-displays-3840x2160p: SUCCESS (0.444s)
Starting subtest: linear-tiling-3-displays-1920x1080p
Subtest linear-tiling-3-displays-1920x1080p: SUCCESS (0.547s)
Starting subtest: linear-tiling-3-displays-2560x1440p
Subtest linear-tiling-3-displays-2560x1440p: SUCCESS (0.555s)
Starting subtest: linear-tiling-3-displays-3840x2160p
Subtest linear-tiling-3-displays-3840x2160p: SUCCESS (0.586s)
Starting subtest: linear-tiling-4-displays-1920x1080p
Subtest linear-tiling-4-displays-1920x1080p: SUCCESS (0.734s)
Starting subtest: linear-tiling-4-displays-2560x1440p
Subtest linear-tiling-4-displays-2560x1440p: SUCCESS (0.742s)
Starting subtest: linear-tiling-4-displays-3840x2160p
Subtest linear-tiling-4-displays-3840x2160p: SUCCESS (0.778s)
Starting subtest: linear-tiling-5-displays-1920x1080p
Subtest linear-tiling-5-displays-1920x1080p: SUCCESS (0.734s)
Starting subtest: linear-tiling-5-displays-2560x1440p
Subtest linear-tiling-5-displays-2560x1440p: SUCCESS (0.743s)
Starting subtest: linear-tiling-5-displays-3840x2160p
Subtest linear-tiling-5-displays-3840x2160p: SUCCESS (0.781s)
Starting subtest: linear-tiling-6-displays-1920x1080p

Test requirement not met in function run_test_linear_tiling, file../tests/kms_bw.c:156:

Test requirement: !(pipe > num_pipes)
ASIC does not have 5 pipes
Subtest linear-tiling-6-displays-1920x1080p: SKIP (0.000s)
Starting subtest: linear-tiling-6-displays-2560x1440p

Test requirement not met in function run_test_linear_tiling, file../tests/kms_bw.c:156:

Test requirement: !(pipe > num_pipes)
ASIC does not have 5 pipes
Subtest linear-tiling-6-displays-2560x1440p: SKIP (0.000s)
Starting subtest: linear-tiling-6-displays-3840x2160p

Test requirement not met in function run_test_linear_tiling, file../tests/kms_bw.c:156:

Test requirement: !(pipe > num_pipes)
ASIC does not have 5 pipes
Subtest linear-tiling-6-displays-3840x2160p: SKIP (0.000s)

The crash usually occurs when executing"linear-tiling-3-displays-1920x1080p" most of time, but the crash canalso occurs at "linear-tiling-3-displays-2560x1440p"


============
This is dump_stack right before the failing msleep.

[IGT] kms_bw: starting subtest linear-tiling-3-displays-1920x1080p
CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126
Workqueue: events drm_mode_rmfb_work_fn [drm]
Call Trace:
 <TASK>
 dump_stack_lvl+0x49/0x63
 dump_stack+0x10/0x16
 dce110_blank_stream.cold+0x5/0x14 [amdgpu]
 core_link_disable_stream+0xe0/0x6b0 [amdgpu]
 ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu]
 dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu]
 dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu]
 ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu]
 ? dcn20_disable_plane+0x2c/0x60 [amdgpu]
 ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu]
 dc_commit_state_no_check+0x39a/0xcd0 [amdgpu]
 ? dc_validate_global_state+0x2ba/0x330 [amdgpu]
 dc_commit_state+0x10f/0x150 [amdgpu]
 amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu]
 ? dcn30_internal_validate_bw+0x349/0xa00 [amdgpu]
 ? slab_post_alloc_hook+0x53/0x270
 ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
 ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
 ? dc_validate_global_state+0x27a/0x330 [amdgpu]
 ? slab_post_alloc_hook+0x53/0x270
 ? __kmalloc+0x18c/0x300
 ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper]
 ? preempt_count_add+0x7c/0xc0
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? wait_for_completion_timeout+0x114/0x140
 ? preempt_count_add+0x7c/0xc0
 ? _raw_spin_unlock_irq+0x1f/0x40
 commit_tail+0x99/0x140 [drm_kms_helper]
 drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
 drm_atomic_commit+0x79/0x100 [drm]
 ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm]
 drm_framebuffer_remove+0x499/0x510 [drm]
 drm_mode_rmfb_work_fn+0x4b/0x90 [drm]
 process_one_work+0x21d/0x3f0
 worker_thread+0x1fa/0x3c0
 ? process_one_work+0x3f0/0x3f0
 kthread+0xff/0x130
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x22/0x30
 </TASK>


============
If msleep is replaced by mdelay, the dump_stack is almost the same:

$ diff mdelay.log msleep.log
<  dce110_blank_stream.cold+0x5/0x1f [amdgpu]
---
>  dce110_blank_stream.cold+0x5/0x14 [amdgpu]


============

If the IGT fix is applied (i.e. no crashes when removing buffers[i]reversely by "igt_remove_fb", the dump_stack outputs are the same:


$ diff msleep_igt.log msleep.log
2c2
< CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126
---
> CPU: 1 PID: 76 Comm: kworker/1:1 Not tainted 5.19.0-99-custom #126

============

Note the msleep doesn't always trigger crashes. One of the passingdump_stack is as below:


[IGT] kms_bw: starting subtest linear-tiling-2-displays-1920x1080p
CPU: 6 PID: 78 Comm: kworker/6:1 Not tainted 5.19.0-99-custom #126
Workqueue: events drm_mode_rmfb_work_fn [drm]
Call Trace:
 <TASK>
 dump_stack_lvl+0x49/0x63
 dump_stack+0x10/0x16
 dce110_blank_stream.cold+0x5/0x14 [amdgpu]
 core_link_disable_stream+0xe0/0x6b0 [amdgpu]
 ? optc1_set_vtotal_min_max+0x6b/0x80 [amdgpu]
 dcn31_reset_hw_ctx_wrap+0x229/0x410 [amdgpu]
 dce110_apply_ctx_to_hw+0x6e/0x6c0 [amdgpu]
 ? dcn20_plane_atomic_disable+0xb2/0x160 [amdgpu]
 ? dcn20_disable_plane+0x2c/0x60 [amdgpu]
 ? dcn20_post_unlock_program_front_end+0x77/0x2c0 [amdgpu]
 dc_commit_state_no_check+0x39a/0xcd0 [amdgpu]
 ? dc_validate_global_state+0x2ba/0x330 [amdgpu]
 dc_commit_state+0x10f/0x150 [amdgpu]
 amdgpu_dm_atomic_commit_tail+0x631/0x2d30 [amdgpu]
 ? debug_smp_processor_id+0x17/0x20
 ? dc_fpu_end+0x4e/0xd0 [amdgpu]
 ? dcn316_populate_dml_pipes_from_context+0x69/0x2a0 [amdgpu]
 ? dcn31_calculate_wm_and_dlg_fp+0x50/0x530 [amdgpu]
 ? kernel_fpu_end+0x26/0x50
 ? dc_fpu_end+0x4e/0xd0 [amdgpu]
 ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
 ? dcn31_validate_bandwidth+0x12c/0x2b0 [amdgpu]
 ? dc_validate_global_state+0x27a/0x330 [amdgpu]
 ? slab_post_alloc_hook+0x53/0x270
 ? __kmalloc+0x18c/0x300
 ? drm_dp_mst_atomic_setup_commit+0x8a/0x1a0 [drm_display_helper]
 ? preempt_count_add+0x7c/0xc0
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? wait_for_completion_timeout+0x114/0x140
 ? preempt_count_add+0x7c/0xc0
 ? _raw_spin_unlock_irq+0x1f/0x40
 commit_tail+0x99/0x140 [drm_kms_helper]
 drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
 drm_atomic_commit+0x79/0x100 [drm]
 ? drm_plane_get_damage_clips.cold+0x1d/0x1d [drm]
 drm_framebuffer_remove+0x499/0x510 [drm]
 drm_mode_rmfb_work_fn+0x4b/0x90 [drm]
 process_one_work+0x21d/0x3f0
 worker_thread+0x1fa/0x3c0
 ? process_one_work+0x3f0/0x3f0
 kthread+0xff/0x130
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x22/0x30
 </TASK>


Fixing IGT will only mask the issue. Userspace should never be able
to put the system in a state where it stops responding entirely. This
will need some sort of fix in the kernel.

Harry

Re: [PATCH 05/18] drm/amd/display: Use mdelay to avoid crashes

Reply via email to