Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

Felix Kuehling Thu, 09 Jan 2025 09:15:16 -0800


On 2025-01-08 20:11, Philip Yang wrote:

On 2025-01-07 22:08, Deng, Emily wrote:
[AMD Official Use Only - AMD Internal Distribution Only]


Hi Philip,
It still has the deadlock, maybe the best way is trying to remove thedelayed free pt work.
[Wed Jan 8 10:35:44 2025 < 0.000000>] INFO: task kfdtest:5827blocked for more than 122 seconds.
[Wed Jan 8 10:35:44 2025 < 0.000290>] Tainted: G OE K 5.10.134-17.2.al8.x86_64 #1
[Wed Jan 8 10:35:44 2025 < 0.000243>] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jan 8 10:35:44 2025 < 0.000317>] task:kfdtest state:D stack: 0 pid: 5827 ppid: 5756 flags:0x00004080
[Wed Jan  8 10:35:44 2025 <    0.000002>] Call Trace:

[Wed Jan  8 10:35:44 2025 <    0.000006>] __schedule+0x1ba/0x490

[Wed Jan  8 10:35:44 2025 <    0.000002>]  ? usleep_range+0x90/0x90

[Wed Jan  8 10:35:44 2025 <    0.000002>] schedule+0x46/0xb0

[Wed Jan  8 10:35:44 2025 <    0.000001>] schedule_timeout+0x12a/0x140

[Wed Jan  8 10:35:44 2025 <    0.000003>]  ? __prepare_to_swait+0x4f/0x70

[Wed Jan  8 10:35:44 2025 <    0.000002>] __wait_for_common+0xb1/0x160

[Wed Jan  8 10:35:44 2025 <    0.000004>] flush_workqueue+0x12f/0x410
[Wed Jan 8 10:35:44 2025 < 0.000126>]svm_range_map_to_gpu+0x1b8/0x730 [amdgpu]
[Wed Jan 8 10:35:44 2025 < 0.000078>]svm_range_validate_and_map+0x978/0xd30 [amdgpu]
[Wed Jan 8 10:35:44 2025 < 0.000065>]svm_range_set_attr+0x55f/0xb20 [amdgpu]
[Wed Jan  8 10:35:44 2025 <    0.000060>] kfd_ioctl+0x208/0x540 [amdgpu]
[Wed Jan 8 10:35:44 2025 < 0.000058>] ?kfd_ioctl_set_xnack_mode+0xd0/0xd0 [amdgpu]
[Wed Jan  8 10:35:44 2025 <    0.000004>]  ? vm_mmap_pgoff+0xf2/0x120

[Wed Jan  8 10:35:44 2025 <    0.000002>] __x64_sys_ioctl+0x88/0xc0

[Wed Jan  8 10:35:44 2025 <    0.000003>] do_syscall_64+0x2e/0x50
[Wed Jan 8 10:35:44 2025 < 0.000002>]entry_SYSCALL_64_after_hwframe+0x62/0xc7
[Wed Jan  8 10:35:44 2025 <    0.000008>] RIP: 0033:0x7f8c472617db
[Wed Jan 8 10:35:44 2025 < 0.000001>] RSP: 002b:00007ffd2908a688EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Wed Jan 8 10:35:44 2025 < 0.000002>] RAX: ffffffffffffffda RBX:00007ffd2908a6fc RCX: 00007f8c472617db
[Wed Jan 8 10:35:44 2025 < 0.000002>] RDX: 00007ffd2908a6c0 RSI:00000000c0384b20 RDI: 0000000000000003
[Wed Jan 8 10:35:44 2025 < 0.000000>] RBP: 00007ffd2908a6c0 R08:0000000000000000 R09: 0000000000000000
[Wed Jan 8 10:35:44 2025 < 0.000001>] R10: 00007f70f467f000 R11:0000000000000246 R12: 00000000c0384b20
[Wed Jan 8 10:35:44 2025 < 0.000000>] R13: 0000000000000003 R14:0000000000200000 R15: 00007ffd2908a770
[Wed Jan 8 10:35:44 2025 < 0.000003>] INFO: taskkworker/u129:7:5942 blocked for more than 122 seconds.
[Wed Jan 8 10:35:44 2025 < 0.001897>] Tainted: G OE K 5.10.134-17.2.al8.x86_64 #1
[Wed Jan 8 10:35:44 2025 < 0.000247>] "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jan 8 10:35:44 2025 < 0.000315>] task:kworker/u129:7 state:D stack: 0 pid: 5942 ppid: 2 flags:0x00004080
[Wed Jan 8 10:35:44 2025 < 0.000067>] Workqueue: amdgpu_recycleamdgpu_vm_pt_free_work [amdgpu]
[Wed Jan  8 10:35:44 2025 <    0.000002>] Call Trace:

[Wed Jan  8 10:35:44 2025 <    0.000003>] __schedule+0x1ba/0x490

[Wed Jan  8 10:35:44 2025 <    0.000002>]  ? newidle_balance+0x16a/0x3b0

[Wed Jan  8 10:35:44 2025 <    0.000001>] schedule+0x46/0xb0
[Wed Jan 8 10:35:44 2025 < 0.000002>]schedule_preempt_disabled+0xa/0x10
[Wed Jan 8 10:35:44 2025 < 0.000001>]__ww_mutex_lock.constprop.0+0x390/0x6e0
[Wed Jan 8 10:35:44 2025 < 0.000045>]amdgpu_vm_pt_free_work+0x97/0x160 [amdgpu]
[Wed Jan  8 10:35:44 2025 <    0.000003>] process_one_work+0x1ad/0x380

[Wed Jan  8 10:35:44 2025 <    0.000001>] worker_thread+0x49/0x310

[Wed Jan  8 10:35:44 2025 <    0.000001>]  ? process_one_work+0x380/0x380

[Wed Jan  8 10:35:44 2025 <    0.000001>] kthread+0x118/0x140
[Wed Jan 8 10:35:44 2025 < 0.000002>] ?__kthread_bind_mask+0x60/0x60
[Wed Jan  8 10:35:44 2025 <    0.000002>] ret_from_fork+0x1f/0x30
Move flush_workqueue to the beginning of svm_range_validate_and_mapshould fix the deadlock, deadlock is because it is aftersvm_range_reserve_bos. Also there is no concurrent unmap mmu notifiercallback to free pt bo as mmap read lock is taken outsidesvm_range_validate_and_map.

I don't think the mmap_read_lock protects you from concurrent MMUnotifiers. I believe we have made that assumption in the past and itproved to be incorrect.


Regards,
  Felix

Ideally it is enough to flush work amdgpu_vm_pt_free_work (not flushqueue system_wq), but svm_range_validate_and_map cannot get thecorrect vm to flush.

adev->wq is shared by all processes and all xcp partitions, maybebetter to add wq to KFD process info, but right nowamdgpu_vm_update_range cannot access KFD process info.


Regards,

Philip

Emily Deng

Best Wishes

*From:*amd-gfx <[email protected]> *On Behalf Of*Deng, Emily

*Sent:* Wednesday, January 8, 2025 8:34 AM

*To:* Yang, Philip <[email protected]>; Kuehling, Felix<[email protected]>; [email protected]; Koenig,Christian <[email protected]>*Subject:* RE: [PATCH v2] drm/amdgpu: Fix the looply callsvm_range_restore_pages issue


[AMD Official Use Only - AMD Internal Distribution Only]

[AMD Official Use Only - AMD Internal Distribution Only]

*From:*Yang, Philip <[email protected]>
*Sent:* Tuesday, January 7, 2025 11:19 PM

*To:* Deng, Emily <[email protected]>; Kuehling, Felix<[email protected]>; [email protected]; Yang, Philip<[email protected]>; Koenig, Christian <[email protected]>*Subject:* Re: [PATCH v2] drm/amdgpu: Fix the looply callsvm_range_restore_pages issue


On 2025-01-07 07:30, Deng, Emily wrote:

    [AMD Official Use Only - AMD Internal Distribution Only]

    Hi Felix,

         You are right, it is easily to hit deadlock, don't know why LOCKDEP 
doesn't catch this. Need to find another solution.

    Hi Philip,

          Do you have a solution for this delay free pt?

Thanks for debugging this case, I had a patch to not free PTB bo whenunmapping from GPU, but it will waste VRAM memory. My test case alsopassed with the tlb flush fence fix, I don't see the no-retry faultgenerated any more.

The deadlock is probably from svm_range_unmap_from_gpu ->flush_workqueue(adev->wq), this is from mmu notifier callback,actually we only need flush pt_free_work before mapping to gpu,please remove the flush_workqueue in unmap from gpu. If deadlockstill happens, please post the backtrace.

[Emily]Yes, you are right, will try to remove flush_workqueue inunmap from gpu to have a try. Will send a v3.

I think you don't need add new adev->wq, use default system_wq andflush_work.

[Emily]No, it doesn’t allow to flush a system_wq in driver, it willtrigger a kernel warning, as lots of other work will be put insystem_wq. I have tried this.


Regards,

Philip

    Emily Deng

    Best Wishes

        -----Original Message-----

        From: Deng, Emily<[email protected]>  <mailto:[email protected]>

        Sent: Tuesday, January 7, 2025 10:34 AM

        To: Deng, Emily<[email protected]>  <mailto:[email protected]>; 
Kuehling, Felix

        <[email protected]>  
<mailto:[email protected]>;[email protected]; Yang, Philip

        <[email protected]>  <mailto:[email protected]>; Koenig, 
Christian<[email protected]>  <mailto:[email protected]>

        Subject: RE: [PATCH v2] drm/amdgpu: Fix the looply call 
svm_range_restore_pages

        issue

        [AMD Official Use Only - AMD Internal Distribution Only]

        Ping....

        How about this? Currently it is easily to reproduce the issue on our 
environment. We

        need this change to fix it.

        Emily Deng

        Best Wishes

            -----Original Message-----

            From: amd-gfx<[email protected]>  
<mailto:[email protected]>  On Behalf Of

            Deng, Emily

            Sent: Monday, January 6, 2025 9:52 AM

            To: Kuehling, Felix<[email protected]>  
<mailto:[email protected]>;

            [email protected]; Yang, Philip<[email protected]>  
<mailto:[email protected]>;

            Koenig, Christian<[email protected]>  
<mailto:[email protected]>

            Subject: RE: [PATCH v2] drm/amdgpu: Fix the looply call

            svm_range_restore_pages issue

            [AMD Official Use Only - AMD Internal Distribution Only]

            [AMD Official Use Only - AMD Internal Distribution Only]

                -----Original Message-----

                From: Kuehling, Felix<[email protected]>  
<mailto:[email protected]>

                Sent: Saturday, January 4, 2025 7:18 AM

                To: Deng, Emily<[email protected]>  
<mailto:[email protected]>;[email protected];

                Yang, Philip<[email protected]>  
<mailto:[email protected]>; Koenig, Christian

                <[email protected]>  <mailto:[email protected]>

                Subject: Re: [PATCH v2] drm/amdgpu: Fix the looply call

                svm_range_restore_pages issue

                On 2025-01-02 21:26, Emily Deng wrote:

                    As the delayed free pt, the wanted freed bo has been reused 
which

                    will cause unexpected page fault, and then call 
svm_range_restore_pages.

                    Detail as below:

                    1.It wants to free the pt in follow code, but it is not 
freed

                    immediately and used “schedule_work(&vm->pt_free_work);”.

                    [   92.276838] Call Trace:

                    [   92.276841]  dump_stack+0x63/0xa0

                    [   92.276887]  amdgpu_vm_pt_free_list+0xfb/0x120 [amdgpu]

                    [   92.276932]  amdgpu_vm_update_range+0x69c/0x8e0 [amdgpu]

                    [   92.276990]  svm_range_unmap_from_gpus+0x112/0x310 
[amdgpu]

                    [   92.277046]  
svm_range_cpu_invalidate_pagetables+0x725/0x780 [amdgpu]

                    [   92.277050]  ? __alloc_pages_nodemask+0x19f/0x3e0

                    [   92.277051]  mn_itree_invalidate+0x72/0xc0

                    [   92.277052]  
__mmu_notifier_invalidate_range_start+0x48/0x60

                    [   92.277054]  migrate_vma_collect+0xf6/0x100

                    [   92.277055]  migrate_vma_setup+0xcf/0x120

                    [   92.277109]  svm_migrate_ram_to_vram+0x256/0x6b0 [amdgpu]

                    2.Call svm_range_map_to_gpu->amdgpu_vm_update_range to 
update the

                    page table, at this time it will use the same entry bo 
which is the

                    want free bo in step1.

                    3.Then it executes the pt_free_work to free the bo. At this 
time,

                    the GPU access memory will cause page fault as pt bo has 
been freed.

                    And then it will call svm_range_restore_pages again.

                    How to fix?

                    Add a workqueue, and flush the workqueue each time before 
updating page

        table.

                I think this is kind of a known issue in the GPUVM code. Philip 
was

                looking at it before.

                Just flushing a workqueue may seem like a simple and elegant 
solution,

                but I'm afraid it introduces lock dependency issues. By 
flushing the

                workqueue, you're effectively creating a lock dependency of the 
MMU

                notifier on any locks held inside the worker function. You now 
get a

                circular lock dependency with any of those locks and memory 
reclaim. I

                think LOCKDEP would be able to catch that if you compile your 
kernel

                with that

            feature enabled.

                The proper solution is to prevent delayed freeing of page 
tables if

                they happened to get reused, or prevent reuse of page tables if 
they

                are flagged for

            delayed freeing.

                Regards,

                   Felix

            Thanks, already enabled LOCKDEP while compiling the kernel. The 
delay

            work seems for other reasons, I am not sure whether it could be 
deleted completely.

            Emily Deng

            Best Wishes

                    Signed-off-by: Emily Deng<[email protected]>  
<mailto:[email protected]>

                    ---

                       drivers/gpu/drm/amd/amdgpu/amdgpu.h              | 1 +

                       drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 1 +

                       drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c           | 7 
+++++--

                       drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c        | 6 
+++++-

                       drivers/gpu/drm/amd/amdkfd/kfd_svm.c             | 3 +++

                       5 files changed, 15 insertions(+), 3 deletions(-)

                    diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h

                    b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

                    index 93c352b08969..cbf68ad1c8d0 100644

                    --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h

                    +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

                    @@ -1188,6 +1188,7 @@ struct amdgpu_device {

                          struct mutex                    
enforce_isolation_mutex;

                          struct amdgpu_init_level *init_lvl;

                    +    struct workqueue_struct *wq;

                       };

                       static inline uint32_t amdgpu_ip_version(const struct

                    amdgpu_device *adev, diff --git

                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

                    index f30548f4c3b3..5b4835bc81b3 100644

                    --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

                    +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

                    @@ -2069,6 +2069,7 @@ int 
amdgpu_amdkfd_gpuvm_map_memory_to_gpu(

                                  if (ret)

                                          goto out;

                          }

                    +    flush_workqueue(adev->wq);

                          ret = reserve_bo_and_vm(mem, avm, &ctx);

                          if (unlikely(ret))

                    diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

                    index 9d6ffe38b48a..500d97cd9114 100644

                    --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

                    +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

                    @@ -2607,7 +2607,7 @@ void amdgpu_vm_fini(struct 
amdgpu_device

                    *adev,

                struct amdgpu_vm *vm)

                          amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);

                          flush_work(&vm->pt_free_work);

                    -

                    +    cancel_work_sync(&vm->pt_free_work);

                          root = amdgpu_bo_ref(vm->root.bo);

                          amdgpu_bo_reserve(root, true);

                          amdgpu_vm_put_task_info(vm->task_info);

                    @@ -2708,6 +2708,8 @@ void amdgpu_vm_manager_init(struct

                    amdgpu_device

                *adev)

                       #endif

                          xa_init_flags(&adev->vm_manager.pasids, 
XA_FLAGS_LOCK_IRQ);

                    +    adev->wq = alloc_workqueue("amdgpu_recycle",

                    +                               WQ_MEM_RECLAIM | WQ_HIGHPRI 
|

                WQ_UNBOUND, 16);

                       }

                       /**

                    @@ -2721,7 +2723,8 @@ void amdgpu_vm_manager_fini(struct

                    amdgpu_device

                *adev)

                       {

                          WARN_ON(!xa_empty(&adev->vm_manager.pasids));

                          xa_destroy(&adev->vm_manager.pasids);

                    -

                    +    flush_workqueue(adev->wq);

                    +    destroy_workqueue(adev->wq);

                          amdgpu_vmid_mgr_fini(adev);

                       }

                    diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c

                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c

                    index f78a0434a48f..1204406215ee 100644

                    --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c

                    +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c

                    @@ -554,15 +554,19 @@ void amdgpu_vm_pt_free_work(struct 
work_struct

                    *work)

                          vm = container_of(work, struct amdgpu_vm, 
pt_free_work);

                    +    printk("Emily:%s\n", __func__);

                          spin_lock(&vm->status_lock);

                          list_splice_init(&vm->pt_freed, &pt_freed);

                          spin_unlock(&vm->status_lock);

                    +    printk("Emily:%s 1\n", __func__);

                          /* flush_work in amdgpu_vm_fini ensure vm->root.bo is 
valid. */

                          amdgpu_bo_reserve(vm->root.bo, true);

                    +    printk("Emily:%s 2\n", __func__);

                          list_for_each_entry_safe(entry, next, &pt_freed, 
vm_status)

                                  amdgpu_vm_pt_free(entry);

                    +    printk("Emily:%s 3\n", __func__);

                          amdgpu_bo_unreserve(vm->root.bo);

                       }

                    @@ -589,7 +593,7 @@ void amdgpu_vm_pt_free_list(struct 
amdgpu_device

                *adev,

                                  spin_lock(&vm->status_lock);

                                  list_splice_init(&params->tlb_flush_waitlist, 
&vm->pt_freed);

                                  spin_unlock(&vm->status_lock);

                    -            schedule_work(&vm->pt_free_work);

                    +            queue_work(adev->wq, &vm->pt_free_work);

                                  return;

                          }

                    diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

                    b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

                    index 3e2911895c74..55edf96d5a95 100644

                    --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

                    +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

                    @@ -1314,6 +1314,7 @@ svm_range_unmap_from_gpu(struct 
amdgpu_device

                *adev, struct amdgpu_vm *vm,

                          uint64_t init_pte_value = 0;

                          pr_debug("[0x%llx 0x%llx]\n", start, last);

                    +    flush_workqueue(adev->wq);

                          return amdgpu_vm_update_range(adev, vm, false, true, 
true,

                    false, NULL,

                start,

                                                        last, init_pte_value, 
0, 0, NULL,

                    NULL, @@ -1422,6

                +1423,8

                    @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, 
struct

                    svm_range

                *prange,

                                   * different memory partition based on 
fpfn/lpfn, we should use

                                   * same vm_manager.vram_base_offset 
regardless memory partition.

                                   */

                    +            flush_workqueue(adev->wq);

                    +

                                  r = amdgpu_vm_update_range(adev, vm, false, 
false, flush_tlb, true,

                                                             NULL, last_start, 
prange->start + i,

                                                             pte_flags,

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

Reply via email to