Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

Christian König Mon, 05 Mar 2018 03:21:08 -0800

Am 05.03.2018 um 09:08 schrieb Liu, Monk:

To better approach this issue I suggest to do the following:
1. Revert the original patch.


2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.
Cannot do that 20ms is not enough, sometimes you need 10 seconds since other 
VFs may doing bad things like occupying GPU intentionally or they are doing 
TDR, so
I don't think separate read and write is good idea, they should be treated 
equally


Well the question is if separating read&writes would actually help.


3. To the read function add a "if (!in_intterupt()) may_sleep();" and then 
retest. That should at least print a nice warning when called from atomic context.
Sorry what is may_sleep() ??

4. Test the whole thing and try to fix all warnings about atomic contexts from 
the may_sleep();

5. Reapply the original patch, but this time only for the read function, not 
the write function.


 From current LKG code, the only one spin lock may wrapping the kiq_rreg/wreg() 
is the pcie_idx_lock, and this lock is only used during init(),
Since init() is run under the case of exclusive mode for SRIOV, which means:
1)  register access is not go through KIQ (see admgpu_mm_reg)
2)  those functions are only in bif_medium_grain_xxx part (vi.c and nbio_v6.c) 
, and they won't hit under SRIOV ( we return in the head if SRIOV detect)
So I don' think this spin_lock may cause trouble...

Ok in this case let's keep the patch for now, but please provide a newpatch which adds "if (!in_intterupt()) may_sleep();" in both the readand write function.


This way we should at least catch problems early on.

Christian.


/Monk



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2018年3月5日 15:57
To: Liu, Monk <monk....@amd.com>; Koenig, Christian <christian.koe...@amd.com>; 
Kuehling, Felix <felix.kuehl...@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

otherwise I don't see how it is better by reverting it

Well it's better to revert it for now because it seems to create more problems 
than it solves.

To better approach this issue I suggest to do the following:
1. Revert the original patch.

2. Stop waiting to long for writes. E.g. use a separate timeout (20ms
maybe?) to wait for the write. Then do a WARN_ON_ONCE when we timeout.

3. To the read function add a "if (!in_intterupt()) may_sleep();" and then 
retest. That should at least print a nice warning when called from atomic context.

4. Test the whole thing and try to fix all warnings about atomic contexts from 
the may_sleep();

5. Reapply the original patch, but this time only for the read function, not 
the write function.

Regards,
Christian.

Am 05.03.2018 um 05:20 schrieb Liu, Monk:

When there are 16 VF/VM on one GPU, we can easily hit "sys lockup to
22s" kernel error/warning introduced by kiq_rreg/wreg routine That's
why I must use this patch to let thread sleep a while and try again,

If you insist reverting this patch please give me a solution,
otherwise I don't see how it is better by reverting it

/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2018年3月3日 21:38
To: Kuehling, Felix <felix.kuehl...@amd.com>; Liu, Monk
<monk....@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
IRQ(v2)

Am 02.03.2018 um 21:47 schrieb Felix Kuehling:

On 2018-03-02 04:29 AM, Liu, Monk wrote:

In_atomic() isnot encouraged to be used to judge if sleep is
possible, see the macros of it

#define in_atomic() (preept_count() != 0)

OK. But my point is still that you're not testing the right thing
when you check in_interrupt(). The comment before the in_atomic macro
definition states the limitations and says "do not use in driver code".
Unfortunately it doesn't suggest any alternative. I think
in_interrupt is actually worse, because it misses even more cases than 
in_atomic.

Thinking about this, Felix seems to be absolutely right.

So we need to revert this patch since you can't reliable detect in a driver if 
sleeping is allowed or not.

Regards,
Christian.

Regards,
     Felix

/Monk

-----Original Message-----
From: Kuehling, Felix
Sent: 2018年3月1日 23:50
To: amd-gfx@lists.freedesktop.org; Liu, Monk <monk....@amd.com>
Subject: Re: [PATCH] drm/amdgpu: try again kiq access if not in
IRQ(v2)

On 2018-02-28 02:27 AM, Monk Liu wrote:

sometimes GPU is switched to other VFs and won't swich back soon,
so the kiq reg access will not signal within a short period,
instead of busy waiting a long time(MAX_KEQ_REG_WAIT) and returning
TMO we can istead sleep 5ms and try again later (non irq context)

And since the waiting in kiq_r/weg is busy wait, so
MAX_KIQ_REG_WAIT shouldn't set to a long time, set it to 10ms is more 
appropriate.

if gpu already in reset state, don't retry the KIQ reg access
otherwise it would always hang because KIQ was already die usually.

v2:
replace schedule() with msleep() for the wait

Change-Id: I8fc807ce85a8d30d2b50153f3f3a6eda344ef994
Signed-off-by: Monk Liu <monk....@amd.com>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
    1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index b832651..1672f5b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -22,7 +22,7 @@
     */

#include "amdgpu.h"

-#define MAX_KIQ_REG_WAIT       100000000 /* in usecs */
+#define MAX_KIQ_REG_WAIT       10000 /* in usecs, 10ms */

uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev) { @@

-152,9
+152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device
+*adev, uint32_t reg)
        amdgpu_ring_commit(ring);
        spin_unlock_irqrestore(&kiq->ring_lock, flags);

+retry_read:

        r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
        if (r < 1) {
                DRM_ERROR("wait for kiq fence error: %ld\n", r);
+               if (!in_interrupt() && !adev->in_gpu_reset) {

You should check in_atomic here. Because it's invalid to sleep in atomic 
context (e.g. while holding a spin lock) even when not in an interrupt.
This seems to happen a lot for indirect register access, e.g.
soc15_pcie_rreg.

Regards,
     Felix

+                       msleep(5);
+                       goto retry_read;
+               }
                return ~0;
        }
        val = adev->wb.wb[adev->virt.reg_val_offs];
@@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, 
uint32_t reg, uint32_t v)
        amdgpu_ring_commit(ring);
        spin_unlock_irqrestore(&kiq->ring_lock, flags);

+retry_write:

        r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
-       if (r < 1)
+       if (r < 1) {
                DRM_ERROR("wait for kiq fence error: %ld\n", r);
+               if (!in_interrupt() && !adev->in_gpu_reset) {
+                       msleep(5);
+                       goto retry_write;
+               }
+       }
    }

/**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: try again kiq access if not in IRQ(v2)

Reply via email to