Re: [PATCH] drm/amdgpu: only WARN freeing buffers when DMA is unavailable

Christian König Fri, 10 Feb 2023 03:56:35 -0800

Am 10.02.23 um 12:30 schrieb Xiao, Jack:

[AMD Official Use Only - General]
>> The driver are resumed before the core Linux memory management isready to serve allocations. E.g. swap for example isn't turned on yet.
>> This means that this stuff only worked because we were able toallocate memory from the pool which isn't guaranteed in any way.
Memory allocation failure can happen at any time, every programmershould correctly handle it.

We are not talking about memory allocation failure, we are talking aboutthe kernel calling panic() because it can't properly resume.


Regards,
Christian.

If memory allocation failure is not critical error and can gracefullycontinue to run, it should be acceptable.

The memory allocation failure during mes self test should be theacceptable one. It will not make system hang up and


driver can gracefully continue to run.

Regards,

Jack

*From:* Koenig, Christian <[email protected]>
*Sent:* Friday, February 10, 2023 6:25 PM

*To:* Xiao, Jack <[email protected]>; Quan, Evan <[email protected]>;Christian König <[email protected]>;[email protected]; Deucher, Alexander<[email protected]>*Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers when DMAis unavailable


Hi Jack,

Am 10.02.23 um 10:51 schrieb Xiao, Jack:

    [AMD Official Use Only - General]

    Hi Christian,

    >> Allocating buffers temporary for stuff like that is illegal
    during resume.

    Can you **deeply** explain why it is illegal during ip late_init
    stage which is a part stage of resume?


Well no, I don't have the time to explain this to everybody individually.

[Jack] …

    In my understanding, after gmc ready, driver can allocate/free
    kernel bo, and after SDMA ready,

    the eviction should be ready. What else prevent driver doing that
    during resume?

The driver are resumed before the core Linux memory management isready to serve allocations. E.g. swap for example isn't turned on yet.

This means that this stuff only worked because we were able toallocate memory from the pool which isn't guaranteed in any way.


    >> I strongly suggest to just remove the MES test. It's abusing
    the kernel ring interface in a way we didn't want anyway and is
    currently replaced by Shahanks work.

    The kernel mes unit test is very meaningful and important to catch
    MES firmware issue at first time before

    issue went spread to other components like kfd/umd to avoid the
    problem complicated, Otherwise, the issue

    would become hard to catch and debug.

    Secondly, for mes unit test is self-containing and no dependency,
    it is a part of milestone to qualify MES ready,

    indicating that it can deliver to other component especially
    during brinup. It is likely ring test and ib test indicating

    gfx is ready to go. After totally transitioning to gfx user queue,
    mes unit test may be the only one unit test which

    can indicate gfx is ready at the very early stage of bringup when
    UMD is not ready.

Alex and I are the maintainers of the driver who are deciding stufflike that and at least I don't really buy that argument. The ring, IBand benchmark tests are in the kernel module because they are simple.

If we have a complicated unit test like simulating creating an MESuser queue to test the firmware functionality than that's reallyoverkill. Especially when you need to allocate memory for it.

We previously had people requesting to add shader code and othercomplicated testing and rejected that as well because it just bloat upthe kernel driver unnecessarily.

If we can modify the MES test to not abuse the amdgpu_ring structureonly work with memory from the SA for example we could keep this, butnot really in the current state.


Regards,
Christian.

    Regards,

    Jack

    *From:* Koenig, Christian <[email protected]>
    <mailto:[email protected]>
    *Sent:* Friday, February 10, 2023 4:08 PM
    *To:* Quan, Evan <[email protected]> <mailto:[email protected]>;
    Christian König <[email protected]>
    <mailto:[email protected]>; Xiao, Jack
    <[email protected]> <mailto:[email protected]>;
    [email protected]; Deucher, Alexander
    <[email protected]> <mailto:[email protected]>
    *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers when
    DMA is unavailable

    Hi Evan,

    yeah, exactly that's what this warning should prevent. Allocating
    buffers temporary for stuff like that is illegal during resume.

    I strongly suggest to just remove the MES test. It's abusing the
    kernel ring interface in a way we didn't want anyway and is
    currently replaced by Shahanks work.

    Regards,
    Christian.

    Am 10.02.23 um 05:12 schrieb Quan, Evan:

        [AMD Official Use Only - General]

        Hi Jack,

        Are you trying to fix the call trace popped up on resuming below?

        It seems mes created some bo for its self test and freed it up
        later at the final stage of the resuming process.

        All these happened before the in_suspend flag cleared. And
        that triggered the call trace.

        Is my understanding correct?

        [74084.799260] WARNING: CPU: 2 PID: 2891 at
        drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:425
        amdgpu_bo_free_kernel+0xfc/0x110 [amdgpu]

        [74084.811019] Modules linked in: nls_iso8859_1 amdgpu(OE)
        iommu_v2 gpu_sched drm_buddy drm_ttm_helper ttm
        drm_display_helper drm_kms_helper i2c_algo_bit fb_sys_fops
        syscopyarea sysfillrect sysimgblt snd_sm

        [74084.811042]  ip_tables x_tables autofs4 hid_logitech_hidpp
        hid_logitech_dj hid_generic e1000e usbhid ptp uas hid video
        i2c_i801 ahci pps_core crc32_pclmul i2c_smbus usb_storage
        libahci wmi

        [74084.914519] CPU: 2 PID: 2891 Comm: kworker/u16:38 Tainted:
        G        W IOE      6.0.0-custom #1

        [74084.923146] Hardware name: ASUS System Product Name/PRIME
        Z390-A, BIOS 2004 11/02/2021

        [74084.931074] Workqueue: events_unbound async_run_entry_fn

        [74084.936393] RIP: 0010:amdgpu_bo_free_kernel+0xfc/0x110 [amdgpu]

        [74084.942422] Code: 00 4d 85 ed 74 08 49 c7 45 00 00 00 00 00
        4d 85 e4 74 08 49 c7 04 24 00 00 00 00 5b 41 5c 41 5d 41 5e 41
        5f 5d c3 cc cc cc cc <0f> 0b e9 39 ff ff ff 3d 00 fe ff ff 0f
        85 75 96 47 00 ebf

        [74084.961199] RSP: 0000:ffffbed6812ebb90 EFLAGS: 00010202

        [74084.966435] RAX: 0000000000000000 RBX: ffffbed6812ebc50
        RCX: 0000000000000000

        [74084.973578] RDX: ffffbed6812ebc70 RSI: ffffbed6812ebc60
        RDI: ffffbed6812ebc50

        [74084.980725] RBP: ffffbed6812ebbb8 R08: 0000000000000000
        R09: 00000000000001ff

        [74084.987869] R10: ffffbed6812ebb40 R11: 0000000000000000
        R12: ffffbed6812ebc70

        [74084.995015] R13: ffffbed6812ebc60 R14: ffff963a2945cc00
        R15: ffff9639c7da5630

        [74085.002160] FS: 0000000000000000(0000)
        GS:ffff963d1dc80000(0000) knlGS:0000000000000000

        [74085.010262] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

        [74085.016016] CR2: 0000000000000000 CR3: 0000000377c0a001
        CR4: 00000000003706e0

        [74085.023164] DR0: 0000000000000000 DR1: 0000000000000000
        DR2: 0000000000000000

        [74085.030307] DR3: 0000000000000000 DR6: 00000000fffe0ff0
        DR7: 0000000000000400

        [74085.037453] Call Trace:

        [74085.039911]  <TASK>

        [74085.042023] amdgpu_mes_self_test+0x385/0x460 [amdgpu]

        [74085.047293] mes_v11_0_late_init+0x44/0x50 [amdgpu]

        [74085.052291] amdgpu_device_ip_late_init+0x50/0x270 [amdgpu]

        [74085.058032] amdgpu_device_resume+0xb0/0x2d0 [amdgpu]

        [74085.063187] amdgpu_pmops_resume+0x37/0x70 [amdgpu]

        [74085.068162] pci_pm_resume+0x68/0x100

        [74085.071836]  ? pci_legacy_resume+0x80/0x80

        [74085.075943] dpm_run_callback+0x4c/0x160

        [74085.079873] device_resume+0xad/0x210

        [74085.083546]  async_resume+0x1e/0x40

        [74085.087046] async_run_entry_fn+0x30/0x120

        [74085.091152] process_one_work+0x21a/0x3f0

        [74085.095173] worker_thread+0x50/0x3e0

        [74085.098845]  ? process_one_work+0x3f0/0x3f0

        [74085.103039]  kthread+0xfa/0x130

        [74085.106189]  ? kthread_complete_and_exit+0x20/0x20

        [74085.110993]  ret_from_fork+0x1f/0x30

        [74085.114576]  </TASK>

        [74085.116773] ---[ end trace 0000000000000000 ]---

        BR

        Evan

        *From:* amd-gfx <[email protected]>
        <mailto:[email protected]> *On Behalf Of
        *Christian König
        *Sent:* Monday, February 6, 2023 5:00 PM
        *To:* Xiao, Jack <[email protected]>
        <mailto:[email protected]>; Koenig, Christian
        <[email protected]> <mailto:[email protected]>;
        [email protected]; Deucher, Alexander
        <[email protected]> <mailto:[email protected]>
        *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing buffers
        when DMA is unavailable

        Am 06.02.23 um 09:28 schrieb Xiao, Jack:

            [AMD Official Use Only - General]

                           >> >> It's simply not allowed to free up
            resources during suspend since those can't be acquired
            again during resume.

            >> The in_suspend flag is set at the beginning of suspend
            and unset at the end of resume. It can’t filter the case
            you mentioned.


                           Why not? This is exactly what it should do.

            [Jack] If freeing up resources during resume, it should
            not hit the issue you described. But only checking
            in_suspend flag would take these cases as warning.


        No, once more: Freeing up or allocating resources between
        suspend and resume is illegal!

        If you free up a resource during resume you should absolutely
        hit that, this is intentional!

        Regards,
        Christian.


            Regards,

            Jack

            *From:* Koenig, Christian <[email protected]>
            <mailto:[email protected]>
            *Sent:* Monday, February 6, 2023 4:06 PM
            *To:* Xiao, Jack <[email protected]>
            <mailto:[email protected]>; Christian König
            <[email protected]>
            <mailto:[email protected]>;
            [email protected]; Deucher, Alexander
            <[email protected]> <mailto:[email protected]>
            *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing
            buffers when DMA is unavailable

            Am 06.02.23 um 08:23 schrieb Xiao, Jack:

                [AMD Official Use Only - General]

                >> Nope, that is not related to any hw state.

                can use other flag.

                >> It's simply not allowed to free up resources during
                suspend since those can't be acquired again during resume.

                The in_suspend flag is set at the beginning of suspend
                and unset at the end of resume. It can’t filter the
                case you mentioned.


            Why not? This is exactly what it should do.


                Do you know the root cause of these cases hitting the
                issue? So that we can get an exact point to warn the
                freeing up behavior.


            Well the root cause are programming errors. See between
            suspending and resuming you should not allocate nor free
            memory.

            Otherwise we can run into trouble. And this check here is
            one part of that, we should probably add another warning
            during allocation of memory. But this here is certainly
            correct.

            Regards,
            Christian.


                Thanks,

                Jack

                *From:* Christian König
                <[email protected]>
                <mailto:[email protected]>
                *Sent:* Friday, February 3, 2023 9:20 PM
                *To:* Xiao, Jack <[email protected]>
                <mailto:[email protected]>; Koenig, Christian
                <[email protected]>
                <mailto:[email protected]>;
                [email protected]; Deucher, Alexander
                <[email protected]>
                <mailto:[email protected]>
                *Subject:* Re: [PATCH] drm/amdgpu: only WARN freeing
                buffers when DMA is unavailable

                Nope, that is not related to any hw state.

                It's simply not allowed to free up resources during
                suspend since those can't be acquired again during resume.

                We had a couple of cases now where this was wrong. If
                you get a warning from that please fix the code which
                tried to free something during suspend instead.

                Regards,
                Christian.

                Am 03.02.23 um 07:04 schrieb Xiao, Jack:

                    [AMD Official Use Only - General]

                    >> It's simply illegal to free up memory during
                    suspend.

                    Why? In my understanding, the limit was caused by
                    DMA shutdown.

                    Regards,

                    Jack

                    *From:* Koenig, Christian
                    <[email protected]>
                    <mailto:[email protected]>
                    *Sent:* Thursday, February 2, 2023 7:43 PM
                    *To:* Xiao, Jack <[email protected]>
                    <mailto:[email protected]>;
                    [email protected]; Deucher, Alexander
                    <[email protected]>
                    <mailto:[email protected]>
                    *Subject:* AW: [PATCH] drm/amdgpu: only WARN
                    freeing buffers when DMA is unavailable

                    Big NAK to this! This warning is not related in
                    any way to the hw state.

                    It's simply illegal to free up memory during suspend.

                    Regards,

                    Christian.

                    
------------------------------------------------------------------------

                    *Von:*Xiao, Jack <[email protected]>
                    *Gesendet:* Donnerstag, 2. Februar 2023 10:54
                    *An:*
                    
[email protected]<[email protected]>;
                    Deucher, Alexander <[email protected]>;
                    Koenig, Christian <[email protected]>
                    *Cc:* Xiao, Jack <[email protected]>
                    *Betreff:* [PATCH] drm/amdgpu: only WARN freeing
                    buffers when DMA is unavailable

                    Reduce waringings, only warn when DMA is unavailable.

                    Signed-off-by: Jack Xiao <[email protected]>
                    ---
                     drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
                     1 file changed, 2 insertions(+), 1 deletion(-)

                    diff --git
                    a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
                    b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
                    index 2d237f3d3a2e..e3e3764ea697 100644
                    --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
                    +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
                    @@ -422,7 +422,8 @@ void
                    amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64
                    *gpu_addr,
                             if (*bo == NULL)
                                     return;

                    -
                    WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend);
                    +
                    WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend
                    &&
                    +
                    
!amdgpu_ttm_adev((*bo)->tbo.bdev)->ip_blocks[AMD_IP_BLOCK_TYPE_SDMA].status.hw);

                             if (likely(amdgpu_bo_reserve(*bo, true)
                    == 0)) {
                                     if (cpu_addr)

--2.37.3

Re: [PATCH] drm/amdgpu: only WARN freeing buffers when DMA is unavailable

Reply via email to