That would be a possibility, but I don't think reading the register once during driver startup is a real issue.

See you need to hammer on the problematic registers to trigger the misbehavior, e.g. not read it once or twice but thousands of times.

Additional to that as I said before, VM flushes are not really the problem. While investigating this I've worked around the VM flush issue by just writing the register value multiple times.

That works around the problem pretty reliable as well and would be trivial to implement, but it just doesn't fix the real issue that the hardware is dropping register writes.

And dropping register writes can have all kind of problematic effects, from performance problems because of not acknowledged interrupts, over incorrect calculation results to complete hangs.

Regards,
Christian.

Am 09.03.2018 um 04:29 schrieb Deucher, Alexander:

Can GIM read the value and pass it to the VF's via the vbios or the shared area?


Alex

------------------------------------------------------------------------
*From:* Liu, Monk
*Sent:* Thursday, March 8, 2018 9:59:34 PM
*To:* Koenig, Christian; Deucher, Alexander; Mao, David
*Cc:* Jin, Jian-Rong; [email protected]
*Subject:* RE: deprecated register issues

>Although KMD now use a cache way to give libdrm the CC_RB_BACKEND_DISABLE value, but KMD still need to read it at least one shot during loading driver right ?

Correct, but we do this while having exclusive access.

The problem is although for VF0 it is reading this register in exclusive mode, but meanwhile another VF may doing the VM flush, thus

The issue is still there as long as any VF is allowed to read this register,

It’s not related if in exclusive or not

/Monk

*From:*Koenig, Christian
*Sent:* 2018年3月8日18:51
*To:* Liu, Monk <[email protected]>; Deucher, Alexander <[email protected]>; Mao, David <[email protected]> *Cc:* Jin, Jian-Rong <[email protected]>; [email protected]
*Subject:* Re: deprecated register issues

    Although KMD now use a cache way to give libdrm the
    CC_RB_BACKEND_DISABLE value, but KMD still need to read it at
    least one shot during loading driver right ?

Correct, but we do this while having exclusive access.


    Not only tlb flush, other ring’s vm flush of VF2 can all be ruined
    by this register’s reading from VF1, right ?

Not sure about that. But as I said that problem is not limited to VM flushes, e.g. when you access this register any other register write or read can be affected.

May theory is that accessing those registers sometimes times out when the MC is to busy or something like this and this timeout in turn results in a register bus reset which affects all other ongoing operations as well.


    How you solve that ?

Stop accessing the affected registers as much as possible, there is no really any other known workaround.

The risk by accessing it when the driver loads is minimal, we just need to make sure that we don't touch them during world switch.

Regards,
Christian.

Am 08.03.2018 um 11:11 schrieb Liu, Monk:

    To be more specific let me give you an example we face with now:

    Although KMD now use a cache way to give libdrm the
    CC_RB_BACKEND_DISABLE value, but KMD still need to read it at
    least one shot during loading driver right ?

    For SR-IOV there is still chance that VF2 is doing CPU based tlb
    flush while VF1 is doing driver loading, so there are chance that
    tlb flush of VF2 was ruined by

    The driver loading (one shot CC_RB_BACKEND_DISABLE reading).

    Not only tlb flush, other ring’s vm flush of VF2 can all be ruined
    by this register’s reading from VF1, right ?

    How you solve that ?

    /Monk

    *From:*amd-gfx [mailto:[email protected]] *On
    Behalf Of *Liu, Monk
    *Sent:* 2018年3月8日18:03
    *To:* Koenig, Christian <[email protected]>
    <mailto:[email protected]>; Deucher, Alexander
    <[email protected]> <mailto:[email protected]>;
    Mao, David <[email protected]> <mailto:[email protected]>
    *Cc:* Jin, Jian-Rong <[email protected]>
    <mailto:[email protected]>; [email protected]
    <mailto:[email protected]>
    *Subject:* RE: deprecated register issues

    Yeah, I agree with you we’d better find all those registers

    Stitch together the REQ and ACK part only avoid this issue for vm
    flush only, and we may still hit the issue in other place, I knew it.

    The frustrating job is how can we find all those registers ?

    And more is since this CC_RB_BACKEND_DISABLE register is not
    deprecated (confirmed with David M), and driver indeed nee to read it

    How could we avoid this reading cause vm hub broken ?

    I believe like you said there are a bunch of registers (not really
    deprecated ) reading would cause vm hub broken, how could we still
    read

    Them when we want but also not to trigger the world switch issue ?

    looks to me there is no way to do that, even if you already find
    out all of those registers, we still need to access them, so the
    world switch (or other issues)

    are still going to fail, and that’s why I think the plan B at
    least have its reason to stand there.

    any thought ?

    /Monk

    *From:*Christian König [mailto:[email protected]]
    *Sent:* 2018年3月8日17:41
    *To:* Liu, Monk <[email protected] <mailto:[email protected]>>;
    Deucher, Alexander <[email protected]
    <mailto:[email protected]>>; Koenig, Christian
    <[email protected] <mailto:[email protected]>>; Mao,
    David <[email protected] <mailto:[email protected]>>
    *Cc:* Jin, Jian-Rong <[email protected]
    <mailto:[email protected]>>; [email protected]
    <mailto:[email protected]>
    *Subject:* Re: deprecated register issues

    Hi Monk,


        While we can avoid such vm flush failure by stitch together of
        the sending REQ and reading ACK part, at least for compute
        ring this is confirmed.

    Well there are two misunderstanding here.

    First of all this solution doesn't really work, it just hides the
    problem because we don't do a world switch in between those two
    packets any more. And while we could change the SDMA, UVD and VCE
    firmware do to something similar you can't apply this solution to
    CPU based flushes.

    The second issue is that this isn't related to VMHUB flushing at
    all, it's just that VMHUB flushing is the first thing where you
    notice that something is wrong.

    The real problem is that when you access CC_RB_BACKEND_DISABLE and
    a bunch of other registers the bus on Vega10 sometimes gets a
    hickup and drops other reads and writes.

    So we need to identify those registers and removes all accesses to
    them, otherwise working with the hardware will just be horrible
    unreliable in general.

    Regards,
    Christian.

    Am 08.03.2018 um 04:05 schrieb Liu, Monk:

        Hi Alex

        While we can avoid such vm flush failure by stitch together of
        the sending REQ and reading ACK part, at least for compute
        ring this is confirmed.

        And I believe for SDMA ring (even UVD/VCE ring) it could also
        be achieved.

        But @Koenig, Christian <mailto:[email protected]>
        insist stitching together the REQ AND ACK part is not a formal
        way to fix the issue, instead just a walkaround and I cannot
        debate that

        What make me worry more is what if there are more registers
        like Alex said that behaves like this CC_RB_BACKEND_DISABLE,

        since we don’t know their names(too hard to filter them out!)
        so we couldn’t remove them all from SR list,

        So I still think we need plan B to handle above case,  A.K.A
        use one package for the REQ and ACK job

        /Monk

        *From:*Deucher, Alexander
        *Sent:* 2018年3月8日10:53
        *To:* Liu, Monk <[email protected]> <mailto:[email protected]>;
        Koenig, Christian <[email protected]>
        <mailto:[email protected]>; Mao, David
        <[email protected]> <mailto:[email protected]>
        *Cc:* [email protected]
        <mailto:[email protected]>; Jin, Jian-Rong
        <[email protected]> <mailto:[email protected]>
        *Subject:* Re: deprecated register issues

        I think there are more than just CC_RB_BACKEND_DISABLE that
        could cause this problem.  IIRC, some entire class of gfx
        registers could cause it, it just happened that this was one
        of the only ones we readback via mmio.  Also for the save and
        restore list, I think the RLC uses a different interface to
        read back the registers so it may not be affected the same way.

        Alex

        ------------------------------------------------------------------------

        *From:*Liu, Monk
        *Sent:* Wednesday, March 7, 2018 9:42:41 PM
        *To:* Deucher, Alexander; Koenig, Christian; Mao, David
        *Cc:* [email protected]
        <mailto:[email protected]>; Jin, Jian-Rong
        *Subject:* RE: deprecated register issues

        Hi guys

        According to Christian’s found, reading this register would
        make vm hub failed to finish the vm flush request , e.g.: sdma
        is doing vm flush which first write data to vm_invalidat_req
        and read result from vm_invalidate_ack, but found driver will
        forever failed to get the correct value from vm_invalidate_ack
        if the meantime BIF is reading this CC_RB_BACKEND_DISABLE
        register.

        Now SR-IOV world switch also may get such similar trouble, see
        below save_restore_list ( during world_switch, RLCV will save
        current VF’s register according to this list and restore all
        those registers when loading back this VF)

        uint32 register_restore[] = {

        (uint32)((0x3000 << 18) | mmPA_SC_FIFO_SIZE),   /* SC   */

               0x00000001,

        *       (uint32)((0x3000 << 18) | mmCC_RB_BACKEND_DISABLE),  
        /* SC SC PER_SE  */*

        *       0x00000000,*

        *       (uint32)((0x3400 << 18) | mmCC_RB_BACKEND_DISABLE),  
        /* SC SC PER_SE  */*

        *       0x00000000,*

        *       (uint32)((0x3800 << 18) | mmCC_RB_BACKEND_DISABLE),  
        /* SC SC PER_SE  */*

        *       0x00000000,*

        *       (uint32)((0x3c00 << 18) | mmCC_RB_BACKEND_DISABLE),  
        /* SC SC PER_SE  */*

        *       0x00000000,*

        (uint32)((0x3000 << 18) | mmVGT_VTX_VECT_EJECT_REG),

               0x00000001,

        (uint32)((0x3000 << 18) | mmVGT_DMA_DATA_FIFO_DEPTH),   /* IA
        WD  */

               0x00000001,

        (uint32)((0x3000 << 18) | mmVGT_DMA_REQ_FIFO_DEPTH),   /* WD   */

               0x00000001,

        (uint32)((0x3000 << 18) | mmVGT_DRAW_INIT_FIFO_DEPTH),   /*
        WD   */

               0x00000001,

        (uint32)((0x3000 << 18) | mmVGT_CACHE_INVALIDATION),   /*  IA  */

               0x00000001,

        (uint32)((0x3000 << 18) | mmVGT_RESET_DEBUG),   /*  WD  */

               0x00000001,

        (uint32)((0x3000 << 18) | mmVGT_FIFO_DEPTHS),

        I will do some test against this CC_RB_BACKEND_DISABLE
        register, see if vm flush failure issue could be avoided by
        removing those four register from SR list

        Thanks

        /Monk

        *From:*Deucher, Alexander
        *Sent:* 2018年3月7日23:13
        *To:* Koenig, Christian <[email protected]
        <mailto:[email protected]>>; Mao, David
        <[email protected] <mailto:[email protected]>>; Liu, Monk
        <[email protected] <mailto:[email protected]>>
        *Cc:* [email protected]
        <mailto:[email protected]>; Jin, Jian-Rong
        <[email protected] <mailto:[email protected]>>
        *Subject:* Re: deprecated register issues

        Right.  We ran into issues with reading back that register at
        runtime when UMDs queried it when other stuff was in flight,
        so we just read it once at startup and cache the results. Now
        when UMDs request it, we return the cached value.

        Alex

        ------------------------------------------------------------------------

        *From:*Koenig, Christian
        *Sent:* Wednesday, March 7, 2018 9:31:13 AM
        *To:* Mao, David; Liu, Monk
        *Cc:* Deucher, Alexander; [email protected]
        <mailto:[email protected]>; Jin, Jian-Rong
        *Subject:* Re: deprecated register issues

        Hi David,

        well I just figured that this is a misunderstanding.

        Accessing this register and some other deprecated registers
        can cause problem when invalidating VMHUBs.

        This register itself isn't deprecated, the wording in a patch
        fixing things is just a bit unclear.

        Question is is that register still accessed regularly or is it
        value cached after startup?

        Regards,
        Christian.

        Am 07.03.2018 um 15:25 schrieb Mao, David:

            We requires base driver to provide the mask of disabled RB.

            This is why kernel read the CC_RB_BACKEND_DISABLE to
            collect the harvest configuration.

            Where did you get to know that the register is deprecated?

            I think it should still be there.

            Best Regards,

            David

                On Mar 7, 2018, at 9:49 PM, Liu, Monk
                <[email protected] <mailto:[email protected]>> wrote:

                + UMD guys

                Hi David

                Do you know if*GC_USER_RB_BACKEND_DISABLE is still
                exist for gfx9/vega10 ?*

                **

                *We found*CC_RB_BACKEND_DISABLE was deprecated but
                looks it is still in use in kmd, so

                I want to check with you both of above registers

                Thanks

                /Monk

                *From:*amd-gfx
                [mailto:[email protected]]*On
                Behalf Of*Christian K?nig
                *Sent:*2018年3月7日20:26
                *To:*Liu, Monk <[email protected]
                <mailto:[email protected]>>; Deucher, Alexander
                <[email protected]
                <mailto:[email protected]>>
                *Cc:*[email protected]
                <mailto:[email protected]>
                *Subject:*Re: deprecated register issues

                Hi Monk,

                I honestly don't have the slightest idea why we are
                still accessing CC_RB_BACKEND_DISABLE. Maybe it still
                contains some useful values?

                Key point was that we needed to stop accessing it all
                the time to avoid triggering problems.

                Regards,
                Christian.

                Am 07.03.2018 um 13:11 schrieb Liu, Monk:

                    Hi Christian

                    I remember you and AlexD mentioned that a handful
                    registers are deprecated for greenland (gfx9)

                    e.g. CC_RB_BACKEND_DISABLE

                    do you know why we still have this routine ?

                    staticu32

                    gfx_v9_0_get_rb_active_bitmap(structamdgpu_device
                    *adev)

                    {

                        u32 data, mask;

                        data =RREG32_SOC15(GC,

                    0, mmCC_RB_BACKEND_DISABLE);

                        data |=RREG32_SOC15(GC,

                    0, mmGC_USER_RB_BACKEND_DISABLE);

                        data &=
                    CC_RB_BACKEND_DISABLE__BACKEND_DISABLE_MASK;

                        data >>=
                    GC_USER_RB_BACKEND_DISABLE__BACKEND_DISABLE__SHIFT;

                        mask
                    
=amdgpu_gfx_create_bitmask(adev->gfx.config.max_backends_per_se/

                    adev->gfx.config.max_sh_per_se);

                        return(~data) & mask;

                    }

                    see that it still read CC_RB_BACKEND_DISABLE

                    thanks

                    /Monk




        _______________________________________________

        amd-gfx mailing list

        [email protected]
        <mailto:[email protected]>

        https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to