On 25/02/2021 01.49, Waldek Kozaczuk wrote:


On Wednesday, February 24, 2021 at 5:07:36 PM UTC-5 [email protected] wrote:

    On Wednesday, February 24, 2021 at 4:38:02 PM UTC-5
    [email protected] wrote:

        On Wednesday, February 24, 2021 at 11:42:33 AM UTC-5
        [email protected] wrote:

            On Tuesday, February 23, 2021 at 3:20:04 PM UTC-5 Stewart
            Hildebrand wrote:

                On Tuesday, February 23, 2021 at 11:50:09 AM UTC-5
                Nadav Har'El wrote:

                    On Tue, Feb 23, 2021 at 2:00 AM Waldek Kozaczuk
                    <[email protected]> wrote:



                        On Monday, February 22, 2021 at 1:36:12 AM
                        UTC-5 Nadav Har'El wrote:

                            On Mon, Feb 22, 2021 at 7:30 AM Waldek
                            Kozaczuk <[email protected]> wrote:
                                asm volatile

                                        ("mov %%rbp, %c[rbp](%0) \n\t"
                                         "movq $1f, %c[rip](%0) \n\t"
                                         "mov %%rsp, %c[rsp](%0) \n\t"
                                         "mov %c[rsp](%1), %%rsp \n\t"
                                         "mov %c[rbp](%1), %%rbp \n\t"
                                         "jmpq *%c[rip](%1) \n\t"
                                         "1: \n\t"
                                         :
                                         : "a"(&old->_state),
                                "c"(&this->_state),
                                 [rsp]"i"(offsetof(thread_state, rsp)),
                                 [rbp]"i"(offsetof(thread_state, rbp)),
                                 [rip]"i"(offsetof(thread_state, rip))
                                         : "rbx", "rdx", "rsi", "rdi",
                                "r8", "r9",
                                           "r10", "r11", "r12", "r13",
                                "r14", "r15", "memory");


                            Note this list of registers here! I think
                            (but it's been years since I looked at
                            this, so I'm rusty...), that the idea
                            is exactly to have the compiler save the
                            callee-saved registers, if it didn't
                            already. It tells the compiler to
                            pretend that this assembly instruction
                            just modified all the registers in the
                            list. So now we have a function
                            switch_to() which thinks it modifies r15
                            et al., so it needs to save and restore
                            these registers.

                            Perhaps exactly the same solution would
                            work for aarch64 as well.

                            This would be better than explicit
                            copying, because as you noted, in some
                            builds reschedule_from_interrupt() already
                            saves these registers so there is no need
                            to do it again.

                            If this is the solution, we need a comment
                            next to this list of registers explaining
                            its raison d'etre :-(


                        Indeed when you disassemble the x64 version of
                        switch_to() (this one is from a release
                        build), it looks like this:

                        Dump of assembler code for function
                        _ZN5sched6thread9switch_toEv:
                           0x00000000403f8140 <+0>:push  %rbp
                           0x00000000403f8141 <+1>:mov   %rsp,%rbp
                           0x00000000403f8144 <+4>:push  %r15
                           0x00000000403f8146 <+6>:push  %r14
                           0x00000000403f8148 <+8>:push  %r13
                           0x00000000403f814a <+10>:push  %r12
                           0x00000000403f814c <+12>:push  %rbx
                        ...
                           0x00000000403f81e1 <+161>:pop   %rbx
                           0x00000000403f81e2 <+162>:pop   %r12
                           0x00000000403f81e4 <+164>:pop   %r13
                           0x00000000403f81e6 <+166>:pop   %r14
                           0x00000000403f81e8 <+168>:pop   %r15
                           0x00000000403f81ea <+170>:pop   %rbp
                           0x00000000403f81eb <+171>:ret

                        So all callee-save registers indeed are pushed
                        and popped from the stack. Some of them like
                        r12 are used by the body of switch_to(), but
                        others are not - r13, r14, r15. This is
                        interesting because according to the
                        documentation of inline assembly -
                        https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
                        <https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html>
                        - it does not mention the callee-save
                        registers. The section "6.47.2.6 Clobbers and
                        Scratch Registers" has this statement:

                        "*When the compiler selects which registers to
                        use to represent input and output operands, it
                        does not use any of the clobbered registers.
                        As a result, clobbered registers are available
                        for any use in the assembler code*."


                        But based on how it works with x64, the
                        clobbered registers list is treated somewhat
                        differently from that statement above. Maybe,
                        in addition, it also means that any
                        callee-saved registers in that list will be
                        pushed and restored from the stack around this
                        inline assembly (more-less, as the pushes are
                        actually in the preamble).


                    I think this statement only partially describes
                    what "clobber registers" means.
                    In my experience, what this list does is that it
                    tells the C compiler (which doesn't understand
                    assembly language - it's a C compiler, not an
                    assembler!) which registers this assembly-language
                    code may be using. The C compiler then needs to do
                    two things:

                     1. If the C compiler used one of these registers
                        for its own purposes in the enclosing function
                        before the assembly code, and wants to use it
                        after the assembly code, it will need to save it.
                     2. If this is one of the *callee saved
                        *registers, the ABI says the function *must
                        *restore this register when it returns. So if
                        the function now includes this (assembly) code
                        which ruins this register, the C compiler must
                        add code to save and restore this register.

                    I think this is exactly what we need to save the
                    callee-saved registers.


                        Anyway, when I follow the same idea and modify
                        the inline assembly in the aarch64 version of
                        switch_to() to contain the callee-save
                        registers x19-x28 like so:

                        diff --git a/arch/aarch64/arch-switch.hh
                        b/arch/aarch64/arch-switch.hh
                        index dff7467c..f0ec61f2 100644
                        --- a/arch/aarch64/arch-switch.hh
                        +++ b/arch/aarch64/arch-switch.hh
                        @@ -42,8 +42,9 @@ void thread::switch_to()
                                          : "Q"(old->_state.fp),
                        "Ump"(old->_state.sp),
                        "Ump"(this->_state.fp), "Ump"(this->_state.sp)
                                          : "x0", "x1", "x2", "x3",
                        "x4", "x5", "x6", "x7", "x8",
                        -                   "x9", "x10", "x11", "x12",
                        "x13", "x14", "x15",
                        -                   "x16", "x17", "x18",
                        "x30", "memory");
                        +                   "x9", "x10", "x11", "x12",
                        "x13", "x14",
                        +                   "x19", "x20", "x21",
                        "x22", "x23", "x24",
                        +                  "x25", "x26", "x27", "x28",
                        "x29", "x30", "memory");
                         }

                        then the assembly of switch_to in debug build
                        shows new pushes and pops as we would to:

                        Dump of assembler code for function
                        sched::thread::switch_to():
                           0x000000004043717c <+0>:stpx29, x30, [sp,
                        #-128]!
                           0x0000000040437180 <+4>:movx29, sp
                           0x0000000040437184 <+8>:stpx19, x20, [sp, #16]
                           0x0000000040437188 <+12>:stpx21, x22, [sp, #32]
                           0x000000004043718c <+16>:stpx23, x24, [sp, #48]
                           0x0000000040437190 <+20>:stpx25, x26, [sp, #64]
                           0x0000000040437194 <+24>:stpx27, x28, [sp, #80]
                        ...
                           0x00000000404371ec <+112>:ldpx19, x20, [sp,
                        #16]
                           0x00000000404371f0 <+116>:ldpx21, x22, [sp,
                        #32]
                           0x00000000404371f4 <+120>:ldpx23, x24, [sp,
                        #48]
                           0x00000000404371f8 <+124>:ldpx25, x26, [sp,
                        #64]
                           0x00000000404371fc <+128>:ldpx27, x28, [sp,
                        #80]
                           0x0000000040437200 <+132>:ldpx29, x30,
                        [sp], #128
                           0x0000000040437204 <+136>:ret

                        And indeed when I run the same tests that were
                        broken in both debug builds (-O0, -01) now
                        they all work which is great.


                    Excellent.


                        Please note that I added x19-x28 registers to
                        the list in the inline assembly, but then I
                        also removed some from the original list -
                        x15-x18. Otherwise, the compiler would
                        complain with this error:
                        "arch/aarch64/arch-switch.hh: In member
                        function ‘void sched::thread::switch_to()’:
                        arch/aarch64/arch-switch.hh:28:5: error: ‘asm’
                        operand has impossible constraints
                           28 |     asm volatile("\n"
                              |     ^~~"
                        This makes sense as the compiler needs some
                        registers for inputs and outputs as I
                        understand. But I also wonder why the original
                        list had all the registers x0-x18 where only 3
                        of those - x0, x1, and x2 would be used in the
                        inline assembly (shouldn't it be enough to
                        list only x0, x1, and x2 besides x29 and x30?):


                    I don't know who wrote this list originally or why
                    that specific list.
                    It should list all the callee-saved registers,
                    plus the registers you actually use in the
                    instructions themselves. Not more (I think).


                Wouldn't we NOT want to have x29 (frame pointer) in
                the clobber list? If I understand the clobber list
                correctly, the compiler uses it to determine which
                registers to save/restore before/after the "asm
                volatile" statement, but only if the compiler happens
                to needs to use those registers elsewhere in the
                function (wild guess: does that become more likely if
                the function gets inlined?). By listing x29 in the
                clobber list, would we be undoing what we're trying to
                accomplish with the inline assembly?


            Unsure if this is related or not, but could there be an
            issue with saving/restoring the FPU state (i.e. Q0-Q31)? I
            took a look at the disassembly
            for fpu_state_save/fpu_state_load
            (arch/aarch64/processor.hh), and I noticed the compiler
            adds a few unwanted register save/restore of the registers
            d8-d15, which is the 64-bit version of the 128-bit
            FPU/SIMD registers q8-q15. Unless I misread the code, it
            appears like this is not actually restoring q8-q15 during
            context switch. Here's the disassembly of fpu_state_load
            with mode=debug:

        This is weird. I see the same thing. These d8-15 registers do
        NOT show up in processor::fpu_state_save() assembly.


    Makes sense, since they're not in the clobber lists in
    fpu_state_save(), so the compiler /shouldn't/ have an awareness
    that we're modifying those registers. And in fact, we're not in
    the _save() function, we're just reading them.

        They do not show up in the page_fault in release build:


    Actually they do

        gdb -batch -ex 'file build/release/loader.elf' -ex
        'disassemble page_fault'
        Dump of assembler code for function page_fault(exception_frame*):
           0x000000004020b994 <+0>:subsp, sp, #0x270
           0x000000004020b998 <+4>:stpx29, x30, [sp]
           0x000000004020b99c <+8>:movx29, sp
           0x000000004020b9a0 <+12>:stpx19, x20, [sp, #16]
           0x000000004020b9a4 <+16>:movx19, x0
           0x000000004020b9a8 <+20>:stpd8, d9, [sp, #32]
           0x000000004020b9ac <+24>:stpd10, d11, [sp, #48]
           0x000000004020b9b0 <+28>:stpd12, d13, [sp, #64]
           0x000000004020b9b4 <+32>:stpd14, d15, [sp, #80]


    Here

           0x000000004020b9b8 <+36>:stpq0, q1, [sp, #96]
           0x000000004020b9bc <+40>:stpq2, q3, [sp, #128]
           0x000000004020b9c0 <+44>:stpq4, q5, [sp, #160]
           0x000000004020b9c4 <+48>:stpq6, q7, [sp, #192]
           0x000000004020b9c8 <+52>:stpq8, q9, [sp, #224]
           0x000000004020b9cc <+56>:stpq10, q11, [sp, #256]
           0x000000004020b9d0 <+60>:stpq12, q13, [sp, #288]
           0x000000004020b9d4 <+64>:stpq14, q15, [sp, #320]
           0x000000004020b9d8 <+68>:stpq16, q17, [sp, #352]
           0x000000004020b9dc <+72>:stpq18, q19, [sp, #384]
           0x000000004020b9e0 <+76>:stpq20, q21, [sp, #416]
           0x000000004020b9e4 <+80>:stpq22, q23, [sp, #448]
           0x000000004020b9e8 <+84>:stpq24, q25, [sp, #480]
           0x000000004020b9ec <+88>:addx1, sp, #0x200
           0x000000004020b9f0 <+92>:stpq26, q27, [x1]
           0x000000004020b9f4 <+96>:stpq28, q29, [x1, #32]
           0x000000004020b9f8 <+100>:addx1, sp, #0x200
           0x000000004020b9fc <+104>:stpq30, q31, [x1, #64]
           0x000000004020ba00 <+108>:mrsx1, fpsr
           0x000000004020ba04 <+112>:strw1, [sp, #608]
           0x000000004020ba08 <+116>:mrsx1, fpcr
           0x000000004020ba0c <+120>:strw1, [sp, #612]
           0x000000004020ba10 <+124>:mrsx20, far_el1
           0x000000004020ba14 <+128>:bl0x4020cb70
        <fixup_fault(exception_frame*)>
           0x000000004020ba18 <+132>:tstw0, #0xff
           0x000000004020ba1c <+136>:b.ne <http://b.ne>0x4020ba60
        <page_fault(exception_frame*)+204>  // b.any
           0x000000004020ba20 <+140>:ldrx0, [x19, #256]
           0x000000004020ba24 <+144>:cbzx0, 0x4020bad8
        <page_fault(exception_frame*)+324>
           0x000000004020ba28 <+148>:mrsx0, tpidr_el0
           0x000000004020ba2c <+152>:addx0, x0, #0x0, lsl #12
           0x000000004020ba30 <+156>:addx0, x0, #0x50
           0x000000004020ba34 <+160>:ldrw0, [x0]
           0x000000004020ba38 <+164>:cbnzw0, 0x4020bb04
        <page_fault(exception_frame*)+368>
           0x000000004020ba3c <+168>:ldrx0, [x19, #264]
           0x000000004020ba40 <+172>:tbnzw0, #7, 0x4020bae4
        <page_fault(exception_frame*)+336>
           0x000000004020ba44 <+176>:msrdaifclr, #0x2
           0x000000004020ba48 <+180>:isb
           0x000000004020ba4c <+184>:movx1, x19
           0x000000004020ba50 <+188>:movx0, x20
           0x000000004020ba54 <+192>:bl0x401da290
        <mmu::vm_fault(unsigned long, exception_frame*)>
           0x000000004020ba58 <+196>:msrdaifset, #0x2
           0x000000004020ba5c <+200>:isb
           0x000000004020ba60 <+204>:ldpq0, q1, [sp, #96]
           0x000000004020ba64 <+208>:ldpq2, q3, [sp, #128]
           0x000000004020ba68 <+212>:ldpq4, q5, [sp, #160]
           0x000000004020ba6c <+216>:ldpq6, q7, [sp, #192]
           0x000000004020ba70 <+220>:ldpq8, q9, [sp, #224]
           0x000000004020ba74 <+224>:ldpq10, q11, [sp, #256]
           0x000000004020ba78 <+228>:ldpq12, q13, [sp, #288]
           0x000000004020ba7c <+232>:ldpq14, q15, [sp, #320]
           0x000000004020ba80 <+236>:ldpq16, q17, [sp, #352]
           0x000000004020ba84 <+240>:ldpq18, q19, [sp, #384]
           0x000000004020ba88 <+244>:ldpq20, q21, [sp, #416]
           0x000000004020ba8c <+248>:ldpq22, q23, [sp, #448]
           0x000000004020ba90 <+252>:ldpq24, q25, [sp, #480]
           0x000000004020ba94 <+256>:addx0, sp, #0x200
           0x000000004020ba98 <+260>:ldpq26, q27, [x0]
           0x000000004020ba9c <+264>:ldpq28, q29, [x0, #32]
           0x000000004020baa0 <+268>:addx0, sp, #0x200
           0x000000004020baa4 <+272>:ldpq30, q31, [x0, #64]
           0x000000004020baa8 <+276>:ldrw0, [sp, #608]
           0x000000004020baac <+280>:msrfpsr, x0
           0x000000004020bab0 <+284>:ldrw0, [sp, #612]
           0x000000004020bab4 <+288>:msrfpcr, x0
           0x000000004020bab8 <+292>:ldpx29, x30, [sp]
           0x000000004020babc <+296>:ldpx19, x20, [sp, #16]
           0x000000004020bac0 <+300>:ldpd8, d9, [sp, #32]
           0x000000004020bac4 <+304>:ldpd10, d11, [sp, #48]
           0x000000004020bac8 <+308>:ldpd12, d13, [sp, #64]
           0x000000004020bacc <+312>:ldpd14, d15, [sp, #80]


    And here

           0x000000004020bad0 <+316>:addsp, sp, #0x270
           0x000000004020bad4 <+320>:ret
           0x000000004020bad8 <+324>:adrpx0, 0x40558000
        
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
           0x000000004020badc <+328>:addx0, x0, #0x248
           0x000000004020bae0 <+332>:bl0x400e8e50 <abort(char const*,
        ...)>
           0x000000004020bae4 <+336>:adrpx3, 0x40558000
        
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
           0x000000004020bae8 <+340>:adrpx1, 0x40558000
        
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
           0x000000004020baec <+344>:adrpx0, 0x40558000
        
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
           0x000000004020baf0 <+348>:addx3, x3, #0x268
           0x000000004020baf4 <+352>:addx1, x1, #0x278
           0x000000004020baf8 <+356>:addx0, x0, #0x290
           0x000000004020bafc <+360>:movw2, #0x2f // #47
           0x000000004020bb00 <+364>:bl0x400e8f50 <__assert_fail(char
        const*, char const*, unsigned int, char const*)>
           0x000000004020bb04 <+368>:adrpx3, 0x40558000
        
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
           0x000000004020bb08 <+372>:adrpx1, 0x40558000
        
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
           0x000000004020bb0c <+376>:adrpx0, 0x40552000
           0x000000004020bb10 <+380>:addx3, x3, #0x268
           0x000000004020bb14 <+384>:addx1, x1, #0x278
           0x000000004020bb18 <+388>:addx0, x0, #0x450
           0x000000004020bb1c <+392>:movw2, #0x2e // #46
           0x000000004020bb20 <+396>:bl0x400e8f50 <__assert_fail(char
        const*, char const*, unsigned int, char const*)>
           0x000000004020bb24 <+400>:ldpq0, q1, [sp, #96]
           0x000000004020bb28 <+404>:ldpq2, q3, [sp, #128]
           0x000000004020bb2c <+408>:ldpq4, q5, [sp, #160]
           0x000000004020bb30 <+412>:ldpq6, q7, [sp, #192]
           0x000000004020bb34 <+416>:ldpq8, q9, [sp, #224]
           0x000000004020bb38 <+420>:ldpq10, q11, [sp, #256]
           0x000000004020bb3c <+424>:ldpq12, q13, [sp, #288]
           0x000000004020bb40 <+428>:ldpq14, q15, [sp, #320]
           0x000000004020bb44 <+432>:ldpq16, q17, [sp, #352]
           0x000000004020bb48 <+436>:ldpq18, q19, [sp, #384]
           0x000000004020bb4c <+440>:ldpq20, q21, [sp, #416]
           0x000000004020bb50 <+444>:ldpq22, q23, [sp, #448]
           0x000000004020bb54 <+448>:ldpq24, q25, [sp, #480]
           0x000000004020bb58 <+452>:addx1, sp, #0x200
           0x000000004020bb5c <+456>:ldpq26, q27, [x1]
           0x000000004020bb60 <+460>:ldpq28, q29, [x1, #32]
           0x000000004020bb64 <+464>:addx1, sp, #0x200
           0x000000004020bb68 <+468>:ldpq30, q31, [x1, #64]
           0x000000004020bb6c <+472>:ldrw1, [sp, #608]
           0x000000004020bb70 <+476>:msrfpsr, x1
           0x000000004020bb74 <+480>:ldrw1, [sp, #612]
           0x000000004020bb78 <+484>:msrfpcr, x1
           0x000000004020bb7c <+488>:bl0x40475940 <_Unwind_Resume>
           0x000000004020bb80 <+492>:msrdaifset, #0x2
           0x000000004020bb84 <+496>:isb
           0x000000004020bb88 <+500>:b0x4020bb24
        <page_fault(exception_frame*)+400>
        End of assembler dump.


You are correct. I missed them. It looks like I need a new pair of glasses :-)


        Compiler bug in debug mode?


    No, the compiler is doing the right thing. Per the AArch64
    Procedure Call Standard: "Registers v8-v15 must be preserved by a
    callee across subroutine calls; ... <snip>...  only the bottom
    64-bits of each value stored in v8-v15 need to be preserved." I
    think the best path forward here is to move fpu_state_save/restore
    to an assembly file (*.S). I'll work on a patch to do this. I'm
    thinking I can add them to entry.S.


Interesting. But given it is an inline assembly which C compiler knows nothing about (what registers are used) how does it know it needs to push registers d8-d15 in processor::fpu_state_load() to follow the Call Standard? I think this does not happen in "normal" functions, does it?



They're in the clobber lists of fpu_state_load(), so the compiler saves them.


Managing the fpu in C++ for aarch64 doesn't work, because the compiler uses those registers freely. It can happen on x86 too, as the compiler is allowed to use those registers for moving data around, but it is rarer. I think it's overall safer to move fpu handling to entry.S for exceptions.



        These functions are used by fpu_lock which we use as RAII
        construct letting us save/restore those FPU registers when we
        handle interrupts and page faults (see interrupt() and
        page_fault() method). So we save/restore those for preemptive
        switches but not for the non-preemptive (cooperative) ones
        which I think is correct. But maybe someone can complain.

I meant "explain" instead of "complain" :-)


"complain" works well in the area of cpu/asm integration arcana. But you're correct.



    I can see the thought process behind this... But if a
    non-preemptive (cooperative) context switch is essentially
    considered a subroutine call, then we need to at least
    save/restore d8-d15 to fully adhere to the Procedure Call
    Standard, because other threads might clobber those. Am I correct
    in guessing that we're not doing this? I guess I can investigate
    and try come up with a patch for this too...

I think you are correct. This would be similar to what we have to do with x19-x28 registers in switch_to() which I am describing in one of the 30 replies ;-) So to sum it in switch_to() we need to save/restore both x19-x28 AND d8-15 registers to make sure that in cooperative switch they would be restored properly.


Yes. And fpu_lock should be migrated out of C++, and probably be part of exception_frame.


--
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/e7750a75-53c3-aa44-24c0-d7683639c3b4%40scylladb.com.

Reply via email to