On Wednesday, February 24, 2021 at 5:07:36 PM UTC-5
[email protected] wrote:
On Wednesday, February 24, 2021 at 4:38:02 PM UTC-5
[email protected] wrote:
On Wednesday, February 24, 2021 at 11:42:33 AM UTC-5
[email protected] wrote:
On Tuesday, February 23, 2021 at 3:20:04 PM UTC-5 Stewart
Hildebrand wrote:
On Tuesday, February 23, 2021 at 11:50:09 AM UTC-5
Nadav Har'El wrote:
On Tue, Feb 23, 2021 at 2:00 AM Waldek Kozaczuk
<[email protected]> wrote:
On Monday, February 22, 2021 at 1:36:12 AM
UTC-5 Nadav Har'El wrote:
On Mon, Feb 22, 2021 at 7:30 AM Waldek
Kozaczuk <[email protected]> wrote:
asm volatile
("mov %%rbp, %c[rbp](%0) \n\t"
"movq $1f, %c[rip](%0) \n\t"
"mov %%rsp, %c[rsp](%0) \n\t"
"mov %c[rsp](%1), %%rsp \n\t"
"mov %c[rbp](%1), %%rbp \n\t"
"jmpq *%c[rip](%1) \n\t"
"1: \n\t"
:
: "a"(&old->_state),
"c"(&this->_state),
[rsp]"i"(offsetof(thread_state, rsp)),
[rbp]"i"(offsetof(thread_state, rbp)),
[rip]"i"(offsetof(thread_state, rip))
: "rbx", "rdx", "rsi", "rdi",
"r8", "r9",
"r10", "r11", "r12", "r13",
"r14", "r15", "memory");
Note this list of registers here! I think
(but it's been years since I looked at
this, so I'm rusty...), that the idea
is exactly to have the compiler save the
callee-saved registers, if it didn't
already. It tells the compiler to
pretend that this assembly instruction
just modified all the registers in the
list. So now we have a function
switch_to() which thinks it modifies r15
et al., so it needs to save and restore
these registers.
Perhaps exactly the same solution would
work for aarch64 as well.
This would be better than explicit
copying, because as you noted, in some
builds reschedule_from_interrupt() already
saves these registers so there is no need
to do it again.
If this is the solution, we need a comment
next to this list of registers explaining
its raison d'etre :-(
Indeed when you disassemble the x64 version of
switch_to() (this one is from a release
build), it looks like this:
Dump of assembler code for function
_ZN5sched6thread9switch_toEv:
0x00000000403f8140 <+0>:push %rbp
0x00000000403f8141 <+1>:mov %rsp,%rbp
0x00000000403f8144 <+4>:push %r15
0x00000000403f8146 <+6>:push %r14
0x00000000403f8148 <+8>:push %r13
0x00000000403f814a <+10>:push %r12
0x00000000403f814c <+12>:push %rbx
...
0x00000000403f81e1 <+161>:pop %rbx
0x00000000403f81e2 <+162>:pop %r12
0x00000000403f81e4 <+164>:pop %r13
0x00000000403f81e6 <+166>:pop %r14
0x00000000403f81e8 <+168>:pop %r15
0x00000000403f81ea <+170>:pop %rbp
0x00000000403f81eb <+171>:ret
So all callee-save registers indeed are pushed
and popped from the stack. Some of them like
r12 are used by the body of switch_to(), but
others are not - r13, r14, r15. This is
interesting because according to the
documentation of inline assembly -
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
<https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html>
- it does not mention the callee-save
registers. The section "6.47.2.6 Clobbers and
Scratch Registers" has this statement:
"*When the compiler selects which registers to
use to represent input and output operands, it
does not use any of the clobbered registers.
As a result, clobbered registers are available
for any use in the assembler code*."
But based on how it works with x64, the
clobbered registers list is treated somewhat
differently from that statement above. Maybe,
in addition, it also means that any
callee-saved registers in that list will be
pushed and restored from the stack around this
inline assembly (more-less, as the pushes are
actually in the preamble).
I think this statement only partially describes
what "clobber registers" means.
In my experience, what this list does is that it
tells the C compiler (which doesn't understand
assembly language - it's a C compiler, not an
assembler!) which registers this assembly-language
code may be using. The C compiler then needs to do
two things:
1. If the C compiler used one of these registers
for its own purposes in the enclosing function
before the assembly code, and wants to use it
after the assembly code, it will need to save it.
2. If this is one of the *callee saved
*registers, the ABI says the function *must
*restore this register when it returns. So if
the function now includes this (assembly) code
which ruins this register, the C compiler must
add code to save and restore this register.
I think this is exactly what we need to save the
callee-saved registers.
Anyway, when I follow the same idea and modify
the inline assembly in the aarch64 version of
switch_to() to contain the callee-save
registers x19-x28 like so:
diff --git a/arch/aarch64/arch-switch.hh
b/arch/aarch64/arch-switch.hh
index dff7467c..f0ec61f2 100644
--- a/arch/aarch64/arch-switch.hh
+++ b/arch/aarch64/arch-switch.hh
@@ -42,8 +42,9 @@ void thread::switch_to()
: "Q"(old->_state.fp),
"Ump"(old->_state.sp),
"Ump"(this->_state.fp), "Ump"(this->_state.sp)
: "x0", "x1", "x2", "x3",
"x4", "x5", "x6", "x7", "x8",
- "x9", "x10", "x11", "x12",
"x13", "x14", "x15",
- "x16", "x17", "x18",
"x30", "memory");
+ "x9", "x10", "x11", "x12",
"x13", "x14",
+ "x19", "x20", "x21",
"x22", "x23", "x24",
+ "x25", "x26", "x27", "x28",
"x29", "x30", "memory");
}
then the assembly of switch_to in debug build
shows new pushes and pops as we would to:
Dump of assembler code for function
sched::thread::switch_to():
0x000000004043717c <+0>:stpx29, x30, [sp,
#-128]!
0x0000000040437180 <+4>:movx29, sp
0x0000000040437184 <+8>:stpx19, x20, [sp, #16]
0x0000000040437188 <+12>:stpx21, x22, [sp, #32]
0x000000004043718c <+16>:stpx23, x24, [sp, #48]
0x0000000040437190 <+20>:stpx25, x26, [sp, #64]
0x0000000040437194 <+24>:stpx27, x28, [sp, #80]
...
0x00000000404371ec <+112>:ldpx19, x20, [sp,
#16]
0x00000000404371f0 <+116>:ldpx21, x22, [sp,
#32]
0x00000000404371f4 <+120>:ldpx23, x24, [sp,
#48]
0x00000000404371f8 <+124>:ldpx25, x26, [sp,
#64]
0x00000000404371fc <+128>:ldpx27, x28, [sp,
#80]
0x0000000040437200 <+132>:ldpx29, x30,
[sp], #128
0x0000000040437204 <+136>:ret
And indeed when I run the same tests that were
broken in both debug builds (-O0, -01) now
they all work which is great.
Excellent.
Please note that I added x19-x28 registers to
the list in the inline assembly, but then I
also removed some from the original list -
x15-x18. Otherwise, the compiler would
complain with this error:
"arch/aarch64/arch-switch.hh: In member
function ‘void sched::thread::switch_to()’:
arch/aarch64/arch-switch.hh:28:5: error: ‘asm’
operand has impossible constraints
28 | asm volatile("\n"
| ^~~"
This makes sense as the compiler needs some
registers for inputs and outputs as I
understand. But I also wonder why the original
list had all the registers x0-x18 where only 3
of those - x0, x1, and x2 would be used in the
inline assembly (shouldn't it be enough to
list only x0, x1, and x2 besides x29 and x30?):
I don't know who wrote this list originally or why
that specific list.
It should list all the callee-saved registers,
plus the registers you actually use in the
instructions themselves. Not more (I think).
Wouldn't we NOT want to have x29 (frame pointer) in
the clobber list? If I understand the clobber list
correctly, the compiler uses it to determine which
registers to save/restore before/after the "asm
volatile" statement, but only if the compiler happens
to needs to use those registers elsewhere in the
function (wild guess: does that become more likely if
the function gets inlined?). By listing x29 in the
clobber list, would we be undoing what we're trying to
accomplish with the inline assembly?
Unsure if this is related or not, but could there be an
issue with saving/restoring the FPU state (i.e. Q0-Q31)? I
took a look at the disassembly
for fpu_state_save/fpu_state_load
(arch/aarch64/processor.hh), and I noticed the compiler
adds a few unwanted register save/restore of the registers
d8-d15, which is the 64-bit version of the 128-bit
FPU/SIMD registers q8-q15. Unless I misread the code, it
appears like this is not actually restoring q8-q15 during
context switch. Here's the disassembly of fpu_state_load
with mode=debug:
This is weird. I see the same thing. These d8-15 registers do
NOT show up in processor::fpu_state_save() assembly.
Makes sense, since they're not in the clobber lists in
fpu_state_save(), so the compiler /shouldn't/ have an awareness
that we're modifying those registers. And in fact, we're not in
the _save() function, we're just reading them.
They do not show up in the page_fault in release build:
Actually they do
gdb -batch -ex 'file build/release/loader.elf' -ex
'disassemble page_fault'
Dump of assembler code for function page_fault(exception_frame*):
0x000000004020b994 <+0>:subsp, sp, #0x270
0x000000004020b998 <+4>:stpx29, x30, [sp]
0x000000004020b99c <+8>:movx29, sp
0x000000004020b9a0 <+12>:stpx19, x20, [sp, #16]
0x000000004020b9a4 <+16>:movx19, x0
0x000000004020b9a8 <+20>:stpd8, d9, [sp, #32]
0x000000004020b9ac <+24>:stpd10, d11, [sp, #48]
0x000000004020b9b0 <+28>:stpd12, d13, [sp, #64]
0x000000004020b9b4 <+32>:stpd14, d15, [sp, #80]
Here
0x000000004020b9b8 <+36>:stpq0, q1, [sp, #96]
0x000000004020b9bc <+40>:stpq2, q3, [sp, #128]
0x000000004020b9c0 <+44>:stpq4, q5, [sp, #160]
0x000000004020b9c4 <+48>:stpq6, q7, [sp, #192]
0x000000004020b9c8 <+52>:stpq8, q9, [sp, #224]
0x000000004020b9cc <+56>:stpq10, q11, [sp, #256]
0x000000004020b9d0 <+60>:stpq12, q13, [sp, #288]
0x000000004020b9d4 <+64>:stpq14, q15, [sp, #320]
0x000000004020b9d8 <+68>:stpq16, q17, [sp, #352]
0x000000004020b9dc <+72>:stpq18, q19, [sp, #384]
0x000000004020b9e0 <+76>:stpq20, q21, [sp, #416]
0x000000004020b9e4 <+80>:stpq22, q23, [sp, #448]
0x000000004020b9e8 <+84>:stpq24, q25, [sp, #480]
0x000000004020b9ec <+88>:addx1, sp, #0x200
0x000000004020b9f0 <+92>:stpq26, q27, [x1]
0x000000004020b9f4 <+96>:stpq28, q29, [x1, #32]
0x000000004020b9f8 <+100>:addx1, sp, #0x200
0x000000004020b9fc <+104>:stpq30, q31, [x1, #64]
0x000000004020ba00 <+108>:mrsx1, fpsr
0x000000004020ba04 <+112>:strw1, [sp, #608]
0x000000004020ba08 <+116>:mrsx1, fpcr
0x000000004020ba0c <+120>:strw1, [sp, #612]
0x000000004020ba10 <+124>:mrsx20, far_el1
0x000000004020ba14 <+128>:bl0x4020cb70
<fixup_fault(exception_frame*)>
0x000000004020ba18 <+132>:tstw0, #0xff
0x000000004020ba1c <+136>:b.ne <http://b.ne>0x4020ba60
<page_fault(exception_frame*)+204> // b.any
0x000000004020ba20 <+140>:ldrx0, [x19, #256]
0x000000004020ba24 <+144>:cbzx0, 0x4020bad8
<page_fault(exception_frame*)+324>
0x000000004020ba28 <+148>:mrsx0, tpidr_el0
0x000000004020ba2c <+152>:addx0, x0, #0x0, lsl #12
0x000000004020ba30 <+156>:addx0, x0, #0x50
0x000000004020ba34 <+160>:ldrw0, [x0]
0x000000004020ba38 <+164>:cbnzw0, 0x4020bb04
<page_fault(exception_frame*)+368>
0x000000004020ba3c <+168>:ldrx0, [x19, #264]
0x000000004020ba40 <+172>:tbnzw0, #7, 0x4020bae4
<page_fault(exception_frame*)+336>
0x000000004020ba44 <+176>:msrdaifclr, #0x2
0x000000004020ba48 <+180>:isb
0x000000004020ba4c <+184>:movx1, x19
0x000000004020ba50 <+188>:movx0, x20
0x000000004020ba54 <+192>:bl0x401da290
<mmu::vm_fault(unsigned long, exception_frame*)>
0x000000004020ba58 <+196>:msrdaifset, #0x2
0x000000004020ba5c <+200>:isb
0x000000004020ba60 <+204>:ldpq0, q1, [sp, #96]
0x000000004020ba64 <+208>:ldpq2, q3, [sp, #128]
0x000000004020ba68 <+212>:ldpq4, q5, [sp, #160]
0x000000004020ba6c <+216>:ldpq6, q7, [sp, #192]
0x000000004020ba70 <+220>:ldpq8, q9, [sp, #224]
0x000000004020ba74 <+224>:ldpq10, q11, [sp, #256]
0x000000004020ba78 <+228>:ldpq12, q13, [sp, #288]
0x000000004020ba7c <+232>:ldpq14, q15, [sp, #320]
0x000000004020ba80 <+236>:ldpq16, q17, [sp, #352]
0x000000004020ba84 <+240>:ldpq18, q19, [sp, #384]
0x000000004020ba88 <+244>:ldpq20, q21, [sp, #416]
0x000000004020ba8c <+248>:ldpq22, q23, [sp, #448]
0x000000004020ba90 <+252>:ldpq24, q25, [sp, #480]
0x000000004020ba94 <+256>:addx0, sp, #0x200
0x000000004020ba98 <+260>:ldpq26, q27, [x0]
0x000000004020ba9c <+264>:ldpq28, q29, [x0, #32]
0x000000004020baa0 <+268>:addx0, sp, #0x200
0x000000004020baa4 <+272>:ldpq30, q31, [x0, #64]
0x000000004020baa8 <+276>:ldrw0, [sp, #608]
0x000000004020baac <+280>:msrfpsr, x0
0x000000004020bab0 <+284>:ldrw0, [sp, #612]
0x000000004020bab4 <+288>:msrfpcr, x0
0x000000004020bab8 <+292>:ldpx29, x30, [sp]
0x000000004020babc <+296>:ldpx19, x20, [sp, #16]
0x000000004020bac0 <+300>:ldpd8, d9, [sp, #32]
0x000000004020bac4 <+304>:ldpd10, d11, [sp, #48]
0x000000004020bac8 <+308>:ldpd12, d13, [sp, #64]
0x000000004020bacc <+312>:ldpd14, d15, [sp, #80]
And here
0x000000004020bad0 <+316>:addsp, sp, #0x270
0x000000004020bad4 <+320>:ret
0x000000004020bad8 <+324>:adrpx0, 0x40558000
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
0x000000004020badc <+328>:addx0, x0, #0x248
0x000000004020bae0 <+332>:bl0x400e8e50 <abort(char const*,
...)>
0x000000004020bae4 <+336>:adrpx3, 0x40558000
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
0x000000004020bae8 <+340>:adrpx1, 0x40558000
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
0x000000004020baec <+344>:adrpx0, 0x40558000
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
0x000000004020baf0 <+348>:addx3, x3, #0x268
0x000000004020baf4 <+352>:addx1, x1, #0x278
0x000000004020baf8 <+356>:addx0, x0, #0x290
0x000000004020bafc <+360>:movw2, #0x2f // #47
0x000000004020bb00 <+364>:bl0x400e8f50 <__assert_fail(char
const*, char const*, unsigned int, char const*)>
0x000000004020bb04 <+368>:adrpx3, 0x40558000
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
0x000000004020bb08 <+372>:adrpx1, 0x40558000
<_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
0x000000004020bb0c <+376>:adrpx0, 0x40552000
0x000000004020bb10 <+380>:addx3, x3, #0x268
0x000000004020bb14 <+384>:addx1, x1, #0x278
0x000000004020bb18 <+388>:addx0, x0, #0x450
0x000000004020bb1c <+392>:movw2, #0x2e // #46
0x000000004020bb20 <+396>:bl0x400e8f50 <__assert_fail(char
const*, char const*, unsigned int, char const*)>
0x000000004020bb24 <+400>:ldpq0, q1, [sp, #96]
0x000000004020bb28 <+404>:ldpq2, q3, [sp, #128]
0x000000004020bb2c <+408>:ldpq4, q5, [sp, #160]
0x000000004020bb30 <+412>:ldpq6, q7, [sp, #192]
0x000000004020bb34 <+416>:ldpq8, q9, [sp, #224]
0x000000004020bb38 <+420>:ldpq10, q11, [sp, #256]
0x000000004020bb3c <+424>:ldpq12, q13, [sp, #288]
0x000000004020bb40 <+428>:ldpq14, q15, [sp, #320]
0x000000004020bb44 <+432>:ldpq16, q17, [sp, #352]
0x000000004020bb48 <+436>:ldpq18, q19, [sp, #384]
0x000000004020bb4c <+440>:ldpq20, q21, [sp, #416]
0x000000004020bb50 <+444>:ldpq22, q23, [sp, #448]
0x000000004020bb54 <+448>:ldpq24, q25, [sp, #480]
0x000000004020bb58 <+452>:addx1, sp, #0x200
0x000000004020bb5c <+456>:ldpq26, q27, [x1]
0x000000004020bb60 <+460>:ldpq28, q29, [x1, #32]
0x000000004020bb64 <+464>:addx1, sp, #0x200
0x000000004020bb68 <+468>:ldpq30, q31, [x1, #64]
0x000000004020bb6c <+472>:ldrw1, [sp, #608]
0x000000004020bb70 <+476>:msrfpsr, x1
0x000000004020bb74 <+480>:ldrw1, [sp, #612]
0x000000004020bb78 <+484>:msrfpcr, x1
0x000000004020bb7c <+488>:bl0x40475940 <_Unwind_Resume>
0x000000004020bb80 <+492>:msrdaifset, #0x2
0x000000004020bb84 <+496>:isb
0x000000004020bb88 <+500>:b0x4020bb24
<page_fault(exception_frame*)+400>
End of assembler dump.
You are correct. I missed them. It looks like I need a new pair of
glasses :-)
Compiler bug in debug mode?
No, the compiler is doing the right thing. Per the AArch64
Procedure Call Standard: "Registers v8-v15 must be preserved by a
callee across subroutine calls; ... <snip>... only the bottom
64-bits of each value stored in v8-v15 need to be preserved." I
think the best path forward here is to move fpu_state_save/restore
to an assembly file (*.S). I'll work on a patch to do this. I'm
thinking I can add them to entry.S.
Interesting. But given it is an inline assembly which C compiler knows
nothing about (what registers are used) how does it know it needs to
push registers d8-d15 in processor::fpu_state_load() to follow the
Call Standard? I think this does not happen in "normal" functions,
does it?