On 9/15/25 10:55, Michael Kelley wrote: > From: Mukesh Rathor <mrat...@linux.microsoft.com> Sent: Tuesday, September 9, > 2025 5:10 PM >> >> Introduce a small asm stub to transition from the hypervisor to linux > > I'd argue for capitalizing "Linux" here and in other places in commit > text and code comments throughout this patch set.
I'd argue against it. A quick grep indicates it is a common practice, and in the code world goes easy on the eyes :). >> upon devirtualization. > > In this patch and subsequent patches, you've used the phrase "upon > devirtualization", which seems a little vague to me. Does this mean > "when devirtualization is complete" or perhaps "when the hypervisor > completes devirtualization"? Since there's no spec on any of this, > being as precise as possible will help future readers. since control comes back to linux at the callback here, i fail to understand what is vague about it. when hyp completes devirt, devirt is complete. >> >> At a high level, during panic of either the hypervisor or the dom0 (aka >> root), the nmi handler asks hypervisor to devirtualize. > > Suggest: > > At a high level, during panic of either the hypervisor or Linux running > in dom0 (a.k.a. the root partition), the Linux NMI handler asks the > hypervisor to devirtualize. > >> As part of that, >> the arguments include an entry point to return back to linux. This asm >> stub implements that entry point. >> >> The stub is entered in protected mode, uses temporary gdt and page table >> to enable long mode and get to kernel entry point which then restores full >> kernel context to resume execution to kexec. >> >> Signed-off-by: Mukesh Rathor <mrat...@linux.microsoft.com> >> --- >> arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++ >> 1 file changed, 105 insertions(+) >> create mode 100644 arch/x86/hyperv/hv_trampoline.S >> >> diff --git a/arch/x86/hyperv/hv_trampoline.S >> b/arch/x86/hyperv/hv_trampoline.S >> new file mode 100644 >> index 000000000000..27a755401a42 >> --- /dev/null >> +++ b/arch/x86/hyperv/hv_trampoline.S >> @@ -0,0 +1,105 @@ >> +/* SPDX-License-Identifier: GPL-2.0-only */ >> +/* >> + * X86 specific Hyper-V kdump/crash related code. > > Add a qualification that this is for root partition only, and not for > general guests? i don't think it is needed, it would be odd for guests to collect hyp core. besides makefile/kconfig shows this is root vm only >> + * >> + * Copyright (C) 2025, Microsoft, Inc. >> + * >> + */ >> +#include <linux/linkage.h> >> +#include <asm/alternative.h> >> +#include <asm/msr.h> >> +#include <asm/processor-flags.h> >> +#include <asm/nospec-branch.h> >> + >> +/* >> + * void noreturn hv_crash_asm32(arg1) >> + * arg1 == edi == 32bit PA of struct hv_crash_trdata > > I think this is "struct hv_crash_tramp_data". correct >> + * >> + * The hypervisor jumps here upon devirtualization in protected mode. This >> + * code gets copied to a page in the low 4G ie, 32bit space so it can run >> + * in the protected mode. Hence we cannot use any compile/link time offsets >> or >> + * addresses. It restores long mode via temporary gdt and page tables and >> + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry. >> + * >> + * PreCondition (ie, Hypervisor call back ABI): >> + * o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled >> + * o CR4 is set to 0x0 >> + * o IA32_EFER is set to 0x901 (SCE and NXE are set) >> + * o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX. >> + * o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit >> 0xFFFF >> + * o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF >> + * o LDTR is initialized as invalid (limit of 0) >> + * o MSR PAT is power on default. >> + * o Other state/registers are cleared. All TLBs flushed. > > Clarification about "Other state/registers are cleared": What about > processor features that Linux may have enabled or disabled during its > initial boot? Are those still in the states Linux set? Or are they reset to > power-on defaults? For example, if Linux enabled x2apic, is x2apic > still enabled when the stub is entered? correct, if linux set x2apic, x2apic would still be enabled. >> + * >> + * See Intel SDM 10.8.5 > > Hmmm. I downloaded the latest combined SDM, and section 10.8.5 > in Volume 3A is about Microcode Update Resources, which doesn't > seem applicable here. Other volumes don't have a section 10.8.5. google ai found it right away upon searching: intel sdm 10.8.5 ia-32e >> + */ >> + >> +#define HV_CRASHDATA_OFFS_TRAMPCR3 0x0 /* 0 */ >> +#define HV_CRASHDATA_OFFS_KERNCR3 0x8 /* 8 */ >> +#define HV_CRASHDATA_OFFS_GDTRLIMIT 0x12 /* 18 */ >> +#define HV_CRASHDATA_OFFS_CS_JMPTGT 0x28 /* 40 */ >> +#define HV_CRASHDATA_OFFS_C_entry 0x30 /* 48 */ > > It seems like these offsets should go in a #include file along > with the definition of struct hv_crash_tramp_data. Then the > BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could > check against these symbolic names instead of hardcoding > numbers that must match these. yeah, that works too and was the first cut. but given the small number of these, and that they are not used/needed anywhere else, and that they will almost never change, creating another tiny header in a non-driver directory didn't seem worth it.. but i could go either way. >> +#define HV_CRASHDATA_TRAMPOLINE_CS 0x8 > > This #define isn't used anywhere. removed >> + >> + .text >> + .code32 >> + >> +SYM_CODE_START(hv_crash_asm32) >> + UNWIND_HINT_UNDEFINED >> + ANNOTATE_NOENDBR > > No ENDBR here, presumably because this function is entered via other > than an indirect CALL or JMP instruction. Right? > >> + movl $X86_CR4_PAE, %ecx >> + movl %ecx, %cr4 >> + >> + movl %edi, %ebx >> + add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx >> + movl %cs:(%ebx), %eax >> + movl %eax, %cr3 >> + >> + # Setup EFER for long mode now. >> + movl $MSR_EFER, %ecx >> + rdmsr >> + btsl $_EFER_LME, %eax >> + wrmsr >> + >> + # Turn paging on using the temp 32bit trampoline page table. >> + movl %cr0, %eax >> + orl $(X86_CR0_PG), %eax >> + movl %eax, %cr0 >> + >> + /* since kernel cr3 could be above 4G, we need to be in the long mode >> + * before we can load 64bits of the kernel cr3. We use a temp gdt for >> + * that with CS.L=1 and CS.D=0 */ >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax >> + lgdtl %cs:(%eax) >> + >> + /* not done yet, restore CS now to switch to CS.L=1 */ >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax >> + ljmp %cs:*(%eax) >> +SYM_CODE_END(hv_crash_asm32) >> + >> + /* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */ >> + .code64 >> + .balign 8 >> +SYM_CODE_START(hv_crash_asm64) >> + UNWIND_HINT_UNDEFINED >> + ANNOTATE_NOENDBR > > But this *is* entered via an indirect JMP, right? So back to my > earlier question about the state of processor feature enablement. > If Linux enabled IBT, is it still enabled after devirtualization and > the hypervisor invokes this entry point? Linux guests on Hyper-V > have historically not enabled IBT, but patches that enable it are > now in linux-next, and will go into the 6.18 kernel. So maybe > this needs an ENDBR64. IBT would be disabled in the transition here.... so doesn't really matter. ENDBR ok too.. >> +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL) >> + /* restore kernel page tables so we can jump to kernel code */ >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_KERNCR3, %eax >> + movq %cs:(%eax), %rbx >> + movq %rbx, %cr3 >> + >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_C_entry, %eax >> + movq %cs:(%eax), %rbx >> + ANNOTATE_RETPOLINE_SAFE >> + jmp *%rbx >> + >> + int $3 >> + >> +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL) >> +SYM_CODE_END(hv_crash_asm64) >> -- >> 2.36.1.vfs.0.0 >> >