On Sun, 4 May 2025, Marcus Glocker wrote:
> On Sun, May 04, 2025 at 09:36:48AM GMT, Marcus Glocker wrote:
> 
> > On Sun, May 04, 2025 at 09:09:37AM GMT, Marcus Glocker wrote:
> > 
> > > On Sun, May 04, 2025 at 08:28:31AM GMT, Marcus Glocker wrote:
> > > 
> > > > On Sat, May 03, 2025 at 05:34:17PM GMT, Philip Guenther wrote:
> > > > 
> > > > > On Sat, 3 May 2025, Marcus Glocker wrote:
> > > > > > On Sat, May 03, 2025 at 10:08:15PM GMT, Marcus Glocker wrote:
> > > > > > 
> > > > > > > On Sat, May 03, 2025 at 09:53:11PM GMT, Marcus Glocker wrote:
> > > > > > > 
> > > > > > > > On Sat, May 03, 2025 at 02:42:09PM GMT, George Koehler wrote:
> > > > > > > > > On Sat, 3 May 2025 08:02:29 +0200
> > > > > > > > > Marcus Glocker <mar...@nazgul.ch> wrote:
> > > > > ...
> > > > > > > > > I don't see a panic message.  I guess that you entered ddb 
> > > > > > > > > from the
> > > > > > > > > db_ktrap call at /sys/arch/amd64/amd64/trap.c:323 (below 
> > > > > > > > > we_re_toast:
> > > > > > > > > in kerntrap), but I don't know the kind of trap.  It might 
> > > > > > > > > help to
> > > > > > > > > move the trap_print call above the db_ktrap call, then build 
> > > > > > > > > a kernel
> > > > > > > > > (without your workaround patch) and reproduce.
> > > > > > > > 
> > > > > > > > Yes.  That's the output when I move trap_print(), and panic() 
> > > > > > > > (converted
> > > > > > > > to an printf) before "if (db_ktrap(type, frame->tf_err, 
> > > > > > > > frame))":
> > > > > > > > 
> > > > > > > > trashcan# halt -p
> > > > > > > > syncing disks... done
> > > > > > > > fatal trace trap in supervisor mode
> > > > > > > > trap type 5 code 0 rip ffffffff8217727d cs 8 rflags 2 cr2 
> > > > > > > > ffff80003c16fa38 cpl d rsp ffff80003c083430
> > > > > > > 
> > > > > > > "rflags 2" means that the Trap Flag (TF) is set I guess.
> > > > > 
> > > > > Hmm?  The trap flag aka PSL_T is 0x100.  0x2 is a must-be-one bit.
> > > > > 
> > > > > Yes, trap 5 is the debugging trap, but the way to see what caused
> > > > > it is to examine %dr6.  Perhaps try this on top of your diff moving
> > > > > up the trap_print() (but without your ignore-T_TRCTRAP-in-kernel
> > > > > diff), to clear %dr6 during boot and show it if its a trace-trap:
> > > > 
> > > > Thanks for the diff!  Attached the complete diff which I've applied to
> > > > test.  And this is the result:
> > > > 
> > > > trashcan# halt -p
> > > > syncing disks... done
> > > > fatal trace trap in supervisor mode
> > > > trap type 5 code 0 rip ffffffff813cb7cd cs 8 rflags 6 cr2 
> > > > ffff80003c088ec8 cpl d rsp ffff80003c17aed0
> > > > gsbase 0xffffffff829cdff0  kgsbase 0x0
> > > > dr6 ffff0ff8
> > > > Stopped at      x86_bus_space_io_write_4+0x1d:  leave
> > > > ddb{0}>
> > > > 
> > > > When I try to interpret the DR6 register value correctly, according to
> > > > the documentation, it would mean that Bit 3 B3 (Breakpoint #3 Condition
> > > > Detected) was set.  Bit 11 BLD (Bus Lock Detection) would be cleared if
> > > > detected, which doesn't seem to be the case here since it's set to 1.
> > > 
> > > And that's the content of the DR7 register (Debug Control Register)
> > > during the trap:
> > > 
> > > dr7 200004c0
> > > 
> > > Which means that local and global breakpoints #3 are enabled (bit 6 and
> > > 7).  When I clear the DR7 register as well during boot, the trap doesn't
> > > happen anymore during "halt -p".
> > 
> > And the breakpoint #3 condition (bit 29:28) is set to 10b which means:
> > 
> > +----------------------------------+
> > | Value     Break on                   |
> > +----------------------------------+
> > | 00b       Instruction execution only |
> > | 01b       Data writes only           |
> > | 10b       I/O reads and writes       | <-
> > |   (only defined if CR4.DE=1) |
> > | 11b       Data reads and writes      |
> > +----------------------------------+
> > 
> > But then it should happen for any read/write I/O, which we should have
> > plenty before the I/O write to port 0x80.  So I'm not sure why it only
> > happens at that point ...
> 
> And that's the value of the CR4 register during the trap:
> 
> cr4 1406f8
> 
> Perhaps the debugging extension bit (DE bit 3) is only set during the
> transition to the S5 state, and that's why the trap during I/O write
> only gets triggered here.
> 
> Anyway, would it be sensible to clear the DR7 register during boot in
> general to salvage machines like this one?

Yes, but not just boot: we should also reset them when coming out
of S3/S4 suspend.

My previous diff made an incorrect stab at settings the BLD bit in
%dr6, per recommendation in the SDM.  If/when someone implements
bus lock detection support they can fix up this ASM to support that;
no point in doing in now.

The latter two chunks add cpufunc.h wrappers for reading %dr[67]
to make the trap.c bit nice.

If this works for your box, Marcus, I think this is ready to go it.


Philip

Index: amd64/acpi_wakecode.S
===================================================================
RCS file: /data/src/openbsd/src/sys/arch/amd64/amd64/acpi_wakecode.S,v
diff -u -p -r1.50 acpi_wakecode.S
--- amd64/acpi_wakecode.S       25 Feb 2024 22:33:09 -0000      1.50
+++ amd64/acpi_wakecode.S       4 May 2025 23:07:50 -0000
@@ -324,6 +324,11 @@ _ACPI_TRMP_LABEL(.Lacpi_long_mode_trampo
        andb    $0xF9, 5(%rax,%rcx)
        ltr     %cx
 
+       /* Reset debug control registers */
+       xorl    %eax,%eax
+       movq    %rax,%dr6
+       movq    %rax,%dr7
+
        pushq   .Lacpi_saved_fl
        popfq
 
Index: amd64/locore0.S
===================================================================
RCS file: /data/src/openbsd/src/sys/arch/amd64/amd64/locore0.S,v
diff -u -p -r1.26 locore0.S
--- amd64/locore0.S     4 Oct 2024 21:15:52 -0000       1.26
+++ amd64/locore0.S     4 May 2025 23:08:02 -0000
@@ -193,7 +193,12 @@ bi_size_ok:
        pushl   $PSL_MBO
        popfl
 
-       xorl    %eax,%eax
+       /* Reset debug control registers */
+       xorl    %eax,%eax
+       movl    %eax,%dr6
+       movl    %eax,%dr7
+
+       /* %eax still zero from above */
        cpuid
        movl    %eax,RELOC(cpuid_level)
        movl    $RELOC(cpu_vendor),%ebp
Index: amd64/trap.c
===================================================================
RCS file: /data/src/openbsd/src/sys/arch/amd64/amd64/trap.c,v
diff -u -p -r1.106 trap.c
--- amd64/trap.c        4 Sep 2024 07:54:51 -0000       1.106
+++ amd64/trap.c        4 May 2025 23:15:58 -0000
@@ -465,6 +465,8 @@ trap_print(struct trapframe *frame, int 
            frame->tf_rflags, rcr2(), curcpu()->ci_ilevel, frame->tf_rsp);
        printf("gsbase %p  kgsbase %p\n",
            (void *)rdmsr(MSR_GSBASE), (void *)rdmsr(MSR_KERNELGSBASE));
+       if (type == T_TRCTRAP)
+               printf("dr6 %llx dr7 %llx\n", rdr6(), rdr7());
 }
 
 
Index: include/cpufunc.h
===================================================================
RCS file: /data/src/openbsd/src/sys/arch/amd64/include/cpufunc.h,v
diff -u -p -r1.43 cpufunc.h
--- include/cpufunc.h   8 Nov 2024 12:08:22 -0000       1.43
+++ include/cpufunc.h   4 May 2025 23:04:06 -0000
@@ -159,6 +159,25 @@ rcr4(void)
        return (u_int) val64;
 }
 
+/*
+ * DR6 and DR7 debug registers
+ */
+static inline uint64_t
+rdr6(void)
+{
+       u_int64_t val;
+       __asm volatile("movq %%dr6,%0" : "=r" (val));
+       return val;
+}
+
+static inline uint64_t
+rdr7(void)
+{
+       u_int64_t val;
+       __asm volatile("movq %%dr7,%0" : "=r" (val));
+       return val;
+}
+
 static __inline void
 tlbflush(void)
 {

Reply via email to