On Fri, Jan 09, 2026 at 12:15:23AM +0100, Mark Kettenis wrote:
> > Date: Thu, 8 Jan 2026 14:37:43 -0800
> > From: Mike Larkin <[email protected]>
> >
> > On Sun, Jan 04, 2026 at 02:01:07PM +0100, Mark Kettenis wrote:
> > > > Date: Sat, 3 Jan 2026 20:50:23 -0800
> > > > From: Mike Larkin <[email protected]>
> > > >
> > > > On Tue, Dec 30, 2025 at 05:20:46PM +0100, Mark Kettenis wrote:
> > > > > > Date: Tue, 30 Dec 2025 07:46:16 +0100
> > > > > > From: Rafael Sadowski <[email protected]>
> > > > > >
> > > > > > On Mon Dec 29, 2025 at 06:17:16PM -0800, [email protected] wrote:
> > > > > > >    I have the same machine and it works fine also, or at least it 
> > > > > > > did last
> > > > > > >    time I tried.
> > > > > > >    Does it work if you ZZZ from the text console, right after 
> > > > > > > boot?
> > > > > > >    -ml
> > > > > >
> > > > > > Yes and no. Instead of getting stuck in the kernel boot I ends up 
> > > > > > in a
> > > > > > wired white artefact screen and then the only thing that helps is a 
> > > > > > hard
> > > > > > reset.
> > > > > >
> > > > > > I also reset my BIOS settings to factory defaults. No changes except
> > > > > > that my OpenBSD EFI boot entry was gone.
> > > > > >
> > > > > > Perhaps something with the GPU:
> > > > > >
> > > > > > dmesg| grep amd
> > > > > >     
> > > > > > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > > > > amdpmc0 at acpi0: PEP_
> > > > > > amdpmc0: SMU program 0 version 76.93.0
> > > > > > amdgpio0 at acpi0 GPIO uid 0 addr 0xfed81500/0x400 irq 7, 184 pins
> > > > > > amdgpu0 at pci6 dev 0 function 0 "ATI Hawk Point" rev 0xd0
> > > > > > drm0 at amdgpu0
> > > > > > amdgpu0: msi
> > > > > > amdgpu0: IP DISCOVERY GC 11.0.1 12 CU rev 0x0c
> > > > > > amdgpu0: 1920x1200, 32bpp
> > > > > > wsdisplay0 at amdgpu0 mux 1: console (std, vt100 emulation), using 
> > > > > > wskbd0
> > > > >
> > > > > My X13 Gen 4 AMD has essentially the same GPU:
> > > > >
> > > > > amdgpu0 at pci5 dev 0 function 0 "ATI Phoenix" rev 0xdd
> > > > > drm0 at amdgpu0
> > > > > amdgpu0: msi
> > > > > amdgpu0: IP DISCOVERY GC 11.0.1 12 CU rev 0x09
> > > > > amdgpu0: 1920x1200, 32bpp
> > > > >
> > > > > Hibernate "works" on this machine but:
> > > > >
> > > > > * After unhibernate, the framebuffer is filled with random crap; we
> > > > >   probably need to clear it in the driver somewhere.
> > > > >
> > > > > * After unhibernate, qwx(4) is somewhat hosed.  It works, but if you
> > > > >   try to down the interface, it hangs.  It seems that the "head
> > > > >   pointer" for one of the ring gets corrupted and this makes the
> > > > >   driver go into an infinite loop.  I can break into that loop using
> > > > >   CTRL-ALT-ESC though (sysctl ddb.console=1).  I'm investigating this
> > > > >   issue.
> > > > >
> > > > > * Sometimes I get a kernel that always produces a
> > > > >
> > > > >     "unhibernate failed: original kernel changed"
> > > > >
> > > > >   message.
> > > > >
> > > >
> > > > Some comments -
> > > >
> > > > 1. if unhibernate tries to unhibernate but fails (wrong kernel, etc), 
> > > > you are
> > > >    certainly going to have a hosed machine. This is because the 
> > > > unhibernating
> > > >    kernel is booting in a neutered mode where a bunch of devices are 
> > > > disabled,
> > > >    as well as all the APs. At best, this leads to a weird experience; 
> > > > at worst,
> > > >    things hang or crash later. Theo and I have discussed what we should 
> > > > do in
> > > >    this case, since there is no way to rewind autoconf and "retry". I 
> > > > suggested
> > > >    just rebooting; theo suggested maybe some informational panic 
> > > > message. I'm
> > > >    not sure if this is what you are seeing in any of the above cases, 
> > > > but I
> > > >    wanted to point that out.
> > >
> > > I'm obviously seeing this when I get the "original kernel changed"
> > > failure.  I was somewhat confused why I couldn't ssh into the machine
> > > at first, but yes, I realized that we booted without qwx(4) and from
> > > then on just reboot when I end up in this case.
> > >
> > > > 2. regarding the "original kernel changed" - the only way this happens 
> > > > if you
> > > >    booted, changed your kernel, then hibernated. there is no other way 
> > > > I can
> > > >    see this happening. I have done this in the past and seen the same:
> > > >
> > > >     a. boot machine
> > > >     b. someone asks me to test a diff, or I'm testing a diff of my own
> > > >     c. build and install new kernel, but not ready for reboot yet 
> > > > (doing other
> > > >        things)
> > > >     d. forget I installed a new kernel and ZZZ
> > > >     e. reboot, unhibernate prints that message.
> > > >
> > > > The code that does this signature check is:
> > > >
> > > >         SHA256Init(&ctx);
> > > >         SHA256Update(&ctx, version, strlen(version));
> > > >         fn = printf;
> > > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > > >         fn = malloc;
> > > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > > >         fn = km_alloc;
> > > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > > >         fn = strlen;
> > > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > > >         SHA256Final((u_int8_t *)&hib->kern_hash, &ctx);
> > > >
> > > > ... so it just fingerprints a bunch of things and then does a sha256 
> > > > compare
> > > > on unpack.
> > > >
> > > > I don't know how to prevent that footgun however, aside from moving all 
> > > > the
> > > > signature checking up into the bootloader and not even attempting the 
> > > > unhibernate
> > > > if we see this situation. That doesn't "fix" the problem but at least 
> > > > you aren't
> > > > running on some halfway-autoconf'ed kernel when it fails. Moving this 
> > > > stuff into
> > > > the bootloader is not trivial; I tried this in 2010-2011 and gave up.
> > > >
> > > > Regarding the other things (device issues, hangs, etc), I have some 
> > > > ideas on how
> > > > to potentially print more information but it needs to be coded.
> > >
> > > I'm 100% sure that I am booting the correct kernel.  The checksum
> > > calculated by that code above is the same.  But for some reason the
> > > checksum that we read back from the hibernation info on disk is
> > > all-zeroes.  So something is going wrong.  Will dig deeper when I have
> > > time.
> >
> > just following up here -
> >
> > so you obviously checked this. are you saying that the *checksum* is zero 
> > but
> > somehow the magic number at the start of the signature block is still valid,
> > as well as the memory range data/etc?
> >
> > I'm trying to understand if the entire signature block is zeros or *just* 
> > the
> > kernel checksum.
> >
> > If it's entirely zero, including the field for the magic number,
> > then the problem lies in the bootloader somehow thinking it's an
> > unhibernate when it really isn't.
> >
> > If the signature block is properly "there" but with an all-zero
> > kernel checksum, then the problem is in the code that calculated
> > that and wrote it out when the ZZZ happened.
> >
> > Ideas?
>
> I'm trying to narrow things down a bit further.  But after adding some
> debug printfs, I haven't been able to reproduce the issue :(.
>

We could put in some "cant happen" checks in the bootloader; we have the
signature block. we could do some checks like:

1. are any of the memory ranges nonsense? (we can't check exact matches
   at that point but we could check for all 0s, etc)
2. is the checksum 0

etc..


Reply via email to