Re: Unhibernate gets stuck on ThinkPad T14 Gen 5 AMD

Mark Kettenis Thu, 08 Jan 2026 15:15:51 -0800

> Date: Thu, 8 Jan 2026 14:37:43 -0800
> From: Mike Larkin <[email protected]>
> 
> On Sun, Jan 04, 2026 at 02:01:07PM +0100, Mark Kettenis wrote:
> > > Date: Sat, 3 Jan 2026 20:50:23 -0800
> > > From: Mike Larkin <[email protected]>
> > >
> > > On Tue, Dec 30, 2025 at 05:20:46PM +0100, Mark Kettenis wrote:
> > > > > Date: Tue, 30 Dec 2025 07:46:16 +0100
> > > > > From: Rafael Sadowski <[email protected]>
> > > > >
> > > > > On Mon Dec 29, 2025 at 06:17:16PM -0800, [email protected] wrote:
> > > > > >    I have the same machine and it works fine also, or at least it 
> > > > > > did last
> > > > > >    time I tried.
> > > > > >    Does it work if you ZZZ from the text console, right after boot?
> > > > > >    -ml
> > > > >
> > > > > Yes and no. Instead of getting stuck in the kernel boot I ends up in a
> > > > > wired white artefact screen and then the only thing that helps is a 
> > > > > hard
> > > > > reset.
> > > > >
> > > > > I also reset my BIOS settings to factory defaults. No changes except
> > > > > that my OpenBSD EFI boot entry was gone.
> > > > >
> > > > > Perhaps something with the GPU:
> > > > >
> > > > > dmesg| grep amd
> > > > >     
> > > > > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > > > amdpmc0 at acpi0: PEP_
> > > > > amdpmc0: SMU program 0 version 76.93.0
> > > > > amdgpio0 at acpi0 GPIO uid 0 addr 0xfed81500/0x400 irq 7, 184 pins
> > > > > amdgpu0 at pci6 dev 0 function 0 "ATI Hawk Point" rev 0xd0
> > > > > drm0 at amdgpu0
> > > > > amdgpu0: msi
> > > > > amdgpu0: IP DISCOVERY GC 11.0.1 12 CU rev 0x0c
> > > > > amdgpu0: 1920x1200, 32bpp
> > > > > wsdisplay0 at amdgpu0 mux 1: console (std, vt100 emulation), using 
> > > > > wskbd0
> > > >
> > > > My X13 Gen 4 AMD has essentially the same GPU:
> > > >
> > > > amdgpu0 at pci5 dev 0 function 0 "ATI Phoenix" rev 0xdd
> > > > drm0 at amdgpu0
> > > > amdgpu0: msi
> > > > amdgpu0: IP DISCOVERY GC 11.0.1 12 CU rev 0x09
> > > > amdgpu0: 1920x1200, 32bpp
> > > >
> > > > Hibernate "works" on this machine but:
> > > >
> > > > * After unhibernate, the framebuffer is filled with random crap; we
> > > >   probably need to clear it in the driver somewhere.
> > > >
> > > > * After unhibernate, qwx(4) is somewhat hosed.  It works, but if you
> > > >   try to down the interface, it hangs.  It seems that the "head
> > > >   pointer" for one of the ring gets corrupted and this makes the
> > > >   driver go into an infinite loop.  I can break into that loop using
> > > >   CTRL-ALT-ESC though (sysctl ddb.console=1).  I'm investigating this
> > > >   issue.
> > > >
> > > > * Sometimes I get a kernel that always produces a
> > > >
> > > >     "unhibernate failed: original kernel changed"
> > > >
> > > >   message.
> > > >
> > >
> > > Some comments -
> > >
> > > 1. if unhibernate tries to unhibernate but fails (wrong kernel, etc), you 
> > > are
> > >    certainly going to have a hosed machine. This is because the 
> > > unhibernating
> > >    kernel is booting in a neutered mode where a bunch of devices are 
> > > disabled,
> > >    as well as all the APs. At best, this leads to a weird experience; at 
> > > worst,
> > >    things hang or crash later. Theo and I have discussed what we should 
> > > do in
> > >    this case, since there is no way to rewind autoconf and "retry". I 
> > > suggested
> > >    just rebooting; theo suggested maybe some informational panic message. 
> > > I'm
> > >    not sure if this is what you are seeing in any of the above cases, but 
> > > I
> > >    wanted to point that out.
> >
> > I'm obviously seeing this when I get the "original kernel changed"
> > failure.  I was somewhat confused why I couldn't ssh into the machine
> > at first, but yes, I realized that we booted without qwx(4) and from
> > then on just reboot when I end up in this case.
> >
> > > 2. regarding the "original kernel changed" - the only way this happens if 
> > > you
> > >    booted, changed your kernel, then hibernated. there is no other way I 
> > > can
> > >    see this happening. I have done this in the past and seen the same:
> > >
> > >     a. boot machine
> > >     b. someone asks me to test a diff, or I'm testing a diff of my own
> > >     c. build and install new kernel, but not ready for reboot yet (doing 
> > > other
> > >        things)
> > >     d. forget I installed a new kernel and ZZZ
> > >     e. reboot, unhibernate prints that message.
> > >
> > > The code that does this signature check is:
> > >
> > >         SHA256Init(&ctx);
> > >         SHA256Update(&ctx, version, strlen(version));
> > >         fn = printf;
> > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > >         fn = malloc;
> > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > >         fn = km_alloc;
> > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > >         fn = strlen;
> > >         SHA256Update(&ctx, &fn, sizeof(fn));
> > >         SHA256Final((u_int8_t *)&hib->kern_hash, &ctx);
> > >
> > > ... so it just fingerprints a bunch of things and then does a sha256 
> > > compare
> > > on unpack.
> > >
> > > I don't know how to prevent that footgun however, aside from moving all 
> > > the
> > > signature checking up into the bootloader and not even attempting the 
> > > unhibernate
> > > if we see this situation. That doesn't "fix" the problem but at least you 
> > > aren't
> > > running on some halfway-autoconf'ed kernel when it fails. Moving this 
> > > stuff into
> > > the bootloader is not trivial; I tried this in 2010-2011 and gave up.
> > >
> > > Regarding the other things (device issues, hangs, etc), I have some ideas 
> > > on how
> > > to potentially print more information but it needs to be coded.
> >
> > I'm 100% sure that I am booting the correct kernel.  The checksum
> > calculated by that code above is the same.  But for some reason the
> > checksum that we read back from the hibernation info on disk is
> > all-zeroes.  So something is going wrong.  Will dig deeper when I have
> > time.
> 
> just following up here -
> 
> so you obviously checked this. are you saying that the *checksum* is zero but
> somehow the magic number at the start of the signature block is still valid,
> as well as the memory range data/etc?
> 
> I'm trying to understand if the entire signature block is zeros or *just* the
> kernel checksum.
> 
> If it's entirely zero, including the field for the magic number,
> then the problem lies in the bootloader somehow thinking it's an
> unhibernate when it really isn't.
> 
> If the signature block is properly "there" but with an all-zero
> kernel checksum, then the problem is in the code that calculated
> that and wrote it out when the ZZZ happened.
> 
> Ideas?


I'm trying to narrow things down a bit further.  But after adding some
debug printfs, I haven't been able to reproduce the issue :(.

Re: Unhibernate gets stuck on ThinkPad T14 Gen 5 AMD

Reply via email to