> Date: Mon, 14 Aug 2023 18:16:49 +0200 > From: Thomas Klausner <w...@netbsd.org> > > On Mon, Aug 14, 2023 at 12:41:06PM +0200, Thomas Klausner wrote: > > I had followed your suggestion and bumped the heartbeat limit from 15 > > to 300, but today it paniced again. > > > > panic: cpu8: found cpu9 heart stopped beating and unresponsive > > > > I have a core dump in case you want any particular details. > > > > I've now switched set it to 0. > > and had a hard hang less than half a day later. > > This hasn't been happening in 10.99.5 (at least not with that > frequency), which had uptimes of weeks, so either the heartbeat code > introduced additional problems (even if disabled this way) or > something else got worse, or I am really really unlucky right now.
Welp. I don't think simply having the heartbeat(9) code around would cause a hang -- it's new code, which is higher-risk, but the design of the code is very low-risk (all loops are bounded; interrupt handler and soft interrupt handler are short and easy to audit for bounded latency; each CPU only writes to its own per-CPU state). I think it's more likely something else changed. Looks like it's time to bisect over the time since your last good build, and see if you can make it a whole day without panicking? 874 commits since I bumped 10.99.5 (which was incidentally when I introduced heartbeat(9)), so...it should only take a week or two if the problem takes half a day to reproduce!