On Sun, 27 Dec 2020 at 08:54, Jesus Cea <j...@jcea.es> wrote: > In the last 8 days I had four crashes of a machine with > "joyent_20201217T173522Z". The machine hangs, no indication on screen > (the screen shows the last content, no errors neither panic). The > machine hangs hard, I need to reset it pressing the button.
I'm sorry to hear that! That's a relatively recent SmartOS platform -- was the machine stable on previous versions of the SmartOS platform, and you're only now seeing these issues since the upgrade? If so, what was the older version, and does reverting to the previous version return the machine to a stable state? > Not being able to boot the machine in this situation should be > considered a bug. Please, fix it. That does seem unfortunate, but is probably SmartOS specific. I would try filing a bug at https://github.com/joyent/smartos-live for that specific issue. > srvfs3 wcons login: 2020-12-25T18:53:24.439856+00:00 srvzfs3 savecore: > [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex, > lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0 > 2020-12-25T18:53:35.688665+00:00 srvzf3 savecore: [ID 570001 auth.error] > reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08 > owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0 > """ That message comes from savecore and suggests a crash dump may now be available for analysis, which I think is a good next step. If the machine is not stable, it'd be good to try and copy it out to another machine first. > After that, the machine hangs. No automatic reboot, it need a hard reset. > > (talking with the operator, this picture was send yesterday after the > server crash, but it showing errors from the 25th, maybe it is referring > to a PREVIOUS crash). > > I am quite surprised about the "auth.error" messages. This machine is a > NFS server not connected to internet. I don't now if it is relevant. I wouldn't worry about the "auth.error" business. It seems that savecore(1M) has chosen that level for perhaps dubious reasons: https://code.illumos.org/plugins/gitiles/illumos-gate/+/refs/heads/master/usr/src/cmd/savecore/savecore.c#1782 (see the comment a little further up that talks about the syslog facility.) > I hope this is somewhat useful to anybody. Please, let me know how to go > deeper debugging this. Getting the crash dump out seems like a good start, as well as figuring out whether this is new instability as a result of an OS upgrade or if it might be a hardware issue. You might also want to see if there are hardware errors being logged by FMA, with something like "fmdump -e". Cheers. -- Joshua M. Clulow http://blog.sysmgr.org ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/T5267a2cbac027109-M0dbcf51a178f4ce166524925 Delivery options: https://illumos.topicbox.com/groups/discuss/subscription