On Sun, 27 Dec 2020 at 08:54, Jesus Cea <j...@jcea.es> wrote:
> In the last 8 days I had four crashes of a machine with
> "joyent_20201217T173522Z". The machine hangs, no indication on screen
> (the screen shows the last content, no errors neither panic). The
> machine hangs hard, I need to reset it pressing the button.

I'm sorry to hear that!  That's a relatively recent SmartOS platform
-- was the machine stable on previous versions of the SmartOS
platform, and you're only now seeing these issues since the upgrade?
If so, what was the older version, and does reverting to the previous
version return the machine to a stable state?

> Not being able to boot the machine in this situation should be
> considered a bug. Please, fix it.

That does seem unfortunate, but is  probably SmartOS specific.  I
would try filing a bug at https://github.com/joyent/smartos-live for
that specific issue.

> srvfs3 wcons login: 2020-12-25T18:53:24.439856+00:00 srvzfs3 savecore:
> [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex,
> lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0
> 2020-12-25T18:53:35.688665+00:00 srvzf3 savecore: [ID 570001 auth.error]
> reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08
> owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0
> """

That message comes from savecore and suggests a crash dump may now be
available for analysis, which I think is a good next step.  If the
machine is not stable, it'd be good to try and copy it out to another
machine first.

> After that, the machine hangs. No automatic reboot, it need a hard reset.
>
> (talking with the operator, this picture was send yesterday after the
> server crash, but it showing errors from the 25th, maybe it is referring
> to a PREVIOUS crash).
>
> I am quite surprised about the "auth.error" messages. This machine is a
> NFS server not connected to internet. I don't now if it is relevant.

I wouldn't worry about the "auth.error" business.  It seems that
savecore(1M) has chosen that level for perhaps dubious reasons:

    
https://code.illumos.org/plugins/gitiles/illumos-gate/+/refs/heads/master/usr/src/cmd/savecore/savecore.c#1782

    (see the comment a little further up that talks about the syslog facility.)

> I hope this is somewhat useful to anybody. Please, let me know how to go
> deeper debugging this.

Getting the crash dump out seems like a good start, as well as
figuring out whether this is new instability as a result of an OS
upgrade or if it might be a hardware issue.  You might also want to
see if there are hardware errors being logged by FMA, with something
like "fmdump -e".


Cheers.

-- 
Joshua M. Clulow
http://blog.sysmgr.org

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T5267a2cbac027109-M0dbcf51a178f4ce166524925
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to