In the last 8 days I had four crashes of a machine with "joyent_20201217T173522Z". The machine hangs, no indication on screen (the screen shows the last content, no errors neither panic). The machine hangs hard, I need to reset it pressing the button.

The first crash was dramatic, because the machine was unable to boot. Checking the logs in single user mode, I saw this:

"""
+ dumpadm -y -d /dev/zvol/dsk/zones/dump
dumpadm: dump device /dev/zvol/dsk/zones/dump is too small to hold a system dump
dump size 2121340928 bytes, device size 1220542464 bytes
+ fatal 'failed to configure dump device'
+ echo 'Error: failed to configure dump device'
Error: failed to configure dump device
+ exit 95
"""

After googling around, I increased "/dev/zvol/dsk/zones/dump" size to 30GB (overkill, but I don't want this to happen ever again).

Not being able to boot the machine in this situation should be considered a bug. Please, fix it.



After that, I hoped to get crash dumps somewhere. My configuration is:

[root@srvzfs3 /var/crash/volatile]# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/zones/dump (dedicated)
Savecore directory: /var/crash/volatile
  Savecore enabled: yes
   Save compressed: on
    Dump encrypted: no

[root@srvzfs3 /var/crash/volatile]# zfs get volsize zones/dump
NAME        PROPERTY  VALUE    SOURCE
zones/dump  volsize   30G      local

But there is nothing in "/var/crash/volatile", it is empty (there was a dump there from a 2017 crash that I deleted). Nevertheless, the boot takes forever. Doing a "savecore" manually I got this:

"""
[root@srvzfs3 /var/crash/volatile]# savecore -v
savecore: bad magic number e16aa54a
savecore: bad summary magic bdec9c78
"""

During the first three crashes the machine was doing a resilvering after a harddisk replacement (the hardware replacement window was used to upgrade the platform to joyent_20201217T173522Z) but this morning the machine crashed again and the resilvering was already done.

The third crash showed some in the screen. I transcribe the (bad quality) photo that the operator sent me, I hope no typos:

"""
srvfs3 wcons login: 2020-12-25T18:53:24.439856+00:00 srvzfs3 savecore: [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0 2020-12-25T18:53:35.688665+00:00 srvzf3 savecore: [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0
"""

After that, the machine hangs. No automatic reboot, it need a hard reset.

(talking with the operator, this picture was send yesterday after the server crash, but it showing errors from the 25th, maybe it is referring to a PREVIOUS crash).

I am quite surprised about the "auth.error" messages. This machine is a NFS server not connected to internet. I don't now if it is relevant.


Checking the "zool history", the replacement was done the right way:

"""
[root@srvzfs3 /var/crash/volatile]# zpool history
[...]
2020-12-19.23:46:48 zpool replace zones 8990018183995816436 c1t0d0
[...]

[root@srvzfs3 /var/crash/volatile]# zpool status
  pool: zones
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
scan: resilvered 2.14T in 5 days 09:59:58 with 0 errors on Sun Dec 27 10:12:52 2020
config:

        NAME          STATE     READ WRITE CKSUM
        zones         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            c1t2d0    ONLINE       0     0     0
            c1t3d0    ONLINE       0     0     0
        logs
          mirror-2    ONLINE       0     0     0
            c1t4d0s0  ONLINE       0     0     0
            c1t5d0s0  ONLINE       0     0     0
        cache
          c1t4d0s1    ONLINE       0     0     0
          c1t5d0s1    ONLINE       0     0     0

errors: No known data errors
"""

"zdb" shows this (the "ashift" is "9" because this is a quite old ZPOOL):

"""
zones:
    version: 5000
    name: 'zones'
    state: 0
    txg: 37335958
    pool_guid: 2807429990997653683
    errata: 0
    hostid: 542799372
    hostname: ''
    com.delphix:has_per_vdev_zaps
    vdev_children: 3
    vdev_tree:
        type: 'root'
        id: 0
        guid: 2807429990997653683
        children[0]:
            type: 'mirror'
            id: 0
            guid: 8841657624222278566
            metaslab_array: 39
            metaslab_shift: 34
            ashift: 9
            asize: 2999985635328
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 55
            children[0]:
                type: 'disk'
                id: 0
                guid: 8956384447561865843
                path: '/dev/dsk/c1t0d0s0'
                devid: 'id1,sd@n60030480008a7d2027714acd11cdc60e/a'
                phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@0,0:a'
                whole_disk: 1
                DTL: 1235
                create_txg: 4
                com.delphix:vdev_zap_leaf: 773
            children[1]:
                type: 'disk'
                id: 1
                guid: 301314384901939396
                path: '/dev/dsk/c1t1d0s0'
                devid: 'id1,sd@n60030480008a7d201ea8de8a42098cb9/a'
                phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@1,0:a'
                whole_disk: 1
                DTL: 8621
                create_txg: 4
                com.delphix:vdev_zap_leaf: 107
        children[1]:
            type: 'mirror'
            id: 1
            guid: 4227076483237831215
            metaslab_array: 36
            metaslab_shift: 34
            ashift: 9
            asize: 2999985635328
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 108
            children[0]:
                type: 'disk'
                id: 0
                guid: 6021145974097762388
                path: '/dev/dsk/c1t2d0s0'
                devid: 'id1,sd@n60030480008a7d201ea8de8c42214e32/a'
                phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@2,0:a'
                whole_disk: 1
                DTL: 8620
                create_txg: 4
                com.delphix:vdev_zap_leaf: 109
            children[1]:
                type: 'disk'
                id: 1
                guid: 9695570681430649539
                path: '/dev/dsk/c1t3d0s0'
                devid: 'id1,sd@n60030480008a7d201ea8de8d4239f176/a'
                phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@3,0:a'
                whole_disk: 1
                DTL: 8619
                create_txg: 4
                com.delphix:vdev_zap_leaf: 110
        children[2]:
            type: 'mirror'
            id: 2
            guid: 1877341096729848291
            metaslab_array: 120
            metaslab_shift: 24
            ashift: 9
            asize: 2150105088
            is_log: 1
            create_txg: 12527708
            com.delphix:vdev_zap_top: 117
            children[0]:
                type: 'disk'
                id: 0
                guid: 6693173462782706499
                path: '/dev/dsk/c1t4d0s0'
                devid: 'id1,sd@n60030480008a7d2022260a60148b2236/a'
                phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@4,0:a'
                whole_disk: 0
                DTL: 1238
                create_txg: 12527708
                com.delphix:vdev_zap_leaf: 118
            children[1]:
                type: 'disk'
                id: 1
                guid: 87265357747160889
                path: '/dev/dsk/c1t5d0s0'
                devid: 'id1,sd@n60030480008a7d2022260a60148b6e95/a'
                phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@5,0:a'
                whole_disk: 0
                DTL: 1237
                create_txg: 12527708
                com.delphix:vdev_zap_leaf: 119
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
"""

I hope this is somewhat useful to anybody. Please, let me know how to go deeper debugging this.

Thanks!.

--
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
j...@jcea.es - https://www.jcea.es/    _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:j...@jabber.org  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

This is a multi-part message in MIME format...

------------=_1609088076-358971-1--

Reply via email to