The first crash was dramatic, because the machine was unable to boot. Checking the logs in single user mode, I saw this:
""" + dumpadm -y -d /dev/zvol/dsk/zones/dumpdumpadm: dump device /dev/zvol/dsk/zones/dump is too small to hold a system dump
dump size 2121340928 bytes, device size 1220542464 bytes + fatal 'failed to configure dump device' + echo 'Error: failed to configure dump device' Error: failed to configure dump device + exit 95 """After googling around, I increased "/dev/zvol/dsk/zones/dump" size to 30GB (overkill, but I don't want this to happen ever again).
Not being able to boot the machine in this situation should be considered a bug. Please, fix it.
After that, I hoped to get crash dumps somewhere. My configuration is: [root@srvzfs3 /var/crash/volatile]# dumpadm Dump content: kernel pages Dump device: /dev/zvol/dsk/zones/dump (dedicated) Savecore directory: /var/crash/volatile Savecore enabled: yes Save compressed: on Dump encrypted: no [root@srvzfs3 /var/crash/volatile]# zfs get volsize zones/dump NAME PROPERTY VALUE SOURCE zones/dump volsize 30G localBut there is nothing in "/var/crash/volatile", it is empty (there was a dump there from a 2017 crash that I deleted). Nevertheless, the boot takes forever. Doing a "savecore" manually I got this:
""" [root@srvzfs3 /var/crash/volatile]# savecore -v savecore: bad magic number e16aa54a savecore: bad summary magic bdec9c78 """During the first three crashes the machine was doing a resilvering after a harddisk replacement (the hardware replacement window was used to upgrade the platform to joyent_20201217T173522Z) but this morning the machine crashed again and the resilvering was already done.
The third crash showed some in the screen. I transcribe the (bad quality) photo that the operator sent me, I hope no typos:
"""srvfs3 wcons login: 2020-12-25T18:53:24.439856+00:00 srvzfs3 savecore: [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0 2020-12-25T18:53:35.688665+00:00 srvzf3 savecore: [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0
""" After that, the machine hangs. No automatic reboot, it need a hard reset.(talking with the operator, this picture was send yesterday after the server crash, but it showing errors from the 25th, maybe it is referring to a PREVIOUS crash).
I am quite surprised about the "auth.error" messages. This machine is a NFS server not connected to internet. I don't now if it is relevant.
Checking the "zool history", the replacement was done the right way: """ [root@srvzfs3 /var/crash/volatile]# zpool history [...] 2020-12-19.23:46:48 zpool replace zones 8990018183995816436 c1t0d0 [...] [root@srvzfs3 /var/crash/volatile]# zpool status pool: zones state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done,the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.scan: resilvered 2.14T in 5 days 09:59:58 with 0 errors on Sun Dec 27 10:12:52 2020
config: NAME STATE READ WRITE CKSUM zones ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 logs mirror-2 ONLINE 0 0 0 c1t4d0s0 ONLINE 0 0 0 c1t5d0s0 ONLINE 0 0 0 cache c1t4d0s1 ONLINE 0 0 0 c1t5d0s1 ONLINE 0 0 0 errors: No known data errors """ "zdb" shows this (the "ashift" is "9" because this is a quite old ZPOOL): """ zones: version: 5000 name: 'zones' state: 0 txg: 37335958 pool_guid: 2807429990997653683 errata: 0 hostid: 542799372 hostname: '' com.delphix:has_per_vdev_zaps vdev_children: 3 vdev_tree: type: 'root' id: 0 guid: 2807429990997653683 children[0]: type: 'mirror' id: 0 guid: 8841657624222278566 metaslab_array: 39 metaslab_shift: 34 ashift: 9 asize: 2999985635328 is_log: 0 create_txg: 4 com.delphix:vdev_zap_top: 55 children[0]: type: 'disk' id: 0 guid: 8956384447561865843 path: '/dev/dsk/c1t0d0s0' devid: 'id1,sd@n60030480008a7d2027714acd11cdc60e/a' phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@0,0:a' whole_disk: 1 DTL: 1235 create_txg: 4 com.delphix:vdev_zap_leaf: 773 children[1]: type: 'disk' id: 1 guid: 301314384901939396 path: '/dev/dsk/c1t1d0s0' devid: 'id1,sd@n60030480008a7d201ea8de8a42098cb9/a' phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@1,0:a' whole_disk: 1 DTL: 8621 create_txg: 4 com.delphix:vdev_zap_leaf: 107 children[1]: type: 'mirror' id: 1 guid: 4227076483237831215 metaslab_array: 36 metaslab_shift: 34 ashift: 9 asize: 2999985635328 is_log: 0 create_txg: 4 com.delphix:vdev_zap_top: 108 children[0]: type: 'disk' id: 0 guid: 6021145974097762388 path: '/dev/dsk/c1t2d0s0' devid: 'id1,sd@n60030480008a7d201ea8de8c42214e32/a' phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@2,0:a' whole_disk: 1 DTL: 8620 create_txg: 4 com.delphix:vdev_zap_leaf: 109 children[1]: type: 'disk' id: 1 guid: 9695570681430649539 path: '/dev/dsk/c1t3d0s0' devid: 'id1,sd@n60030480008a7d201ea8de8d4239f176/a' phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@3,0:a' whole_disk: 1 DTL: 8619 create_txg: 4 com.delphix:vdev_zap_leaf: 110 children[2]: type: 'mirror' id: 2 guid: 1877341096729848291 metaslab_array: 120 metaslab_shift: 24 ashift: 9 asize: 2150105088 is_log: 1 create_txg: 12527708 com.delphix:vdev_zap_top: 117 children[0]: type: 'disk' id: 0 guid: 6693173462782706499 path: '/dev/dsk/c1t4d0s0' devid: 'id1,sd@n60030480008a7d2022260a60148b2236/a' phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@4,0:a' whole_disk: 0 DTL: 1238 create_txg: 12527708 com.delphix:vdev_zap_leaf: 118 children[1]: type: 'disk' id: 1 guid: 87265357747160889 path: '/dev/dsk/c1t5d0s0' devid: 'id1,sd@n60030480008a7d2022260a60148b6e95/a' phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@5,0:a' whole_disk: 0 DTL: 1237 create_txg: 12527708 com.delphix:vdev_zap_leaf: 119 features_for_read: com.delphix:hole_birth com.delphix:embedded_data """I hope this is somewhat useful to anybody. Please, let me know how to go deeper debugging this.
Thanks!. -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ j...@jcea.es - https://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:j...@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
OpenPGP_signature
Description: OpenPGP digital signature
This is a multi-part message in MIME format... ------------=_1609088076-358971-1--