Thanks Robert for the feedback. Actually, I do not know about Lustre at all. I am also trying to contact the engineer who built the Lustre system for more information regarding the drive information. To my knowledge, the LustreMDT pool is a 4 SSD disk group (named /dev/mapper/SSD) with hardware RAID5.
I can manually mount the LustreMDT/mdt0-work by following steps: pcs cluster standby --all (Stop MDS and OSS) zpool import LustreMDT zfs set canmount=on LustreMDT/mdt0-work zfs mount LustreMDT/mdt0-work Then I ls the file /LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0 it returned I/O error, but other files look fine. [root@mds1 mdt0-work]# ls -ahlt "/LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0" ls: reading directory /LustreMDT/mdt0-work/oi.3/0x200000003:0x2:0x0: Input/output error total 23M drwxr-xr-x 2 root root 2 Jan 1 1970 . drwxr-xr-x 0 root root 0 Jan 1 1970 .. Is this the drive failure situation you referring to? Best, Ian On Wed, Sep 21, 2022 at 9:32 PM Robert Anderson <[email protected]> wrote: > I could be reading your zpool status output wrong, but it looks like you > had 2 drives in that pool. Not mirrored, so no fault tolerance. Any drive > failure would lose half of the pool data. > > Unless you can get that drive working you are missing half of your data > and have no resilience to errors, nothing to recover from. > > However you proceed you should ensure that have a mirrored zfs pool or > more drives and raidz (I like raidz2). > > On September 20, 2022 11:57:09 PM Ian Yi-Feng Chang via lustre-discuss < > [email protected]> wrote: > >> CAUTION: This email originated from outside of the University System. Do >> not click links or open attachments unless you recognize the sender and >> know the content is safe. >> >> Dear All, >> I think this problem is more related to ZFS, but I would like to ask >> for help from experts in all fields. >> Our MDT cannot work properly after the IB >> switch was accidentally rebooted (power issue). >> Everything looks good except for the MDT cannot be started. >> Our MDT's ZFS didn't have a backup or snapshot. >> I would like to ask, could this problem be fixed and how to fix? >> >> Thanks for your help in advance. >> >> Best, >> Ian >> >> Lustre: Build Version: 2.10.4 >> OS: CentOS Linux release 7.5.1804 (Core) >> uname -r: 3.10.0-862.el7.x86_64 >> >> >> [root@mds1 etc]# pcs status >> Cluster name: mdsgroup01 >> Stack: corosync >> Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with >> quorum >> Last updated: Wed Sep 21 11:46:25 2022 >> Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1 >> >> 2 nodes configured >> 9 resources configured >> >> Online: [ mds1 mds2 ] >> >> Full list of resources: >> >> Resource Group: group-MDS >> zfs-LustreMDT (ocf::heartbeat:ZFS): Started mds1 >> MGT (ocf::lustre:Lustre): Started mds1 >> MDT (ocf::lustre:Lustre): Stopped >> ipmi-fencingMDS1 (stonith:fence_ipmilan): Started mds2 >> ipmi-fencingMDS2 (stonith:fence_ipmilan): Started mds2 >> Clone Set: healthLUSTRE-clone [healthLUSTRE] >> Started: [ mds1 mds2 ] >> Clone Set: healthLNET-clone [healthLNET] >> Started: [ mds1 mds2 ] >> >> Failed Actions: >> * MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete, >> exitreason='', >> last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms >> * MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete, >> exitreason='', >> last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms >> >> >> Daemon Status: >> corosync: active/enabled >> pacemaker: active/enabled >> pcsd: active/enabled >> >> >> >> After zpool scrub MDT, the zpool status -v of MDT pool reported: >> >> pool: LustreMDT >> state: ONLINE >> status: One or more devices has experienced an error resulting in data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise restore the >> entire pool from backup. >> see: http://zfsonlinux.org/msg/ZFS-8000-8A >> <https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzfsonlinux.org%2Fmsg%2FZFS-8000-8A&data=05%7C01%7CRobert.E.Anderson%40unh.edu%7Caf460f2d320d41de75c208da9b8559ca%7Cd6241893512d46dc8d2bbe47e25f5666%7C0%7C0%7C637993294290968694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Kpb%2Fqtg0pMqpxN9ClcBOWlQ%2BmSiMPhMBLT0%2BE1n5mj0%3D&reserved=0> >> scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 >> 2022 >> config: >> >> NAME STATE READ WRITE CKSUM >> LustreMDT ONLINE 0 0 2 >> SSD ONLINE 0 0 8 >> >> errors: Permanent errors have been detected in the following files: >> >> LustreMDT/mdt0-work:/oi.3/0x200000003:0x2:0x0 >> >> >> >> # dmesg -T >> [Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4 >> [Tue Sep 20 15:01:43 2022] LNet: Using FMR for registration >> [Tue Sep 20 15:01:43 2022] LNet: Added LNI 172.29.32.21@o2ib >> [8/256/0/180] >> [Tue Sep 20 15:01:50 2022] Lustre: MGS: Connection restored to >> b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo) >> [Tue Sep 20 15:01:50 2022] LustreError: >> 3839:0:(llog.c:1296:llog_backup()) MGC172.29.32.21@o2ib: failed to open >> log work-MDT0000: rc = -5 >> [Tue Sep 20 15:01:50 2022] LustreError: >> 3839:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC172.29.32.21@o2ib: >> failed to copy remote log work-MDT0000: rc = -5 >> [Tue Sep 20 15:01:50 2022] LustreError: 13a-8: Failed to get MGS log >> work-MDT0000 and no local copy. >> [Tue Sep 20 15:01:50 2022] LustreError: 15c-8: MGC172.29.32.21@o2ib: The >> configuration from log 'work-MDT0000' failed (-2). This may be the result >> of communication errors between this node and the MGS, a bad configuration, >> or other errors. See the syslog for more information. >> [Tue Sep 20 15:01:50 2022] LustreError: >> 3839:0:(obd_mount_server.c:1386:server_start_targets()) failed to start >> server work-MDT0000: -2 >> [Tue Sep 20 15:01:50 2022] LustreError: >> 3839:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start >> targets: -2 >> [Tue Sep 20 15:01:50 2022] LustreError: >> 3839:0:(obd_mount_server.c:1589:server_put_super()) no obd work-MDT0000 >> [Tue Sep 20 15:01:50 2022] Lustre: server umount work-MDT0000 complete >> [Tue Sep 20 15:01:50 2022] LustreError: >> 3839:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-2) >> [Tue Sep 20 15:01:56 2022] Lustre: >> 4112:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has >> timed out for slow reply: [sent 1663657311/real 1663657311] >> req@ffff8d6f0e728000 x1744471122247856/t0(0) o251->MGC172.29.32.21@o2ib >> @0@lo:26/25 lens 224/224 e 0 to 1 dl 1663657317 ref 2 fl >> Rpc:XN/0/ffffffff rc 0/-1 >> [Tue Sep 20 15:01:56 2022] Lustre: server umount MGS complete >> [Tue Sep 20 15:02:29 2022] Lustre: MGS: Connection restored to >> b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo) >> [Tue Sep 20 15:02:54 2022] Lustre: MGS: Connection restored to >> 28ec81ea-0d51-d721-7be2-4f557da2546d (at 172.29.32.1@o2ib) >> >> >> >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
