Can you say more about these networking issues? Good to make a note of them in case anyone sees similar in the future.
On Fri, 12 May 2023, 20:40 Jane Liu via lustre-discuss, < [email protected]> wrote: > Hi Jeff, > > Thanks for your response. We discovered later that the network issues > originating from the iDRAC IP were causing the SAS driver to hang or > experience timeouts when trying to access the drives. This resulted in > the drives being kicked out. > > Once we resolved this issue, both the mkfs and mount operations started > working fine. > > Thanks, > Jane > > > > On 2023-05-10 12:43, Jeff Johnson wrote: > > Jane, > > > > You're having hardware errors, the codes in those mpt3sas errors > > define as "PL_LOGINFO_SUB_CODE_OPEN_FAILURE_ORR_TIMEOUT", or in other > > words your SAS HBA cannot open a command dialogue with your disk. I'd > > suspect backplane or cabling issues as an internal disk failure will > > be reported by the target disk with its own error code. In this case > > your HBA can't even talk to it properly. > > > > Is sdah the partner mpath device to sdef? Or is sdah a second failing > > disk interface? > > > > Looking at this, I don't think your hardware is deploy-ready. > > > > --Jeff > > > > On Wed, May 10, 2023 at 9:29 AM Jane Liu via lustre-discuss > > <[email protected]> wrote: > > > >> Hi, > >> > >> We recently attempted to add several new OSS servers ( RHEL 8.7 and > >> Lustre 2.15.2). While creating new OSTs, I noticed that mdstat > >> reported > >> some disk failures after the mkfs, even though the disks were > >> functional > >> before the mkfs command. Our hardware admins managed to resolve the > >> mdstat issue and restore the disks to normal operation. However, > >> when I > >> ran the mount OST command (when network had a problem and mount > >> command > >> timed out), similar problems occurred, and several disks were kicked > >> > >> out. The relevant /var/log/messages are provided below. > >> > >> This problem was consistent across all our OSS servers. Any insights > >> > >> into the possible cause would be appreciated. > >> > >> Jane > >> > >> ----------------------------- > >> > >> May 9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted > >> filesystem > >> with ordered data mode. Opts: errors=remount-ro > >> May 9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount: > >> Succeeded. > >> May 9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU > >> cores: > >> 72, npartitions: 2 > >> May 9 13:33:16 sphnxoss47 kernel: alg: No test for adler32 > >> (adler32-zlib) > >> May 9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered > >> May 9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered > >> May 9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version: > >> 2.15.2 > >> May 9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2@tcp > >> [8/256/0/180] > >> May 9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988 > >> May 9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted > >> filesystem > >> with ordered data mode. Opts: > >> errors=remount-ro,no_mbcache,nodelalloc > >> May 9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd: > >> enabled > >> 'large_dir' feature on device /dev/md0 > >> May 9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of > >> user > >> root. > >> May 9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user > >> root. > >> May 9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b: > >> sphnx01-OST0244: > >> cannot register this server with the MGS: rc = -110. Is the MGS > >> running? > >> May 9 13:34:36 sphnxoss47 kernel: LustreError: > >> 45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to > >> start > >> targets: -110 > >> May 9 13:34:36 sphnxoss47 kernel: LustreError: > >> 45314:0:(obd_mount_server.c:1644:server_put_super()) no obd > >> sphnx01-OST0244 > >> May 9 13:34:36 sphnxoss47 kernel: LustreError: > >> 45314:0:(obd_mount_server.c:131:server_deregister_mount()) > >> sphnx01-OST0244 not registered > >> May 9 13:34:39 sphnxoss47 kernel: Lustre: server umount > >> sphnx01-OST0244 > >> complete > >> May 9 13:34:39 sphnxoss47 kernel: LustreError: > >> 45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount > >> <unknown>: rc = -110 > >> May 9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted > >> filesystem > >> with ordered data mode. Opts: errors=remount-ro > >> May 9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount: > >> Succeeded. > >> May 9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted > >> filesystem > >> with ordered data mode. Opts: > >> errors=remount-ro,no_mbcache,nodelalloc > >> May 9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd: > >> enabled > >> 'large_dir' feature on device /dev/md1 > >> May 9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b: > >> sphnx01-OST0245: > >> cannot register this server with the MGS: rc = -110. Is the MGS > >> running? > >> May 9 13:36:00 sphnxoss47 kernel: LustreError: > >> 46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to > >> start > >> targets: -110 > >> May 9 13:36:00 sphnxoss47 kernel: LustreError: > >> 46127:0:(obd_mount_server.c:1644:server_put_super()) no obd > >> sphnx01-OST0245 > >> May 9 13:36:00 sphnxoss47 kernel: LustreError: > >> 46127:0:(obd_mount_server.c:131:server_deregister_mount()) > >> sphnx01-OST0245 not registered > >> May 9 13:36:08 sphnxoss47 kernel: Lustre: server umount > >> sphnx01-OST0245 > >> complete > >> May 9 13:36:08 sphnxoss47 kernel: LustreError: > >> 46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount > >> <unknown>: rc = -110 > >> May 9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted > >> filesystem > >> with ordered data mode. Opts: errors=remount-ro > >> May 9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount: > >> Succeeded. > >> May 9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted > >> filesystem > >> with ordered data mode. Opts: > >> errors=remount-ro,no_mbcache,nodelalloc > >> Show less > >> 11:03 AM > >> > >> ----------------------------- > >> > >> it just repeats for all of the md raids, then the errors start and > >> the > >> drive fails and is disabled: > >> > >> May 9 13:44:31 sphnxoss47 kernel: LustreError: > >> 48069:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount > >> <unknown>: rc = -110 > >> May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> .... > >> .... > >> May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102 > >> FAILED > >> Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s > >> May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#1102 > >> CDB: > >> Read(10) 28 00 00 00 87 79 00 00 01 00 > >> May 9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error, > >> dev > >> sdef, sector 277448 op 0x0:(READ) flags 0x84700 phys_seg 1 prio > >> class 0 > >> May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800 > >> FAILED > >> Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s > >> May 9 13:44:33 sphnxoss47 kernel: sd 16:0:31:0: [sdef] tag#6800 > >> CDB: > >> Read(10) 28 00 00 00 87 dd 00 00 01 00 > >> May 9 13:44:33 sphnxoss47 kernel: blk_update_request: I/O error, > >> dev > >> sdef, sector 278248 op 0x0:(READ) flags 0x84700 phys_seg 1 prio > >> class 0 > >> May 9 13:44:33 sphnxoss47 kernel: device-mapper: multipath: 253:52: > >> > >> Failing path 128:112. > >> May 9 13:44:33 sphnxoss47 multipathd[6051]: sdef: mark as failed > >> May 9 13:44:33 sphnxoss47 multipathd[6051]: mpathae: remaining > >> active > >> paths: 1 > >> ... > >> ... > >> May 9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:34 sphnxoss47 kernel: mpt3sas_cm0: > >> log_info(0x3112011a): > >> originator(PL), code(0x12), sub_code(0x011a) > >> May 9 13:44:34 sphnxoss47 kernel: md: super_written gets error=-5 > >> May 9 13:44:34 sphnxoss47 kernel: md/raid:md8: Disk failure on > >> dm-55, > >> disabling device. > >> May 9 13:44:34 sphnxoss47 kernel: md: super_written gets error=-5 > >> May 9 13:44:34 sphnxoss47 kernel: md/raid:md8: Operation continuing > >> on > >> 9 devices. > >> May 9 13:44:34 sphnxoss47 multipathd[6051]: sdah: mark as failed > >> _______________________________________________ > >> lustre-discuss mailing list > >> [email protected] > >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org [1] > > > > -- > > > > ------------------------------ > > Jeff Johnson > > Co-Founder > > Aeon Computing > > > > [email protected] > > www.aeoncomputing.com [2] > > t: 858-412-3810 x1001 f: 858-412-3845 > > m: 619-204-9061 > > > > 4170 Morena Boulevard, Suite C - San Diego, CA 92117 > > > > High-Performance Computing / Lustre Filesystems / Scale-out Storage > > > > Links: > > ------ > > [1] > > > https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6u95iVHjg$ > > [2] > > > https://urldefense.com/v3/__http://www.aeoncomputing.com__;!!P4SdNyxKAPE!B65twCaGe4aP1xnGrjpUnd-1OYuemL3X9zWyxfWEA54zk2tnvbhhrBFW5x9rXl7nFEkSsZpiRGIbodWHehLDQyvnK6vvMMT5RQ$ > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
