Mike, Have you made sure the the o2ib interface on all of your Lustre servers (MDS & OSS) are functioning properly? Are you able to `lctl ping x.x.x.x@o2ib` successfully between MDS and OSS nodes?
--Jeff On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss < lustre-discuss@lists.lustre.org> wrote: > Rick, > 172.16.100.4 is the IB address of one of the OSS servers. I > believe the mgt and mdt0 are the same target. My understanding is that > we have a single instanceof the MGT which is on the first MDT server i.e. > it was created via a comand similar to: > > # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb > > Does that make sense. > > On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick <moh...@ornl.gov> wrote: > >> Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same >> target or are they two separate targets just on the same host? >> >> --Rick >> >> >> On 6/21/23, 12:52 PM, "Mike Mosley" <mike.mos...@charlotte.edu <mailto: >> mike.mos...@charlotte.edu>> wrote: >> >> >> Hi Rick, >> >> >> The MGS/MDS are combined. The output I posted is from the primary. >> >> >> >> >> THanks, >> >> >> >> >> Mike >> >> >> >> >> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick <moh...@ornl.gov <mailto: >> moh...@ornl.gov> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov>>> >> wrote: >> >> >> Mike, >> >> >> It looks like the mds server is having a problem contacting the mgs >> server. I'm guessing the mgs is a separate host? I would start by looking >> for possible network problems that might explain the LNet timeouts. You can >> try using "lctl ping" to test the LNet connection between nodes, and you >> can also try regular "ping" between the IP addresses on the IB interfaces. >> >> >> --Rick >> >> >> >> >> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via >> lustre-discuss" <lustre-discuss-boun...@lists.lustre.org <mailto: >> lustre-discuss-boun...@lists.lustre.org> <_blank> <mailto: >> lustre-discuss-boun...@lists.lustre.org <mailto: >> lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of >> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org> >> <_blank> <mailto:lustre-discuss@lists.lustre.org <mailto: >> lustre-discuss@lists.lustre.org> <_blank>>> wrote: >> >> >> >> >> Greetings, >> >> >> >> >> We have experienced some type of issue that is causing both of our MDS >> servers to only be able to mount the mdt device in read only mode. Here are >> some of the error messages we are seeing in the log files below. We lost >> our Lustre expert a while back and we are not sure how to proceed to >> troubleshoot this issue. Can anybody provide us guidance on how to proceed? >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> >> >> >> Mike >> >> >> >> >> >> >> >> >> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for >> more than 120 seconds. >> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1 >> 0x00000086 >> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace: >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb585da9>] schedule+0x29/0x70 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb5838b1>] >> schedule_timeout+0x221/0x2d0 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6b8e5>] ? >> tracing_is_on+0x15/0x30 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6f5bd>] ? >> tracing_record_cmdline+0x1d/0x120 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf77d9b>] ? >> probe_sched_wakeup+0x2b/0xa0 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaed7d15>] ? >> ttwu_do_wakeup+0xb5/0xe0 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb58615d>] >> wait_for_completion+0xfd/0x140 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaedb990>] ? >> wake_up_state+0x20/0x20 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f529a4>] >> llog_process_or_fork+0x244/0x450 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f52bc4>] >> llog_process+0x14/0x20 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f85d05>] >> class_config_parse_llog+0x125/0x350 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a69fc0>] >> mgc_process_cfg_log+0x790/0xc40 [mgc] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6d4cc>] >> mgc_process_log+0x3dc/0x8f0 [mgc] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6e15f>] ? >> config_recover_log_add+0x13f/0x280 [mgc] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ? >> class_config_dump_handler+0x7e0/0x7e0 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6eb2b>] >> mgc_process_config+0x88b/0x13f0 [mgc] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f91b58>] >> lustre_process_log+0x2d8/0xad0 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0e5a177>] ? >> libcfs_debug_msg+0x57/0x80 [libcfs] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f7c8b9>] ? >> lprocfs_counter_add+0xf9/0x160 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc08f4>] >> server_start_targets+0x13a4/0x2a20 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f94bb0>] ? >> lustre_start_mgc+0x260/0x2510 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ? >> class_config_dump_handler+0x7e0/0x7e0 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc303c>] >> server_fill_super+0x10cc/0x1890 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f97a08>] >> lustre_fill_super+0x468/0x960 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f975a0>] ? >> lustre_common_put_super+0x270/0x270 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0510cf>] >> mount_nodev+0x4f/0xb0 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8f9a8>] >> lustre_mount+0x38/0x60 [obdclass] >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb051c4e>] mount_fs+0x3e/0x1b0 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0707a7>] >> vfs_kern_mount+0x67/0x110 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb072edf>] do_mount+0x1ef/0xd00 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb049d7a>] ? >> __check_object_size+0x1ca/0x250 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0288ec>] ? >> kmem_cache_alloc_trace+0x3c/0x200 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb073d33>] SyS_mount+0x83/0xd0 >> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb592ed2>] >> system_call_fastpath+0x25/0x2a >> Jun 20 15:13:14 hyd-mds1 kernel: LNet: >> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for >> 172.16.100.4@o2ib: 9 seconds >> Jun 20 15:13:14 hyd-mds1 kernel: LNet: >> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous >> similar messages >> Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for >> more than 120 seconds. >> Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Jun 20 15:14:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1 >> 0x00000086 >> >> >> >> >> >> >> >> >> >> >> >> >> dumpe2fs seems to show that the file systems are clean i.e. >> >> >> >> >> >> >> >> >> dumpe2fs 1.45.6.wc1 (20-Mar-2020) >> Filesystem volume name: hydra-MDT0000 >> Last mounted on: / >> Filesystem UUID: 3ae09231-7f2a-43b3-a4ee-7f36080b5a66 >> Filesystem magic number: 0xEF53 >> Filesystem revision #: 1 (dynamic) >> Filesystem features: has_journal ext_attr resize_inode dir_index filetype >> mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink >> quota >> Filesystem flags: signed_directory_hash >> Default mount options: user_xattr acl >> Filesystem state: clean >> Errors behavior: Continue >> Filesystem OS type: Linux >> Inode count: 2247671504 >> Block count: 1404931944 >> Reserved block count: 70246597 >> Free blocks: 807627552 >> Free inodes: 2100036536 >> First block: 0 >> Block size: 4096 >> Fragment size: 4096 >> Reserved GDT blocks: 1024 >> Blocks per group: 20472 >> Fragments per group: 20472 >> Inodes per group: 32752 >> Inode blocks per group: 8188 >> Flex block group size: 16 >> Filesystem created: Thu Aug 8 14:21:01 2019 >> Last mount time: Tue Jun 20 15:19:03 2023 >> Last write time: Wed Jun 21 10:43:51 2023 >> Mount count: 38 >> Maximum mount count: -1 >> Last checked: Thu Aug 8 14:21:01 2019 >> Check interval: 0 (<none>) >> Lifetime writes: 219 TB >> Reserved blocks uid: 0 (user root) >> Reserved blocks gid: 0 (group root) >> First inode: 11 >> Inode size: 1024 >> Required extra isize: 32 >> Desired extra isize: 32 >> Journal inode: 8 >> Default directory hash: half_md4 >> Directory Hash Seed: 2e518531-82d9-4652-9acd-9cf9ca09c399 >> Journal backup: inode blocks >> MMP block number: 1851467 >> MMP update interval: 5 >> User quota inode: 3 >> Group quota inode: 4 >> Journal features: journal_incompat_revoke >> Journal size: 4096M >> Journal length: 1048576 >> Journal sequence: 0x0a280713 >> Journal start: 0 >> MMP_block: >> mmp_magic: 0x4d4d50 >> mmp_check_interval: 6 >> mmp_sequence: 0xff4d4d50 >> mmp_update_date: Wed Jun 21 10:43:51 2023 >> mmp_update_time: 1687358631 >> mmp_node_name: hyd-mds1.uncc.edu <_blank> <_blank> >> mmp_device_name: dm-0 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > -- ------------------------------ Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite C - San Diego, CA 92117 High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org