Mike,

Have you made sure the the o2ib interface on all of your Lustre servers
(MDS & OSS) are functioning properly? Are you able to `lctl ping
x.x.x.x@o2ib` successfully between MDS and OSS nodes?

--Jeff


On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Rick,
> 172.16.100.4 is the IB address of one of the OSS servers.    I
>  believe the mgt and mdt0 are the same target.   My understanding is that
> we have a single instanceof the MGT which is on the first MDT server i.e.
> it was created via a comand similar to:
>
> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>
> Does that make sense.
>
> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick <moh...@ornl.gov> wrote:
>
>> Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same
>> target or are they two separate targets just on the same host?
>>
>> --Rick
>>
>>
>> On 6/21/23, 12:52 PM, "Mike Mosley" <mike.mos...@charlotte.edu <mailto:
>> mike.mos...@charlotte.edu>> wrote:
>>
>>
>> Hi Rick,
>>
>>
>> The MGS/MDS are combined. The output I posted is from the primary.
>>
>>
>>
>>
>> THanks,
>>
>>
>>
>>
>> Mike
>>
>>
>>
>>
>> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick <moh...@ornl.gov <mailto:
>> moh...@ornl.gov> <mailto:moh...@ornl.gov <mailto:moh...@ornl.gov>>>
>> wrote:
>>
>>
>> Mike,
>>
>>
>> It looks like the mds server is having a problem contacting the mgs
>> server. I'm guessing the mgs is a separate host? I would start by looking
>> for possible network problems that might explain the LNet timeouts. You can
>> try using "lctl ping" to test the LNet connection between nodes, and you
>> can also try regular "ping" between the IP addresses on the IB interfaces.
>>
>>
>> --Rick
>>
>>
>>
>>
>> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
>> lustre-discuss" <lustre-discuss-boun...@lists.lustre.org <mailto:
>> lustre-discuss-boun...@lists.lustre.org> <_blank> <mailto:
>> lustre-discuss-boun...@lists.lustre.org <mailto:
>> lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of
>> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
>> <_blank> <mailto:lustre-discuss@lists.lustre.org <mailto:
>> lustre-discuss@lists.lustre.org> <_blank>>> wrote:
>>
>>
>>
>>
>> Greetings,
>>
>>
>>
>>
>> We have experienced some type of issue that is causing both of our MDS
>> servers to only be able to mount the mdt device in read only mode. Here are
>> some of the error messages we are seeing in the log files below. We lost
>> our Lustre expert a while back and we are not sure how to proceed to
>> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>>
>>
>>
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>>
>>
>> Mike
>>
>>
>>
>>
>>
>>
>>
>>
>> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
>> more than 120 seconds.
>> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1
>> 0x00000086
>> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb585da9>] schedule+0x29/0x70
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb5838b1>]
>> schedule_timeout+0x221/0x2d0
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6b8e5>] ?
>> tracing_is_on+0x15/0x30
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6f5bd>] ?
>> tracing_record_cmdline+0x1d/0x120
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf77d9b>] ?
>> probe_sched_wakeup+0x2b/0xa0
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaed7d15>] ?
>> ttwu_do_wakeup+0xb5/0xe0
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb58615d>]
>> wait_for_completion+0xfd/0x140
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaedb990>] ?
>> wake_up_state+0x20/0x20
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f529a4>]
>> llog_process_or_fork+0x244/0x450 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f52bc4>]
>> llog_process+0x14/0x20 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f85d05>]
>> class_config_parse_llog+0x125/0x350 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a69fc0>]
>> mgc_process_cfg_log+0x790/0xc40 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6d4cc>]
>> mgc_process_log+0x3dc/0x8f0 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6e15f>] ?
>> config_recover_log_add+0x13f/0x280 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ?
>> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6eb2b>]
>> mgc_process_config+0x88b/0x13f0 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f91b58>]
>> lustre_process_log+0x2d8/0xad0 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0e5a177>] ?
>> libcfs_debug_msg+0x57/0x80 [libcfs]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f7c8b9>] ?
>> lprocfs_counter_add+0xf9/0x160 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc08f4>]
>> server_start_targets+0x13a4/0x2a20 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f94bb0>] ?
>> lustre_start_mgc+0x260/0x2510 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ?
>> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc303c>]
>> server_fill_super+0x10cc/0x1890 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f97a08>]
>> lustre_fill_super+0x468/0x960 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f975a0>] ?
>> lustre_common_put_super+0x270/0x270 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0510cf>]
>> mount_nodev+0x4f/0xb0
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8f9a8>]
>> lustre_mount+0x38/0x60 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb051c4e>] mount_fs+0x3e/0x1b0
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0707a7>]
>> vfs_kern_mount+0x67/0x110
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb072edf>] do_mount+0x1ef/0xd00
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb049d7a>] ?
>> __check_object_size+0x1ca/0x250
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0288ec>] ?
>> kmem_cache_alloc_trace+0x3c/0x200
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb073d33>] SyS_mount+0x83/0xd0
>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb592ed2>]
>> system_call_fastpath+0x25/0x2a
>> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
>> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for
>> 172.16.100.4@o2ib: 9 seconds
>> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
>> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous
>> similar messages
>> Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
>> more than 120 seconds.
>> Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Jun 20 15:14:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1
>> 0x00000086
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> dumpe2fs seems to show that the file systems are clean i.e.
>>
>>
>>
>>
>>
>>
>>
>>
>> dumpe2fs 1.45.6.wc1 (20-Mar-2020)
>> Filesystem volume name: hydra-MDT0000
>> Last mounted on: /
>> Filesystem UUID: 3ae09231-7f2a-43b3-a4ee-7f36080b5a66
>> Filesystem magic number: 0xEF53
>> Filesystem revision #: 1 (dynamic)
>> Filesystem features: has_journal ext_attr resize_inode dir_index filetype
>> mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink
>> quota
>> Filesystem flags: signed_directory_hash
>> Default mount options: user_xattr acl
>> Filesystem state: clean
>> Errors behavior: Continue
>> Filesystem OS type: Linux
>> Inode count: 2247671504
>> Block count: 1404931944
>> Reserved block count: 70246597
>> Free blocks: 807627552
>> Free inodes: 2100036536
>> First block: 0
>> Block size: 4096
>> Fragment size: 4096
>> Reserved GDT blocks: 1024
>> Blocks per group: 20472
>> Fragments per group: 20472
>> Inodes per group: 32752
>> Inode blocks per group: 8188
>> Flex block group size: 16
>> Filesystem created: Thu Aug 8 14:21:01 2019
>> Last mount time: Tue Jun 20 15:19:03 2023
>> Last write time: Wed Jun 21 10:43:51 2023
>> Mount count: 38
>> Maximum mount count: -1
>> Last checked: Thu Aug 8 14:21:01 2019
>> Check interval: 0 (<none>)
>> Lifetime writes: 219 TB
>> Reserved blocks uid: 0 (user root)
>> Reserved blocks gid: 0 (group root)
>> First inode: 11
>> Inode size: 1024
>> Required extra isize: 32
>> Desired extra isize: 32
>> Journal inode: 8
>> Default directory hash: half_md4
>> Directory Hash Seed: 2e518531-82d9-4652-9acd-9cf9ca09c399
>> Journal backup: inode blocks
>> MMP block number: 1851467
>> MMP update interval: 5
>> User quota inode: 3
>> Group quota inode: 4
>> Journal features: journal_incompat_revoke
>> Journal size: 4096M
>> Journal length: 1048576
>> Journal sequence: 0x0a280713
>> Journal start: 0
>> MMP_block:
>> mmp_magic: 0x4d4d50
>> mmp_check_interval: 6
>> mmp_sequence: 0xff4d4d50
>> mmp_update_date: Wed Jun 21 10:43:51 2023
>> mmp_update_time: 1687358631
>> mmp_node_name: hyd-mds1.uncc.edu <_blank> <_blank>
>> mmp_device_name: dm-0
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to