Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

Mohr, Rick via lustre-discuss Wed, 21 Jun 2023 13:33:19 -0700

Mike,

On the off chance that the recovery process is causing the issue, you could try 
mounting the mdt with the "abort_recov" option and see if the behavior changes.


--Rick



On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson" 
<[email protected] 
<mailto:[email protected]> on behalf of 
[email protected] <mailto:[email protected]>> wrote:


Maybe someone else in the list can add clarity but I don't believe a recovery 
process on mount would keep the MDS read-only or trigger that trace. Something 
else may be going on. 


I would start from the ground up. Bring your servers up, unmounted. Ensure lnet 
is loaded and configured properly. Test lnet using ping or lnet_selftest from 
your MDS to all of your OSS nodes. Then mount your combined MGS/MDT volume on 
the MDS and see what happens. 




Is your MDS in a high-availability pair? 
What version of Lustre are you running? 




...just a few things readers on the list might want to know.




--Jeff








On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley <[email protected] 
<mailto:[email protected]> <mailto:[email protected] 
<mailto:[email protected]>>> wrote:


Jeff,


At this point we have the OSS shutdown. We were coming back from. full outage 
and so we are trying to get the MDS up before starting to bring up the OSS.




Mike




On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <[email protected] 
<mailto:[email protected]> <_blank>> wrote:


Mike,


Have you made sure the the o2ib interface on all of your Lustre servers (MDS & 
OSS) are functioning properly? Are you able to `lctl ping x.x.x.x@o2ib` 
successfully between MDS and OSS nodes?




--Jeff








On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss 
<[email protected] <mailto:[email protected]> 
<_blank>> wrote:


Rick,172.16.100.4 is the IB address of one of the OSS servers. I 
believe the mgt and mdt0 are the same target. My understanding is that we have 
a single instanceof the MGT which is on the first MDT server i.e. it was 
created via a comand similar to:




# mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb 






Does that make sense.






On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick <[email protected] 
<mailto:[email protected]> <_blank>> wrote:


Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same target or 
are they two separate targets just on the same host?


--Rick




On 6/21/23, 12:52 PM, "Mike Mosley" <[email protected] 
<mailto:[email protected]> <_blank> <mailto:[email protected] 
<mailto:[email protected]> <_blank>>> wrote:




Hi Rick,




The MGS/MDS are combined. The output I posted is from the primary.








THanks,








Mike








On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick <[email protected] 
<mailto:[email protected]> <_blank> <mailto:[email protected] 
<mailto:[email protected]> <_blank>> <mailto:[email protected] 
<mailto:[email protected]> <_blank> <mailto:[email protected] 
<mailto:[email protected]> <_blank>>>> wrote:




Mike,




It looks like the mds server is having a problem contacting the mgs server. I'm 
guessing the mgs is a separate host? I would start by looking for possible 
network problems that might explain the LNet timeouts. You can try using "lctl 
ping" to test the LNet connection between nodes, and you can also try regular 
"ping" between the IP addresses on the IB interfaces.




--Rick








On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via 
lustre-discuss" <[email protected] 
<mailto:[email protected]> <_blank> 
<mailto:[email protected] 
<mailto:[email protected]> <_blank>> <_blank> 
<mailto:[email protected] 
<mailto:[email protected]> <_blank> 
<mailto:[email protected] 
<mailto:[email protected]> <_blank>> <_blank>> on behalf 
of [email protected] <mailto:[email protected]> 
<_blank> <mailto:[email protected] 
<mailto:[email protected]> <_blank>> <_blank> 
<mailto:[email protected] 
<mailto:[email protected]> <_blank> 
<mailto:[email protected] 
<mailto:[email protected]> <_blank>> <_blank>>> wrote:








Greetings,








We have experienced some type of issue that is causing both of our MDS servers 
to only be able to mount the mdt device in read only mode. Here are some of the 
error messages we are seeing in the log files below. We lost our Lustre expert 
a while back and we are not sure how to proceed to troubleshoot this issue. Can 
anybody provide us guidance on how to proceed?
















Thanks,
















Mike
















Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for more 
than 120 seconds.
Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1 
0x00000086
Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb585da9>] schedule+0x29/0x70
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb5838b1>] 
schedule_timeout+0x221/0x2d0
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6b8e5>] ? tracing_is_on+0x15/0x30
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6f5bd>] ? 
tracing_record_cmdline+0x1d/0x120
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf77d9b>] ? 
probe_sched_wakeup+0x2b/0xa0
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaed7d15>] ? ttwu_do_wakeup+0xb5/0xe0
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb58615d>] 
wait_for_completion+0xfd/0x140
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaedb990>] ? wake_up_state+0x20/0x20
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f529a4>] 
llog_process_or_fork+0x244/0x450 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f52bc4>] llog_process+0x14/0x20 
[obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f85d05>] 
class_config_parse_llog+0x125/0x350 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a69fc0>] 
mgc_process_cfg_log+0x790/0xc40 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6d4cc>] 
mgc_process_log+0x3dc/0x8f0 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6e15f>] ? 
config_recover_log_add+0x13f/0x280 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ? 
class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6eb2b>] 
mgc_process_config+0x88b/0x13f0 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f91b58>] 
lustre_process_log+0x2d8/0xad0 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0e5a177>] ? 
libcfs_debug_msg+0x57/0x80 [libcfs]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f7c8b9>] ? 
lprocfs_counter_add+0xf9/0x160 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc08f4>] 
server_start_targets+0x13a4/0x2a20 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f94bb0>] ? 
lustre_start_mgc+0x260/0x2510 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ? 
class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc303c>] 
server_fill_super+0x10cc/0x1890 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f97a08>] 
lustre_fill_super+0x468/0x960 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f975a0>] ? 
lustre_common_put_super+0x270/0x270 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0510cf>] mount_nodev+0x4f/0xb0
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8f9a8>] lustre_mount+0x38/0x60 
[obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb051c4e>] mount_fs+0x3e/0x1b0
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0707a7>] vfs_kern_mount+0x67/0x110
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb072edf>] do_mount+0x1ef/0xd00
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb049d7a>] ? 
__check_object_size+0x1ca/0x250
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0288ec>] ? 
kmem_cache_alloc_trace+0x3c/0x200
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb073d33>] SyS_mount+0x83/0xd0
Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb592ed2>] 
system_call_fastpath+0x25/0x2a
Jun 20 15:13:14 hyd-mds1 kernel: LNet: 
4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.16.100.4@o2ib: 9 seconds
Jun 20 15:13:14 hyd-mds1 kernel: LNet: 
4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous similar 
messages
Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for more 
than 120 seconds.
Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 20 15:14:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123 1 
0x00000086
























dumpe2fs seems to show that the file systems are clean i.e.
















dumpe2fs 1.45.6.wc1 (20-Mar-2020)
Filesystem volume name: hydra-MDT0000
Last mounted on: /
Filesystem UUID: 3ae09231-7f2a-43b3-a4ee-7f36080b5a66
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype mmp 
flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags: signed_directory_hash 
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 2247671504
Block count: 1404931944
Reserved block count: 70246597
Free blocks: 807627552
Free inodes: 2100036536
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1024
Blocks per group: 20472
Fragments per group: 20472
Inodes per group: 32752
Inode blocks per group: 8188
Flex block group size: 16
Filesystem created: Thu Aug 8 14:21:01 2019
Last mount time: Tue Jun 20 15:19:03 2023
Last write time: Wed Jun 21 10:43:51 2023
Mount count: 38
Maximum mount count: -1
Last checked: Thu Aug 8 14:21:01 2019
Check interval: 0 (<none>)
Lifetime writes: 219 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 1024
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 2e518531-82d9-4652-9acd-9cf9ca09c399
Journal backup: inode blocks
MMP block number: 1851467
MMP update interval: 5
User quota inode: 3
Group quota inode: 4
Journal features: journal_incompat_revoke
Journal size: 4096M
Journal length: 1048576
Journal sequence: 0x0a280713
Journal start: 0
MMP_block:
mmp_magic: 0x4d4d50
mmp_check_interval: 6
mmp_sequence: 0xff4d4d50
mmp_update_date: Wed Jun 21 10:43:51 2023
mmp_update_time: 1687358631
mmp_node_name: hyd-mds1.uncc.edu <_blank> <_blank> <_blank>
mmp_device_name: dm-0








































_______________________________________________
lustre-discuss mailing list
[email protected] <mailto:[email protected]> 
<_blank>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 
<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> <_blank>








-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing


[email protected] <mailto:[email protected]> <_blank>
www.aeoncomputing.com <_blank>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061


4170 Morena Boulevard, Suite C - San Diego, CA 92117


High-Performance Computing / Lustre Filesystems / Scale-out Storage




















-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing


[email protected] <mailto:[email protected]> <_blank>
www.aeoncomputing.com <_blank>
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061


4170 Morena Boulevard, Suite C - San Diego, CA 92117


High-Performance Computing / Lustre Filesystems / Scale-out Storage











_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

Reply via email to