This is fantastic output, you guessed correctly. We use an md raid on ldiskfs. 
We're getting ext4 io errors on boot and were able to reboot enough to get into 
a state where we can read the mount, so we are moving data off that mount onto 
another and rebuilding(which is the intended design of this). We were sometimes 
able to restore the mount easily, and sometimes we would have to run fsck.ext4 
in order to repair or move bad blocks. This back and forth landed us in the 
ability to move and rebuild as we are doing now.

I will definitely be able to use your input in testing further.

Thanks,
JC



Internal
From: lustre-discuss <[email protected]> On Behalf Of 
Cameron Harr via lustre-discuss
Sent: Tuesday, August 9, 2022 5:41 PM
To: [email protected]
Subject: Re: [lustre-discuss] Lustre 2.12.6 on RHEL 7.9 not able to mount disks 
after reboot


CAUTION: EXTERNAL MAIL. DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS YOU DO NOT 
TRUST
ATTENTION : COURRIEL EXTERNE. NE CLIQUEZ PAS SUR DES LIENS ET N'OUVREZ PAS DE 
PIÈCES JOINTES AUXQUELS VOUS NE FAITES PAS CONFIANCE


JC,

The message where it asks if the MGS is running is a pretty common error that 
you'll see when something isn't right. There's not a lot of detail in your 
message but first step is to make sure your OST device is present on the OSS 
server. You mentioned remounting the RAID directories; is this software/MD 
RAID? Are you using ldiskfs or ZFS for the backend storage (I'll guess ldiskfs 
if using MD RAID).

If you've already verified the OST volume is present, see if you can 'lctl 
ping' between the MDS and OSS nodes. I'm not sure what your knowledge base is 
so forgive me if this is too elementary, but on each node, type 'lctl 
list_nids' to get the Lustre node identifier, then run 'lctl ping <NID>' to 
make sure you can talk Lustre/LNet between them:

[root@tin1:~]# lctl list_nids

192.168.101.1@o2ib1<mailto:192.168.101.1@o2ib1>



[root@tin6:~]# lctl list_nids

192.168.101.6@o2ib1<mailto:192.168.101.6@o2ib1>



[root@tin6:~]# lctl ping 192.168.101.1@o2ib1<mailto:192.168.101.1@o2ib1>

12345-0@lo

12345-192.168.101.1@o2ib1<mailto:12345-192.168.101.1@o2ib1>



[root@tin1:~]# lctl ping 192.168.101.6@o2ib1<mailto:192.168.101.6@o2ib1>

12345-0@lo

12345-192.168.101.6@o2ib1<mailto:12345-192.168.101.6@o2ib1>

If you get a failure (like I/O Error), then you have a communications problem 
and you'll want to make sure all the correct interfaces are up. If the pings do 
work, then you'll want to look for messages in /var/log/lustre and dmesg.

Cameron
On 8/9/22 06:45, Crowder, Jonathan via lustre-discuss wrote:
Hello, this is my first post here so I may need some guidance on the function 
of this system.

I am in a small team supporting some 36TB lustre servers for a business unit. 
Our configuration per mount point is one lustre master node and 3 lustre object 
stores. We had one of the object stores lost to an unidentified reboot and upon 
getting it booted back into the lustre kernel by azure cloud teams, we saw 
behavior where we could not get it to remount the raid directories for storage 
to the local file paths we have set up for them. I can obtain the output soon 
here, it knows the MGS node, but asks if it's running. I am having difficulty 
investigating more deeply into why this is happening as the other object stores 
are working without issue.

Thanks,
JC


Internal


_______________________________________________

lustre-discuss mailing list

[email protected]<mailto:[email protected]>

https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!l2FbiTAR6qhLwbOqf4kfzj8IRp8tfTexTXEOpPVB2ASGCAIVUTpJGN5isgF9Ugs$<https://clicktime.symantec.com/15tStaNZJe7fN1Rd2QBxh?h=Sv2alGuGQqv2LT-fhBdOjFsaWjrYLSvMk3aOegXZzfI=&u=https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__%3B%21%21G2kpM7uM-TzIFchu%21l2FbiTAR6qhLwbOqf4kfzj8IRp8tfTexTXEOpPVB2ASGCAIVUTpJGN5isgF9Ugs%24>
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to