Hi Rick,

Thanks for your suggestions. Turns out I was able to get the filesystem started this morning and restore access to the critical data. It was a long journey of troubleshooting but here are the steps I ended up taking to fix the issue.

- stop the lustre filesystem (umount the osts and mdt/mgt)
- mount the ldiskfs filesystem for the problematic ost (/dev/mapper/ost5 to /mnt/ost5 in this case)
- backup the CONFIGS/lfs1-client file
# cp -a /mnt/ost5/CONFIGS/lfs1-client /mnt/ost5/CONFIGS/lfs1-client.ORIG - copy a working non-corrupted 'lfs1-client' file from the MGS (from the mounted ldiskfs filesystem on the MGS) (there were signs of corruption in the file when I ran llog_reader against the bad lfs1-client file and received unexpected output)
- umount all ldiskfs filesystems
- run a writeconf to the MDS and all OSTs
    # tunefs.lustre --verbose --writeconf /dev/mapper/ostX
- restart the filesystem
    (this is where lfs1-OST0006 finally mounted!)
- mount the filesystem on a client

Our setup has 2 oss servers (oss1 and oss2) which serve 3 OSTs on each:
oss1:
/mnt/ost0
/mnt/ost1
/mnt/ost2

oss2:
/mnt/oss3
/mnt/ost4
/mnt/ost5

I'm sending this out for reference.

Thanks again,
Rafael


On 11/23/2015 10:57 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
On Nov 22, 2015, at 6:12 PM, Perez, Rafael <[email protected]> wrote:

LustreError: 10476:0:(mgc_request.c:1707:mgc_llog_local_copy()) 
MGC172.31.11.121@o2ib: failed to copy remote log lfs1-client: rc = -5
LustreError: 13a-8: Failed to get MGS log lfs1-client and no local copy.
LustreError: 15c-8: MGC172.31.11.121@o2ib: The configuration from log 
'lfs1-client' failed (-2). This may be the result of communication errors 
between this node and the MGS, a bad configuration, or other errors. See the 
syslog for more information.
LustreError: 10476:0:(obd_mount_server.c:1285:server_start_targets()) 
lfs1-OST0006: failed to start LWP: -2
Does this server have other OSTs that mount? Or is this the only OST on this 
OSS server?  You can use tune2fs to list the OST config parameters and verify 
that they are correct.  I have also seen this kind of error when there are 
network problems.  I would look for IB errors or other signs of problems.  
(Maybe even do a bandwidth test to see if it is performing as expected.)  You 
can also run “lctl ping” to test LNet connectivity between the OSS server and 
the MGS server.

If the network checks out and it really is the llog that is the problem, you 
can try doing a writeconf to fix things up.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


--
Rafael Perez
[email protected]
ITD HPC Support, Sr Technology Engineer
(631) 344-4426

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to