Hi Rick,
Thanks for your suggestions. Turns out I was able to get the filesystem
started this morning and restore access to the critical data. It was a
long journey of troubleshooting but here are the steps I ended up taking
to fix the issue.
- stop the lustre filesystem (umount the osts and mdt/mgt)
- mount the ldiskfs filesystem for the problematic ost (/dev/mapper/ost5
to /mnt/ost5 in this case)
- backup the CONFIGS/lfs1-client file
# cp -a /mnt/ost5/CONFIGS/lfs1-client
/mnt/ost5/CONFIGS/lfs1-client.ORIG
- copy a working non-corrupted 'lfs1-client' file from the MGS (from the
mounted ldiskfs filesystem on the MGS)
(there were signs of corruption in the file when I ran llog_reader
against the bad lfs1-client file and received unexpected output)
- umount all ldiskfs filesystems
- run a writeconf to the MDS and all OSTs
# tunefs.lustre --verbose --writeconf /dev/mapper/ostX
- restart the filesystem
(this is where lfs1-OST0006 finally mounted!)
- mount the filesystem on a client
Our setup has 2 oss servers (oss1 and oss2) which serve 3 OSTs on each:
oss1:
/mnt/ost0
/mnt/ost1
/mnt/ost2
oss2:
/mnt/oss3
/mnt/ost4
/mnt/ost5
I'm sending this out for reference.
Thanks again,
Rafael
On 11/23/2015 10:57 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
On Nov 22, 2015, at 6:12 PM, Perez, Rafael <[email protected]> wrote:
LustreError: 10476:0:(mgc_request.c:1707:mgc_llog_local_copy())
MGC172.31.11.121@o2ib: failed to copy remote log lfs1-client: rc = -5
LustreError: 13a-8: Failed to get MGS log lfs1-client and no local copy.
LustreError: 15c-8: MGC172.31.11.121@o2ib: The configuration from log
'lfs1-client' failed (-2). This may be the result of communication errors
between this node and the MGS, a bad configuration, or other errors. See the
syslog for more information.
LustreError: 10476:0:(obd_mount_server.c:1285:server_start_targets())
lfs1-OST0006: failed to start LWP: -2
Does this server have other OSTs that mount? Or is this the only OST on this
OSS server? You can use tune2fs to list the OST config parameters and verify
that they are correct. I have also seen this kind of error when there are
network problems. I would look for IB errors or other signs of problems.
(Maybe even do a bandwidth test to see if it is performing as expected.) You
can also run “lctl ping” to test LNet connectivity between the OSS server and
the MGS server.
If the network checks out and it really is the llog that is the problem, you
can try doing a writeconf to fix things up.
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
--
Rafael Perez
[email protected]
ITD HPC Support, Sr Technology Engineer
(631) 344-4426
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org