Thanks guys, Looks like unmounting the "unhealthy" OST filesystem and running an fsck on it (which found several errors) solved the problem! I still don't understand why it looked different from different clients...
Cheers, Herbert Oleg Drokin wrote: > Hello! > > So are there any other compplaints on the OSS node when you mount that OST? > Did you try to run e2fsck on the ost disk itself (while unmounted)? I > assume one of the possible problems is just on0disk fs corruptions > (and it might show unhealthy due to that right after mount too). > > Bye, > Oleg > On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote: > >> Sorry, I had meant to cc this to the list. >> >> Herbert >> >> From: Herbert Fruchtl <[email protected]> >> Date: November 18, 2010 12:56:53 PM EST >> To: Kevin Van Maren <[email protected]> >> Subject: Re: [Lustre-discuss] Broken client >> >> >> Hi Kevin, >> >> That didn't change anything. Umounting the of the OSTs hung (yes, with an >> LBUG), and I did a hard reboot. It came up again, and the status is as >> before: on the MDT server, I can see all files (well, I assume it's all); on >> the client in question some files appear broken. The OST is still "not >> healthy". I am running another lfsck, without much hope. Here's the LBUG: >> >> Nov 18 17:05:16 oss1-fs kernel: LustreError: >> 8125:0:(lprocfs_status.c:865:lprocfs_free_client_stats()) LBU >> >> Herbert >> >> Kevin Van Maren wrote: >>> Reboot the server with the unhealthy OST. >>> If you look at the logs, there is likely an LBUG that is causing the >>> problems. >>> Kevin >>> On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl >>> <[email protected]> wrote: >>>>> It looks like you may have corruption on the mdt or an ost, where the >>>>> objects on an OST can't be found for the directory entry. Have you >>>>> had a crash recently or run Lustre fsck? You might need to do fsck and >>>>> delete (unlink) the "broken" files. >>>>> >>>> The files do exist (I can see them on the mdt server) and I don't want to >>>> delete >>>> them. There was a crash lately, and I have run an lfsck afterwards >>>> (repeatedly, >>>> actually. >>>> >>>>> I suppose it's also possible you're seeing fallout from an earlier LBUG or >>>>> something. Try 'cat /proc/fs/lustre/health_check' on all the servers. >>>>> >>>> There seems to be a problem: >>>> [r...@master ~]# cat /proc/fs/lustre/health_check >>>> healthy >>>> [r...@master ~]# ssh oss1 'cat /proc/fs/lustre/health_check' >>>> device home-OST0005 reported unhealthy >>>> NOT HEALTHY >>>> [r...@master ~]# ssh oss2 'cat /proc/fs/lustre/health_check' >>>> healthy >>>> [r...@master ~]# ssh oss3 'cat /proc/fs/lustre/health_check' >>>> healthy >>>> >>>> What do I do about the unhealthy OST? >>>> >>>> Herbert >>>> -- >>>> Herbert Fruchtl >>>> Senior Scientific Computing Officer >>>> School of Chemistry, School of Mathematics and Statistics >>>> University of St Andrews >>>> -- >>>> The University of St Andrews is a charity registered in Scotland: >>>> No SC013532 >> -- >> Herbert Fruchtl >> Senior Scientific Computing Officer >> School of Chemistry, School of Mathematics and Statistics >> University of St Andrews >> -- >> The University of St Andrews is a charity registered in Scotland: >> No SC013532 >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > -- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
