> On Aug 11, 2016, at 5:42 AM, E.S. Rosenberg <[email protected]> 
> wrote:
> 
> Our MDT suffered a kernel panic (which I will post separately), the OSSs 
> stayed alive but the MDT was out for some time while nodes still tried to 
> interact with lustre.
> 
> So I have several questions:
> a. what happens to processes/reading writing during such an event (if they 
> already have handles on the OSS for instance that makes a difference)? I 
> noticed several of our compute-nodes ended up filling their swap/RAM so I 
> assume some level of caching is happening until the MDT returns….

In theory, the processes should just hang until the client can contact the 
server again.  In my experience, this works a large fraction of the time (I 
have occasionally done server reboots on a production file system that was in 
use in order to fix some problems), but I wouldn’t say it is 100% guaranteed.

> b. what is the best/proper procedure now to ensure filesystem integrity?
> Should I take the filesystem offline and run an lfsck first on the MDT then 
> on the OSS?

If the MDS crashed, then you may was to check the MDT.  But if the OSS was 
still up, I don’t think there should be any problem with the OSTs that would 
require a fsck.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to