> On Aug 11, 2016, at 5:42 AM, E.S. Rosenberg <[email protected]> > wrote: > > Our MDT suffered a kernel panic (which I will post separately), the OSSs > stayed alive but the MDT was out for some time while nodes still tried to > interact with lustre. > > So I have several questions: > a. what happens to processes/reading writing during such an event (if they > already have handles on the OSS for instance that makes a difference)? I > noticed several of our compute-nodes ended up filling their swap/RAM so I > assume some level of caching is happening until the MDT returns….
In theory, the processes should just hang until the client can contact the server again. In my experience, this works a large fraction of the time (I have occasionally done server reboots on a production file system that was in use in order to fix some problems), but I wouldn’t say it is 100% guaranteed. > b. what is the best/proper procedure now to ensure filesystem integrity? > Should I take the filesystem offline and run an lfsck first on the MDT then > on the OSS? If the MDS crashed, then you may was to check the MDT. But if the OSS was still up, I don’t think there should be any problem with the OSTs that would require a fsck. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
