> After a power spike this weekend that crashed several machines > (not the OSS'es...) and/or possibly hitting 100% file space > usage on one of them (we have been dangerously close for a > while), it hung this morning.
That's fairly clear, but did you do any checks as to whether all the drives involved are entirely error free? How do you know your storage system is still good to use? Also did you have battery backup for at least the storage HAs? > After restarting, it showed many files as missing. [ ... ] > Now I am afraid that if I carry on (probably just cycling the > power, since "reboot" also hangs), it will come back in the > same state, i.e. 95% of the data gone. Is this already > irreparably the case, or am I just paranoid? Any suggestions > would be appreciated (in other words: HELP!!!!). There is one simple solution: restore backups. That's what they are for, situations like this. It is probably much faster than any attempt at recovery, if the backups are on suitable media. I think that in many cases restoring from backup is faster than running 'fsck' over damaged filesystems. As to that, I reckon that it is often little appreciated that the most cost effective way to backup efficiently a large Lustre storage pool may be another Lustre storage pool, and Lustre can make pretty good backup servers (excellent sequential write rates from cheap low IOPS drives, over Ethernet). _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
