Hi all: We have a Linux cluster (CentOS 6.5, Lustre 1.8.9-wcl) which mounts a Lustre FS from CentOS-based server appliance (Lustre 2.1.0).
The Lustre cluster has 4 OSSes as two failover pairs. Due to bad luck we have one OSS unbootable, and replacing it will require taking its live partner down too (though not any of the other Lustre servers). We can prevent I/O to the Lustre FS by suspending (kill -STOP) the user processes on the cluster compute nodes before the maintenance work, and resuming them (kill -CONT) afterwards. I don't know what would happen, though, in those cases where the STOP'd process has an open file decriptor on the Lustre FS. If the relevant OSS/OSTs become unavailable, and then available again, during the STOP'd time, what would happen when the process is CONT'd? I tried a Web search on this, but the best I could find was stuff which assumed that one of a failover partner set would remain available. or was specifially about evictions (which I guess are a risk of this maintenance prccedure anyway). I did find one doc ( http://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency )which suggested that silent data corruption was a possibility in the event of evictions. But what about non-evicted clients with open filehandles? Thanks for any insight! -- Paul Brunk, system administrator Georgia Advanced Computing Resource Center (GACRC) Enterprise IT Svcs, the University of Georgia _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
