I've just gotten done recovering from a wierd problem on our test system, and
wondered if anybody else has seen this kind of thing.

Lustre 1.6b5
Kernel vanilla (more or less) 2.6.15
Opteron smp clients and servers

The clients boot diskless, using lustre as their rootfs.  The servers have
over the last few weeks been bounced numerous times due to me playing around
with hardware while evaluating interface cards and stuff.  Some of the time I
didn't restart the clients when rebooting the servers, either because I forgot
or because something crashed unexpectedly.

Up until today, it seemed like everything was behaving itself pretty well.
The clients didn't work, of course, when their rootfs was down, but when the
servers came back up, the clients seemed to reconnect and carry on, so I
didn't worry overly much about it.

Today I tripped over evidence that the rootfs was corrupted, and that the
corruption wasn't limited to in-memory structures, it was on the disks.  I
rebuilt the fs (ie reformatted all the osts and things), installed a new
rootfs, and then discovered that it was *still* corrupted.  Looking at server
logs, there were a number of errors from one oss where one of the old clients
had been trying to reconnect to it and replay transactions.  

So:  Is it an incredibly bad idea to allow an old stale client to try to
reconnect to a freshly-reconstituted server?  I had the impression that lustre
had sufficient protocol in place to avoid that kind of skewage causing
problems, but if that's not the case, it would certainly account for the
lossage.  If that is supposed to be safe, I guess this means I've probably
found a bug, and should try to characterize it further.

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to