On Wed, 2008-08-06 at 09:29 -0600, Chris Worley wrote: > On Wed, Aug 6, 2008 at 9:15 AM, Brian J. Murrell <[EMAIL PROTECTED]> wrote: > > > > So, now what does the MDS serving lfs-MDT0000 say about this? Why did > > it evict? What version of Lustre is this? Perhaps you said so already > > and I have just forgotten. > > 1.6.5.1 clients w/ 1.6.4.3 OSS's. > > The MDS is very verbose. I get these all the time, even prior to the error: > > Lustre: lfs-OST0000: haven't heard from client > 12f00621-096c-b331-8774-abfc72dfd82 > 2 (at [EMAIL PROTECTED]) in 92 seconds. I think it's dead, and I am evicting > it.
Yup. If you can correlate those kinds of messages (they have the client ip address in them) to the errors on the client, you have your eviction events. I notice that you are getting messages out of dmesg rather than syslog. Syslog makes correlation easier and more definite due to the time stamps. But this kind of eviction is simply due to clients that are unresponsive from the POV of the MDS. They are neither making filesystem RPC nor are they "ping"ing (keepalives) so the MDS assumes they have died and evicts them to get back the locks it could be holding and not having that dead client holding up other, living clients. So you need to investigate why the clients are dying or appear to be dead (i.e. going silent) to the MDS. b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
