Hi Bernd,

> this message just happened on a rather fresh customer system
> and is rather annoying, since it fills the logs...

Your comment that "it fills the logs" suggests to me that the client is 
retrying the operation indefinitely, similar to the description in CFS bug 
11211 comment 8 https://bugzilla.clusterfs.com/show_bug.cgi?id=11211#c8. At 
that time Andreas Dilger said "the fact that it is retrying on ENOENT is 
wrong... should probably only happen for -ETIMEDOUT and -EIO and not other 
errors" but I don't see any record of this ever being changed. Bug 11211 
records the fixing of a memory leak associated with the message, but no change 
to the looping behavior.

> After some time there are evictions

In Lustre versions that do not have the fix for bug 11211 (pre 1.4.9 according 
to bugzilla) the client retry loop will quite quickly consume all of the memory 
on the OSS node (about 40 minutes on a server with 2GB RAM in our experience) 
and the server will go down. Later Lustre versions will not leak memory, so the 
server will stay up, but the looping client will place a considerable load on 
it. I would not be surprised if this is the cause of your evictions.

Joe.

-----Original Message-----
From: Bernd Schubert [mailto:[EMAIL PROTECTED]
Sent: 26 November 2007 18:01
To: Oleg Drokin; [email protected]
Subject: Re: [Lustre-discuss] trying to BRW to non-existent file xyz

On Monday 26 November 2007 18:33:02 you wrote:
> Hello!
>
> On Nov 26, 2007, at 12:08 PM, Bernd Schubert wrote:
> > when an OST reports "trying to BRW to non-existent file xyz", how
> > can I find
> > out which file the inode xyz belongs to?
>
> Usually there is none. Can you tell us more about situations where you
> see this?
> Were there any evictions?

I think so, this message just happened on a rather fresh customer system and
is rather annoying, since it fills the logs...
I can reproduce this rather soon here, I just need to run fsstress for some
hours. After some time there are evictions.

> One common scenario for this kind of errors is this:
> Client opens a file. File gets unlinked. Client is evicted from mds,
> mds notices
> it held last reference to a file and issue destroy request for file
> objects (effectively
> removing file objects on OSTs).
> Now if client would continue to access the file (because it was not
> evicted from ost
> or if it reconnected), you will get these errors.
>
> You can do e.g. lfs find /mountpoint -v for your fs (which I guess
> would take quite a
> while if it's big) and then grep the output for interesting objectid
> (just pay
> attention that ost index should also match).

Thanks, I will try over night, don't want to disturb the people there now.



Thanks a lot,
Bernd

--
Bernd Schubert
Q-Leap Networks GmbH


_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to