Hi All,
I was just wondering if any of you guys had a chance to validate the
hypothesis and the proposed fix.
Thanks,
Rahul
-----Original Message-----
From: Srivastava, Rahul
Sent: Tuesday, August 02, 2005 8:32 AM
To: 'Marcelo Tosatti'; 'Ernie Petrides'; 'Larry Woodman'
Subject: RE: oops in 2.4.25 prune_icache() called from kswapd
Hi,
Thanks for reviewing the mail. I was thinking whether below changes in
clear_inode() will close the race window:
in clear_inode(), change line:
inode->i_state = I_CLEAR;
with below piece of code:
*****
spin_lock(&inode_lock);
while (inode->i_state & I_LOCK) {
spin_unlock(&inode_lock);
__wait_on_inode(inode);
spin_lock(&inode_lock);
}
inode->i_state = I_CLEAR;
spin_unlock(&inode_lock);
*****
I feel the race is between "__sync_one()" and "iput()/clear_inode()"
(also suggested by Albert) which is as follows:
**************** race
condition*******************************************
engine 0:
|
calls iput() and lock inode_lock. iput removes the inode from the i_list
and unlocks |
inode_lock
|
|
| engine 1:
| grab inode_lock and calls __sync_one()
|
engine 0:
|
calls clear_inode(), get past the call to "wait_on_inode()" which looks
if I_LOCK is set. |
/* From this point onwards clear_inode() and the remainder of iput()
does not care about |
I_LOCK or inode_lock. */
|
|
| engine 1:
| Sets I_LOCK.
|
engine 0:
|
sets i_state = I_CLEAR
|
iput() calls destroy_inode()
|
kmem_cache_free() returns the inode to free list of inode cache.
|
|
| engine 1:
| Goes ahead and inserts the freed inode into one of the three possible
lists.
And we endup in having a corrupted inode on the inode list.
Your thoughts please.
Thanks,
Rahul
-----Original Message-----
From: Marcelo Tosatti [mailto:[EMAIL PROTECTED]
Sent: Monday, August 01, 2005 6:08 AM
To: Srivastava, Rahul; Ernie Petrides; Larry Woodman
Subject: Re: oops in 2.4.25 prune_icache() called from kswapd
On Thu, Jul 28, 2005 at 05:57:43PM -0500, Srivastava, Rahul wrote:
> Hi Marcelo,
>
> Thanks a lot for your detailed response.
>
> However, I still don't see how adding I_CLEAR in __refile_inode is
> avoiding the race mentioned below.
>
> I understand that the I_CLEAR flag addition is done to ensure that
> __refile_inode() doesn't insert the inode in one of the four lists and
> should just return (in this particular scenario). Please correct me if
> I am wrong in my assumption.
>
> Now if we see the code, without I_CLEAR flag (unpatched code), we are
> checking for I_FREEING flag in __refile_inode(). And if a inode is
> marked for freeing (i.e., I_FREEING is set) current code also returns
> without adding it into one of the four lists.
>
> And there is no scenario, in code, wherein we can have a inode with
> I_CLEAR flag set but I_FREEING unset.
I fail to disagree: the I_FREEING flag will always be set when an
attempt
is made to set I_CLEAR (there are asserts to guarantee that).
So, I also can't understand the addition of I_CLEAR check on
__refile_inode() and its purpose.
Larry, can you clarify for us?
> In nutshell: If I_CLEAR is set that means I_FREEING will also be set.
> And since we are already checking for I_FREEING in __refile_inode,
> checking for I_CLEAR will be a kind of duplication?
>
> Hope I have not misinterpreted the complete story.
>
> Thanks,
> Rahul
>
>
> -----Original Message-----
> From: Marcelo Tosatti [mailto:[EMAIL PROTECTED]
> Sent: Thursday, July 28, 2005 5:38 AM
> To: Srivastava, Rahul
> Subject: Re: oops in 2.4.25 prune_icache() called from kswapd
>
>
> On Thu, Jul 28, 2005 at 10:39:12AM -0500, Srivastava, Rahul wrote:
> > Hi Marcelo,
> >
> > I was seeing your fix in inode.c and need a clarification.
> >
> > In the patch you have added I_CLEAR flag. What is confusing me is
> > that
>
> > "I_CLEAR" flag is set only in "clear_inode()". And in this same
> > function we have:
> >
> > **********
> > if (!(inode->i_state & I_FREEING))
> > BUG();
> > **********
> >
> > So we are setting I_CLEAR only is I_FREEING is set. If that is the
> > case, shouldn't just a check for I_FREEING is enough in
refile_inode.
> > ?
>
> The problem is that it is a race between two processors.
>
> > Basically, I am not able to make out the significance of adding
> > I_CLEAR in _refile_inode().
> >
> > + if (inode->i_state & (I_FREEING|I_CLEAR))
>
> That will avoid __refile_inode() from putting a freed inode into
> the lists...
>
> > Thanks for your time and sorry to bother you,
> > Rahul
>
> No problem.
>
> The description goes like:
>
> The first scenerio is:
>
> 1.) cpu0 is in __sync_one() just about to call __refile_inode() after
> taking the inode_lock and clearing I_LOCK.
>
> spin_lock(&inode_lock);
> inode->i_state &= ~I_LOCK;
> if (!(inode->i_state & I_FREEING))
> __refile_inode(inode);
> wake_up(&inode->i_wait);
>
> 2.) cpu1 is in [dispose_list()] where it has dropped the inode_lock
> and calls clear_inode(). It doesnt block because I_LOCK is clear so it
> sets the inode state.
>
> void clear_inode(struct inode *inode)
> {
> ...
> wait_on_inode(inode);
> ...
> inode->i_state = I_CLEAR;
> ...
> }
>
> 3.) cpu0 calls __refile_inode which places is on one of the four
> possible inode lists
>
> static inline void __refile_inode(struct inode *inode)
> {
> if (inode->i_state & I_DIRTY)
> to = &inode->i_sb->s_dirty;
> else if (atomic_read(&inode->i_count))
> to = &inode_in_use;
> else if (inode->i_data.nrpages)
> to = &inode_unused_pagecache;
> else
> to = &inode_unused;
>
> list_del(&inode->i_list);
> list_add(&inode->i_list, to);
> }
>
> 4.) cpu1 returns from clear_inode() then calls destroy_inode() which
> kmem_cache_free()s it.
>
> static void destroy_inode(struct inode *inode)
> {
>
> if (inode->i_sb->s_op->destroy_inode)
> inode->i_sb->s_op->destroy_inode(inode);
> else
> kmem_cache_free(inode_cachep, inode);
> }
>
> 5.) at this point we have an inode that has been kmem_cache_free()'d
> that is also sitting one of the lists determined by __refile_inode(),
> that cant be good!!! Also, the code looks the same in RHEL4.
>
>
> >
> >
> >
> >
> > ============================
> >
> > FYI
> >
> > commit cc54d1333e409f714aa9c7db63f7f9ed07cc57a9
> > tree f301f581dd4389028f8b2588940d456904e552f1
> > parent 2e8f68c45925123d33d476ce369b570bd989dd9a
> > author Larry Woodman <[EMAIL PROTECTED]> Fri, 15 Jul 2005 11:32:08
> > -0400 committer Marcelo Tosatti <[EMAIL PROTECTED]> Tue, 26 Jul 2005
> > 07:52:46 -0300
> >
> > [PATCH] workaround inode cache (prune_icache/__refile_inode) SMP
> > races
> >
> > Over the past couple of weeks we have seen two races in the
> > inode
> > cache
> > code. The first is between [dispose_list()] and __refile_inode()
> > and the
> > second is between prune_icache() and truncate_inodes(). I posted
> > both of
> > these patches but wanted to make sure they got properly reviewed
> and
> > included in RHEL3-U6.
> >
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -297,7 +297,7 @@ static inline void __refile_inode(struct {
> > struct list_head *to;
> >
> > - if (inode->i_state & I_FREEING)
> > + if (inode->i_state & (I_FREEING|I_CLEAR))
> > return;
> > if (list_empty(&inode->i_hash))
> > return;
> > @@ -634,7 +634,9 @@ void clear_inode(struct inode *inode)
> > cdput(inode->i_cdev);
> > inode->i_cdev = NULL;
> > }
> > + spin_lock(&inode_lock);
> > inode->i_state = I_CLEAR;
> > + spin_unlock(&inode_lock);
> > }
> >
> > /*
> >
> >
> > On Sun, Jun 19, 2005 at 11:07:44PM +0000, Chris Caputo wrote:
> > > My basic repro method was:
> > >
> > > --
> > > 0) start irqbalance
> > > 1) run loop_dbench, which is the following dbench script which
uses
> > > client_plain.txt:
> > >
> > > #!/bin/sh
> > >
> > > while [ 1 ]
> > > do
> > > date
> > > dbench 2
> > > 2) wait for oops
> > > --
> > >
> > > I think I was using dbench-2.1:
> > >
> > > http://samba.org/ftp/tridge/dbench/dbench-2.1.tar.gz
> > >
> > > In my case irqbalance was key. If I didn't run it I never got the
> > > problem. I think irqbalance just did a good job of exasperating a
> > > race condition in some way.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html