Hi, Thanks a lot. The second race mentioned below explains it all.
Now I understood the significance of adding I_CLEAR. I actually never noticed that I_CLEAR flag is directly assigned to i_state. Since this will clear up the I_FREEING flag, the addition of I_CLEAR in __refile_inodes() does make sense. Thanks, Rahul -----Original Message----- From: Ernie Petrides [mailto:[EMAIL PROTECTED] Sent: Monday, August 08, 2005 3:20 PM To: Srivastava, Rahul Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [email protected] Subject: Re: FW: oops in 2.4.25 prune_icache() called from kswapd On Monday, 8-Aug-2005 at 11:43 CDT, "Rahul Srivastava" wrote: > I was just wondering if any of you guys had a chance to validate the > hypothesis and the proposed fix. Larry Woodman is the one who worked on this problem, and he agreed (last week) to follow up on this discussion. Unfortunately, he's away this week. I've forwarded his mail (posted to an internal Red Hat patch review mailing list 3 months ago) in the interim, and this contains the patch that has been committed to Red Hat Enterprise Linux version 3 (of which Update 6 is currently in beta). Hopefully, Larry will follow up when he gets back. Cheers. -ernie ------- Forwarded Message From: Larry Woodman <[EMAIL PROTECTED]> Date: Mon, 09 May 2005 11:39:11 -0400 Subject: [Taroon Patch] fix inode cache deadlock/race... ------------------------------------------------------------------------ Over the past couple of weeks we have seen two races in the inode cache code. The first is between [dispose_list()] and __refile_inode() and the second is between prune_icache() and truncate_inodes(). I posted both of these patches but wanted to make sure they got properly reviewed and included in RHEL3-U6. Fixes [RHEL3 bugzillas 149636 and] 155289. The first scenerio is: 1.) cpu0 is in __sync_one() just about to call __refile_inode() after taking the inode_lock and clearing I_LOCK. --------------------------------------------------------- spin_lock(&inode_lock); inode->i_state &= ~I_LOCK; if (!(inode->i_state & I_FREEING)) __refile_inode(inode); wake_up(&inode->i_wait); --------------------------------------------------------- 2.) cpu1 is in [dispose_list()] where it has dropped the inode_lock and calls clear_inode(). It doesnt block because I_LOCK is clear so it sets the inode state. --------------------------------------------------------- void clear_inode(struct inode *inode) { ... wait_on_inode(inode); ... inode->i_state = I_CLEAR; ... } --------------------------------------------------------- 3.) cpu0 calls __refile_inode which places is on one of the four possible inode lists --------------------------------------------------------- static inline void __refile_inode(struct inode *inode) { if (inode->i_state & I_DIRTY) to = &inode->i_sb->s_dirty; else if (atomic_read(&inode->i_count)) to = &inode_in_use; else if (inode->i_data.nrpages) to = &inode_unused_pagecache; else to = &inode_unused; list_del(&inode->i_list); list_add(&inode->i_list, to); } --------------------------------------------------------- 4.) cpu1 returns from clear_inode() then calls destroy_inode() which kmem_cache_free()s it. --------------------------------------------------------- static void destroy_inode(struct inode *inode) { if (inode->i_sb->s_op->destroy_inode) inode->i_sb->s_op->destroy_inode(inode); else kmem_cache_free(inode_cachep, inode); } --------------------------------------------------------- 5.) at this point we have an inode that has been kmem_cache_free()'d that is also sitting one of the lists determined by __refile_inode(), that cant be good!!! Also, the code looks the same in RHEL4. The second scenerio is: CPU0 is in prune_icache() called by kswapd and CPU1 is in invalidate_inodes() called by the auto-mount daemon. 1.) CPU0: prune_icache() sets the I_LOCK bit in an inode on the inode_unused_pagecache list, releases the inode_lock and calls invalidate_inode_pages. 2.) CPU1: invalidate_inodes() calls invalidate_list() for the inode_unused_pagecache list with the node_lock held and sets the I_FREEING bit in the inode->i_state. 3.) CPU0: prune_icache() acquires the inode_lock and clears the I_LOCK bit in the inode->i_state. 4.) CPU1: dispose_list() calls clear_inode() without the inode_lock held. Since the I_LOCK bit is clear, clear_inode() sets inode->i_state = I_CLEAR, clearing the I_FREEING bit. 5.) CPU0: prune_icache() calls __refile_inode() because clear_inode() cleared I_FREEING without holding the inode_lock. This inode that is no longer on the inode_unused_pagecache list which results in that inode being placed on the inode_unused list. 6.) CPU1: dispose_list() calls destroy_inode() which kmem_cache_free()s an inode that is also on the inode_unused list. At this point there is an inode that has been kmem_cache_free()'d and is also on the inode_unused list. This patch to clear_inode() acquires the inode_lock before manipulating the inode->i_state field. This is the only place in the kernel that manipulates the inode without holding the inode_lock. --- linux-2.4.21/fs/inode.c.orig +++ linux-2.4.21/fs/inode.c @@ -296,7 +296,7 @@ static inline void __refile_inode(struct { struct list_head *to; - if (inode->i_state & I_FREEING) + if (inode->i_state & (I_FREEING|I_CLEAR)) return; if (list_empty(&inode->i_hash)) return; @@ -636,7 +636,9 @@ void clear_inode(struct inode *inode) cdput(inode->i_cdev); inode->i_cdev = NULL; } + spin_lock(&inode_lock); inode->i_state = I_CLEAR; + spin_unlock(&inode_lock); } /* ------- End of Forwarded Message - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
