On Tue 10-03-15 15:45:21, Josef Bacik wrote:
> From: Dave Chinner <[email protected]>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
> 
> Signed-off-by: Dave Chinner <[email protected]>
  The code looks mostly fine. Two comments below.
...
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must 
> have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +     BUG_ON(!(inode->i_state & I_FREEING));
> +
>       spin_lock(&inode->i_lock);
>       __inode_wait_for_writeback(inode);
>       spin_unlock(&inode->i_lock);
> +
> +     /*
> +      * bd_inode's will have already had the bdev free'd so don't bother
> +      * doing the bdi_clear_inode_writeback.
> +      */
> +     if (!sb_is_blkdev_sb(inode->i_sb))
> +             bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
  Umm, I don't get this even though I've read the comment several times.
Why don't we want to remove bdev inode from writeback list of
blockdev_super? Is the comment suggesting that bdev inode cannot be on
writeback list?

> @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
>        */
>       WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -     mutex_lock(&sb->s_sync_lock);
> -     spin_lock(&sb->s_inode_list_lock);
> -
>       /*
> -      * Data integrity sync. Must wait for all pages under writeback,
> -      * because there may have been pages dirtied before our sync
> -      * call, but which had writeout started before we write it out.
> -      * In which case, the inode may not be on the dirty list, but
> -      * we still have to wait for that writeout.
> +      * Data integrity sync. Must wait for all pages under writeback, because
> +      * there may have been pages dirtied before our sync call, but which had
> +      * writeout started before we write it out.  In which case, the inode
> +      * may not be on the dirty list, but we still have to wait for that
> +      * writeout.
> +      *
> +      * To avoid syncing inodes put under IO after we have started here,
> +      * splice the io list to a temporary list head and walk that. Newly
> +      * dirtied inodes will go onto the primary list so we won't wait for
> +      * them. This is safe to do as we are serialised by the s_sync_lock,
> +      * so we'll complete processing the complete list before the next
> +      * sync operation repeats the splice-and-walk process.
> +      *
> +      * Inodes that have pages under writeback after we've finished the wait
> +      * may or may not be on the primary list. They had pages under put IO
> +      * after we started our wait, so we need to make sure the next sync or
> +      * IO completion treats them correctly, Move them back to the primary
> +      * list and restart the walk.
> +      *
> +      * Inodes that are clean after we have waited for them don't belong on
> +      * any list, so simply remove them from the sync list and move onwards.
>        */
> -     list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +     mutex_lock(&sb->s_sync_lock);
> +     spin_lock(&bdi->wb.list_lock);
> +     list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> +     while (!list_empty(&sync_list)) {
> +             struct inode *inode = list_first_entry(&sync_list, struct inode,
> +                                                    i_wb_list);
>               struct address_space *mapping = inode->i_mapping;
>  
> +             /*
> +              * We are lazy on IO completion and don't remove inodes from the
> +              * list when they are clean. Detect that immediately and skip
> +              * inodes we don't ahve to wait on.
> +              */
> +             if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +                     list_del_init(&inode->i_wb_list);
> +                     cond_resched_lock(&bdi->wb.list_lock);
> +                     continue;
> +             }
> +
>               spin_lock(&inode->i_lock);
> -             if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -                 (mapping->nrpages == 0)) {
> +             if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
  Indenting looks damaged here...

>                       spin_unlock(&inode->i_lock);
> +                     cond_resched_lock(&bdi->wb.list_lock);
>                       continue;
  Why do we put freeing inodes back to b_writeback list? For I_FREEING and
I_WILL_FREE we could as well just delete them. For I_NEW we would start
waiting for them once I_NEW gets cleared but still I_NEW inodes under
writeback look somewhat worrying (flusher thread skips them so the
semantics isn't clear there). Anyway, probably lets just keep code as is
for I_NEW but moving back to b_writeback for I_FREEING | I_WILL_FREE looks
just dumb.

                                                                Honza
-- 
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to