Hi Jaegeuk,

> -----Original Message-----
> From: Jaegeuk Kim [mailto:jaeg...@kernel.org]
> Sent: Friday, September 25, 2015 2:50 AM
> To: Marc Lehmann
> Cc: linux-f2fs-devel@lists.sourceforge.net
> Subject: Re: [f2fs-dev] sync/umount hang on 3.18.21, 1.4TB gone after crash
> 
> On Wed, Sep 23, 2015 at 11:58:51PM +0200, Marc Lehmann wrote:
> > Hi!
> >
> > I moved one of the SMR disks to another box with a 3.18.21 kernel.
> >
> > I formatted and mounted like this:
> >
> >    /opt/f2fs-tools/sbin/mkfs.f2fs -lTEST -s90 -t0 -a0 /dev/vg_test/test
> >    mount -t f2fs -onoatime,flush_merge,no_heap /dev/vg_test/test /mnt
> >
> > I then copied (tar | tar) 2.1TB of data to the disk, which took about 6
> > hours, which is about the read speed of this data set (so the speed was very
> > good).
> >
> > When I came back after ~10 hours, I found a number of hung task messages
> > in syslog, and when I entered sync, sync was consuming 100% system time.
> 
> Hmm, at this time, it would be good to check what process is stuck through
> sysrq.
> 
> > I took a snapshot of /sys/kernel/debug/f2fs/status before sync, and the
> > values arfe "frozen", i.e. they didn't change.
> >
> > I was able to read from the mounted filesystem normally, and I was able to
> > read and write the block device itself, so the disk is responsive.
> >
> > After ~1h in this state, I tried to umount, which made the filesystem
> > mountpoint go away, but umount hangs, and /sys/kernel/debug/f2fs/status 
> > still
> > doesn't change.
> >
> > This is the output of /sys/kernel/debug/f2fs/status:
> >
> > http://ue.tst.eu/d88ce0e21a7ca0fb74b1ecadfa475df0.txt
> >
> > I then deleted the device, but the echo 1 >/sys/block/sde/device/delete was
> > also hanging.
> >
> > Here are /proc/.../stack outputs of sync, umount and bash(echo):
> >
> >    sync:
> >    [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >    umount:
> >    [<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20
> >    [<ffffffff811e7ee6>] deactivate_super+0x46/0x70
> >    [<ffffffff81204733>] cleanup_mnt+0x43/0x90
> >    [<ffffffff812047d2>] __cleanup_mnt+0x12/0x20
> >    [<ffffffff8108e8a4>] task_work_run+0xc4/0xe0
> >    [<ffffffff81012fa7>] do_notify_resume+0x97/0xb0
> >    [<ffffffff8178896f>] int_signal+0x12/0x17
> >    [<ffffffffffffffff>] 0xffffffffffffffff
> >
> >    bash (delete):
> >    [<ffffffff810d8917>] msleep+0x37/0x50
> >    [<ffffffff8135d686>] __blk_drain_queue+0xa6/0x1a0
> >    [<ffffffff8135da05>] blk_cleanup_queue+0x1b5/0x1c0
> >    [<ffffffff8152082a>] __scsi_remove_device+0x5a/0xe0
> >    [<ffffffff815208d6>] scsi_remove_device+0x26/0x40
> >    [<ffffffff81520917>] sdev_store_delete+0x27/0x30
> >    [<ffffffff814bf748>] dev_attr_store+0x18/0x30
> >    [<ffffffff8125bc4d>] sysfs_kf_write+0x3d/0x50
> >    [<ffffffff8125b154>] kernfs_fop_write+0xe4/0x160
> >    [<ffffffff811e51a7>] vfs_write+0xb7/0x1f0
> >    [<ffffffff811e5c26>] SyS_write+0x46/0xb0
> >    [<ffffffff817886cd>] system_call_fastpath+0x16/0x1b
> >    [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > After a forced reboot, I did a fsck, and got this, which looks good except
> > for the "Wrong segment type" message, which hopefully is harmless.
> >
> > http://ue.tst.eu/4c750d2301a581cb07249d607aa0e6d0.txt
> >
> > After mounting, status was this (and was changing):
> >
> > http://ue.tst.eu/6462606ac3aa85bde0d6674365c86318.txt
> >
> > Note that 1.4TB of data are missing(!)
> >
> > This large amount of missing data was certainly unexpected. I assume f2fs
> > stopped checkpointing earlier, and only after a checkpoint the data is
> > safe, but being able to write 1.4TB of data without it ever reaching the
> > disk is very unexpected behaviour for a filesystem (which normally loses
> > about half a minute of data at most).
> 
> It seems there was no fsync after sync at all. That's why f2fs recovered back 
> to
> the latest checkpoint. Anyway, I'm thinking that it's worth to add a kind of
> periodic checkpoints.

Agree, I have that in my mind for long time, since Yunlei said that they
may lost all data of new generated photos after an abnormal poweroff, I
wrote the below patch, but I have not much time to test and tuned up with
it.

I hope if you have time, we can discuss the implementation of periodic cp.
Maybe in another thread. :)

>From c81c03fb69612350b12a14bccc07a1fd95cf606b Mon Sep 17 00:00:00 2001
From: Chao Yu <chao2...@samsung.com>
Date: Wed, 5 Aug 2015 22:58:54 +0800
Subject: [PATCH] f2fs: support background data flush

Signed-off-by: Chao Yu <chao2...@samsung.com>
---
 fs/f2fs/data.c  | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/f2fs/f2fs.h  |  15 +++++++++
 fs/f2fs/inode.c |  16 +++++++++
 fs/f2fs/namei.c |   7 ++++
 fs/f2fs/super.c |  50 ++++++++++++++++++++++++++--
 5 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index a82abe9..39b6339 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -20,6 +20,8 @@
 #include <linux/prefetch.h>
 #include <linux/uio.h>
 #include <linux/cleancache.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 
 #include "f2fs.h"
 #include "node.h"
@@ -27,6 +29,104 @@
 #include "trace.h"
 #include <trace/events/f2fs.h>
 
+static void f2fs_do_data_flush(struct f2fs_sb_info *sbi)
+{
+       struct list_head *inode_list = &sbi->inode_list;
+       struct f2fs_inode_info *fi, *tmp;
+       struct inode *inode;
+       unsigned int number;
+
+       spin_lock(&sbi->inode_lock);
+       number = sbi->inode_num;
+       list_for_each_entry_safe(fi, tmp, inode_list, i_flush) {
+
+               if (number-- == 0)
+                       break;
+
+               inode = &fi->vfs_inode;
+
+               /*
+                * If the inode is in evicting path, we will fail to igrab
+                * inode since I_WILL_FREE or I_FREEING should be set in
+                * inode, so after grab valid inode, it's safe to flush
+                * dirty page after unlock inode_lock.
+                */
+               inode = igrab(inode);
+               if (!inode)
+                       continue;
+
+               spin_unlock(&sbi->inode_lock);
+
+               if (!get_dirty_pages(inode))
+                       goto next;
+
+               filemap_flush(inode->i_mapping);
+next:
+               iput(inode);
+               spin_lock(&sbi->inode_lock);
+       }
+       spin_unlock(&sbi->inode_lock);
+}
+
+static int f2fs_data_flush_thread(void *data)
+{
+       struct f2fs_sb_info *sbi = data;
+       wait_queue_head_t *wq = &sbi->dflush_wait_queue;
+       struct cp_control cpc;
+       unsigned long wait_time;
+
+       wait_time = sbi->wait_time;
+
+       do {
+               if (try_to_freeze())
+                       continue;
+               else
+                       wait_event_interruptible_timeout(*wq,
+                                               kthread_should_stop(),
+                                               msecs_to_jiffies(wait_time));
+               if (kthread_should_stop())
+                       break;
+
+               if (sbi->sb->s_writers.frozen >= SB_FREEZE_WRITE)
+                       continue;
+
+               mutex_lock(&sbi->gc_mutex);
+
+               f2fs_do_data_flush(sbi);
+
+               cpc.reason = __get_cp_reason(sbi);
+               write_checkpoint(sbi, &cpc);
+
+               mutex_unlock(&sbi->gc_mutex);
+
+       } while (!kthread_should_stop());
+       return 0;
+}
+
+int start_data_flush_thread(struct f2fs_sb_info *sbi)
+{
+       dev_t dev = sbi->sb->s_bdev->bd_dev;
+       int err = 0;
+
+       init_waitqueue_head(&sbi->dflush_wait_queue);
+       sbi->data_flush_thread = kthread_run(f2fs_data_flush_thread, sbi,
+                       "f2fs_flush-%u:%u", MAJOR(dev), MINOR(dev));
+       if (IS_ERR(sbi->data_flush_thread)) {
+               err = PTR_ERR(sbi->data_flush_thread);
+               sbi->data_flush_thread = NULL;
+       }
+
+       return err;
+}
+
+void stop_data_flush_thread(struct f2fs_sb_info *sbi)
+{
+       if (!sbi->data_flush_thread)
+               return;
+       kthread_stop(sbi->data_flush_thread);
+       sbi->data_flush_thread = NULL;
+}
+
 static void f2fs_read_end_io(struct bio *bio)
 {
        struct bio_vec *bvec;
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index f1a90ff..b6790c9 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -52,6 +52,7 @@
 #define F2FS_MOUNT_NOBARRIER           0x00000800
 #define F2FS_MOUNT_FASTBOOT            0x00001000
 #define F2FS_MOUNT_EXTENT_CACHE                0x00002000
+#define F2FS_MOUNT_DATA_FLUSH          0X00004000
 
 #define clear_opt(sbi, option) (sbi->mount_opt.opt &= ~F2FS_MOUNT_##option)
 #define set_opt(sbi, option)   (sbi->mount_opt.opt |= F2FS_MOUNT_##option)
@@ -322,6 +323,8 @@ enum {
                                         */
 };
 
+#define DEF_DATA_FLUSH_DELAY_TIME      5000    /* delay time of data flush */
+
 #define F2FS_LINK_MAX  0xffffffff      /* maximum link count per file */
 
 #define MAX_DIR_RA_PAGES       4       /* maximum ra pages of dir */
@@ -436,6 +439,8 @@ struct f2fs_inode_info {
 
        struct extent_tree *extent_tree;        /* cached extent_tree entry */
 
+       struct list_head i_flush;       /* link in inode_list of sbi */
+
 #ifdef CONFIG_F2FS_FS_ENCRYPTION
        /* Encryption params */
        struct f2fs_crypt_info *i_crypt_info;
@@ -808,6 +813,14 @@ struct f2fs_sb_info {
        struct list_head s_list;
        struct mutex umount_mutex;
        unsigned int shrinker_run_no;
+
+       /* For data flush support */
+       struct task_struct *data_flush_thread;  /* data flush task */
+       wait_queue_head_t dflush_wait_queue;    /* data flush wait queue */
+       unsigned long wait_time;                /* wait time for flushing */
+       struct list_head inode_list;            /* link all inmem inode */
+       spinlock_t inode_lock;                  /* protect inode list */
+       unsigned int inode_num;                 /* inode number in inode_list */
 };
 
 /*
@@ -1780,6 +1793,8 @@ void destroy_checkpoint_caches(void);
 /*
  * data.c
  */
+int start_data_flush_thread(struct f2fs_sb_info *);
+void stop_data_flush_thread(struct f2fs_sb_info *);
 void f2fs_submit_merged_bio(struct f2fs_sb_info *, enum page_type, int);
 int f2fs_submit_page_bio(struct f2fs_io_info *);
 void f2fs_submit_page_mbio(struct f2fs_io_info *);
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 35aae65..6bf22ad 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -158,6 +158,13 @@ static int do_read_inode(struct inode *inode)
        stat_inc_inline_inode(inode);
        stat_inc_inline_dir(inode);
 
+       if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
+               spin_lock(&sbi->inode_lock);
+               list_add_tail(&fi->i_flush, &sbi->inode_list);
+               sbi->inode_num++;
+               spin_unlock(&sbi->inode_lock);
+       }
+
        return 0;
 }
 
@@ -335,6 +342,15 @@ void f2fs_evict_inode(struct inode *inode)
 
        f2fs_destroy_extent_tree(inode);
 
+       if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
+               spin_lock(&sbi->inode_lock);
+               if (!list_empty(&fi->i_flush)) {
+                       list_del(&fi->i_flush);
+                       sbi->inode_num--;
+               }
+               spin_unlock(&sbi->inode_lock);
+       }
+
        if (inode->i_nlink || is_bad_inode(inode))
                goto no_delete;
 
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index a680bf3..f639e96 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -71,6 +71,13 @@ static struct inode *f2fs_new_inode(struct inode *dir, 
umode_t mode)
        stat_inc_inline_inode(inode);
        stat_inc_inline_dir(inode);
 
+       if (S_ISREG(inode->i_mode) || S_ISLNK(inode->i_mode)) {
+               spin_lock(&sbi->inode_lock);
+               list_add_tail(&F2FS_I(inode)->i_flush, &sbi->inode_list);
+               sbi->inode_num++;
+               spin_unlock(&sbi->inode_lock);
+       }
+
        trace_f2fs_new_inode(inode, 0);
        mark_inode_dirty(inode);
        return inode;
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index f794781..286cdb4 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -67,6 +67,7 @@ enum {
        Opt_extent_cache,
        Opt_noextent_cache,
        Opt_noinline_data,
+       Opt_data_flush,
        Opt_err,
 };
 
@@ -91,6 +92,7 @@ static match_table_t f2fs_tokens = {
        {Opt_extent_cache, "extent_cache"},
        {Opt_noextent_cache, "noextent_cache"},
        {Opt_noinline_data, "noinline_data"},
+       {Opt_data_flush, "data_flush"},
        {Opt_err, NULL},
 };
 
@@ -215,6 +217,7 @@ F2FS_RW_ATTR(SM_INFO, f2fs_sm_info, min_fsync_blocks, 
min_fsync_blocks);
 F2FS_RW_ATTR(NM_INFO, f2fs_nm_info, ram_thresh, ram_thresh);
 F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, max_victim_search, max_victim_search);
 F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, dir_level, dir_level);
+F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, wait_time, wait_time);
 
 #define ATTR_LIST(name) (&f2fs_attr_##name.attr)
 static struct attribute *f2fs_attrs[] = {
@@ -231,6 +234,7 @@ static struct attribute *f2fs_attrs[] = {
        ATTR_LIST(max_victim_search),
        ATTR_LIST(dir_level),
        ATTR_LIST(ram_thresh),
+       ATTR_LIST(wait_time),
        NULL,
 };
 
@@ -397,6 +401,9 @@ static int parse_options(struct super_block *sb, char 
*options)
                case Opt_noinline_data:
                        clear_opt(sbi, INLINE_DATA);
                        break;
+               case Opt_data_flush:
+                       set_opt(sbi, DATA_FLUSH);
+                       break;
                default:
                        f2fs_msg(sb, KERN_ERR,
                                "Unrecognized mount option \"%s\" or missing 
value",
@@ -434,6 +441,8 @@ static struct inode *f2fs_alloc_inode(struct super_block 
*sb)
        /* Will be used by directory only */
        fi->i_dir_level = F2FS_SB(sb)->dir_level;
 
+       INIT_LIST_HEAD(&fi->i_flush);
+
 #ifdef CONFIG_F2FS_FS_ENCRYPTION
        fi->i_crypt_info = NULL;
 #endif
@@ -514,6 +523,8 @@ static void f2fs_put_super(struct super_block *sb)
        }
        kobject_del(&sbi->s_kobj);
 
+       stop_data_flush_thread(sbi);
+
        stop_gc_thread(sbi);
 
        /* prevent remaining shrinker jobs */
@@ -742,6 +753,8 @@ static int f2fs_remount(struct super_block *sb, int *flags, 
char *data)
        int err, active_logs;
        bool need_restart_gc = false;
        bool need_stop_gc = false;
+       bool need_restart_df = false;
+       bool need_stop_df = false;
 
        sync_filesystem(sb);
 
@@ -785,6 +798,19 @@ static int f2fs_remount(struct super_block *sb, int 
*flags, char *data)
                need_stop_gc = true;
        }
 
+       if ((*flags & MS_RDONLY) || !test_opt(sbi, DATA_FLUSH)) {
+               if (sbi->data_flush_thread) {
+                       stop_data_flush_thread(sbi);
+                       f2fs_sync_fs(sb, 1);
+                       need_restart_df = true;
+               }
+       } else if (!sbi->data_flush_thread) {
+               err = start_data_flush_thread(sbi);
+               if (err)
+                       goto restore_gc;
+               need_stop_df = true;
+       }
+
        /*
         * We stop issue flush thread if FS is mounted as RO
         * or if flush_merge is not passed in mount option.
@@ -794,13 +820,21 @@ static int f2fs_remount(struct super_block *sb, int 
*flags, char *data)
        } else if (!SM_I(sbi)->cmd_control_info) {
                err = create_flush_cmd_control(sbi);
                if (err)
-                       goto restore_gc;
+                       goto restore_df;
        }
 skip:
        /* Update the POSIXACL Flag */
         sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
                (test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
        return 0;
+restore_df:
+       if (need_restart_df) {
+               if (start_data_flush_thread(sbi))
+                       f2fs_msg(sbi->sb, KERN_WARNING,
+                               "background data flush thread has stopped");
+       } else if (need_stop_df) {
+               stop_data_flush_thread(sbi);
+       }
 restore_gc:
        if (need_restart_gc) {
                if (start_gc_thread(sbi))
@@ -1216,6 +1250,11 @@ try_onemore:
        INIT_LIST_HEAD(&sbi->dir_inode_list);
        spin_lock_init(&sbi->dir_inode_lock);
 
+       sbi->wait_time = DEF_DATA_FLUSH_DELAY_TIME;
+       INIT_LIST_HEAD(&sbi->inode_list);
+       spin_lock_init(&sbi->inode_lock);
+       sbi->inode_num = 0;
+
        init_extent_cache_info(sbi);
 
        init_ino_entry_info(sbi);
@@ -1324,6 +1363,12 @@ try_onemore:
                if (err)
                        goto free_kobj;
        }
+
+       if (test_opt(sbi, DATA_FLUSH) && !f2fs_readonly(sb)) {
+               err = start_data_flush_thread(sbi);
+               if (err)
+                       goto stop_gc;
+       }
        kfree(options);
 
        /* recover broken superblock */
@@ -1333,7 +1378,8 @@ try_onemore:
        }
 
        return 0;
-
+stop_gc:
+       stop_gc_thread(sbi);
 free_kobj:
        kobject_del(&sbi->s_kobj);
 free_proc:
-- 
2.4.2



------------------------------------------------------------------------------
_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Reply via email to