Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
On 16.11.2017 09:38, Qu Wenruo wrote: > > > On 2017年11月16日 14:54, Nikolay Borisov wrote: >> >> >> On 16.11.2017 04:18, Qu Wenruo wrote: >>> Hi all, >>> >>> [Background] >>> Recently I'm considering the possibility to use checksum from filesystem >>> to enhance device-mapper raid. >>> >>> The idea behind it is quite simple, since most modern filesystems have >>> checksum for their metadata, and even some (btrfs) have checksum for data. >>> >>> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time >>> it can use the checksum to determine which copy is correct so it can >>> return the correct data even one copy get corrupted. >>> >>> [Objective] >>> The final objective is to allow device mapper to do the checksum >>> verification (and repair if possible). >>> >>> If only for verification, it's not much different from current endio >>> hook method used by most of the fs. >>> However if we can move the repair part from filesystem (well, only btrfs >>> supports it yet), it would benefit all fs. >>> >>> [What we have] >>> The nearest infrastructure I found in kernel is bio_integrity_payload. >>> >>> However I found it's bounded to device, as it's designed to support >>> SCSI/SATA integrity protocol. >>> While for such use case, it's more bounded to filesystem, as fs (or >>> higher layer dm device) is the source of integrity data, and device >>> (dm-raid) only do the verification and possible repair. >>> >>> I'm not sure if this is a good idea to reuse or abuse >>> bio_integrity_payload for this purpose. >>> >>> Should we use some new infrastructure or enhance existing >>> bio_integrity_payload? >>> >>> (Or is this a valid idea or just another crazy dream?) >>> >> >> This sounds good in principle, however I think there is one crucial >> point which needs to be considered: >> >> All fs with checksums store those checksums in some specific way, then >> when they fetch data from disk they they also know how to acquire the >> respective checksum. > > Just like integrity payload, we generate READ bio attached with checksum > hook function and checksum data. So how is this checksum data acquired in the first place? > > So for data read, we read checksum first and attach it to data READ bio, > then submit it. > > And for metadata read, in most case the checksum is integrated into > metadata header, like what we did in btrfs. > > In that case we attach empty checksum data to bio, but use metadata > specific function hook to handle it. > >> What you suggest might be doable but it will >> require lower layers (dm) be aware of how to acquire the specific >> checksum for some data. > > In above case, dm only needs to call the verification hook function. > If verification passed, that's good. > If not, try other copy if we have. > > In this case, I don't think dm layer needs any extra interface to > communicate with higher layer. Well that verification function is the interface I meant, you are communicating the checksum out of band essentially (notwithstanding the metadata case, since you said checksum is in the actual metadata header) In the end - which problem are you trying to solve, allow for a generic checksumming layer which filesystems may use if they decide to ? > > Thanks, > Qu > >> I don't think at this point there is such infra >> and frankly I cannot even envision how it will work elegantly. Sure you >> can create a dm-checksum target (which I believe dm-verity is very >> similar to) that stores checksums alongside data but at this point the >> fs is really out of the picture. >> >> >>> Thanks, >>> Qu >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >
Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
On 16.11.2017 04:18, Qu Wenruo wrote: > Hi all, > > [Background] > Recently I'm considering the possibility to use checksum from filesystem > to enhance device-mapper raid. > > The idea behind it is quite simple, since most modern filesystems have > checksum for their metadata, and even some (btrfs) have checksum for data. > > And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time > it can use the checksum to determine which copy is correct so it can > return the correct data even one copy get corrupted. > > [Objective] > The final objective is to allow device mapper to do the checksum > verification (and repair if possible). > > If only for verification, it's not much different from current endio > hook method used by most of the fs. > However if we can move the repair part from filesystem (well, only btrfs > supports it yet), it would benefit all fs. > > [What we have] > The nearest infrastructure I found in kernel is bio_integrity_payload. > > However I found it's bounded to device, as it's designed to support > SCSI/SATA integrity protocol. > While for such use case, it's more bounded to filesystem, as fs (or > higher layer dm device) is the source of integrity data, and device > (dm-raid) only do the verification and possible repair. > > I'm not sure if this is a good idea to reuse or abuse > bio_integrity_payload for this purpose. > > Should we use some new infrastructure or enhance existing > bio_integrity_payload? > > (Or is this a valid idea or just another crazy dream?) > This sounds good in principle, however I think there is one crucial point which needs to be considered: All fs with checksums store those checksums in some specific way, then when they fetch data from disk they they also know how to acquire the respective checksum. What you suggest might be doable but it will require lower layers (dm) be aware of how to acquire the specific checksum for some data. I don't think at this point there is such infra and frankly I cannot even envision how it will work elegantly. Sure you can create a dm-checksum target (which I believe dm-verity is very similar to) that stores checksums alongside data but at this point the fs is really out of the picture. > Thanks, > Qu >
Re: [PATCH] bcache: lock in btree_flush_write() to avoid races
On 24.01.2018 08:54, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > In btree_flush_write(), two places need to take a locker to > avoid races: > > Firstly, we need take rcu read locker to protect the bucket_hash > traverse, since hlist_for_each_entry_rcu() must be called under > the protection of rcu read locker. > > Secondly, we need take b->write_lock locker to protect journal > of the btree node, otherwise, the btree node may have been > written, and the journal have been assign to NULL. > > Signed-off-by: Tang Junhui > --- > drivers/md/bcache/journal.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c > index 02a98dd..505f9f3 100644 > --- a/drivers/md/bcache/journal.c > +++ b/drivers/md/bcache/journal.c > @@ -375,7 +375,9 @@ static void btree_flush_write(struct cache_set *c) > retry: > best = NULL; > > - for_each_cached_btree(b, c, i) > + rcu_read_lock(); > + for_each_cached_btree(b, c, i) { > + mutex_lock(>write_lock); You can't sleep in rcu read critical section, yet here you take mutex which can sleep under rcu_read_lock. > if (btree_current_write(b)->journal) { > if (!best) > best = b; > @@ -385,6 +387,9 @@ static void btree_flush_write(struct cache_set *c) > best = b; > } > } > + mutex_unlock(>write_lock); > + } > + rcu_read_unlock(); > > b = best; > if (b) { >
Re: [PATCH 1/2] direct-io: Remove unused DIO_ASYNC_EXTEND flag
On 23.02.2018 13:45, Nikolay Borisov wrote: > This flag was added by 6039257378e4 ("direct-io: add flag to allow aio > writes beyond i_size") to support XFS. However, with the rework of > XFS' DIO's path to use iomap in acdda3aae146 ("xfs: use iomap_dio_rw") > it became redundant. So let's remove it. > > Signed-off-by: Nikolay Borisov <nbori...@suse.com> Jens, On a second look I think you are the more appropriate person to take these patches. SO do you have any objections to merging those via the block tree. ( I did CC you but didn't cc linux-block). > --- > fs/direct-io.c | 3 +-- > include/linux/fs.h | 3 --- > 2 files changed, 1 insertion(+), 5 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index a0ca9e48e993..99a81c49bce9 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -1252,8 +1252,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode > *inode, >*/ > if (is_sync_kiocb(iocb)) > dio->is_async = false; > - else if (!(dio->flags & DIO_ASYNC_EXTEND) && > - iov_iter_rw(iter) == WRITE && end > i_size_read(inode)) > + else if (iov_iter_rw(iter) == WRITE && end > i_size_read(inode)) > dio->is_async = false; > else > dio->is_async = true; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 2a815560fda0..260c233e7375 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2977,9 +2977,6 @@ enum { > /* filesystem does not support filling holes */ > DIO_SKIP_HOLES = 0x02, > > - /* filesystem can handle aio writes beyond i_size */ > - DIO_ASYNC_EXTEND = 0x04, > - > /* inode/fs/bdev does not need truncate protection */ > DIO_SKIP_DIO_COUNT = 0x08, > }; >
Re: GPF in wb_congested due to null bdi_writeback
On 27.02.2018 18:05, Nikolay Borisov wrote: > Hello Tejun, > > So while running some fs tests I hit the following GPF. Btw the > warning taint flag was due to a debugging WARN_ON in btrfs 100 or so > tests ago so is unrelated to this gpf: > > [ 4255.628110] general protection fault: [#1] SMP PTI > [ 4255.628303] Modules linked in: > [ 4255.628446] CPU: 4 PID: 58 Comm: kswapd0 Tainted: GW > 4.16.0-rc3-nbor #488 > [ 4255.628666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > [ 4255.628928] RIP: 0010:shrink_page_list+0x320/0x1180 > [ 4255.629072] RSP: 0018:c9b2fb38 EFLAGS: 00010287 > [ 4255.629220] RAX: 26c74ca226c74ca2 RBX: ea000444aea0 RCX: > > [ 4255.629394] RDX: RSI: RDI: > 880136761450 > [ 4255.629568] RBP: c9b2fea0 R08: 880136761640 R09: > > [ 4255.629742] R10: R11: R12: > c9b2fc68 > [ 4255.629913] R13: ea000444ae80 R14: c9b2fba8 R15: > 0001 > [ 4255.630125] FS: () GS:88013fd0() > knlGS: > [ 4255.630339] CS: 0010 DS: ES: CR0: 80050033 > [ 4255.630494] CR2: 7fb16b3955f8 CR3: 000135108000 CR4: > 06a0 > [ 4255.630667] Call Trace: > [ 4255.630790] shrink_inactive_list+0x27b/0x800 > [ 4255.630951] shrink_node_memcg+0x3b0/0x7e0 > [ 4255.631181] ? mem_cgroup_iter+0xe3/0x730 > [ 4255.631374] ? mem_cgroup_iter+0xe3/0x730 > [ 4255.631509] ? shrink_node+0xcc/0x350 > [ 4255.631651] shrink_node+0xcc/0x350 > [ 4255.631780] kswapd+0x307/0x910 > [ 4255.631913] kthread+0x103/0x140 > [ 4255.632033] ? mem_cgroup_shrink_node+0x2f0/0x2f0 > [ 4255.632201] ? kthread_create_on_node+0x40/0x40 > [ 4255.632348] ret_from_fork+0x3a/0x50 > [ 4255.632499] Code: 85 c0 74 59 49 8b 38 48 c7 c0 60 2f 16 82 48 85 ff 74 18 > 48 8b 47 28 48 3b 05 75 b6 0f 01 0f 84 42 0a 00 00 48 8b 80 28 01 00 00 <48> > 8b 48 58 48 8b 51 20 48 85 d2 0f 84 69 04 00 00 4c 89 04 24 > [ 4255.633055] RIP: shrink_page_list+0x320/0x1180 RSP: c9b2fb38 > [ 4255.633456] ---[ end trace 5c1558c67347a58d ]--- > > shrink_page_list+0x320/0x1180 is: > wb_congested at include/linux/backing-dev.h:170 > (inlined by) inode_congested at include/linux/backing-dev.h:456 > (inlined by) inode_write_congested at include/linux/backing-dev.h:468 > (inlined by) shrink_page_list at mm/vmscan.c:957 > > So the actual faulting code is in wb_congested's first line: > > struct backing_dev_info *bdi = wb->bdi; > > So this means wb_congested is called with a null bdi_writeback. > This is the first time I've seen it so it's likely new. > I haven't tried bisecting. FWIW I triggered it with xfstest > generic/176 running on btrfs. But from the looks the filesystem > wasn't a play here. > I should read more carefully - it's not due to null wb but rather having garbage in rax. The actual (annotated) disassembly: All code 0: 85 c0 test %eax,%eax 2: 74 59 je 0x5d 4: 49 8b 38mov(%r8),%rdi 7: 48 c7 c0 60 2f 16 82mov$0x82162f60,%rax e: 48 85 fftest %rdi,%rdi 11: 74 18 je 0x2b 13: 48 8b 47 28 mov0x28(%rdi),%rax; rax = inode->i_sb (in inode_to_bdi ) 17: 48 3b 05 75 b6 0f 01cmp0x10fb675(%rip),%rax# 0x10fb693 1e: 0f 84 42 0a 00 00 je 0xa66 24: 48 8b 80 28 01 00 00mov0x128(%rax),%rax ; rax = sb->s_bdi 2b:* 48 8b 48 58 mov0x58(%rax),%rcx <-- trapping instruction 2f: 48 8b 51 20 mov0x20(%rcx),%rdx 33: 48 85 d2test %rdx,%rdx 36: 0f 84 69 04 00 00 je 0x4a5 3c: 4c 89 04 24 mov%r8,(%rsp) SO somehow the inode's i_sb is bogus
GPF in wb_congested due to null bdi_writeback
Hello Tejun, So while running some fs tests I hit the following GPF. Btw the warning taint flag was due to a debugging WARN_ON in btrfs 100 or so tests ago so is unrelated to this gpf: [ 4255.628110] general protection fault: [#1] SMP PTI [ 4255.628303] Modules linked in: [ 4255.628446] CPU: 4 PID: 58 Comm: kswapd0 Tainted: GW 4.16.0-rc3-nbor #488 [ 4255.628666] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 4255.628928] RIP: 0010:shrink_page_list+0x320/0x1180 [ 4255.629072] RSP: 0018:c9b2fb38 EFLAGS: 00010287 [ 4255.629220] RAX: 26c74ca226c74ca2 RBX: ea000444aea0 RCX: [ 4255.629394] RDX: RSI: RDI: 880136761450 [ 4255.629568] RBP: c9b2fea0 R08: 880136761640 R09: [ 4255.629742] R10: R11: R12: c9b2fc68 [ 4255.629913] R13: ea000444ae80 R14: c9b2fba8 R15: 0001 [ 4255.630125] FS: () GS:88013fd0() knlGS: [ 4255.630339] CS: 0010 DS: ES: CR0: 80050033 [ 4255.630494] CR2: 7fb16b3955f8 CR3: 000135108000 CR4: 06a0 [ 4255.630667] Call Trace: [ 4255.630790] shrink_inactive_list+0x27b/0x800 [ 4255.630951] shrink_node_memcg+0x3b0/0x7e0 [ 4255.631181] ? mem_cgroup_iter+0xe3/0x730 [ 4255.631374] ? mem_cgroup_iter+0xe3/0x730 [ 4255.631509] ? shrink_node+0xcc/0x350 [ 4255.631651] shrink_node+0xcc/0x350 [ 4255.631780] kswapd+0x307/0x910 [ 4255.631913] kthread+0x103/0x140 [ 4255.632033] ? mem_cgroup_shrink_node+0x2f0/0x2f0 [ 4255.632201] ? kthread_create_on_node+0x40/0x40 [ 4255.632348] ret_from_fork+0x3a/0x50 [ 4255.632499] Code: 85 c0 74 59 49 8b 38 48 c7 c0 60 2f 16 82 48 85 ff 74 18 48 8b 47 28 48 3b 05 75 b6 0f 01 0f 84 42 0a 00 00 48 8b 80 28 01 00 00 <48> 8b 48 58 48 8b 51 20 48 85 d2 0f 84 69 04 00 00 4c 89 04 24 [ 4255.633055] RIP: shrink_page_list+0x320/0x1180 RSP: c9b2fb38 [ 4255.633456] ---[ end trace 5c1558c67347a58d ]--- shrink_page_list+0x320/0x1180 is: wb_congested at include/linux/backing-dev.h:170 (inlined by) inode_congested at include/linux/backing-dev.h:456 (inlined by) inode_write_congested at include/linux/backing-dev.h:468 (inlined by) shrink_page_list at mm/vmscan.c:957 So the actual faulting code is in wb_congested's first line: struct backing_dev_info *bdi = wb->bdi; So this means wb_congested is called with a null bdi_writeback. This is the first time I've seen it so it's likely new. I haven't tried bisecting. FWIW I triggered it with xfstest generic/176 running on btrfs. But from the looks the filesystem wasn't a play here.