Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: > [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] > > On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: > > Theodore Ts'owrites: > > > > > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > > >> 1.) Privacy implications. Say the filesystem is being shared between > > >> multiple > > >> users, and one user unpacks foo.tar.gz into their home directory, > > >> which > > >> they've set to mode 700 to hide from other users. Because of this > > >> new > > >> ioctl, all users will be able to see every (inode number, size in > > >> blocks) > > >> pair that was added to the filesystem, as well as the exact layout > > >> of the > > >> physical block allocations which might hint at how the files were > > >> created. > > >> If there is a known "fingerprint" for the unpacked foo.tar.gz in this > > >> regard, its presence on the filesystem will be revealed to all > > >> users. And > > >> if any filesystems happen to prefer allocating blocks near the > > >> containing > > >> directory, the directory the files are in would likely be revealed > > >> too. > > Frankly, why are container users even allowed to make unrestricted ioctl > calls? I thought we had a bunch of security infrastructure to constrain > what userspace can do to a system, so why don't ioctls fall under these > same protections? If your containers are really that adversarial, you > ought to be blacklisting as much as you can. > Personally I don't find the presence of sandboxing features to be a very good excuse for introducing random insecure ioctls. Not everyone has everything perfectly "sandboxed" all the time, for obvious reasons. It's easy to forget about the filesystem ioctls, too, since they can be executed on any regular file, without having to open some device node in /dev. (And this actually does happen; the SELinux policy in Android, for example, still allows apps to call any ioctl on their data files, despite all the effort that has gone into whitelisting other types of ioctls. Which should be fixed, of course, but it shows that this kind of mistake is very easy to make.) > > > Unix/Linux has historically not been terribly concerned about trying > > > to protect this kind of privacy between users. So for example, in > > > order to do this, you would have to call GETFSMAP continously to track > > > this sort of thing. Someone who wanted to do this could probably get > > > this information (and much, much more) by continuously running "ps" to > > > see what processes are running. > > > > > > (I will note. wryly, that in the bad old days, when dozens of users > > > were sharing a one MIPS Vax/780, it was considered a *good* thing > > > that social pressure could be applied when it was found that someone > > > was running a CPU or memory hogger on a time sharing system. The > > > privacy right of someone running "xtrek" to be able to hide this from > > > other users on the system was never considered important at all. :-) > > Not to mention someone running GETFSMAP in a loop will be pretty obvious > both from the high kernel cpu usage and the huge number of metadata > operations. Well, only if that someone running GETFSMAP actually wants to watch things in real-time (it's not necessary for all scenarios that have been mentioned), *and* there is monitoring in place which actually detects it and can do something about it. Yes, PIDs have traditionally been global, but today we have PID namespaces, and many other isolation features such as mount namespaces. Nothing is perfect, of course, and containers are a lot worse than VMs, but it seems weird to use that as an excuse to knowingly make things worse... > > > > Fortunately, the days of timesharing seem to well behind us. For > > > those people who think that containers are as secure as VM's (hah, > > > hah, hah), it might be that best way to handle this is to have a mount > > > option that requires root access to this functionality. For those > > > people who really care about this, they can disable access. > > Or use separate filesystems for each container so that exploitable bugs > that shut down the filesystem can't be used to kill the other > containers. You could use a torrent of metadata-heavy operations > (fallocate a huge file, punch every block, truncate file, repeat) to DoS > the other containers. > > > What would be the reason for not putting this behind > > capable(CAP_SYS_ADMIN)? > > > > What possible legitimate function could this functionality serve to > > users who don't own your filesystem? > > As I've said before, it's to enable dedupe tools to decide, given a set > of files with shareable blocks, roughly how many other times each of > those shareable blocks are shared so that they can make better decisions > about which file keeps its shareable blocks, and which file
Re: Question on compression unit
At 05/11/2017 04:11 AM, Xiaochu Liu wrote: Hi there, I'm trying to tune compression options for btrfs. Specifically, I want to know the performance impact on the system under different compression unit (block) sizes. Compression unit size is fixed in btrfs. It's sectorsize, determined at mkfs time, and only 4K (page size) is supported for x86 yet. I'm aware of '--nodesize' parameter which sets the block size of metadata tree. Does that also set the block size in an extent? (from my understanding, file data are mostly stored in extent unless small enough to be inline-d in metadata leaf node?) nodesize only affects metadata, nothing to do with data size. Also from btrfs's wikipedia page: In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent. Is that still true? Yes. For example if there is one continuous range represents 0~1M data of one file, and all this data is dirty (not written to disk). Then compress will happen when trying to writing them to disk. And since the maximum uncompressed size for compressed extent is 128K (fixed), the 0~1M will be split into 8 extents (if compression ratio is acceptable). And then each 128K extent will be compressed then compressed data will be written to disk. (compressed extent still meet sectorsize alignment). So the wiki page is still right and we must read out the whole (compressed) extent to get its content. And since both uncompressed data and compressed extent must meet sectorsize alignment, data smaller than or equal to sectorsize won't go through compression since it will just waste CPU time and no space saving. Thanks, Qu Thanks, Xiaochu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qgroup: direct writes returns -EDQUOT too soon
At 05/11/2017 08:19 AM, Goldwyn Rodrigues wrote: Here is a sample script to recreate the issue: mkfs.btrfs -f /dev/vdb mount /dev/vdb /mnt btrfs quota enable /mnt btrfs sub create /mnt/tmp btrfs qgroup limit 200M /mnt/tmp btrfs quota rescan -w /mnt cd /mnt/tmp for i in {1..5}; do sync dd if=/dev/zero of=/mnt/tmp/file-$i oflag=direct sync done btrfs qgroup show -pcref /mnt/tmp Output: Create subvolume '/mnt/tmp' quota rescan started dd: writing to '/mnt/tmp/file-1': Disk quota exceeded 11991+0 records in 11990+0 records out 6138880 bytes (6.1 MB, 5.9 MiB) copied, 2.40459 s, 2.6 MB/s dd: writing to '/mnt/tmp/file-2': Disk quota exceeded 11807+0 records in 11806+0 records out 6044672 bytes (6.0 MB, 5.8 MiB) copied, 2.11256 s, 2.9 MB/s dd: writing to '/mnt/tmp/file-3': Disk quota exceeded 11628+0 records in 11627+0 records out 5953024 bytes (6.0 MB, 5.7 MiB) copied, 2.53767 s, 2.3 MB/s dd: writing to '/mnt/tmp/file-4': Disk quota exceeded 11080+0 records in 11079+0 records out 5672448 bytes (5.7 MB, 5.4 MiB) copied, 2.3697 s, 2.4 MB/s dd: writing to '/mnt/tmp/file-5': Disk quota exceeded 11358+0 records in 11357+0 records out 5814784 bytes (5.8 MB, 5.5 MiB) copied, 2.10354 s, 2.8 MB/s qgroupid rfer excl max_rfer max_excl parent child -- - 0/25728.84MiB 28.84MiB200.00MiB none --- --- The files created are only 5-6MB when the subvolume size is 200m. Each of the attempts, including the first attempt, returns EDQUOT at around 5-6 MB. IIRC that's the problem of metadata reservation. We're always over-reserving metadata, not only for qgroup. We reserve one full tree block even if we're only trying to insert one item. So I'm afraid the problem is that the default block size for dd is too small (512) and direct makes it always try to reserve a new tree block for it. It's already over 10K file extents, which will take up more than 160M if using 16K nodesize. At least this can explain the problem. Try to change dd blocksize to 1M or more should avoid the problem. However the true solution is to find a method to calculate metadata reservation more correctly, which I didn't see any good method yet. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qgroup: direct writes returns -EDQUOT too soon
Here is a sample script to recreate the issue: mkfs.btrfs -f /dev/vdb mount /dev/vdb /mnt btrfs quota enable /mnt btrfs sub create /mnt/tmp btrfs qgroup limit 200M /mnt/tmp btrfs quota rescan -w /mnt cd /mnt/tmp for i in {1..5}; do sync dd if=/dev/zero of=/mnt/tmp/file-$i oflag=direct sync done btrfs qgroup show -pcref /mnt/tmp Output: Create subvolume '/mnt/tmp' quota rescan started dd: writing to '/mnt/tmp/file-1': Disk quota exceeded 11991+0 records in 11990+0 records out 6138880 bytes (6.1 MB, 5.9 MiB) copied, 2.40459 s, 2.6 MB/s dd: writing to '/mnt/tmp/file-2': Disk quota exceeded 11807+0 records in 11806+0 records out 6044672 bytes (6.0 MB, 5.8 MiB) copied, 2.11256 s, 2.9 MB/s dd: writing to '/mnt/tmp/file-3': Disk quota exceeded 11628+0 records in 11627+0 records out 5953024 bytes (6.0 MB, 5.7 MiB) copied, 2.53767 s, 2.3 MB/s dd: writing to '/mnt/tmp/file-4': Disk quota exceeded 11080+0 records in 11079+0 records out 5672448 bytes (5.7 MB, 5.4 MiB) copied, 2.3697 s, 2.4 MB/s dd: writing to '/mnt/tmp/file-5': Disk quota exceeded 11358+0 records in 11357+0 records out 5814784 bytes (5.8 MB, 5.5 MiB) copied, 2.10354 s, 2.8 MB/s qgroupid rfer excl max_rfer max_excl parent child -- - 0/25728.84MiB 28.84MiB200.00MiB none --- --- The files created are only 5-6MB when the subvolume size is 200m. Each of the attempts, including the first attempt, returns EDQUOT at around 5-6 MB. -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
errno=-28 No space left, with kernel backtrace (blocking bug)
Hi, this bug is very blocking for me: https://bugzilla.kernel.org/show_bug.cgi?id=195257 The server is backup server, I btrfs receive (with and without -p), and of course btrfs subvolume delete The volume is 70TB, then I use space_cache=v2 Cheers, -- alpha_one_x86/BRULE HermanMain developer of Supercopier/Ultracopier/CatchChallenger, Esourcing and server management IT, OS, technologies, research & development, security and business department -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
[cc btrfs, since afaict that's where most of the dedupe tool authors hang out] On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: > Theodore Ts'owrites: > > > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > >> 1.) Privacy implications. Say the filesystem is being shared between > >> multiple > >> users, and one user unpacks foo.tar.gz into their home directory, which > >> they've set to mode 700 to hide from other users. Because of this new > >> ioctl, all users will be able to see every (inode number, size in > >> blocks) > >> pair that was added to the filesystem, as well as the exact layout of > >> the > >> physical block allocations which might hint at how the files were > >> created. > >> If there is a known "fingerprint" for the unpacked foo.tar.gz in this > >> regard, its presence on the filesystem will be revealed to all users. > >> And > >> if any filesystems happen to prefer allocating blocks near the > >> containing > >> directory, the directory the files are in would likely be revealed too. Frankly, why are container users even allowed to make unrestricted ioctl calls? I thought we had a bunch of security infrastructure to constrain what userspace can do to a system, so why don't ioctls fall under these same protections? If your containers are really that adversarial, you ought to be blacklisting as much as you can. > > Unix/Linux has historically not been terribly concerned about trying > > to protect this kind of privacy between users. So for example, in > > order to do this, you would have to call GETFSMAP continously to track > > this sort of thing. Someone who wanted to do this could probably get > > this information (and much, much more) by continuously running "ps" to > > see what processes are running. > > > > (I will note. wryly, that in the bad old days, when dozens of users > > were sharing a one MIPS Vax/780, it was considered a *good* thing > > that social pressure could be applied when it was found that someone > > was running a CPU or memory hogger on a time sharing system. The > > privacy right of someone running "xtrek" to be able to hide this from > > other users on the system was never considered important at all. :-) Not to mention someone running GETFSMAP in a loop will be pretty obvious both from the high kernel cpu usage and the huge number of metadata operations. > > Fortunately, the days of timesharing seem to well behind us. For > > those people who think that containers are as secure as VM's (hah, > > hah, hah), it might be that best way to handle this is to have a mount > > option that requires root access to this functionality. For those > > people who really care about this, they can disable access. Or use separate filesystems for each container so that exploitable bugs that shut down the filesystem can't be used to kill the other containers. You could use a torrent of metadata-heavy operations (fallocate a huge file, punch every block, truncate file, repeat) to DoS the other containers. > What would be the reason for not putting this behind > capable(CAP_SYS_ADMIN)? > > What possible legitimate function could this functionality serve to > users who don't own your filesystem? As I've said before, it's to enable dedupe tools to decide, given a set of files with shareable blocks, roughly how many other times each of those shareable blocks are shared so that they can make better decisions about which file keeps its shareable blocks, and which file gets remapped. Dedupe is not a privileged operation, nor are any of the tools. > I have seen several people speak up how this is a concern I don't see > anyone saying here is a legitimate use for a non-system administrator. /I/ said that a few emails ago. --D > This doesn't seem like something where abuses of time-sharing systems > can be observed. > > Eric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Question on compression unit
Hi there, I'm trying to tune compression options for btrfs. Specifically, I want to know the performance impact on the system under different compression unit (block) sizes. I'm aware of '--nodesize' parameter which sets the block size of metadata tree. Does that also set the block size in an extent? (from my understanding, file data are mostly stored in extent unless small enough to be inline-d in metadata leaf node?) Also from btrfs's wikipedia page: In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent. Is that still true? Thanks, Xiaochu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 5/6] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
On 05/09/2017 09:36 PM, Qu Wenruo wrote: > Introduce a new parameter, struct extent_changeset for > btrfs_qgroup_reserved_data() and its callers. > > Such extent_changeset was used in btrfs_qgroup_reserve_data() to record > which range it reserved in current reserve, so it can free it at error > path. > > The reason we need to export it to callers is, at buffered write error > path, without knowing what exactly which range we reserved in current > allocation, we can free space which is not reserved by us. > > This will lead to qgroup reserved space underflow. > > Reviewed-by: Chandan Rajendra> Signed-off-by: Qu Wenruo > --- > fs/btrfs/ctree.h | 6 -- > fs/btrfs/extent-tree.c | 16 +++- > fs/btrfs/extent_io.h | 34 ++ > fs/btrfs/file.c| 12 +--- > fs/btrfs/inode-map.c | 4 +++- > fs/btrfs/inode.c | 18 ++ > fs/btrfs/ioctl.c | 5 - > fs/btrfs/qgroup.c | 41 + > fs/btrfs/qgroup.h | 3 ++- > fs/btrfs/relocation.c | 4 +++- > 10 files changed, 113 insertions(+), 30 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 1e82516fe2d8..52a0147cd612 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -2704,8 +2704,9 @@ enum btrfs_flush_state { > COMMIT_TRANS= 6, > }; > > -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len); > int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); > +int btrfs_check_data_free_space(struct inode *inode, > + struct extent_changeset **reserved, u64 start, u64 len); > void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len); > void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start, > u64 len); > @@ -2723,7 +2724,8 @@ void btrfs_subvolume_release_metadata(struct > btrfs_fs_info *fs_info, > struct btrfs_block_rsv *rsv); > int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 > num_bytes); > void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 > num_bytes); > -int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len); > +int btrfs_delalloc_reserve_space(struct inode *inode, > + struct extent_changeset **reserved, u64 start, u64 len); > void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len); > void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type); > struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info, > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 4f62696131a6..782e0f5feb69 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3364,6 +3364,7 @@ static int cache_save_setup(struct > btrfs_block_group_cache *block_group, > struct btrfs_fs_info *fs_info = block_group->fs_info; > struct btrfs_root *root = fs_info->tree_root; > struct inode *inode = NULL; > + struct extent_changeset *data_reserved = NULL; > u64 alloc_hint = 0; > int dcs = BTRFS_DC_ERROR; > u64 num_pages = 0; > @@ -3483,7 +3484,7 @@ static int cache_save_setup(struct > btrfs_block_group_cache *block_group, > num_pages *= 16; > num_pages *= PAGE_SIZE; > > - ret = btrfs_check_data_free_space(inode, 0, num_pages); > + ret = btrfs_check_data_free_space(inode, _reserved, 0, num_pages); > if (ret) > goto out_put; > > @@ -3514,6 +3515,7 @@ static int cache_save_setup(struct > btrfs_block_group_cache *block_group, > block_group->disk_cache_state = dcs; > spin_unlock(_group->lock); > > + extent_changeset_free(data_reserved); > return ret; > } > > @@ -4282,7 +4284,8 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode > *inode, u64 bytes) > * Will replace old btrfs_check_data_free_space(), but for patch split, > * add a new function first and then replace it. > */ > -int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len) > +int btrfs_check_data_free_space(struct inode *inode, > + struct extent_changeset **reserved, u64 start, u64 len) > { > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > int ret; > @@ -4297,9 +4300,11 @@ int btrfs_check_data_free_space(struct inode *inode, > u64 start, u64 len) > return ret; > > /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */ > - ret = btrfs_qgroup_reserve_data(inode, start, len); > + ret = btrfs_qgroup_reserve_data(inode, reserved, start, len); > if (ret < 0) > btrfs_free_reserved_data_space_noquota(inode, start, len); > + else > + ret = 0; > return ret; > } > > @@ -6140,11 +6145,12 @@ void
Re: [PATCH 0/6 RFC] utilize bio_clone_fast to clean up
On Tue, May 09, 2017 at 03:49:13PM -0700, Liu Bo wrote: > > This patch is too big, can you split it to smaller chunks? I was not > > able to review it, it seems to touch several things at once, it's hard > > to keep the context. > > Oh I see, the diff does look scary but the changes are in fact not > intrusive, I'll try to do something. On second read, I think it's not that bad as it looked. Some of the code is just moved/indented differently and I see the core of the change now. I'm concerned about the used types for length/size, but this can be done as a followup patch. If you have split the patch already, please send it anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 23/27] gfs2: clean up some filemap_* calls
- Original Message - | In some places, it's trying to reset the mapping error after calling | filemap_fdatawait. That's no longer required. Also, turn several | filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait. | That will at least return writeback errors that occur during the write | phase. | | Signed-off-by: Jeff Layton| --- Hi Jeff, This version looks better. ACK. Regards, Bob Peterson Red Hat File Systems -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix incorrect error return ret being passed to mapping_set_error
On Tue, May 09, 2017 at 11:58:55AM -0700, Liu Bo wrote: > On Tue, May 09, 2017 at 06:14:01PM +0100, Colin King wrote: > > From: Colin Ian King> > > > The setting of return code ret should be based on the error code > > passed into function end_extent_writepage and not on ret. Thanks > > to Liu Bo for spotting this mistake in the original fix I submitted. > > > > Detected by CoverityScan, CID#1414312 ("Logically dead code") > > Looks good. > > Reviewed-by: Liu Bo Thanks, patch replaced. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it
On Wed, 2017-05-10 at 07:18 -0700, Matthew Wilcox wrote: > On Tue, May 09, 2017 at 11:49:16AM -0400, Jeff Layton wrote: > > +++ b/lib/errseq.c > > @@ -0,0 +1,199 @@ > > +#include > > +#include > > +#include > > +#include > > + > > +/* > > + * An errseq_t is a way of recording errors in one place, and allowing any > > + * number of "subscribers" to tell whether it has changed since an > > arbitrary > > + * time of their choosing. > > You use the word "time" in several places in the documentation, but I think > it's clearer to say "sampling point" or "sample", since you're not using > jiffies > or nanoseconds. For example, I'd phrase this paragraph this way: > > * An errseq_t is a way of recording errors in one place, and allowing any > * number of "subscribers" to tell whether it has changed since they last > * sampled it. > > > + * The general idea is for consumers to sample an errseq_t value at a > > + * particular point in time. Later, that value can be used to tell whether > > any > > + * new errors have occurred since that time. > > * The general idea is for consumers to sample an errseq_t value. That > * value can be used to tell whether any new errors have occurred since > * the last time it was sampled. > > > +/* The "ones" bit for the counter */ > > Maybe "The lowest bit of the counter"? > > > +/** > > + * errseq_check - has an error occurred since a particular point in time? > > "has an error occurred since the last time it was sampled" > > > +/** > > + * errseq_check_and_advance - check an errseq_t and advance it to the > > current value > > + * @eseq: pointer to value being checked reported > > "value being checked reported"? > Thanks. I'm cleaning up the comments like you suggest. > > +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > > +{ > > + int err = 0; > > + errseq_t old, new; > > + > > + /* > > +* Most callers will want to use the inline wrapper to check this, > > +* so that the common case of no error is handled without needing > > +* to lock. > > +*/ > > + old = READ_ONCE(*eseq); > > + if (old != *since) { > > + /* > > +* Set the flag and try to swap it into place if it has > > +* changed. > > +* > > +* We don't care about the outcome of the swap here. If the > > +* swap doesn't occur, then it has either been updated by a > > +* writer who is bumping the seq count anyway, or another > > +* reader who is just setting the "seen" flag. Either outcome > > +* is OK, and we can advance "since" and return an error based > > +* on what we have. > > +*/ > > + new = old | ERRSEQ_SEEN; > > + if (new != old) > > + cmpxchg(eseq, old, new); > > + *since = new; > > + err = -(new & MAX_ERRNO); > > + } > > I probably need to read through the patchset some more to understand this. > Naively, surely "since" should be updated to the current value of 'eseq' > if we failed the cmpxchg()? I don't think so. If we want to do that, then we'll need to redrive the cmpxchg to set the SEEN flag if it's now clear. Storing the value in "since" is effectively sampling it, so you do have to mark it seen. The good news is that I think that "new" is just as valid a value to store here as *eseq would be. It ends up representing an errseq_t value that never actually got stored in eseq, but that's OK with the way this works. -- Jeff Layton-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it
On Tue, May 09, 2017 at 11:49:16AM -0400, Jeff Layton wrote: > +++ b/lib/errseq.c > @@ -0,0 +1,199 @@ > +#include > +#include > +#include > +#include > + > +/* > + * An errseq_t is a way of recording errors in one place, and allowing any > + * number of "subscribers" to tell whether it has changed since an arbitrary > + * time of their choosing. You use the word "time" in several places in the documentation, but I think it's clearer to say "sampling point" or "sample", since you're not using jiffies or nanoseconds. For example, I'd phrase this paragraph this way: * An errseq_t is a way of recording errors in one place, and allowing any * number of "subscribers" to tell whether it has changed since they last * sampled it. > + * The general idea is for consumers to sample an errseq_t value at a > + * particular point in time. Later, that value can be used to tell whether > any > + * new errors have occurred since that time. * The general idea is for consumers to sample an errseq_t value. That * value can be used to tell whether any new errors have occurred since * the last time it was sampled. > +/* The "ones" bit for the counter */ Maybe "The lowest bit of the counter"? > +/** > + * errseq_check - has an error occurred since a particular point in time? "has an error occurred since the last time it was sampled" > +/** > + * errseq_check_and_advance - check an errseq_t and advance it to the > current value > + * @eseq: pointer to value being checked reported "value being checked reported"? > +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > +{ > + int err = 0; > + errseq_t old, new; > + > + /* > + * Most callers will want to use the inline wrapper to check this, > + * so that the common case of no error is handled without needing > + * to lock. > + */ > + old = READ_ONCE(*eseq); > + if (old != *since) { > + /* > + * Set the flag and try to swap it into place if it has > + * changed. > + * > + * We don't care about the outcome of the swap here. If the > + * swap doesn't occur, then it has either been updated by a > + * writer who is bumping the seq count anyway, or another > + * reader who is just setting the "seen" flag. Either outcome > + * is OK, and we can advance "since" and return an error based > + * on what we have. > + */ > + new = old | ERRSEQ_SEEN; > + if (new != old) > + cmpxchg(eseq, old, new); > + *since = new; > + err = -(new & MAX_ERRNO); > + } I probably need to read through the patchset some more to understand this. Naively, surely "since" should be updated to the current value of 'eseq' if we failed the cmpxchg()? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting
On Wed 10-05-17 08:19:50, Jeff Layton wrote: > On Wed, 2017-05-10 at 13:48 +0200, Jan Kara wrote: > > On Tue 09-05-17 11:49:17, Jeff Layton wrote: > > > diff --git a/fs/file_table.c b/fs/file_table.c > > > index 954d510b765a..d6138b6411ff 100644 > > > --- a/fs/file_table.c > > > +++ b/fs/file_table.c > > > @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, > > > fmode_t mode, > > > file->f_path = *path; > > > file->f_inode = path->dentry->d_inode; > > > file->f_mapping = path->dentry->d_inode->i_mapping; > > > + file->f_wb_err = filemap_sample_wb_error(file->f_mapping); > > > > Why do you sample here when you also sample in do_dentry_open()? I didn't > > find any alloc_file() callers that would possibly care about writeback > > errors... > > > > Honza > > I basically used the setting of f_mapping as a guideline as to where to > sample it for initialization. My thinking was that if f_mapping ever > ended up different then you'd probably also want f_wb_err to be > resampled anyway. OK, makes sense. > I can drop this hunk if you think we don't need it. I don't really care. I was just wondering whether I'm missing something... Honza -- Jan KaraSUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 06/27] fs: check for writeback errors after syncing out buffers in generic_file_fsync
On Tue, May 09, 2017 at 11:49:09AM -0400, Jeff Layton wrote: > ext2 currently does a test+clear of the AS_EIO flag, which is > is problematic for some coming changes. > > What we really need to do instead is call filemap_check_errors > in __generic_file_fsync after syncing out the buffers. That > will be sufficient for this case, and help other callers detect > these errors properly as well. > > With that, we don't need to twiddle it in ext2. > > Suggested-by: Jan Kara> Signed-off-by: Jeff Layton > Reviewed-by: Christoph Hellwig > Reviewed-by: Jan Kara Reviewed-by: Matthew Wilcox -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting
On Wed, 2017-05-10 at 13:48 +0200, Jan Kara wrote: > On Tue 09-05-17 11:49:17, Jeff Layton wrote: > > Most filesystems currently use mapping_set_error and > > filemap_check_errors for setting and reporting/clearing writeback errors > > at the mapping level. filemap_check_errors is indirectly called from > > most of the filemap_fdatawait_* functions and from > > filemap_write_and_wait*. These functions are called from all sorts of > > contexts to wait on writeback to finish -- e.g. mostly in fsync, but > > also in truncate calls, getattr, etc. > > > > The non-fsync callers are problematic. We should be reporting writeback > > errors during fsync, but many places spread over the tree clear out > > errors before they can be properly reported, or report errors at > > nonsensical times. > > > > If I get -EIO on a stat() call, there is no reason for me to assume that > > it is because some previous writeback failed. The fact that it also > > clears out the error such that a subsequent fsync returns 0 is a bug, > > and a nasty one since that's potentially silent data corruption. > > > > This patch adds a small bit of new infrastructure for setting and > > reporting errors during address_space writeback. While the above was my > > original impetus for adding this, I think it's also the case that > > current fsync semantics are just problematic for userland. Most > > applications that call fsync do so to ensure that the data they wrote > > has hit the backing store. > > > > In the case where there are multiple writers to the file at the same > > time, this is really hard to determine. The first one to call fsync will > > see any stored error, and the rest get back 0. The processes with open > > fds may not be associated with one another in any way. They could even > > be in different containers, so ensuring coordination between all fsync > > callers is not really an option. > > > > One way to remedy this would be to track what file descriptor was used > > to dirty the file, but that's rather cumbersome and would likely be > > slow. However, there is a simpler way to improve the semantics here > > without incurring too much overhead. > > > > This set adds an errseq_t to struct address_space, and a corresponding > > one is added to struct file. Writeback errors are recorded in the > > mapping's errseq_t, and the one in struct file is used as the "since" > > value. > > > > This changes the semantics of the Linux fsync implementation such that > > applications can now use it to determine whether there were any > > writeback errors since fsync(fd) was last called (or since the file was > > opened in the case of fsync having never been called). > > > > Note that those writeback errors may have occurred when writing data > > that was dirtied via an entirely different fd, but that's the case now > > with the current mapping_set_error/filemap_check_error infrastructure. > > This will at least prevent you from getting a false report of success. > > > > The new behavior is still consistent with the POSIX spec, and is more > > reliable for application developers. This patch just adds some basic > > infrastructure for doing this. Later patches will change the existing > > code to use this new infrastructure. > > > > Signed-off-by: Jeff Layton> > Just one nit below. Otherwise the patch looks good to me. You can add: > > Reviewed-by: Jan Kara > > > diff --git a/fs/file_table.c b/fs/file_table.c > > index 954d510b765a..d6138b6411ff 100644 > > --- a/fs/file_table.c > > +++ b/fs/file_table.c > > @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, > > fmode_t mode, > > file->f_path = *path; > > file->f_inode = path->dentry->d_inode; > > file->f_mapping = path->dentry->d_inode->i_mapping; > > + file->f_wb_err = filemap_sample_wb_error(file->f_mapping); > > Why do you sample here when you also sample in do_dentry_open()? I didn't > find any alloc_file() callers that would possibly care about writeback > errors... > > Honza I basically used the setting of f_mapping as a guideline as to where to sample it for initialization. My thinking was that if f_mapping ever ended up different then you'd probably also want f_wb_err to be resampled anyway. I can drop this hunk if you think we don't need it. -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 14/27] fs: new infrastructure for writeback error handling and reporting
On Tue 09-05-17 11:49:17, Jeff Layton wrote: > Most filesystems currently use mapping_set_error and > filemap_check_errors for setting and reporting/clearing writeback errors > at the mapping level. filemap_check_errors is indirectly called from > most of the filemap_fdatawait_* functions and from > filemap_write_and_wait*. These functions are called from all sorts of > contexts to wait on writeback to finish -- e.g. mostly in fsync, but > also in truncate calls, getattr, etc. > > The non-fsync callers are problematic. We should be reporting writeback > errors during fsync, but many places spread over the tree clear out > errors before they can be properly reported, or report errors at > nonsensical times. > > If I get -EIO on a stat() call, there is no reason for me to assume that > it is because some previous writeback failed. The fact that it also > clears out the error such that a subsequent fsync returns 0 is a bug, > and a nasty one since that's potentially silent data corruption. > > This patch adds a small bit of new infrastructure for setting and > reporting errors during address_space writeback. While the above was my > original impetus for adding this, I think it's also the case that > current fsync semantics are just problematic for userland. Most > applications that call fsync do so to ensure that the data they wrote > has hit the backing store. > > In the case where there are multiple writers to the file at the same > time, this is really hard to determine. The first one to call fsync will > see any stored error, and the rest get back 0. The processes with open > fds may not be associated with one another in any way. They could even > be in different containers, so ensuring coordination between all fsync > callers is not really an option. > > One way to remedy this would be to track what file descriptor was used > to dirty the file, but that's rather cumbersome and would likely be > slow. However, there is a simpler way to improve the semantics here > without incurring too much overhead. > > This set adds an errseq_t to struct address_space, and a corresponding > one is added to struct file. Writeback errors are recorded in the > mapping's errseq_t, and the one in struct file is used as the "since" > value. > > This changes the semantics of the Linux fsync implementation such that > applications can now use it to determine whether there were any > writeback errors since fsync(fd) was last called (or since the file was > opened in the case of fsync having never been called). > > Note that those writeback errors may have occurred when writing data > that was dirtied via an entirely different fd, but that's the case now > with the current mapping_set_error/filemap_check_error infrastructure. > This will at least prevent you from getting a false report of success. > > The new behavior is still consistent with the POSIX spec, and is more > reliable for application developers. This patch just adds some basic > infrastructure for doing this. Later patches will change the existing > code to use this new infrastructure. > > Signed-off-by: Jeff LaytonJust one nit below. Otherwise the patch looks good to me. You can add: Reviewed-by: Jan Kara > diff --git a/fs/file_table.c b/fs/file_table.c > index 954d510b765a..d6138b6411ff 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t > mode, > file->f_path = *path; > file->f_inode = path->dentry->d_inode; > file->f_mapping = path->dentry->d_inode->i_mapping; > + file->f_wb_err = filemap_sample_wb_error(file->f_mapping); Why do you sample here when you also sample in do_dentry_open()? I didn't find any alloc_file() callers that would possibly care about writeback errors... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it
On Tue 09-05-17 11:49:16, Jeff Layton wrote: > An errseq_t is a way of recording errors in one place, and allowing any > number of "subscribers" to tell whether an error has been set again > since a previous time. > > It's implemented as an unsigned 32-bit value that is managed with atomic > operations. The low order bits are designated to hold an error code > (max size of MAX_ERRNO). The upper bits are used as a counter. > > The API works with consumers sampling an errseq_t value at a particular > point in time. Later, that value can be used to tell whether new errors > have been set since that time. > > Note that there is a 1 in 512k risk of collisions here if new errors > are being recorded frequently, since we have so few bits to use as a > counter. To mitigate this, one bit is used as a flag to tell whether the > value has been sampled since a new value was recorded. That allows > us to avoid bumping the counter if no one has sampled it since it > was last bumped. > > Later patches will build on this infrastructure to change how writeback > errors are tracked in the kernel. > > Signed-off-by: Jeff LaytonThe patch looks good to me. Feel free to add: Reviewed-by: Jan Kara Just two nits below: ... > +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > +{ > + int err = 0; > + errseq_t old, new; > + > + /* > + * Most callers will want to use the inline wrapper to check this, > + * so that the common case of no error is handled without needing > + * to lock. > + */ I'm not sure which locking you are speaking about here. Is the comment stale? > + old = READ_ONCE(*eseq); > + if (old != *since) { > + /* > + * Set the flag and try to swap it into place if it has > + * changed. > + * > + * We don't care about the outcome of the swap here. If the > + * swap doesn't occur, then it has either been updated by a > + * writer who is bumping the seq count anyway, or another "bumping the seq count anyway" part is not quite true. Writer may see ERRSEQ_SEEN not set and so just update the error code and leave seq count as is. But since you compare full errseq_t for equality, this works out as designed... > + * reader who is just setting the "seen" flag. Either outcome > + * is OK, and we can advance "since" and return an error based > + * on what we have. > + */ > + new = old | ERRSEQ_SEEN; > + if (new != old) > + cmpxchg(eseq, old, new); > + *since = new; > + err = -(new & MAX_ERRNO); > + } > + return err; > +} > +EXPORT_SYMBOL(errseq_check_and_advance); Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it
On Wed, 2017-05-10 at 08:03 +1000, NeilBrown wrote: > On Tue, May 09 2017, Jeff Layton wrote: > > > An errseq_t is a way of recording errors in one place, and allowing any > > number of "subscribers" to tell whether an error has been set again > > since a previous time. > > > > It's implemented as an unsigned 32-bit value that is managed with atomic > > operations. The low order bits are designated to hold an error code > > (max size of MAX_ERRNO). The upper bits are used as a counter. > > > > The API works with consumers sampling an errseq_t value at a particular > > point in time. Later, that value can be used to tell whether new errors > > have been set since that time. > > > > Note that there is a 1 in 512k risk of collisions here if new errors > > are being recorded frequently, since we have so few bits to use as a > > counter. To mitigate this, one bit is used as a flag to tell whether the > > value has been sampled since a new value was recorded. That allows > > us to avoid bumping the counter if no one has sampled it since it > > was last bumped. > > > > Later patches will build on this infrastructure to change how writeback > > errors are tracked in the kernel. > > > > Signed-off-by: Jeff Layton> > I like that this is a separate lib/*.c - nicely structured too. > > Reviewed-by: NeilBrown > > Thanks, yeah...it occurred to me that this scheme is not really specific to writeback errors. While I can't think of another use-case for errseq_t's right offhand, I think this makes for cleaner layering and should make it easy to use in other ways should they arise. -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
Hello, thanks. But is there any way to recover from this error? Like removing the item or so? Data loss isn't a problem. Just reconstructing the whole FS will take quite a long time. Stefan Am 10.05.2017 um 11:54 schrieb Hugo Mills: > On Wed, May 10, 2017 at 11:20:58AM +0200, Stefan Priebe - Profihost AG wrote: >> Hello, >> >> here's the output: >> # for block in 163316514816 163322413056 163325722624; do echo $block; >> btrfs-debug-tree -b $block /dev/mapper/crypt_md0|sed -re 's/(\t| )name: >> .*/\1name: HIDDEN/'; done >> >> 163316514816 >> btrfs-progs v4.8.5 >> leaf 163316514816 items 188 free space 1387 generation 86739 owner 3892 >> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 >> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f > [...] >> item 37 key (23760 DIR_INDEX 36) itemoff 14278 itemsize 58 >> location key (28124232 INODE_ITEM 0) type FILE >> transid 86739 data_len 0 name_len 28 >> name: HIDDEN >> item 38 key (23760 DIR_INDEX 37) itemoff 14220 itemsize 58 >> location key (28124233 INODE_ITEM 0) type FILE >> transid 86739 data_len 0 name_len 28 >> name: HIDDEN >> item 39 key (23760 DIR_INDEX 38) itemoff 14165 itemsize 55 >> location key (28124234 INODE_ITEM 0) type FILE >> transid 86739 data_len 0 name_len 25 >> name: HIDDEN >> item 40 key (23760 DIR_INDEX 22) itemoff 14115 itemsize 50 >> location key (26923383 INODE_ITEM 0) type FILE >> transid 74009 data_len 0 name_len 20 >> name: HIDDEN >> item 41 key (23760 DIR_INDEX 23) itemoff 14067 itemsize 48 >> location key (26923384 INODE_ITEM 0) type FILE >> transid 74009 data_len 0 name_len 18 >> name: HIDDEN > [...] >> 163322413056 >> btrfs-progs v4.8.5 >> leaf 163322413056 items 113 free space 934 generation 86739 owner 3892 >> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 >> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f > [...] >> item 73 key (24016 DIR_INDEX 19) itemoff 9651 itemsize 62 >> location key (28124251 INODE_ITEM 0) type FILE >> transid 86739 data_len 0 name_len 32 >> name: HIDDEN >> item 74 key (24016 DIR_INDEX 20) itemoff 9592 itemsize 59 >> location key (28124252 INODE_ITEM 0) type FILE >> transid 86739 data_len 0 name_len 29 >> name: HIDDEN >> item 75 key (24016 DIR_INDEX 4) itemoff 9538 itemsize 54 >> location key (26923401 INODE_ITEM 0) type FILE >> transid 74009 data_len 0 name_len 24 >> name: HIDDEN >> item 76 key (24016 DIR_INDEX 5) itemoff 9486 itemsize 52 >> location key (26923402 INODE_ITEM 0) type FILE >> transid 74009 data_len 0 name_len 22 >> name: HIDDEN > [...] >> 163325722624 >> btrfs-progs v4.8.5 >> leaf 163325722624 items 78 free space 6563 generation 86739 owner 3892 >> fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 >> chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f > [...] >> item 62 key (24189 DIR_INDEX 16) itemoff 9409 itemsize 64 >> location key (28124267 INODE_ITEM 0) type FILE >> transid 86739 data_len 0 name_len 34 >> name: HIDDEN >> item 63 key (24189 DIR_INDEX 17) itemoff 9349 itemsize 60 >> location key (28124268 INODE_ITEM 0) type FILE >> transid 86739 data_len 0 name_len 30 >> name: HIDDEN >> item 64 key (24189 DIR_INDEX 4) itemoff 9296 itemsize 53 >> location key (26923421 INODE_ITEM 0) type FILE >> transid 74010 data_len 0 name_len 23 >> name: HIDDEN >> item 65 key (24189 DIR_INDEX 5) itemoff 9236 itemsize 60 >> location key (26923422 INODE_ITEM 0) type FILE >> transid 74010 data_len 0 name_len 30 >> name: HIDDEN > [...] > >In each case, the tree node keys have simply been sorted > incorrectly -- the ordering is otherwise correct, but jumps backwards > at some point in the sequence. At least in the first instance, some of > the keys appear to have been duplicated: there are two (23760 > DIR_INDEX 22) keys in the list. (I didn't check in detail with the > other two whether there are duplicates, but I wouldn't be surprised). > >I note that the jump in the keys in the first two cases is 16, and > the jump in the third case is 13. That _might_ suggest some kind of > bit-level error, but it's probably not in the RAM that was used to > store the blocks, as the error appears in a different place in each > block. > >I'm tentatively going to point the finger at your hardware for > this. It's probably going to need something really long and/or > stressful to pick it up, though (one
Re: [PATCH v4 12/27] cifs: set mapping error when page writeback fails in writepage or launder_pages
On Tue 09-05-17 11:49:15, Jeff Layton wrote: > Signed-off-by: Jeff Layton> Reviewed-by: Christoph Hellwig Looks good to me. You can add: Reviewed-by: Jan Kara Honza > --- > fs/cifs/file.c | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > index 21d404535739..0bee7f8d91ad 100644 > --- a/fs/cifs/file.c > +++ b/fs/cifs/file.c > @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct > writeback_control *wbc) > set_page_writeback(page); > retry_write: > rc = cifs_partialpagewrite(page, 0, PAGE_SIZE); > - if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) > - goto retry_write; > - else if (rc == -EAGAIN) > + if (rc == -EAGAIN) { > + if (wbc->sync_mode == WB_SYNC_ALL) > + goto retry_write; > redirty_page_for_writepage(wbc, page); > - else if (rc != 0) > + } else if (rc != 0) { > SetPageError(page); > - else > + mapping_set_error(page->mapping, rc); > + } else { > SetPageUptodate(page); > + } > end_page_writeback(page); > put_page(page); > free_xid(xid); > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 11/27] fuse: set mapping error in writepage_locked when it fails
On Tue 09-05-17 11:49:14, Jeff Layton wrote: > This ensures that we see errors on fsync when writeback fails. > > Signed-off-by: Jeff Layton> Reviewed-by: Christoph Hellwig Looks good to me. You can add: Reviewed-by: Jan Kara Honza > --- > fs/fuse/file.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index ec238fb5a584..07d0efcb050c 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page) > err_free: > fuse_request_free(req); > err: > + mapping_set_error(page->mapping, error); > end_page_writeback(page); > return error; > } > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 05/27] btrfs: btrfs_wait_tree_block_writeback can be void return
On Tue 09-05-17 11:49:08, Jeff Layton wrote: > Nothing checks its return value. > > Signed-off-by: Jeff LaytonLooks good to me. You can add: Reviewed-by: Jan Kara Honza > --- > fs/btrfs/disk-io.c | 6 +++--- > fs/btrfs/disk-io.h | 2 +- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index eb1ee7b6f532..8c479bd5534a 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -1222,10 +1222,10 @@ int btrfs_write_tree_block(struct extent_buffer *buf) > buf->start + buf->len - 1); > } > > -int btrfs_wait_tree_block_writeback(struct extent_buffer *buf) > +void btrfs_wait_tree_block_writeback(struct extent_buffer *buf) > { > - return filemap_fdatawait_range(buf->pages[0]->mapping, > -buf->start, buf->start + buf->len - 1); > + filemap_fdatawait_range(buf->pages[0]->mapping, > + buf->start, buf->start + buf->len - 1); > } > > struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 > bytenr, > diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h > index 2e0ec29bfd69..9cc87835abb5 100644 > --- a/fs/btrfs/disk-io.h > +++ b/fs/btrfs/disk-io.h > @@ -127,7 +127,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, > struct inode *inode, > extent_submit_bio_hook_t *submit_bio_done); > unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info); > int btrfs_write_tree_block(struct extent_buffer *buf); > -int btrfs_wait_tree_block_writeback(struct extent_buffer *buf); > +void btrfs_wait_tree_block_writeback(struct extent_buffer *buf); > int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, >struct btrfs_fs_info *fs_info); > int btrfs_add_log_tree(struct btrfs_trans_handle *trans, > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 01/27] fs: remove unneeded forward definition of mm_struct from fs.h
On Tue 09-05-17 11:49:04, Jeff Layton wrote: > Signed-off-by: Jeff LaytonLooks good. You can add: Reviewed-by: Jan Kara Honza > --- > include/linux/fs.h | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 7251f7bb45e8..38adefd8e2a0 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1252,8 +1252,6 @@ extern void f_delown(struct file *filp); > extern pid_t f_getown(struct file *filp); > extern int send_sigurg(struct fown_struct *fown); > > -struct mm_struct; > - > /* > * Umount options > */ > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] fstests: regression test for nocsum buffered read's repair
On Tue, May 09, 2017 at 11:56:11AM -0600, Liu Bo wrote: > This is to test whether buffered read retry-repair code is able to work in > raid1 case as expected. > > Please note that without checksum, btrfs doesn't know if the data used to > repair is correct, so repair is more of resync which makes sure that both > of the copy has the same content. > > Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and > drop > checks") introduced the regression. > > The upstream fix is > Btrfs: bring back repair during read btrfs/14[1-3] all could refer to the upstream patch along with its commit id. > > Signed-off-by: Liu Bo> --- > tests/btrfs/143 | 155 > > tests/btrfs/143.out | 39 + > tests/btrfs/group | 1 + > 3 files changed, 195 insertions(+) > create mode 100755 tests/btrfs/143 > create mode 100644 tests/btrfs/143.out > > diff --git a/tests/btrfs/143 b/tests/btrfs/143 > new file mode 100755 > index 000..5263e78 > --- /dev/null > +++ b/tests/btrfs/143 > @@ -0,0 +1,155 @@ > +#! /bin/bash > +# FS QA Test 143 > +# > +# Regression test for btrfs buffered read's repair during read without > checksum. > +# > +# This is to test whether buffered read retry-repair code is able to work in > +# raid1 case as expected. > +# > +# Please note that without checksum, btrfs doesn't know if the data used to > +# repair is correct, so repair is more of resync which makes sure that both > +# of the copy has the same content. > +# > +# Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and > drop > +# checks") introduced the regression. > +# > +# The upstream fix is > +#Btrfs: bring back repair during read > +# > +#--- > +# Copyright (c) 2017 Liu Bo. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +# Modify as appropriate. > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch_dev_pool 2 btrfs/142 and btrfs/143 need a "_require_fail_make_request" > + > +_require_btrfs_command inspect-internal dump-tree > +_require_command "$FILEFRAG_PROG" filefrag > + > +_check_repair() > +{ > + filter=${1:-cat} > + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | > $filter | grep -q -e "read error corrected" > + if [ $? -eq 0 ]; then > + echo 1 > + else > + echo 0 > + fi > +} This function can be removed. Thanks, Eryu > + > +get_physical() > +{ > +# $1 is logical address > +# print chunk tree and find devid 2 which is $SCRATCH_DEV > +$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | \ > + grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print > $6 }' > +} > + > +SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV` > + > +start_fail() > +{ > + echo 100 > $DEBUGFS_MNT/fail_make_request/probability > + echo 4 > $DEBUGFS_MNT/fail_make_request/times > + echo 0 > $DEBUGFS_MNT/fail_make_request/verbose > + echo 1 > $SYSFS_BDEV/make-it-fail > +} > + > +stop_fail() > +{ > + echo 0 > $DEBUGFS_MNT/fail_make_request/probability > + echo 0 > $DEBUGFS_MNT/fail_make_request/times > + echo 0 > $SYSFS_BDEV/make-it-fail > +} > + > +_scratch_dev_pool_get 2 > +# step 1, create a raid1 btrfs which contains one 128k file. > +echo "step 1..mkfs.btrfs" >>$seqres.full > + > +mkfs_opts="-d raid1 -b 1G" > +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1 > + > +# -o nospace_cache makes sure data is written to the start position of the > data > +# chunk > +_scratch_mount -o nospace_cache,nodatasum > + > +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" > | _filter_xfs_io > + > +#
Re: [PATCH 3/6] fstests: regression test for btrfs dio read repair
On Tue, May 09, 2017 at 11:56:08AM -0600, Liu Bo wrote: > This case tests whether dio read can repair the bad copy if we have > a good copy. > > Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") > introduced the regression. > > The upstream fix is > Btrfs: fix invalid dereference in btrfs_retry_endio I noticed this is in upstream now, you can refer to it along with hash tag too. > > Signed-off-by: Liu Bo> --- > tests/btrfs/140 | 115 > > tests/btrfs/140.out | 39 ++ > tests/btrfs/group | 1 + > 3 files changed, 155 insertions(+) > create mode 100755 tests/btrfs/140 > create mode 100644 tests/btrfs/140.out > > diff --git a/tests/btrfs/140 b/tests/btrfs/140 > new file mode 100755 > index 000..09a9939 > --- /dev/null > +++ b/tests/btrfs/140 > @@ -0,0 +1,115 @@ > +#! /bin/bash > +# FS QA Test 140 > +# > +# Regression test for btrfs DIO read's repair during read. > +# > +# Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") > +# introduced the regression. > +# The upstream fix is > +#Btrfs: fix invalid dereference in btrfs_retry_endio > +# > +#--- > +# Copyright (c) 2017 Liu Bo. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +# Modify as appropriate. > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch_dev_pool 2 > + > +_require_btrfs_command inspect-internal dump-tree > +_require_command "$FILEFRAG_PROG" filefrag > +_require_odirect > + > +get_physical() > +{ > + # $1 is logical address > + # print chunk tree and find devid 2 which is $SCRATCH_DEV > + $BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | \ > + grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print > $6 }' > +} > + > +_scratch_dev_pool_get 2 > +# step 1, create a raid1 btrfs which contains one 128k file. > +echo "step 1..mkfs.btrfs" >>$seqres.full > + > +mkfs_opts="-d raid1 -b 1G" > +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1 > + > +# -o nospace_cache makes sure data is written to the start position of the > data > +# chunk > +_scratch_mount -o nospace_cache > + > +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" > | _filter_xfs_io > + > +# step 2, corrupt the first 64k of one copy (on SCRATCH_DEV which is the > first > +# one in $SCRATCH_DEV_POOL > +echo "step 2..corrupt file extent" >>$seqres.full > + > +${FILEFRAG_PROG} -v $SCRATCH_MNT/foobar >> $seqres.full > +logical_in_btrfs=`${FILEFRAG_PROG} -v $SCRATCH_MNT/foobar | _filter_filefrag > | cut -d '#' -f 1` > +physical_on_scratch=`get_physical ${logical_in_btrfs}` > + > +_scratch_unmount > +$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 64K $physical_on_scratch 64K" > $SCRATCH_DEV | _filter_xfs_io > + > +_scratch_mount > + > +# step 3, 128k dio read (this read can repair bad copy) > +echo "step 3..repair the bad copy" >>$seqres.full > + > +# since raid1 consists of two copies, and the following read may read the > good > +# copy directly, so lets loop 10 times here and discard output that dio reads > +# give > +for i in `seq 1 10`; do > + $XFS_IO_PROG -d -c "pread -b 128K 0 128K" "$SCRATCH_MNT/foobar" > > /dev/null > + _get_current_dmesg | grep -q -e "csum failed" && break > +done Half of the time I got test failure because pread from SCRATCH_DEV read 0xbb instead of 0xaa on v4.11 kernel (bug should be fixed there), tested on two different hosts and could hit failure on both hosts. Similar failure happened to all the 4 tests randomly. I thought it was because "csum failed" was never hit, so I tried a "while true; do" loop, and that did fix the btrfs/140 failure for me,
Re: [PATCH 2/6] fstests: add _get_current_dmesg
On Tue, May 09, 2017 at 11:56:07AM -0600, Liu Bo wrote: > _get_current_dmesg can be used to grep customized pattern. > > Signed-off-by: Liu BoI can't apply this patch on top of current master, perhaps it needs a rebase :) > --- > common/rc | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/common/rc b/common/rc > index 78a2101..111ed69 100644 > --- a/common/rc > +++ b/common/rc > @@ -3215,6 +3215,12 @@ _get_device_size() > grep `_short_dev $1` /proc/partitions | awk '{print $3}' > } > > +# get current dmesg log > +_get_current_dmesg() > +{ > + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac > +} > + > # check dmesg log for WARNING/Oops/etc. > _check_dmesg() > { > @@ -3230,8 +3236,7 @@ _check_dmesg() > > # search the dmesg log of last run of $seqnum for possible failures > # use sed \cregexpc address type, since $seqnum contains "/" This second line of comment should go to _get_current_dmesg. Thanks, Eryu > - dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | \ > - tac | $filter >$seqres.dmesg > + _get_current_dmesg | $filter >$seqres.dmesg > grep -q -e "kernel BUG at" \ >-e "WARNING:" \ >-e "BUG:" \ > -- > 2.5.0 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
On Wed, May 10, 2017 at 11:20:58AM +0200, Stefan Priebe - Profihost AG wrote: > Hello, > > here's the output: > # for block in 163316514816 163322413056 163325722624; do echo $block; > btrfs-debug-tree -b $block /dev/mapper/crypt_md0|sed -re 's/(\t| )name: > .*/\1name: HIDDEN/'; done > > 163316514816 > btrfs-progs v4.8.5 > leaf 163316514816 items 188 free space 1387 generation 86739 owner 3892 > fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 > chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f [...] > item 37 key (23760 DIR_INDEX 36) itemoff 14278 itemsize 58 > location key (28124232 INODE_ITEM 0) type FILE > transid 86739 data_len 0 name_len 28 > name: HIDDEN > item 38 key (23760 DIR_INDEX 37) itemoff 14220 itemsize 58 > location key (28124233 INODE_ITEM 0) type FILE > transid 86739 data_len 0 name_len 28 > name: HIDDEN > item 39 key (23760 DIR_INDEX 38) itemoff 14165 itemsize 55 > location key (28124234 INODE_ITEM 0) type FILE > transid 86739 data_len 0 name_len 25 > name: HIDDEN > item 40 key (23760 DIR_INDEX 22) itemoff 14115 itemsize 50 > location key (26923383 INODE_ITEM 0) type FILE > transid 74009 data_len 0 name_len 20 > name: HIDDEN > item 41 key (23760 DIR_INDEX 23) itemoff 14067 itemsize 48 > location key (26923384 INODE_ITEM 0) type FILE > transid 74009 data_len 0 name_len 18 > name: HIDDEN [...] > 163322413056 > btrfs-progs v4.8.5 > leaf 163322413056 items 113 free space 934 generation 86739 owner 3892 > fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 > chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f [...] > item 73 key (24016 DIR_INDEX 19) itemoff 9651 itemsize 62 > location key (28124251 INODE_ITEM 0) type FILE > transid 86739 data_len 0 name_len 32 > name: HIDDEN > item 74 key (24016 DIR_INDEX 20) itemoff 9592 itemsize 59 > location key (28124252 INODE_ITEM 0) type FILE > transid 86739 data_len 0 name_len 29 > name: HIDDEN > item 75 key (24016 DIR_INDEX 4) itemoff 9538 itemsize 54 > location key (26923401 INODE_ITEM 0) type FILE > transid 74009 data_len 0 name_len 24 > name: HIDDEN > item 76 key (24016 DIR_INDEX 5) itemoff 9486 itemsize 52 > location key (26923402 INODE_ITEM 0) type FILE > transid 74009 data_len 0 name_len 22 > name: HIDDEN [...] > 163325722624 > btrfs-progs v4.8.5 > leaf 163325722624 items 78 free space 6563 generation 86739 owner 3892 > fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 > chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f [...] > item 62 key (24189 DIR_INDEX 16) itemoff 9409 itemsize 64 > location key (28124267 INODE_ITEM 0) type FILE > transid 86739 data_len 0 name_len 34 > name: HIDDEN > item 63 key (24189 DIR_INDEX 17) itemoff 9349 itemsize 60 > location key (28124268 INODE_ITEM 0) type FILE > transid 86739 data_len 0 name_len 30 > name: HIDDEN > item 64 key (24189 DIR_INDEX 4) itemoff 9296 itemsize 53 > location key (26923421 INODE_ITEM 0) type FILE > transid 74010 data_len 0 name_len 23 > name: HIDDEN > item 65 key (24189 DIR_INDEX 5) itemoff 9236 itemsize 60 > location key (26923422 INODE_ITEM 0) type FILE > transid 74010 data_len 0 name_len 30 > name: HIDDEN [...] In each case, the tree node keys have simply been sorted incorrectly -- the ordering is otherwise correct, but jumps backwards at some point in the sequence. At least in the first instance, some of the keys appear to have been duplicated: there are two (23760 DIR_INDEX 22) keys in the list. (I didn't check in detail with the other two whether there are duplicates, but I wouldn't be surprised). I note that the jump in the keys in the first two cases is 16, and the jump in the third case is 13. That _might_ suggest some kind of bit-level error, but it's probably not in the RAM that was used to store the blocks, as the error appears in a different place in each block. I'm tentatively going to point the finger at your hardware for this. It's probably going to need something really long and/or stressful to pick it up, though (one of the CPU stress tests, for example, and also a good long run with a RAM tester -- 24 hours or longer). Hugo. > Stefan > Am 10.05.2017 um 11:08 schrieb Hugo Mills: > > On Wed, May 10, 2017 at 10:40:31AM +0200, Stefan Priebe - Profihost AG > > wrote: > >> Am 10.05.2017 um 09:40 schrieb Hugo Mills: > >>> On Wed, May 10, 2017 at 09:36:30AM +0200,
Re: runtime btrfsck
Hello, here's the output: # for block in 163316514816 163322413056 163325722624; do echo $block; btrfs-debug-tree -b $block /dev/mapper/crypt_md0|sed -re 's/(\t| )name: .*/\1name: HIDDEN/'; done 163316514816 btrfs-progs v4.8.5 leaf 163316514816 items 188 free space 1387 generation 86739 owner 3892 fs uuid 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 chunk uuid b86efe94-ab40-4344-ac6b-46ec59c41b8f item 0 key (23760 DIR_ITEM 2479948887) itemoff 16229 itemsize 54 location key (26923382 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 24 name: HIDDEN item 1 key (23760 DIR_ITEM 2652742785) itemoff 16170 itemsize 59 location key (28124230 INODE_ITEM 0) type FILE transid 86739 data_len 0 name_len 29 name: HIDDEN item 2 key (23760 DIR_ITEM 2688971413) itemoff 16119 itemsize 51 location key (26923386 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 21 name: HIDDEN item 3 key (23760 DIR_ITEM 2764880658) itemoff 16064 itemsize 55 location key (26923399 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 25 name: HIDDEN item 4 key (23760 DIR_ITEM 2805527189) itemoff 16006 itemsize 58 location key (28124233 INODE_ITEM 0) type FILE transid 86739 data_len 0 name_len 28 name: HIDDEN item 5 key (23760 DIR_ITEM 2876464375) itemoff 15957 itemsize 49 location key (26923393 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 19 name: HIDDEN item 6 key (23760 DIR_ITEM 2951059296) itemoff 15907 itemsize 50 location key (28124218 INODE_ITEM 0) type FILE transid 86739 data_len 0 name_len 20 name: HIDDEN item 7 key (23760 DIR_ITEM 3058144963) itemoff 15859 itemsize 48 location key (26923384 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 18 name: HIDDEN item 8 key (23760 DIR_ITEM 3095440808) itemoff 15804 itemsize 55 location key (26923394 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 25 name: HIDDEN item 9 key (23760 DIR_ITEM 3124573416) itemoff 15748 itemsize 56 location key (26923387 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 26 name: HIDDEN item 10 key (23760 DIR_ITEM 3194204932) itemoff 15690 itemsize 58 location key (26923397 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 28 name: HIDDEN item 11 key (23760 DIR_ITEM 3281114395) itemoff 15637 itemsize 53 location key (26923388 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 23 name: HIDDEN item 12 key (23760 DIR_ITEM 3353597736) itemoff 15588 itemsize 49 location key (24944 INODE_ITEM 0) type FILE transid 10694 data_len 0 name_len 19 name: HIDDEN item 13 key (23760 DIR_ITEM 3389003195) itemoff 15539 itemsize 49 location key (28124226 INODE_ITEM 0) type FILE transid 86739 data_len 0 name_len 19 name: HIDDEN item 14 key (23760 DIR_ITEM 3461310858) itemoff 15473 itemsize 66 location key (26923392 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 36 name: HIDDEN item 15 key (23760 DIR_ITEM 3660173809) itemoff 15422 itemsize 51 location key (28124225 INODE_ITEM 0) type FILE transid 86739 data_len 0 name_len 21 name: HIDDEN item 16 key (23760 DIR_ITEM 3678308711) itemoff 15371 itemsize 51 location key (28124220 INODE_ITEM 0) type FILE transid 86739 data_len 0 name_len 21 name: HIDDEN item 17 key (23760 DIR_ITEM 3708519009) itemoff 15316 itemsize 55 location key (28124224 INODE_ITEM 0) type FILE transid 86739 data_len 0 name_len 25 name: HIDDEN item 18 key (23760 DIR_ITEM 3716314603) itemoff 15258 itemsize 58 location key (26923396 INODE_ITEM 0) type FILE transid 74009 data_len 0 name_len 28 name: HIDDEN item 19 key (23760 DIR_ITEM 3958443109) itemoff 15224 itemsize 34 location key (24016 INODE_ITEM 0) type DIR transid 10693 data_len 0 name_len 4 name: HIDDEN item 20 key (23760 DIR_INDEX 2) itemoff 15190 itemsize 34 location key (24016 INODE_ITEM 0) type DIR transid 10693 data_len 0 name_len 4 name: HIDDEN item 21 key (23760 DIR_INDEX
Re: runtime btrfsck
On Wed, May 10, 2017 at 10:40:31AM +0200, Stefan Priebe - Profihost AG wrote: > Am 10.05.2017 um 09:40 schrieb Hugo Mills: > > On Wed, May 10, 2017 at 09:36:30AM +0200, Stefan Priebe - Profihost AG > > wrote: > >> Hello Roman, > >> > >> the FS is mountable. It just goes readonly when trying to write some data. > >> > >> The kernel msgs are: > >> BTRFS critical (device dm-2): corrupt leaf, bad key order: > >> block=163316514816,root=1, slot=39 > >> BTRFS critical (device dm-2): corrupt leaf, bad key order: > >> block=163322413056,root=1, slot=74 > >> BTRFS critical (device dm-2): corrupt leaf, bad key order: > >> block=163325722624,root=1, slot=63 > >> BTRFS critical (device dm-2): corrupt leaf, bad key order: > >> block=163316514816,root=1, slot=39 > >> BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure > >> BTRFS info (device dm-2): forced readonly > >> BTRFS info (device dm-2): delayed_refs has NO entry > > > >Can you show the output of btrfs-debug-tree -b , where > > is each of the three "block=" values above? > > Can do that. But the lists are very long - should i upload them to > pastebin? Is it ok to remove the name atribute - which provides filenames? On the mailing list would be preferable, as then the conversation is completely preserved in archives. They shouldn't be so long that vger rejects them. And yes, if you want to filter out the filenames, do so (but _just_ the filename -- don't remove the whole line). Hugo. > Stefan > > > >Hugo. > > > >> Greets, > >> Stefan > >> Am 10.05.2017 um 09:18 schrieb Roman Mamedov: > >>> On Wed, 10 May 2017 09:02:46 +0200 > >>> Stefan Priebe - Profihost AGwrote: > >>> > how to fix bad key ordering? > >>> > >>> You should clarify does the FS in question mount (read-write? read-only?) > >>> and what are the kernel messages if it does not. > >>> > > -- Hugo Mills | Alert status mauve ocelot: Slight chance of hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: runtime btrfsck
On Wed, 10 May 2017 09:48:07 +0200 Martin Steigerwaldwrote: > Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. > Indeed it is. It may or may not be possible to build a perfect Fsck, but IMO for the time being, what's most sorely missing, is some sort of a knowingly destructive repair mode, as in "I don't care about partial user data loss, just whack the FS metadata into full logical consistency at any means necessary". Also feels like it doesn't currently deal with the majority of actual in-real-world corruptions, notably the "parent transid failure" (even by a few dozens increments) which it can only helpfully "Ignore" during repair. So even with a minor corruption (something wonky in just ONE block of a multi-terabyte FS) the answer is way too often "nuke the entire thing and restore from backups". -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
Hi, Am 10.05.2017 um 09:48 schrieb Martin Steigerwald: > Stefan Priebe - Profihost AG - 10.05.17, 09:02: >> I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me >> something about the expected runtime or how to fix bad key ordering? > > I had a similar issue which remained unresolved. > But I clearly saw that btrfs check was running in a loop, see thread: > [4.9] btrfs check --repair looping over file extent discount errors > > So it would be interesting to see the exact output of btrfs check, maybe > there > is something like repeated numbers that also indicate a loop. Output is just: enabling repair mode Checking filesystem on /dev/mapper/crypt_md0 UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 bad key ordering 39 40 checking extents [.] even after 2,5 weeks running. Stefan > I was about to say that BTRFS is production ready before this issue happened. > I still think it for a lot of setup mostly is, as at least the "I get stuck > on > the CPU while searching for free space" issue seems to be gone since about > anything between 4.5/4.6 kernels. I also think so regarding absence of data > loss. I was able to copy over all of the data I needed of the broken > filesystem. > > Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. > > So unless someone has a clever idea here and shares it with you, it may be > needed to backup anything you can from this filesystem and then start over > from > scratch. As to my past experience something like xfs_repair surpasses btrfs > check in the ability to actually fix broken filesystem by a great extent. > > Ciao, > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
Am 10.05.2017 um 09:40 schrieb Hugo Mills: > On Wed, May 10, 2017 at 09:36:30AM +0200, Stefan Priebe - Profihost AG wrote: >> Hello Roman, >> >> the FS is mountable. It just goes readonly when trying to write some data. >> >> The kernel msgs are: >> BTRFS critical (device dm-2): corrupt leaf, bad key order: >> block=163316514816,root=1, slot=39 >> BTRFS critical (device dm-2): corrupt leaf, bad key order: >> block=163322413056,root=1, slot=74 >> BTRFS critical (device dm-2): corrupt leaf, bad key order: >> block=163325722624,root=1, slot=63 >> BTRFS critical (device dm-2): corrupt leaf, bad key order: >> block=163316514816,root=1, slot=39 >> BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure >> BTRFS info (device dm-2): forced readonly >> BTRFS info (device dm-2): delayed_refs has NO entry > >Can you show the output of btrfs-debug-tree -b , where > is each of the three "block=" values above? Can do that. But the lists are very long - should i upload them to pastebin? Is it ok to remove the name atribute - which provides filenames? Stefan >Hugo. > >> Greets, >> Stefan >> Am 10.05.2017 um 09:18 schrieb Roman Mamedov: >>> On Wed, 10 May 2017 09:02:46 +0200 >>> Stefan Priebe - Profihost AGwrote: >>> how to fix bad key ordering? >>> >>> You should clarify does the FS in question mount (read-write? read-only?) >>> and what are the kernel messages if it does not. >>> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
Stefan Priebe - Profihost AG - 10.05.17, 09:02: > I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me > something about the expected runtime or how to fix bad key ordering? I had a similar issue which remained unresolved. But I clearly saw that btrfs check was running in a loop, see thread: [4.9] btrfs check --repair looping over file extent discount errors So it would be interesting to see the exact output of btrfs check, maybe there is something like repeated numbers that also indicate a loop. I was about to say that BTRFS is production ready before this issue happened. I still think it for a lot of setup mostly is, as at least the "I get stuck on the CPU while searching for free space" issue seems to be gone since about anything between 4.5/4.6 kernels. I also think so regarding absence of data loss. I was able to copy over all of the data I needed of the broken filesystem. Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. So unless someone has a clever idea here and shares it with you, it may be needed to backup anything you can from this filesystem and then start over from scratch. As to my past experience something like xfs_repair surpasses btrfs check in the ability to actually fix broken filesystem by a great extent. Ciao, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
On Wed, May 10, 2017 at 09:36:30AM +0200, Stefan Priebe - Profihost AG wrote: > Hello Roman, > > the FS is mountable. It just goes readonly when trying to write some data. > > The kernel msgs are: > BTRFS critical (device dm-2): corrupt leaf, bad key order: > block=163316514816,root=1, slot=39 > BTRFS critical (device dm-2): corrupt leaf, bad key order: > block=163322413056,root=1, slot=74 > BTRFS critical (device dm-2): corrupt leaf, bad key order: > block=163325722624,root=1, slot=63 > BTRFS critical (device dm-2): corrupt leaf, bad key order: > block=163316514816,root=1, slot=39 > BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure > BTRFS info (device dm-2): forced readonly > BTRFS info (device dm-2): delayed_refs has NO entry Can you show the output of btrfs-debug-tree -b , where is each of the three "block=" values above? Hugo. > Greets, > Stefan > Am 10.05.2017 um 09:18 schrieb Roman Mamedov: > > On Wed, 10 May 2017 09:02:46 +0200 > > Stefan Priebe - Profihost AGwrote: > > > >> how to fix bad key ordering? > > > > You should clarify does the FS in question mount (read-write? read-only?) > > and what are the kernel messages if it does not. > > -- Hugo Mills | Welcome to Hollywood, a land just off the coast of hugo@... carfax.org.uk | Planet Earth http://carfax.org.uk/ | PGP: E2AB1DE4 |The Cat's Meow signature.asc Description: Digital signature
Re: runtime btrfsck
Hello Roman, the FS is mountable. It just goes readonly when trying to write some data. The kernel msgs are: BTRFS critical (device dm-2): corrupt leaf, bad key order: block=163316514816,root=1, slot=39 BTRFS critical (device dm-2): corrupt leaf, bad key order: block=163322413056,root=1, slot=74 BTRFS critical (device dm-2): corrupt leaf, bad key order: block=163325722624,root=1, slot=63 BTRFS critical (device dm-2): corrupt leaf, bad key order: block=163316514816,root=1, slot=39 BTRFS: error (device dm-2) in btrfs_drop_snapshot:8839: errno=-5 IO failure BTRFS info (device dm-2): forced readonly BTRFS info (device dm-2): delayed_refs has NO entry Greets, Stefan Am 10.05.2017 um 09:18 schrieb Roman Mamedov: > On Wed, 10 May 2017 09:02:46 +0200 > Stefan Priebe - Profihost AGwrote: > >> how to fix bad key ordering? > > You should clarify does the FS in question mount (read-write? read-only?) > and what are the kernel messages if it does not. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
Hello, (Warning: I'm a user, not a developer here.) In my experience (on kernel 4.4) it processed larger and slower devices within a day, BUT according to some recent topics the runaway fragmentation (meaning in this case large number of small extents regardless of their relative location) can significantly slow down BTRFS operations to the point of making them infeasible. Possible reasons for fragmentations are snapshotting volumes too often and/or running VM images from BTRFS without taking some precautions. On top of this, mount device name makes one suspicious there's another layer between BTRFS and hardware. Are you sure it's not the bottleneck in this case? -- With Best Regards, Marat Khalili On 10/05/17 10:02, Stefan Priebe - Profihost AG wrote: I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me something about the expected runtime or how to fix bad key ordering? Greets, Stefan Am 06.05.2017 um 07:56 schrieb Stefan Priebe - Profihost AG: It's still running. Is this the normal behaviour? Is there any other way to fix the bad key ordering? Greets, Stefan Am 02.05.2017 um 08:29 schrieb Stefan Priebe - Profihost AG: Hello list, i wanted to check an fs cause it has bad key ordering. But btrfscheck is now running since 7 days. Current output: # btrfsck -p --repair /dev/mapper/crypt_md0 enabling repair mode Checking filesystem on /dev/mapper/crypt_md0 UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 bad key ordering 39 40 checking extents [O] FS is a 12TB BTRFS Raid 0 over 3 mdadm Raid 5 devices. How long should btrfsck run and is there any way to speed it up? btrfs tools is 4.8.5 Thanks! Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
On Wed, 10 May 2017 09:02:46 +0200 Stefan Priebe - Profihost AGwrote: > how to fix bad key ordering? You should clarify does the FS in question mount (read-write? read-only?) and what are the kernel messages if it does not. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me something about the expected runtime or how to fix bad key ordering? Greets, Stefan Am 06.05.2017 um 07:56 schrieb Stefan Priebe - Profihost AG: > It's still running. Is this the normal behaviour? Is there any other way > to fix the bad key ordering? > > Greets, > Stefan > > Am 02.05.2017 um 08:29 schrieb Stefan Priebe - Profihost AG: >> Hello list, >> >> i wanted to check an fs cause it has bad key ordering. >> >> But btrfscheck is now running since 7 days. Current output: >> # btrfsck -p --repair /dev/mapper/crypt_md0 >> enabling repair mode >> Checking filesystem on /dev/mapper/crypt_md0 >> UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 >> bad key ordering 39 40 >> checking extents [O] >> >> FS is a 12TB BTRFS Raid 0 over 3 mdadm Raid 5 devices. How long should >> btrfsck run and is there any way to speed it up? btrfs tools is 4.8.5 >> >> Thanks! >> >> Greets, >> Stefan >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: record_orphan_data_extent: Fix memory leak
The nodes of the list at btrfs_root->orphan_data_extents are never freed causing memory to be leaked. This commit fixes the bug by freeing the nodes on all the btrfs_root->orphan_data_extents list. Signed-off-by: Praveen K Pandey--- cmds-check.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 0014bc2..cac94a8 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -9286,6 +9286,9 @@ static int check_extent_refs(struct btrfs_root *root, { struct extent_record *rec; struct cache_extent *cache; + struct btrfs_fs_info *fs_info; + struct btrfs_root *root_orphan_extent; + struct rb_node *rb_node; int ret = 0; int had_dups = 0; @@ -9439,6 +9442,26 @@ static int check_extent_refs(struct btrfs_root *root, rec->start + rec->max_size - 1); free(rec); } + /* +* Release memory of oprhan_data_extent +* which allocated while traversing in orphan_data_extents +*/ + + fs_info = root->fs_info; + rb_node = rb_first(_info->fs_root_tree); + while (rb_node) { + root_orphan_extent = container_of(rb_node, + struct btrfs_root, rb_node); + free_orphan_data_extents(_orphan_extent->orphan_data_extents); + rb_node = rb_next(rb_node); + } + free_orphan_data_extents(_info->extent_root->orphan_data_extents); + free_orphan_data_extents(_info->tree_root->orphan_data_extents); + free_orphan_data_extents(_info->chunk_root->orphan_data_extents); + free_orphan_data_extents(_info->dev_root->orphan_data_extents); + free_orphan_data_extents(_info->csum_root->orphan_data_extents); + if (fs_info->quota_enabled) + free_orphan_data_extents(_info->quota_root->orphan_data_extents); repair_abort: if (repair) { if (ret && ret != -EAGAIN) { -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html